Splunking with Cisco

I have recently installed Splunk and have a spent a bit of time getting it to work with Cisco IOS devices. So here is a post of a few things I have found that may be useful to first-time Splunkers or just help make it a bit more useful.

The plan for Splunk for me is to monitor faults, track ongoing performance issues and for managing inventory. Note I have only configured this on ISR routers and switches, I haven’t tried it on any ASAs, voice gateways, servers or WLC/APs.

Basic Setup

First, if you haven’t already, download and install Splunk. You can run on the trial license while you are trying it out. Once you’ve got Splunk installed, add the Cisco app. After you have installed the app, open the app and click help.

Within the help section you will find some pretty good documentation on how to configure the logging basics and also how to configure call-home to send inventory to Splunk.

A few caveats

Here are a few gotchas I have noticed.

  • If you have a switch running 12.2(50) or 12.2(55) code, you can’t configure call-home. 12.2(58) works and so does any 15 code. 6500s running 12.2(33) also works.
  • Older versions of 6500 IOS also doesn’t allow you to configure call-home from an interface within a VRF. Not quite sure where this was fixed, but 12.2(33)SXH5 doesn’t work and 12.2(33)SXJ10 does. So somewhere between the two
  • If you are running VRFs you will need to change the logging host configuration to:
logging host [Splunk IP] vrf [MANAGEMENT VRF]
logging source-interface [interface] vrf [MANAGEMENT VRF]
  • Having working DNS for your call-home and logging source interfaces makes everything much nicer
  • If you want to see the link up/down status on 6500s you will need to include the command:
logging event link-status default

See here for me details.

Once you’ve done that, you should see the various pre-built dashboards start to fill up with data.

IOS Version Issues

One of the first things I noticed was the IOS versions weren’t being picked up correctly for:

  • ISR routers (2951, 3845 and 4321 were the ones I noticed) running either 12 or 15 code
  • 3850 switches running 3.6.x or 3.7.x

Pretty much every device also has the version 2.0 associated with it and large numbers of devices have “unknown” IOS.

6500s, 3750s running 15.x code or 12.2(58) code, ASR920s and switches running 15.2.x code all seemed to work fine. Excluding the duplicate 2.0 version that is listed.

To fix this, you need to either edit the “Software version” data type or create a new one. I created a new one. To do this, click settings -> data models and go to Cisco_IOS_Event and click Edit Ojbects. If you have acceleration turned on you will need to temporarily turn it off.

From here click Add Attribute -> Regular Expression.

After looking at the call-home data you can see that there are other fields within the call-home that are picked up by the default “Software version” data type. Hence the 2.0 values and also some 3850s get picked up as 15.6 etc, instead of 3.6 or 3.7. The best Regex that I have so that, that matches all devices that I have been is:

.*Version.*(?<ios_version>((03|15|12)\.[0-9][\w\.\(\)\-]*))((</rme)|(,* RELEASE)|( RELEASE))

If you are running any 16 code you will obviously need to update the Regex to match them.

This should get rid of all the unknowns and random 2.0s that appear and seems to work for the call-home format of all devices that I have seen so far.

Diagnostic Messages – Cleaning it up a bit

On the overview dashboard, there is a diagnostic messages table. This table doesn’t seem to capture everything it’s supposed to, and can sometimes be crowded by hundreds of more or less the same messages, for example if you getting high SP utilization you might get hundreds of messages like this:

CPU util(5sec): SP=35% RP=12% Traffic=1%
CPU util(5sec): SP=37% RP=11% Traffic=1%
CPU util(5sec): SP=34% RP=12% Traffic=1%

etc

This isn’t very useful, the fact that there is high CPU is important but the exact value probably doesn’t matter. So you can group them all together to just get this:

CPU util(5sec): SP=X% RP=X% Traffic=X%

On the dashboard, click Edit -> Edit Panels. Go down to the diagnostic message table, click the magnifying glass -> edit search string. This is what I have:

(eventtype="cisco_ios-diag" OR (eventtype="cisco_ios-ios" AND facility="CONST_DIAG"))  | eval eventcode=facility + "-" + severity_id + "-" + mnemonic | rex field=message_text mode=sed "s/([\:\=])(\d+)/\1X/g"|stats count AS Count, latest(_time) AS _time, latest(severity_id) AS severity_id by host, eventcode, message_text, | lookup cisco_ios_severity severity_id | sort +severity_id,-Count | table _time, host, eventcode, message_text, severity_id_and_name, Count

This part:

sed "s/([\:\=])(\d+)/\1X/g"

Is what replaces any number after a “:” or a “=” with a “X” to group them together. That may or may not be useful to you, but I think it cuts down on the clutter.

I think that’s about it for this post, I’ve also created a couple of graphs to monitor high temperature warnings and to breakdown IOS against model types, I post those shortly.

If you have any other cool graphs or tweaks feel free to add them below!

Tom

This entry was posted in Config, IOS, Network Montoring, Splunk by Tom. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *