Question about recent network outage

CodeBlox · January 2014

We have a 6509 as our core switch. One of three 48 port GB blades failed causing a catastrophic failure. When I walked in the NOC I found a single red light and all of the switchport lights off. The problem blade seemed to also be causing the entire core switch to lock up preventing me from using telnet/console for access. Has anyone ever heard of a single blade (48 1GB ports) locking up an entire switch? I'm just wondering if there could have been more than one issue. Things have normalized since then with a blade from our old core switch. We plan to cut over to the replacement that was flash shippped tomorrow.

networker050184 · January 2014

I've seen plenty situations like this. Could be many things. Jacked up cef table, back plane, supervisor got wacked out. Probably no way to know for sure unless you have a crash file (which is unlikely if from what I'm understanding the SUP didn't crash) for Cisco to look at. Even then they will probably just tell you it got in a corrupted state from the line card failing....

CodeBlox · January 2014

The SUPs didn't crash but in the chaos, I was asked to reboot the entire core switch which means that the logs from that time are no longer there. Couldn't console into it. Is there anything else I should check or should I leave it at "the blade failed"?

inscom.brigade · January 2014

try doing a show tech-support
, open a case with cisco

networker050184 · January 2014

You can open a case with Cisco, but I really doubt you'll get much out of it honestly without any information to give them. Show tech probably won't have anything. It's more of a current state of the router and log messages.

inscom.brigade · January 2014

we had "several", 4506 e chassis with blade trouble and power supply trouble. ( i could elaborate but that would take a page itself), after opening a case with cisco and giving them a show tech-su they came back with "their" IOS, had a bug and that we needed to upgrade the IOS.
If you have all the time in the world to look at that, you'll figure out what's wrong your self right. But if you need to attend to things at hand while your paying for a service contract putting cisco to work for you, is a good idea?

networker050184 · January 2014

Well yeah, if it's an IOS bug that is one thing, but if it's just a case of a router getting in an inconsistent state you aren't likely to find anything.

Iristheangel · January 2014

Time to set up a Syslog server and some SNMP traps! At least in that scenario, all your logs would not have been lost upon reboot.

I had a somewhat similar situation about a month ago where I was asked to reboot the core. I was on-call for the week and I got the call our entire data center was down early on a Sunday morning and the CIO and his boss had already noticed. I jumped in the car in my pajama's and flip flops and zoomed off to the data center. The console wasn't gummed up but it was close to it. I was getting HSRP and EIGRP flaps blasting my console and the CPU utilization for both cores were up to 100%. Logging into that mess was a process. I'll spare you the gory details but the CIO ended up showing up to the datacenter 20 minutes into my troubleshooting and was like "If we reboot them, it is possible it will fix the problem?" I basically was like "Very very very very doubtful. Plus we lose the logs so if the problem goes away, we run the risk of it occurring again at a worse time than a non-business day if we don't root cause it." The CIO was insistent about me rebooting it anyways so I did.

In my situation, the problem came right back so the loss of logs didn't really matter as much and we were able to find the problem (STP issue - port channels were flapping on one of the access layer switches) but it sure lit some fires under some people's butts to get a better logging and alerting solution in place. We ended up adjusting and pointing SNMP traps to out newer Solarwinds server, adding logging for emergency, alert, critical and error levels and pointing it towards Solarwinds, adding out-of-band terminal access, adding out-of-band alerting.

My point is that it may suck that you lost all your logs and feeling like you have a potential underlying issue going on that may go off at any point but it's a great opportunity to pitch a syslog and SNMP server to the business. It'll be pretty invaluable in the future if you can get them to approve getting it going.

PCSPreston · January 2014

I would think that would be a ISO file corruption. Hard to tell however at this point.

CodeBlox · January 2014

I think this definitely merits sending the logs to our SolarWinds Orion server via Syslog. Will be putting that in place on Monday.

Question about recent network outage

Comments