Strange issue with 4500 switches

bermovick · April 2013

Am running into a strange problem every couple months where I work.

Our new building here has a pair of 4510's on each floor (2 floors, 4 total). Each 4510 has dual 6000W PS and dual Sup7 cards.

After a couple months uptime, some of the ports on the blade cards stop working. Normally it acts like a port has gone bad, where it fails to come up when an end device is plugged in - indicator light for the port stays out, status is down(notconnect), etc. Normally when this happens you'll see about 4-6 ports in a row on the blade have the issue. Reloading later corrects it. Yesterday I had a port go nuts and fill the cam table. The issue wasn't a comprimised end device and when I re-enabled the port it stayed notconnect (but indicater light stayed ON whether cable was in or not), and one of the fiber ports between floors reported itself as down/notconnect. Reloads fixed both as well.

Our team lead thinks this is a power issue (I'm not that great here - the racks are running on 224? rather than 240, is what I think he said), but I don't see anything that jumps out at me with show power / show power detail commands. I'm not ruling it out, as reloading WOULD modify how much power is being pulled I suppose. I'm not as sure as he is, though, since (and this was new to me!), reloading with dual supervisors seems to only reload the active supervisor; the one in hot standby takes over during the process, so you aren't really completely resetting the entire chassis like you do with fixed-config switches. I'm wondering more if it's something with the Sup7 since switching to the standby seems to fix it.

I'm probably going to config a kron job to reload all 4 on the first of each month, but I hate doing that and calling it done (what is this, windows?) so am checking if anyone else has seen anything similar or has any ideas on other things to check while I continue to google as well.

DevilWAH · April 2013

IS the 4500 the same as the 6500 series when it comes to power? (I havent used 4500 for a while). but with the 6500 if it runs low on power it start carrying out the following

First shut down Poe on indivual ports starting at the highest port on the highest module, if after all the Poe is disabled shut down modules from the highest number to the lowest.

So if it is a power issue where you don't have enough power you would expect the same port to be affected and in the same order.

instant000 · April 2013

Disclaimer: I realize that this issue is not exactly like your issue, and I don't have a strong background in electricity, so I could be using the wrong terms.

I wouldn't discount the idea that there may be some intermittent issue with the hardware, or a power-releated issue.

I worked last year at a company with tons of 4500s deployed throughout their buildings. One day, I came in and received an alert about a PoE line card. Further investigation revealed that the card had apparently failed. I opened a TAC case, since reseating the card didn't clear its problem, and using a different new line card from the spares worked just fine. I was thinking it could have possibly been fried due to a voltage feedback, as I remember once in Iraq where I saw a PoE standalone switch get fried because an officer was going around plugging power supplies into the phones, thinking it would make them "work better" (anyone's who has been into the sandbox knows how haphazard and crazy the power is there.).

So, it is with this background knowledge that I went into the TAC case.

I queried Cisco about the particular error that occurred, sharing all my theories on what could have caused it. I could not locate a root cause. I was thinking it was possibly the power in the facility, or the cards, or the chassis .... I was thinking that something was the root cause.

After consulting with TAC on the case for a few minutes, they decided to send out another line card. OK, cool. I set up the RMA, and get to work on other tickets.

Later that morning, the regular employees showed up. One of the network admins clued me in on the problem they were having with the 4500s, since this was my first time seeing it during the time I was there. Intermittently, they would get line card failures on their 4500s. They were unable to isolate it to any particular chassis, and could not correlate it to any power surge or any other such event. It was so bad that they kept a log of the serial numbers that we had turned back in to Cisco about the issue. He informed me that the reason there was a "new" spare line card was because of how regularly this problem occurred.

Each time I opened a ticket, (during the few months I was there, I had to work several more tickets for this issue), I would refer to all the prior serial numbers that had been submitted for the same problem.

Cisco's answer? Send out more line cards.

I dunno. I tend to think the problem is either the line cards failing, something plugged into the line card and somehow shortcircuiting them, or its just dirty power in the building. Since the intermittent problem affects only line cards specifically, I like to lean towards something plugged into the cards, or just a bad lot of cards.

Disclaimer: I realize that this issue is not exactly like your issue, and I don't have a strong background in electricity, so I could be using the wrong terms.

bermovick · April 2013

That's useful to know, but that's not quite what's happening if they do work the same.

PoE has been disabled since no devices on the network require it.
Ethernet ports tend to drop in groups, but aren't high ports/modules. They'll be a random blade and a seemingly random group of ports on the blade - ports 33, 35, 37, 39 as an example - as if each section of ports is connected to a single controller chip on the blade and one of those is what temporarily fails or something.

I dunno. It's hard gathering useful information as infrequently as it happens.

Strange issue with 4500 switches

Comments