My First Live broadcast storm (a long one)

DANMOH009DANMOH009 Member Posts: 241
So ive read about it and lab-ed it out, but today i was faced with a broadcast storm on a live network! and let me tell you it was a nightmare!!!!

I work for a managed service provider and the company i was supporting was remote. Just so you have an idea of what the network was. The company has two main sites, Site A and Site B. Both sites have a router(1921s - which connect to the their VPN cloud via leased line), and 3560 L3 switches (that connect a private line 1Gbps link, between the two sites together). We as a company manage these devices, but there were a few more in the mix, which i later found out.

Now i still consider myself as a noob, so im just going to talk you through what i did. I would like to know what other engineers approaches would be on this one, so feel free to tell me where you would go different or what i could have done more efficiently.

The ticket got raised saying sites within their VPN and site also A could not access servers on site B. My first step was to log into the VPN cloud and see if i could communicate with Site B's 1921 and the 3560. Both came back with successful pings results, hmmmm at this point i rang the customer for more information.

After speaking to the Customer it was then reported the service was intermittent and and i asked for the IP address of the servers on site. I then tried pinging these devices from the vpn - and i was getting some lost packets. I then jumped onto the L3 switch at site B. I could access it fine, but the switch was a little laggy. I then noticed my terminal screen had froze for a minute but then came back to life.

iBGP was running between the 1921 on site B and the L3 switch. I looked at the show ip bgp summary on the 1921 (site B) and noticed that BGP had flapped. A 'sh log' command had suggested that a tear down request had gone to the L3 to bring down the BGP peering!

This explained why there was intermittent access, BGP was flapping! but why???? so i jumped back onto the L3 switch (over the private line- as bgp was being a pain).

I ran the command 'show processes cpu history' (amazing command!) this showed me that the cpu was hitting a constant 99/100% - which explained why bgp was flapping. CPU process was so high, it couldn't respond to the bgp hello's and as the peer didnt get a response it teared down the peering.

At this point what i did try to do was change the dead timer interval - hoping that if i set it to 10 and 300 it would keep bgp up longer ( i think i did this wrong as it didnt work, so i just moved on). Honestly by this time i was panicking. I suggested rebooted equipment (praying that this would some how fix the weird cpu usage). This also didnt work.

I then finally looked at the counters on the interface and the broadcasts were HUGE!!!!! for example if there were 10,000 packets going across the interface the broadcasts were around 9,500.

The remote engineer i was speaking to (our customers IT/networking dept guy), was at site A so i asked him to start making his way to site B!

After he arrived it turned out that The 1921 was connecting to a packet shaper, which then connected to a HP switch, the 3560 also connected to this HP switch, basically the HP switch was like the centre and i think everything at that site was connected into that! servers other switches (Netgears) all sorts!!!

(almost towards the end - for all you guys reading hang in there)

So now if im going to be honest, i wasn't really sure what to do, i knew it was something on the LAN network. I could have just at this point said to the customer it was with the devices we supported and send him on his way (the thought did cross my mind), but we were in this together!!! so i had to think of something.

I had two plans of action.

Get the customer to console into the HP switch run a show interfaces and get him to send me the output! - I asked, he couldnt console into it icon_sad.gif

Next was to disconnect each port on the switch one by one. I clear the counters and check how many broadcast increases we get- So i tried this!!! And there results were not what i wanted to expect, broadcast just kept increasing. For some reason we kept looking at one particular Netgear switch which was daisey chained off this HP (i think it had caused a few problems in the past - not a broadcast storm but something else which was a bit dodgey).

After removing this and waiting / and the finally rebooting.A Miracle happened BGP was up for a atleast 11 minutes ( the longest it had been all morning) and the broadcast numbers had reduced, they were now increasing in there 100s not in there 20,000s!

AND THATS IT! IT WAS GONE!!!


so this is probably my most lengthy post, but i did want to share. I also would like to know what others would do in this scenario?

Thanks for reading.

Comments

  • networker050184networker050184 Mod Posts: 11,962 Mod
    Look at the interface counters earlier in the process!

    BGP timers are negotiated when the session is formed so changing them won't have an effect until the neighbor relationship reforms.
    An expert is a man who has made all the mistakes which can be made.
  • Dieg0MDieg0M Member Posts: 861
    First thing is you check if your interfaces are clean for your BGP peer. Usually excessive ignored and nobuffer counters will indicate a broadcast storm. From there, configure broadcast storm-control to isolate the port. Once you've identified the port that is err-disabled, investigate.
    Follow my CCDE journey at www.routingnull0.com
  • DANMOH009DANMOH009 Member Posts: 241
    Dieg0M wrote: »
    configure broadcast storm-control to isolate the port. Once you've identified the port that is err-disabled, investigate.

    Im not sure if i could have done this, because the devices i managed were the cisco's router and switch, which connected had one interface on each going into a HP pro curve (this was the device that had everything connected to it)- i think this was where the broadcast storm was happening (i had no access to this HP pro curve).

    I gather that really i should check interfaces as a priority before anything else, so thanks for the heads up. Reckon i need to read a bit more about runts/*throttles etc.. so i get a better understanding.

    Thanks for the advice though guys, really appreciate it.
  • BinaryheroBinaryhero Member Posts: 29 ■□□□□□□□□□
    Its always the HP switch that no one knows the password to icon_wink.gif
    I dont think theres anything wrong with the way you were troubleshooting..
    The only thing you could have done different was keeping your cool better, but that something that comes with experience so it will come.

    Id say the number one error people are making during these kind of situations (myself included) is they start panicking and then the ability to apply logical thinking becomes more narrow, plus assumptions are being made.

    So good job, and very nice that you are reviewing your actions afterwards - thats really the key to growth!
  • d4nz1gd4nz1g Member Posts: 464
    It seems you did a great job pointing out the problem! Congratz!

    Tell them to use some tools against l2 issues so this doesn't happen in the future (loopguard, broadcast storm control, etc)
  • DANMOH009DANMOH009 Member Posts: 241
    Thanks guys.

    And ye i do panic a bit from time to time Binary :) like you said fingers crossed it will get better with time.
Sign In or Register to comment.