Just witnessed my first routing loop

W Stewart · March 2013

Not that big of a deal but It's not something I've come across at previous jobs since I've never worked at a big data center with a complex network set up. I'm a jr sys admin so I'm more of a server guy but I've got a pretty decent amount of networkng knowledge. I initially got a"time to live exceeded" message from ping a few servers on a specific subnet and we pretty much figured it was due to a change that one of the networking guys made in order to fix another problem and it got me thinking about what was causing the message. After doing a little reading I ran a traceroute and verified that it was a routing loop. Not that important but it's nice to see what one looks like.

nerdydad · March 2013

Wow, haven't seen one in production, and I've only seen it once by accident in a lab, many times when it was configured to loop. Were they redistributing between protocols? I have seen some broadcast storms when STP failed, that's pretty awesome except that it was in a production network.

networker050184 · March 2013

I've seen them plenty of times in production. Once you get a large enough network with a lot of people making changes it's bound to happen!

nerdydad · March 2013

That is where solid processes pays off. I'm not saying mistakes never happen, but by the time we make any changes on the network, those changes have been through a gauntlet of more experienced eyes looking at them. We often complain about all the process involved in even the simplest of changes, but in the end it is in the name of network stability. We also were doing very little redistribution, as we move to a different internal routing protocol I have noticed asymmetrical routing, but fortunately no loops. Being on the build side of the house, I hope to never see one in our current network, as it will mean that I or one of my coworkers caused it, otherwise it would be solved by operations.

Mrock4 · March 2013

I have seen a fair share of loops- most were indeed a misconfiguration. It's never good to see:

tracert 1.1.1.1
1- 10.25.1.1
2- 10.16.19.5
3- 10.25.1.1
4- 10.16.19.5
5- 10.25.1.1

lol...the good news is 99% of the time the issue is with that second hop (which sends the route back), so often times a look at the routing table there can go a long way.

Congrats on your first loop!

VAHokie56 · March 2013

networker050184 wrote: »

I've seen them plenty of times in production. Once you get a large enough network with a lot of people making changes it's bound to happen!

I caused one ! on my first real network gig had a project to set up dual EPL's for a site to site connection..ran them into a couple 6524's and peer'ed them up bgp to the main site...long story short I learned a a lot about route tagging and route maps that night at 4am on the fly...I miss my rookie days =P

Xyro · March 2013

I have mixed sentiments about ever seeing 1. I think it would be cool

, but I also have a feeling I'll be the 1 everyone expects to fix it lol.

Congrats on the 1st loop though!

CodeBlox · March 2013

I've witnessed it out at our sites in Utah. I found that reloading one of the routers invokes a temporary EIGRP routing loop that lasts for about 5 minutes.

Mrock4 · March 2013

Seeing a loop isn't too bad- once you see it you've found the problem

W Stewart · March 2013

Networking didn't give us the details on what exactly happened. I figured it might have been misconfiguration but one of my co-workers thought that the server might have been sending spam and they intentionally looped it to cut it off from the network. I believe it was an entire subnet with at least two different customers though so that may not have been the case.

W Stewart · March 2013

CodeBlox wrote: »

I've witnessed it out at our sites in Utah. I found that reloading one of the routers invokes a temporary EIGRP routing loop that lasts for about 5 minutes.

That could have been it. When we called networking they said the issue should mitigate itself eventually so it's very possible that this is what happened. They seemed very convinced in the notes that the issue had aready been resolved.

RouteMyPacket · March 2013

Pfft!

I have seen one in production alright...nothing like losing half a building due to one. God forbid someone configure STP on a switch. lol

phoeneous · March 2013

Mrock4 wrote: »

I have seen a fair share of loops- most were indeed a misconfiguration. It's never good to see:

tracert 1.1.1.1
1- 10.25.1.1
2- 10.16.19.5
3- 10.25.1.1
4- 10.16.19.5
5- 10.25.1.1

lol...the good news is 99% of the time the issue is with that second hop (which sends the route back), so often times a look at the routing table there can go a long way.

Congrats on your first loop!

This just happened to me last week! Misconfig in gre tunnel. Fun times!

networker050184 · March 2013

nerdydad wrote: »

That is where solid processes pays off. I'm not saying mistakes never happen, but by the time we make any changes on the network, those changes have been through a gauntlet of more experienced eyes looking at them. We often complain about all the process involved in even the simplest of changes, but in the end it is in the name of network stability. We also were doing very little redistribution, as we move to a different internal routing protocol I have noticed asymmetrical routing, but fortunately no loops. Being on the build side of the house, I hope to never see one in our current network, as it will mean that I or one of my coworkers caused it, otherwise it would be solved by operations.

I understand the whole change control process, but it can't be perfect. If I'm reviewing one change and another guy is reviewing another we don't know about the other change and they could end up causing an issue if both done. Can't have one person see everything. The change control process usually takes weeks at a time to get written, reviewed adn then completed. Other things could have changed in that time where the change would have been flawless if not for some traffic reroute due to a bad circuit etc. Things happen!

Mrock4 wrote: »

Seeing a loop isn't too bad- once you see it you've found the problem

Yeah, not seeing the loop is when you get in trouble!

ShamPOW · March 2013

I saw a really interesting one just last week. A bit of setup..........I work for an ISP. This customer was a remote site for a VERY large chain of stores. We had assigned a /30 block to them, binding their username/WAN interface to their first useable IP of that subnet.

The issue arose when it looks like their equipment was configured to be expecting their SECOND useable on that WAN interface.

Here's where it got tricky, since I was actually talking to an offsite 3rd party IT group who had no real idea what was going on out there as far as equipment and configuration. Their WAN interface was pulling the first useable IP. Traceroutes to the second useable would route to that IP, bounce around a couple of hops on a private 10.x.x.x network, hit an IP that belonged to a /16 block owned by this chain of stores, through time warner, level3, then BACK to my ISP, and BACK to the site ad-infinitum (or 30 hops). I suspect that /16 was their VPN to the corporate office, but I havne't covered VPN's all that much yet so I could be completely wrong there.

I saved a screenshot of the traceroute for posterity.

nerdydad · March 2013

networker050184 wrote: »

I understand the whole change control process, but it can't be perfect. If I'm reviewing one change and another guy is reviewing another we don't know about the other change and they could end up causing an issue if both done. Can't have one person see everything. The change control process usually takes weeks at a time to get written, reviewed adn then completed. Other things could have changed in that time where the change would have been flawless if not for some traffic reroute due to a bad circuit etc. Things happen!

Absolutely, our process involves multiple types of calls depending on the type of change, it is a dedicated team within operations that review everything and when they look at the devices that you are making the changes on, they can see every other change that has been associated with that device. It's not fool proof and stuff happens, but I have been surprised by some of the things they have caught. I mean, in the end a routing loop is easily discovered, and once it is discovered, it is usually easily mitigated unless you have really crazy amounts of redistribution going on.

Just witnessed my first routing loop

Comments