EIGRP Flapping...

CodeBlox · August 2013

On our network at work, we have two remote sites peered up through EIGRP. I'm not sure how long it's been happening but the neighborship has been flapping up and down. Doing debug eigrp packet hello, I can see that the hellos (Multicast) are sent and recieved just fine. The Updates (Unicast) are what appear to not work. The same debug shows them with the retry counter incrementing all the way up to 16 then the neighborship breaks and the routes are lost. The neighbors reconverge and it happens all over again. The neighbors are peered up on SVIs. One thing I noticed is that for one of the neighbors, the MTU is set to 1500 while on the other, it's set to 1504. For the site with the MTU of 1504, QinQ is in place and I believe this is necessary for that to work. Could this mismatch in MTU cause the issue I am seeing? Neighborship forms but updates constantly retry. QCnt stays at 1 and RTO is 5000

instant000 · August 2013

Hrm.

Just to confirm that it is not something else, these neighbors can ping each other prior to setting up EIGRP, right?

Have you tried configuring a neighbor command in order to get EIGRP to use unicast updates versus multicast (in case some security guy is blocking the multicast traffic?).

I hope these suggestions give you some ideas:

1. confirm the neighbor is always reachable
2. try unicast versus multicast

Hrm ... my connection to cisco.com is down right now, I was going to look over the EIGRP FAQs to see if this issue had surfaced before ... and I need to be getting to bed, so I can't be bothered to lab this up right now...I thought EIGRP supported neighbor statement for unicast, but my mind is in "go to bed" mode right now.

I hope these ideas help, though.

EDIT:

I see that I MISREAD your post. Apparently, the unicast is what is not working!

Disregard everything I said, and allow me to get some sleep.

CodeBlox · August 2013

No worries

I am going to further investigate the QinQ config tomorrow... QinQ might be transparent to us and if it is, I'm gonna remove the system wide mtu setting which requires a reboot of the core :P It might be a default setting but I'm gonna set it to 1500 like the other end and see what happens.

Additionally, not sure if it's a reliable test but I cannot ping the other end if I set the MTU to 1504 with the df bit set. I can otherwise ping it though. The issue happens roughly every minute (16 update retries). I started a ping of 15000 packets and none of them dropped from end to end.

instant000 · August 2013

A crazy idea I was wondering about was this: If this update is sent out each source, and set not to fragment, then this would cause them to not get received on the other end.

Of course, if there was a packet capture, it would put this question to rest. I've not been able to find a document that simply stated this, and won't be able to lab (to prove this actually occurs) until later. [Have other "work" to do. LOL.]

Let me know how it goes today. Quite curious now.

networker050184 · August 2013

Sounds like an MTU mismatch could be the cause. You have the MTU set to 1504 on the SVI? You shouldn't need it there.

Mrock4 · August 2013

networker050184 wrote: »

Sounds like an MTU mismatch could be the cause. You have the MTU set to 1504 on the SVI? You shouldn't need it there.

Ding ding! I think MTU is your issue here at first glance.

phoeneous · August 2013

Along with mtu mismatch, take a look at your interface statistics.

vanquish23 · August 2013

WE had eigrp issues with our 4500 switches for our VOIP/LAN switches in the office, and ended up restarting the switches to correct the issue.

networker050184 · August 2013

You sound like an MS admin vanquish23.

RouteMyPacket · August 2013

lulz

CodeBlox · August 2013

Update on this... The carrier had an incident today and now there is an issue in the carriers layer two service that has caused the neighborship to totally break. Even though they're on the same subnet they can't ping each other now :P Once they resolve that, I can continue to investigate this. I'd be surprised if this issue doesn't exist after they resolve this outage. I say that because I called them last week about this and was told that there was no issues on their end.

EDIT: They restored their service so I'll continue to troubleshoot when back in the office since the issue still exists.

RouteMyPacket · August 2013

CodeBlox wrote: »

Update on this... The carrier had an incident today and now there is an issue in the carriers layer two service that has caused the neighborship to totally break. Even though they're on the same subnet they can't ping each other now :P Once they resolve that, I can continue to investigate this. I'd be surprised if this issue doesn't exist after they resolve this outage. I say that because I called them last week about this and was told that there was no issues on their end.

EDIT: They restored their service so I'll continue to troubleshoot when back in the office since the issue still exists.

Keep us updated, I would like to see what you find. Were you able to verify MTU?

jamesp1983 · August 2013

I'm anxious to hear about this as well...

CodeBlox · September 2013

This has been resolved... Two things here, somebody set the mtu to 1400 on one of the vlan interfaces which caused an issue with accessing certain websites causing them to either not load at all or take a really long time. I found and removed this from the config and that cleared up. Other thing, I brought the interface down administratively to force traffic another way. Upon coming back up, the issue no longer is happening. Very strange indeed but Q Cnt is now 0 and RTO is 200 like normal. Testing with the MTU back to 1400 i cannot replicate the EIGRP issue -_-

I assure you though, it was happening for weeks looking at the logs

instant000 · September 2013

Did I read that correctly?

Putting the "if I was there" hat on:

You really cannot afford to have customer's suffering because Cisco Bob wanted to try that new "ip mtu" command that he learned on a Youtube video.

Basically, it is not good to have changes occur without approval.

It appears that your network has an issue with change control. You might want to confirm that you're logging configuration changes, so that you can catch the rogue admin in the act next time.

Also, go ahead and let them know that it is enabled. This might discourage unauthorized changes in the future.

Edit: I use the term Cisco Bob to poke fun at Microsoft Bob. In the old days when I took Microsoft tests, there was this guy Bob who was always having issues. Didn't everyone call him "Microsoft Bob"?

Edit2: Wow, I tried to look up Microsoft Bob, and came upon this product that looked absolutely horrible. They tried to be user-friendly, but it looked like something for kids.
http://toastytech.com/guis/bob.html

CodeBlox · September 2013

Lol! I have an idea of who dun it, the logs show me that too in Orion. I couldn't agree more about the change management part. Folks have brought up that problem here actually. We very recently started a change management process. Problem with that is, we are a small shop and usually the person submitting the request is the ONLY SME on the subject of their change. Only thing other folks COULD do is go "Ok...?"

I learned something interesting about https traffic in all of this... It comes with the don't fragment bit set in the ip header.

Mrock4 · September 2013

If you didn't already, read up on Path MTU discovery (PMTUD) - it applies to your situation. It's a good read for future reference and it goes in line with the fragmentation bit you mentioned:

Resolve IP Fragmentation, MTU, MSS, and PMTUD Issues with GRE and IPSEC - Cisco Systems

Glad the problem is resolved.

EIGRP Flapping...

Comments