Cisco 3750X (IPBase) running OSPF Point-to-Multipoint

Monkerz · November 2013

I've run into a problem when it was decided that "we" were going to add another site into a MetroE ring using a stack of 3750s.

Prior to this deployment, this particular MetroE consisted of seven Cisco 6509's. Two of which have 6Gb, three have 2Gb, and the other two have 1Gb connections. In order to resolve a routing issue between this MetroE and one our dark fiber rings, OSPF point-to-multipoint was utilized. The issue was resolved and I moved on to another project.

Fast forward to today, and I'm scratching my head. I have deployed a stack of six 3750X running IPBase. Upon connecting via an interface configured as OSPF point-to-multipoint, a FULL adjacency was only formed with four of the other seven nodes within the same MetroE. Two of the nodes were stuck in EXCHANGE whilst the last would bounce from EXSTART to DOWN.

I cleared the ip ospf process to find the same exact issue. I rebooted the stack and received the same adjacency ratio, but a couple of the nodes that were showing FULL before were now stuck in EXCHANGE and those that were stuck in EXCHANGE before were now FULL. It was getting late so I decided to finish up the hardware install and head home for the night to revisit in the morning. I powered down the stack to keep from filling up our syslog server with errors.

The next morning I powered up the stack and to my amazement found all seven neighbors were showing FULL. Just to be safe I cleared the ospf process again and found 4 FULL / 2 EXCHANGE / 1 EXSTART. I decided to activate the 60 day trial of IPServices on the stack to see if that was the problem. Upon rebooting in IPServices I found 3 FULL / 3 EXCHANGE / 1 EXSTART. Needless to say I downgraded back to IPBase.

Things I've Noticed:
1. With OSPF enabled and adjacencys of FULL, EXCHANGE and EXSTART. If I ping a node currently in EXSTART or EXCHANGE I will get a 77-87% success rate using 100 and 1500 byte packets and flipping the df-bit. But with the WAN interface passive, ping succeeds just fine.

2. The host routes (being a point-to-multipoint network type) for node's in EXCHANGE or EXSTART bounce off a FULL neighbor. So say I am testing from 10.0.0.139 and this node has a FULL adjacency with 10.0.0.130 and an EXCHANGE with 10.0.0.135. An OSPF route for 10.0.0.135/32 with a next hop of 10.0.0.130 will be sitting in 10.0.0.139's routing table.

3. Sniffing OSPF traffic ingress and egressing the WAN interface of this 3750 stack is showing TTL exceeded ICMP from the adjacency not yet FULL (curious if the host route is causing this).

I have tried quite a few different things, most not mentioned he because I am sitting on my couch sipping on a beer and I cannot frankly remember what all I have done. I have configured this site to run via static routes (ugh) for now while I figure out what is happening.

Anyone have any ideas?

phoeneous · November 2013

How are the areas configured?
What type of devices are the other nodes?
Is your wan plugged into the master of the stack?
What debugs have you ran?

CodeBlox · November 2013

Sounds like a fun problem to troubleshoot

I am going to think about it but... Have you checked MTU at all links? It should match. If I'm not mistaken it won't prevent neighbors from going 2Way/Init but will cause problems at the EXSTART step. Your problem doesn't seem consistent. You say you come in one day it everything had a full ADJ huh? What does debug ip ospf adj say about the neighbors that never go full??

Monkerz · November 2013

This MetroE resides completely in area 0.

Seven 6509s and one stack of 3750x.

WAN connection is plugged into Gi1/0/1 which sw1 is master.

I don't have the output as it is logged to my laptop which is at work. But I've ran every OSPF debug I could find. I will post in the morning what I've logged.

Monkerz · November 2013

phoeneous wrote: »

How are the areas configured?
What type of devices are the other nodes?
Is your wan plugged into the master of the stack?
What debugs have you ran?

MTU is consistently 1500, that was my first check. Pinging at 1500 byte with df-bit set would have indicated a problem in my mind, am I wrong for thinking that?

I did notice when I was reviewing the "ip ospf adj" debug that there may be a master/slave arguement going on between the 3750 and the nodes in EXCHANGE. I would see the output "First DBD and we are not the slave" or something to that affect. Also tons of retransmitted DBDs.

AwesomeGarrett · November 2013

You could always configure SPAN and get packet capture to see what going on. I'm gonna go on a limb here, with the information provided I would guess either service provider issue or IOS bug. That is assuming everything is configured correctly.

phoeneous · November 2013

For testing purposes, can you unstack the stack? I've heard of people having sporadic issues with 3750x stacks so it's worth a shot. Also, what ios?

Monkerz · November 2013

I am currently running 12.2(55)SE3, but have also tried 12.2(5

SE2.

I can attempt to unstack tomorrow pending my availability.

Monkerz · November 2013

Proxy-arp, need I explain more? I feel so dumb right now...

I started troubleshooting from the top down, but didn't quite get low enough to resolve the issue. I opened a TAC case and was watching as the engineer typed at the speed of light within my console. Then suddenly as I saw the output of 'show ip arp' it just clicked, "those IPs shouldn't have the same MAC." I expressed my concern to the engineer, who confirmed my suspension. I entered static arp entries, cleared arp-cache and clear ospf process again. Entered 'show ip ospf nei' and felt this wave of relief as I saw 7 FULL adjacencies.

Going to schedule a change management and disable proxy-arp on all nodes within this ring tonight.

Thank you for everyone's suggestions. Really love this brotherhood/sisterhood of tech nuts.

Thanks again,

Monkz

networker050184 · November 2013

Curious why you are using a point-to-multipoint setup here?

Monkerz · November 2013

The use of different connection speeds into the provider ring. If a site's connection into this provider ring is 6G, it would see all neighbors as the cost equivalent of 6G (on this network that would be a 6) away plus whatever cost from the neighbor to the actual network.

This is a problem as some connections are 6G, but some are also 2G and 1G. Using point-to-multipoint I can use neighbor statements to specify the actual cost to neighbors (the neighbors cost into the provider ring). So a neighbor with a link of 2G into the provider ring would be seen from a node with a 6G connection into the ring as a cost of 20 or 2G (given reference bandwidth is 40G).

We had load balancing issues with three connections into this provider ring. Two of them were 6G and one 1G. All three were sitting at the same diameter from one of our data centers connected via one of our dark fiber rings. Other nodes on the provider ring were seeing three paths to this data center as equal cost via unequal cost paths.

networker050184 · November 2013

Cool. I've used something similar in the past for a set up something like this. Thanks for the explanation!

phoeneous · November 2013

Monkerz wrote: »

Proxy-arp, need I explain more?

Proxy arp strikes again!

Glad you got it figured out.

Cisco 3750X (IPBase) running OSPF Point-to-Multipoint

Comments