Going insane over this ASA/Nexus issue

hurricane1091 · May 2017

Hello,

I posted here about this before, but the problem persists. Let's just get a quick layout of the land.

Two Nexus 5500s at the core, serving as the gateway for all VLANs. From here, a port channel exists to a 3650 stack, which has a port channel to a DMVPN hub, a port channel to an ASA 5545 (with Source Fire), and a port channel to a voice router.

The Nexus core has a BGP neighbor relationship with the edge router, to which it gets a default route to the IP of the edge router (with a static route to the firewall for that edge router IP). Our ISP connection is FINE, BGP does not drop nor do my IP SLA monitors experience any issues. The SLA monitor on the ASA to the internet also remains up - always. However, we experience problems where people cannot get to the internet. You cannot even ping out, so not a DNS issue. My access switch that goes through the core to get to the internet has failures for an IP SLA monitor pinging 8.8.8.8 as well. A trace route from my work station will show it getting to the core, then it dies (the ASA reports itself as a hop by the way).

With me so far? Great. No, I do not believe this is a physical cabling issue. Our voice traffic traverses the same port channel as data to the 3650 stack, and no phone issues (I personally made calls to prove this during the "outage"). People can also VPN from the OUTSIDE successfully, but cannot get to internal resources (but people internally can get to these same resources).

With me still? It's as if the inside interface of the firewall is not getting traffic or processing traffic. I have captures set up on the ASA, and they remain at 0 bytes. The captures are set to monitor ICMP and TCP traffic to a specific site. I try to ping and open a telnet session on port 80 from my workstation when the issue occurs, and they fail with the captures being at 0 bytes. The captures do work, because when the issue does not exist they fill up.

So what the heck? My BGP relationships between the core and the edge router STAY UP. Default Cisco timers of 60 and 180, and the outage most definitely lasted 10 minutes (you can easily see this from utilization graphs by the way). So I was thinking an ARP issue - maybe the ARP entry on the Nexus for the firewall expires, and for some reason the firewall is not responding to an ARP request causing grief. I have not been able to verify this theory, but I already find it to be debunked because my BGP relationship stays up - and those keep alives go through the firewall. I now have an SLA monitor on the ASA to ping the core constantly, and an SLA monitor on the access switch to ping the inside interface of the firewall constantly.

This is not a bandwidth issue, and seemingly not a physical cabling issue. It is sporadic, and short lived. I have cleared interface counters everywhere and checked them again. No drops, no errors. No HSRP changes between the core switches that would cause weird issues. I'm at a loss here, I really have no idea what to even look for now. I wondered if something is using the same IP as the firewall causing ARP issues, but I don't think so as that VLAN doesn't even exist on the access layer. Plus, how could it be an ARP issue if my BGP neighbor does not go down?

The only thing I can think of doing is setting up a packet capture on the 3650 stack, to make sure the core is passing traffic. I don't know why my captures say 0 bytes, does that mean nothing is coming into the ASA on the inside interface at all? Is this Source Fire thing doing something? I see no events from my IP address, but there is a ton of traffic to New Relic being black listed. Not sure if that would trigger some sort of weird issue or what. Thoughts?

Edit: The phones work, like I said. So obviously this core routes traffic fine. From the charts, it's about a 4-5 minute decline to 0 bandwidth utilization the ASA, it's not abrupt. We do have logging for the ASA and I did not see anything interesting, but will be scanning them again.

MitM · May 2017

Sounds insane. When you say port-channel from the nexus, are you using vPC? Do you have "no ip redirects" on all of your VLAN interfaces?

hurricane1091 · May 2017

MitM wrote: »

Sounds insane. When you say port-channel from the nexus, are you using vPC? Do you have "no ip redirects" on all of your VLAN interfaces?

Thanks for the reply. We are using vPC. I never dug too far into this at my old job, but Core 1 is the active forwarder for all VLANs, so I am thinking only the ten gig link from that Nexus to the 3650 is getting used in this scenario. Not sure about the second question. I just started this job two weeks ago and am the only network engineer, so I have no one to ask why things were done certain ways for what it is worth.

devils_haircut · May 2017

Out of curiousity, what code version are you running on your ASA? I was on 9.6(2) recently (just upgraded to 9.7(4)? I think). We were having weird issues similar to what you describe, random "hiccups" where our ASA would just appear to stop passing traffic between interfaces. You could still ping its inside interface during these times, though. After I upgraded to 9.7, it hasn't happened again. I have SNMP monitoring on all my network gear, and I send everything to a syslog server as well, but was never able to actually verify what the issue was.

mayhem87 · May 2017

You might want to add the BGP to the packet capture. Not sure what you are using for your neighbor statements but if its a loopback or something else maybe the traffic is going around a different way during the outage. I mean it shouldn't because the firewall should be separating those two networks but if you have some sort of out of band network that then carriers that loopback on a igp route then that could explain the bgp staying up. iBGP doesnt change next hop for peers by default.
Quick way to prove your theory is show the arp for the gateway ip (should be the ASA interface) on your nexus during a good time and then show ip arp during the outage.

craigaaron · May 2017

I had a similar issue. I got a replacement module because the ports where up and connected I couldn't ping any hosts and the arp entry would show incomplete. if I moved the network cable to another module it would work and come back to live.

hurricane1091 · May 2017

devils_haircut wrote: »

Out of curiousity, what code version are you running on your ASA? I was on 9.6(2) recently (just upgraded to 9.7(4)? I think). We were having weird issues similar to what you describe, random "hiccups" where our ASA would just appear to stop passing traffic between interfaces. You could still ping its inside interface during these times, though. After I upgraded to 9.7, it hasn't happened again. I have SNMP monitoring on all my network gear, and I send everything to a syslog server as well, but was never able to actually verify what the issue was.

Holy crap - WE ARE ON 9.6(2). Thank you for the reply, I will try to upgrade.

Edit - We got on this code last year. I am told (as I was not here) that we had a nasty spell of this problem last year for awhile too.

I do believe the core sends it's keep alives through the firewall and there's really no other way for it to happen. The neighbor is the edge router and uses a public IP, the route to that IP is a static route pointing to the firewall. The reverse is also true - the edge router has two static routes pointing to the firewall for the IP used to establish the BGP relationship between the edge router and both cores.

hurricane1091 · May 2017

mayhem87 wrote: »

You might want to add the BGP to the packet capture. Not sure what you are using for your neighbor statements but if its a loopback or something else maybe the traffic is going around a different way during the outage. I mean it shouldn't because the firewall should be separating those two networks but if you have some sort of out of band network that then carriers that loopback on a igp route then that could explain the bgp staying up. iBGP doesnt change next hop for peers by default.
Quick way to prove your theory is show the arp for the gateway ip (should be the ASA interface) on your nexus during a good time and then show ip arp during the outage.

I kind of posted about this in my previous post, but yes I do plan to see the ARP entry during the outage. My problem with that though is the BGP neighbor never does go down from either core switch to the edge router, and there is static routing involved so no alternative path.

hurricane1091 · May 2017

Here's a utilization graph on the ASA port-channel interface too.

http://i.imgur.com/Wpdbz2n.png

mayhem87 · May 2017

yea i can see how that would drive you nuts. guess once you get all the information/proof gathered you could provide it all to TAC but they will probably do it all over again anyways.

hurricane1091 · May 2017

I did open a TAC case, presented everything, and basically got "well shucks" lol. Great stuff. I will try the new code soon and see if it resolves the issue.

hurricane1091 · June 2017

devils_haircut wrote: »

Out of curiousity, what code version are you running on your ASA? I was on 9.6(2) recently (just upgraded to 9.7(4)? I think). We were having weird issues similar to what you describe, random "hiccups" where our ASA would just appear to stop passing traffic between interfaces. You could still ping its inside interface during these times, though. After I upgraded to 9.7, it hasn't happened again. I have SNMP monitoring on all my network gear, and I send everything to a syslog server as well, but was never able to actually verify what the issue was.

This fixed the issue! Thanks

devils_haircut · June 2017

Sweet! Yeah, I never saw an official bug report from Cisco, but it was making me look really bad for several weeks before I managed to get a change window approved. We would go several days with no problems, then have 1-2 days in a row with multiple drops throughout the day. Haven't had any issues since I upgraded almost a month ago. Glad to hear it worked.

hurricane1091 · June 2017

devils_haircut wrote: »

Sweet! Yeah, I never saw an official bug report from Cisco, but it was making me look really bad for several weeks before I managed to get a change window approved. We would go several days with no problems, then have 1-2 days in a row with multiple drops throughout the day. Haven't had any issues since I upgraded almost a month ago. Glad to hear it worked.

Same here brother. Cheers mate.

hurricane1091 · June 2017

Had a good run. 22 days of stability, it was a good run I guess. I think this ASA is junk and needs to be replaced.

Going insane over this ASA/Nexus issue

Comments