Going insane over this ASA/Nexus issue
Hello,
I posted here about this before, but the problem persists. Let's just get a quick layout of the land.
Two Nexus 5500s at the core, serving as the gateway for all VLANs. From here, a port channel exists to a 3650 stack, which has a port channel to a DMVPN hub, a port channel to an ASA 5545 (with Source Fire), and a port channel to a voice router.
The Nexus core has a BGP neighbor relationship with the edge router, to which it gets a default route to the IP of the edge router (with a static route to the firewall for that edge router IP). Our ISP connection is FINE, BGP does not drop nor do my IP SLA monitors experience any issues. The SLA monitor on the ASA to the internet also remains up - always. However, we experience problems where people cannot get to the internet. You cannot even ping out, so not a DNS issue. My access switch that goes through the core to get to the internet has failures for an IP SLA monitor pinging 8.8.8.8 as well. A trace route from my work station will show it getting to the core, then it dies (the ASA reports itself as a hop by the way).
With me so far? Great. No, I do not believe this is a physical cabling issue. Our voice traffic traverses the same port channel as data to the 3650 stack, and no phone issues (I personally made calls to prove this during the "outage"). People can also VPN from the OUTSIDE successfully, but cannot get to internal resources (but people internally can get to these same resources).
With me still? It's as if the inside interface of the firewall is not getting traffic or processing traffic. I have captures set up on the ASA, and they remain at 0 bytes. The captures are set to monitor ICMP and TCP traffic to a specific site. I try to ping and open a telnet session on port 80 from my workstation when the issue occurs, and they fail with the captures being at 0 bytes. The captures do work, because when the issue does not exist they fill up.
So what the heck? My BGP relationships between the core and the edge router STAY UP. Default Cisco timers of 60 and 180, and the outage most definitely lasted 10 minutes (you can easily see this from utilization graphs by the way). So I was thinking an ARP issue - maybe the ARP entry on the Nexus for the firewall expires, and for some reason the firewall is not responding to an ARP request causing grief. I have not been able to verify this theory, but I already find it to be debunked because my BGP relationship stays up - and those keep alives go through the firewall. I now have an SLA monitor on the ASA to ping the core constantly, and an SLA monitor on the access switch to ping the inside interface of the firewall constantly.
This is not a bandwidth issue, and seemingly not a physical cabling issue. It is sporadic, and short lived. I have cleared interface counters everywhere and checked them again. No drops, no errors. No HSRP changes between the core switches that would cause weird issues. I'm at a loss here, I really have no idea what to even look for now. I wondered if something is using the same IP as the firewall causing ARP issues, but I don't think so as that VLAN doesn't even exist on the access layer. Plus, how could it be an ARP issue if my BGP neighbor does not go down?
The only thing I can think of doing is setting up a packet capture on the 3650 stack, to make sure the core is passing traffic. I don't know why my captures say 0 bytes, does that mean nothing is coming into the ASA on the inside interface at all? Is this Source Fire thing doing something? I see no events from my IP address, but there is a ton of traffic to New Relic being black listed. Not sure if that would trigger some sort of weird issue or what. Thoughts?
Edit: The phones work, like I said. So obviously this core routes traffic fine. From the charts, it's about a 4-5 minute decline to 0 bandwidth utilization the ASA, it's not abrupt. We do have logging for the ASA and I did not see anything interesting, but will be scanning them again.