Weird ASA issues?

hurricane1091 · May 2017

Hello folks,

Started a new job last week, and already dealing with frustrating issues. Every day the internet seems to stop working at HQ. First, let me just explain the simple set up. Two Nexus switches at the core which appear to serve as the default gateway for all VLANs. These connect to an aggregation switch, and off of this switch is a few things but most importantly an ASA 5545 and the internet border router.

Where I am at is this - the IBR does NOT drop the BGP relationship with our ISP. No packet drops or i/o errors. In fact, I can ping out to 8.8.8.8 fine (I connect to this through an OOB switch, so not going through the firewall). The core has a BGP relationship with the IBR through the ASA, but this relationship stays up as well and the default route to the internet ends up pointing to the ASA (with a static route to the ASA for the default route pointing to the IBR). So from the core, I cannot ping 8.8.8.8 during these outages (but again can ping 8.8.8.8 on the IBR). No drops or i/o errors on the switch interfaces connecting to anything. Very strange, right? No historical logging is enabled on the ASA, but I think this needs to change. I noticed during this outage on the live stats that the connections spike to 1000 on ASDM (with an arrow pointing upward, so who knows how high it actually spikes, but this model handles 30000 connections IIRC). There are drops on outside (normal) and inside (normal maybe?) interface on the ASA, but the only rule from inside to outside is a permit IP any (no web proxies making request on client behalf). I see nothing interesting in the logs, and I'm really at a loss here. Someone else took a look at the log server and nothing appears to be interesting there. Any ideas?

Looking at my monitoring, I can directly see that the amount of traffic dips to next to nothing twice this morning, at the exact times we were having issues. Something has to be wrong with this.

mayhem87 · May 2017

Have you tried pinging out the ASA during the drop time? Can you confirm ARP entries on CORE, ASA, and IBR for the next hops compared to when its working? How long are the drops? Assuming short enough that the dead timer on BGP isn't hit < 3 mins. Have you tried taking packet captures from the ASA to confirm traffic? IE you see your pings hit inside asa vs actually exiting interface to IBR and never returning.

hurricane1091 · May 2017

mayhem87 wrote: »

Have you tried pinging out the ASA during the drop time? Can you confirm ARP entries on CORE, ASA, and IBR for the next hops compared to when its working? How long are the drops? Assuming short enough that the dead timer on BGP isn't hit < 3 mins. Have you tried taking packet captures from the ASA to confirm traffic? IE you see your pings hit inside asa vs actually exiting interface to IBR and never returning.

The drops apparently seem to be less than 5 minutes. I am somewhat ruling out the BGP scenario, since it seems the ISP is using non-Cisco and the hold time is 90 seconds, and the outage seems to last longer than that and BGP remains up. I've also been on the core and the IBR at the same time, and found the IBR could ping while the core could not.

The drops do not last long, and it has been hard to get data during that window when it occurs. Looking at my monitoring, I see the same drop of in/out traffic on our IBR as I do on the ASA. I lean towards the ASA being the culprit because traffic should still be coming into the inside interface even if the IBR is having connectivity issues, but it seems like that is not happening. I could buy the fact that traffic dips on the IBR if the FW is killing off all active connections. I have looked at logging for the ASA since this time during specific times when the outage was occurring, and I am seeing builds/tear downs like normal which makes no sense. Plenty of connection builds, but monitoring says no traffic coming in/out?

Thanks for the reply by the way. Firewalls are not my strong suit. Neither is wireless. Of course these are both things seemingly having issues immediately, and nothing is broke that is actually up my wheelhouse.

Edit - Just thinking about things, if the FW was completely blocking traffic, the BGP relationship between the core and IBR should go down. That timer should be 3 minutes though, and the outages are right around that time, so it's possible the ASA still causes issues but that BGP relationship stays up. I could potentially lower this timer to something to the tune of 20 60 so it could potentially make the problem more obvious and point some fingers at the ASA further? I assume this would take down BGP briefly, so I should probably hold off on that.

mayhem87 · May 2017

Edit - Just thinking about things, if the FW was completely blocking traffic, the BGP relationship between the core and IBR should go down. That timer should be 3 minutes though, and the outages are right around that time, so it's possible the ASA still causes issues but that BGP relationship stays up. I could potentially lower this timer to something to the tune of 20 60 so it could potentially make the problem more obvious and point some fingers at the ASA further? I assume this would take down BGP briefly, so I should probably hold off on that.

This is what I was looking at. I believe your issue is between the CORE - ASA - IBR and that since your bgp relationship isn't dropping that somewhat suggests that the outage is less than 3 minutes. Honestly my first go to in a situation like this would be packet captures on the firewall and checking ARP entries. The packet capture would be definitive in telling you where the problem is. You could even do two at the same time.

I would do a packet on your inside interface looking for your IP or some test machine ip that is running a continuous ping to a destination that wouldn't likely be used by your company. From there I would take a packet capture from the outside and then once the issue happens you can stop them and look at it. There might be some other things to look at but when it comes to weird intermittent issues my defacto standard is a packet capture.

Personally I do cli for the ASA caps unless for some reason i need to get them off the box. An easy way would just be
"cap capin interface <inside> match ip host <your ip or test box> host <destination>"
"cap capout interface <outside> match ip any host <destination>" reason second is different is cause i dont know where you are natting and it would probably be easier to look at. Now you could limit these to just icmp but this is just a quick capture.

To show captures:
show cap capin
show cap capout

To turn off:
no cap capin
no cap capout

Also I don't think that the ASA is rejecting all connections. This could be some PAT exhaustion as well or something else. I still question the ARP as well.

hurricane1091 · May 2017

mayhem87 wrote: »

This is what I was looking at. I believe your issue is between the CORE - ASA - IBR and that since your bgp relationship isn't dropping that somewhat suggests that the outage is less than 3 minutes. Honestly my first go to in a situation like this would be packet captures on the firewall and checking ARP entries. The packet capture would be definitive in telling you where the problem is. You could even do two at the same time.

I would do a packet on your inside interface looking for your IP or some test machine ip that is running a continuous ping to a destination that wouldn't likely be used by your company. From there I would take a packet capture from the outside and then once the issue happens you can stop them and look at it. There might be some other things to look at but when it comes to weird intermittent issues my defacto standard is a packet capture.

Personally I do cli for the ASA caps unless for some reason i need to get them off the box. An easy way would just be
"cap capin interface <inside> match ip host <your ip or test box> host <destination>"
"cap capout interface <outside> match ip any host <destination>" reason second is different is cause i dont know where you are natting and it would probably be easier to look at. Now you could limit these to just icmp but this is just a quick capture.

To show captures:
show cap capin
show cap capout

To turn off:
no cap capin
no cap capout

Also I don't think that the ASA is rejecting all connections. This could be some PAT exhaustion as well or something else. I still question the ARP as well.

I am not too familiar with doing packet captures on the ASA, but have down packet traces which are seemingly of a similar format. The only thing is if I do this capture, I probably won't notice anything until an issue occurs. Do I really want to run a packet capture for potentially hours upon end?

My monitoring does show the traffic dips down to nearly zero for both inbound and outbound traffic. We're not running a ton of connections, so there shouldn't be a PAT exhaustion (I'm not even sure what that limit would be, but it's probably quite large).

In terms of the ARP issue, what exactly am I looking for? I've done a "show arp" and I see some IP and MAC address mappings, and then a time next to it which is seemingly a seconds counter. The weird thing is with that is the ARP entry for the default gateway IP keeps resetting after the count (seconds?) gets to 1 or 2. No other entry is doing this.

Also seeing this: Dropped blocks in ARP: 653251

mayhem87 · May 2017

By limiting the packet capture you should be fine. I wouldn't do a pcap of any any as that may over run your box however, in production I do them often.

In terms of arp you are looking for something taking over. IE maybe you have proxy arp "on" for a device and it is taking over your MAC address to your ASA IP. Basically just making sure that MAC addresses that you are sending traffic to are the correct to the device you are sending traffic to. I've seen code upgrades enable features that were off before and lead to new issues. Also not sure if you have HA setup on the firewalls but might want to make sure they aren't constantly failing over.

hurricane1091 · May 2017

No HA set up, and the code has been on here for awhile (up time is 2 months, so at least that long).

Not understanding what this dropped blocks thing is all about could be something I guess. Not sure why the gateway router has a time of 0 for the ARP entry to the ISP, unless they are sending gratuitous arps constantly.

Weird ASA issues?

Comments