packet drops

razam · December 2013

Hello guys,

I have a network setup and facing few problems in it. would appreciate if anyone can give some suggestion in improving it.

Devices
Core Switch = 4506-E
Access Switches = 2960

Connectivity
all access switches connected to core using single 1G fiber uplink.

Management
Vlan 100,
Core Switch = 192.168.100.1/24
Access Switches = 192.168.100.5/24 onwards

Data (internet)
vlans used 10 to 30
each vlan for 5 access switches
vlan 10 = 192.168.10.0/24 to vlan 30 = 192.168.30.0/24

DHCP Pools configured on Core Switch
default router for different vlans configured on the core switch (192.168.10.1, 192.168.11.1 .... 192.168.30.1)

default gateway configured on all access switches is 192.168.100.1 (management-ip address of core), should i keep this as a default gateway or should i change it to according to vlan (for example if access switch is in vlan 10, ip default gateway 192.168.10.1) ?

dhcp snooping is configured, spanning tree port fast configured, storm control configured

Problem
1) when connect computer to any access switch port, i face few packet drops after every 20 to 25 responses.
2) from the core side when i try to telnet any access switch, some switches take 3 to 4 tries to respond. if i ping them they will not respond, will use show mac address command, will not be able to see mac of the switch on management vlan. after 3 to 4 tries i will get ping response and will be able to telnet and will be able to see mac on that core switch port. what can be the reason for this issue ?

any suggestions for improving the network ? anything else i can implement on it ?

thank you

Dieg0M · December 2013

You posted same thread in the CCIE forums. Please provide configs for both devices if you want a definitive answer and refrain from double posting.

razam · December 2013

ok, wil take care next time..

here is the sample of the configuration of access switches

configure terminal

hostname <...>

vlan 100
name Management-VLAN

vlan 10
name Data-Vlan

Interface vlan 100
description ##Management-Interface##
Ip address 192.168.100.x 255.255.255.0
no shut
exit

ip dhcp snooping
ip dhcp snooping vlan 10

int range fastEthernet 0/1 - 24/48
description ##TO-END-USERS##
switchport mode access
switchport access vlan 10
speed auto
duplex auto
spanning-tree portfast
spanning-tree bpdufilter enable
no shut
no ip dhcp snooping trust
ip dhcp snooping limit rate 70
storm-control broadcast level 30.00 10.00
storm-control action shutdown
exit

Interface gigabitethernet 0/1
Description ##Uplink-to-Core-Switch##
Switchport mode trunk
switchport trunk allowed vlan 10-30,100
No shut
ip dhcp snooping trust
exit

Service password-encryption

Ip default-gateway 192.168.100.1

errdisable recovery cause all
errdisable recovery interval 30

vtp mode transparent

no ip domain lookup
ntp server 192.168.100.1
clock timezone gmt 3

line console 0
login local
exit

Line vty 0 4
Transport input telnet
Login local
end

write

razam · December 2013

on the core side, configuration is

1) interfaces configuration that are connected to access switches

int range gig 2/0/1 - 48
description <.....>
switchport mode trunk
ip dhcp snooping trust

2) DHCP Pools are created

ip dhcp pool DATA_vlan10
network 192.168.10.0 255.255.255.0
default-router 192.168.10.1
dns-server <......>
domain-name <.....>

3) interface vlan configuration

interface vlan 10
ip address 192.168.10.1 255.255.255.0

4) dhcp snooping is enabled

ip dhcp snooping
ip dhcp snooping vlan 10-30

razam · December 2013

i have checked interface statistics, no errors on interfaces, not even a single error on any interface

"show interface gig 0/x counters error"

i have run this command both on access side and on core side.. there are no errors...

checked one more command on the access switches side, will share its results

bldg3#show interfaces gigabitEthernet 0/1 transceiver
ITU Channel not available (Wavelength not available),
Transceiver is internally calibrated.
If device is externally calibrated, only calibrated values are printed.
++ : high alarm, + : high warning, - : low warning, -- : low alarm.
NA or N/A: not applicable, Tx: transmit, Rx: receive.
mA: milliamperes, dBm: decibels (milliwatts).

Optical Optical
Temperature Voltage Tx Power Rx Power
Port (Celsius) (Volts) (dBm) (dBm)

Gi0/1 33.7 3.30 -5.1 -9.8

Dieg0M · December 2013

Is this a production network? Is it possible you are encountering broadcast storms and the error-recovery mechanism is re-enabling those ports over and over again? This would match the packet loss cycle you are describing. Enable a logging at debug level and give us a show log for a 10 minute period.

Thank you

razam · December 2013

this network is set up for residential area...

there are around 130 access switches, core switch 4506-E has 3 slots of 48 Gig SFP ports / slot .. all 130 access switches connected to this single core.

there are some users who connect their access points on the ports and use 3 to 4 devices in their rooms.

i think i need to modify errdisable recovery configuration. if the interface gets in an err disabled state due to any cause, recovery time set is 30 seconds.

i will get the debug output soon and share it with you..

thank you

razam · December 2013

i would like to confirm one thing here,, if there is storm broadcast, will it shut down the port ? or will it put it in an err disabled state ?

i have used command storm-control action shutdown

EdTheLad · December 2013

So you have 3 x 48 gig links from your core switch to access switches, these are all trunks and no allowed list configured on the core switch. This means you have 22 vlans x 130 ports = 2860 spanning-tree instances on the core switch, that's at a minimum, if you have any additional vlans configured in the vlan database on the core switch multiple them by 130 and add to 2860.
First thing to do is to add a vlan allow list to all trunk ports on the core switch.It's possible you have exceeded the bpdu limit on the switch, if some random bpdu's don't get generated by the core that would manifest into the problem your seeing with ports going blocking and forwarding etc.
Check cpu utilization and see what stp stats are available on the switches. Check the logs of access switches to check if ports are transistioning spanning-tree states.
If this is the issue, do you really need spanning-tree on the access switches? it sounds like you have a hub and spoke environment so without redundancy i don't see the need.
Might be something completely different, buts that's what i'd look at given the info you provided.

deth1k · December 2013

EdTheLad, where did you get 22 Vlans from also how is that 2860 STP instances? He has 3 Vlans which would be 3 STP instances regardless of the number of access switches being used PVST (clue is in the name).

Razam, configure down-when-looped on your access switches also how many end hosts do you have? Did you check you TCAM utilisation for number of MAC's? Also suggest you run SPAN session and sniff one of the core uplinks, you might have an ARP storm, check CPU usage...

EdTheLad · December 2013

vlan 10-30 = 21 vlans
vlan 100 = 1 vlan
Total = 22 vlans

Each core port has a minimum of 22 vlans, since no vlan allow list is configured it could be alot more. I meant bpdu instances per chassis, each core port will send a bpdu per vlan, which means the switch has to support sending and receiving 130 x 22 bpdu's.If the switch cant handle this, bpdu's wont be sent and remote blocked ports will go forwarding, causing a temporary loop etc.

deth1k · December 2013

meh, can't read. swear it was 10,30

Either way you will have an instance per vlan in the database so 22 instances in total well 23 including vlan 1. I'd be more worried with TCAM usage as 2960 has 8K limit also remove BPDU filter and enable root guard, them cheap access points also run STP so there's all sorts of problems could come from those. Stormcontrol would drop broadcast storms however take the hit of the CPU so something to consider too.

"sh platform tcam utilization" would be nice to check...

deth1k · December 2013

btw Razam, why not move away from flat L2 model and run individual vlan to each access switch and have them in common subnet using "ip unnumbered" on an SVI? Are there any reasons for so many vlans spanned to each access switch?

Dieg0M · December 2013

razam wrote: »

i would like to confirm one thing here,, if there is storm broadcast, will it shut down the port ? or will it put it in an err disabled state ?

i have used command storm-control action shutdown

The shutdown command will put it in err-disable mode. See : Catalyst 2950 and Catalyst 2955 Switch Software Configuration Guide, 12.1(22)EA7 - Configuring Port-Based Traffic Control [Cisco Catalyst 2950 Series Switches] - Cisco Systems

Ed, there will only be 21 PVST+ instances. I still need a show log at debugging level to give more information.

Thank you

EdTheLad · December 2013

Dieg0M wrote: »

Ed, there will only be 21 PVST+ instances. I still need a show log at debugging level to give more information.

I know there are 21 pvst+ instances, but the number of instances is not an issue, the number of bpdu's is the issue. If you had 2 ports and 100 instances it would be similar to 2 instances and 100 ports. The hard work done by the switch is generating the bpdus, i've seen large switching networks melt due to switches having issues processing bpdu's. I'd be pretty sure the issue he is seeing is due to the core switch, otherwise he would have seen the issue previously before getting to 130 access sites.

razam · December 2013

thank you all for your suggestion,

@ EdTheLad

now i have allowed only 2 vlans per port from the core side, vlan 100 & one data vlan.

cpu utilization is 50%, last 60secs, last 60 mins, last 72 hours.. for all it shows 50%.. i will test the performance tomorrow from the access switches side, will share the update.

@ Dieg0M

thank you for sharing the article, next thing ill do is modify my err-disable configuration, so that it does not recover the port if it has gone down because of storm broadcast.

i will run debug command and share the results.

@ deth1k

ill check the TCAM utilizatio tomorrow, and share the result with you.. i will implement root-guard on all the switches, there is a possibility that any access point / network device the end user connects it tries to become the root.. you suggested on thing,

why not move away from flat L2 model and run individual vlan to each access switch and have them in common subnet using "ip unnumbered" on an SVI? Are there any reasons for so many vlans spanned to each access switch?

can you please share any article or give some more input in it for its implementation ?

once again thank you for your suggestions

razam · December 2013

after yesterdays modification of vlan allow list on trunk interfaces today checked the performance of the core switch, cpu utilization reduced from 50% before to 15% ... a lot of improvement...

before if we connected computer to any access switch port, response time from the gateway configured on the Core 4500 used to be 100 ms.. now it is 1 or 2 ms...

today have also modified my err-disable recovery configuration, before it used to be any cause but restore it in 30 seconds, now if the port gets in err disable state because of storm control broadcast it will not restore it... by this i will get to know that which users access points are violating network traffic...

@ deth1k

please see the below results for show platform tcam utilization, taken from one of the access switches.

CAM Utilization for ASIC# 0 Max Used
Masks/Values Masks/values

Unicast mac addresses: 1040/8320 15/38
IPv4 IGMP groups + multicast routes: 56/448 7/28
IPv4 unicast routes: 0/0 0/0
IPv4 policy based routing aces: 0/0 0/0
IPv4 qos aces: 384/384 260/260
IPv4 security aces: 384/384 39/39

EdTheLad · December 2013

Yup, as i expected bpdu issue, you shouldn't see any broadcast storms anymore.I don't agree with your err disable modification, this means your users will be offline until user intervention. Why not just setup a syslog server and monitor which ports go err-disabled.

Dieg0M · December 2013

EdTheLad wrote: »

Yup, as i expected bpdu issue, you shouldn't see any broadcast storms anymore.I don't agree with your err disable modification, this means your users will be offline until user intervention. Why not just setup a syslog server and monitor which ports go err-disabled.

You would have the whole network down rather than parts of it? Broadcast storm will affect the switch performance as a whole, storm control will only err-disable the affected ports.

EdTheLad · December 2013

If the broadcast storm threshold is set low enough it wont bring the network down, but isolating a customer until somebody manually brings the port online might not fit with his company policy regarding sla's etc, so its better he thinks about this before blindly configuring it. Using a low enough threshold which is just a little higher than customer usage and below a rate that can affect the rest of the network is the best solution.

Dieg0M · December 2013

EdTheLad wrote: »

If the broadcast storm threshold is set low enough it wont bring the network down, but isolating a customer until somebody manually brings the port online might not fit with his company policy regarding sla's .

True but it is much harder to pinpoint the root cause of an intermittent problem rather than an ongoing problem (a port that is down and staying down). If the packet drops were more spaced out (let's say a couple of hours), the user might not even notice the interface resets and this problem can be present without you knowing about it. I don't know about you but we don't care about interfaces going up/down at the access-layer, it happens all the time.

EdTheLad · December 2013

If nobody notices there is a problem, there's not a problem lol. But in the case you highlight above, what would happen is a port is in err-disabled mode, a customer is down, nobody knows why. First thing that will happen is somebody will toggle the port and see if everything is ok, it will be fine and 2 hours later it will fail again. To troubleshoot this issue, you will more than likely need the port up and perform a span etc to see what traffic is causing the problem. So in my opinion it would be better to monitor ports using syslog or traps, if a broadcast threshold is exceeded constantly, span the port and notify the customer of the issue, rather then have them calling you.
Anyway, i think we've given razam enough info for him to decide which way he prefers to go.

Dieg0M · December 2013

EdTheLad wrote: »

If nobody notices there is a problem, there's not a problem lol. But in the case you highlight above, what would happen is a port is in err-disabled mode, a customer is down, nobody knows why. First thing that will happen is somebody will toggle the port and see if everything is ok, it will be fine and 2 hours later it will fail again.

That is an Ad-hoc management mentality and it's not the way we work in our firm. For our engineers to close an Incident ticket, it is mandatory for us to put down the root cause of the problem. We can't simply put a band-aid on the problem and walk away. I don't want to sound like I'm telling you that you don't know what you are doing, but I think it is terrible advise to tell someone to use error disable recovery in a network experiencing problems.

EdTheLad · December 2013

So we'll agree to disagree, firstly you don't know how his network is managed, as an engineer i look at the worst case scenario and work around that.If using err-disable was the ideal solution it would be a default, because it's traffic affecting and can be controlled with thresholds Cisco have chosen to use the less aggressive approach. I really cant understand where your coming from, a customer has a broadcast storm which when it hits a threshold, traffic is transmitted up to that threshold, this threshold would be set to a value that the customer is permitted for unicast, broadcast would be set to a much lower level, which means it cant affect the rest of the network. A trap or syslog is sent as soon as the threshold is reached, you are now aware of the issue so troubleshooting can begin. Maybe the storm is a random event every 6 hrs for 2 mins, you are diagnosing and troubleshooting the issue before the customer is even aware of it. You find that traffic is all sourced from the same mac address, at this time you shutdown the port and forward your findings to the company.

But you prefer to have the port shutdown and affect a customer without any clue as to whats causing the issue. The customer is calling complaining, your boss is asking you to resolve the issue asap etc etc... Maybe that's just how you roll!

Dieg0M · December 2013

Let's agree to disagree.

There are arguments for and against and I think this was informative either way for the OP.

razam · December 2013

hello guys,

recently checked the cpu utilization, its showing 50% on average. checked "show process cpu", it is because of "ARP input" process that the CPU is showing very high utilization. wat can be done to resolve this issue ?

EdTheLad · December 2013

First find out why you have alot of arp traffic, what's the timeout set on the arp cache? you don't route to the ethernet interface rather than next-hop ip?
You'll probably have to span the port and find out where the requests are coming from.

packet drops

Comments