packet drops
Hello guys,
I have a network setup and facing few problems in it. would appreciate if anyone can give some suggestion in improving it.
Devices
Core Switch = 4506-E
Access Switches = 2960
Connectivity
all access switches connected to core using single 1G fiber uplink.
Management
Vlan 100,
Core Switch = 192.168.100.1/24
Access Switches = 192.168.100.5/24 onwards
Data (internet)
vlans used 10 to 30
each vlan for 5 access switches
vlan 10 = 192.168.10.0/24 to vlan 30 = 192.168.30.0/24
DHCP Pools configured on Core Switch
default router for different vlans configured on the core switch (192.168.10.1, 192.168.11.1 .... 192.168.30.1)
default gateway configured on all access switches is 192.168.100.1 (management-ip address of core), should i keep this as a default gateway or should i change it to according to vlan (for example if access switch is in vlan 10, ip default gateway 192.168.10.1) ?
dhcp snooping is configured, spanning tree port fast configured, storm control configured
Problem
1) when connect computer to any access switch port, i face few packet drops after every 20 to 25 responses.
2) from the core side when i try to telnet any access switch, some switches take 3 to 4 tries to respond. if i ping them they will not respond, will use show mac address command, will not be able to see mac of the switch on management vlan. after 3 to 4 tries i will get ping response and will be able to telnet and will be able to see mac on that core switch port. what can be the reason for this issue ?
any suggestions for improving the network ? anything else i can implement on it ?
thank you
I have a network setup and facing few problems in it. would appreciate if anyone can give some suggestion in improving it.
Devices
Core Switch = 4506-E
Access Switches = 2960
Connectivity
all access switches connected to core using single 1G fiber uplink.
Management
Vlan 100,
Core Switch = 192.168.100.1/24
Access Switches = 192.168.100.5/24 onwards
Data (internet)
vlans used 10 to 30
each vlan for 5 access switches
vlan 10 = 192.168.10.0/24 to vlan 30 = 192.168.30.0/24
DHCP Pools configured on Core Switch
default router for different vlans configured on the core switch (192.168.10.1, 192.168.11.1 .... 192.168.30.1)
default gateway configured on all access switches is 192.168.100.1 (management-ip address of core), should i keep this as a default gateway or should i change it to according to vlan (for example if access switch is in vlan 10, ip default gateway 192.168.10.1) ?
dhcp snooping is configured, spanning tree port fast configured, storm control configured
Problem
1) when connect computer to any access switch port, i face few packet drops after every 20 to 25 responses.
2) from the core side when i try to telnet any access switch, some switches take 3 to 4 tries to respond. if i ping them they will not respond, will use show mac address command, will not be able to see mac of the switch on management vlan. after 3 to 4 tries i will get ping response and will be able to telnet and will be able to see mac on that core switch port. what can be the reason for this issue ?
any suggestions for improving the network ? anything else i can implement on it ?
thank you
Comments
-
Dieg0M Member Posts: 861You posted same thread in the CCIE forums. Please provide configs for both devices if you want a definitive answer and refrain from double posting.Follow my CCDE journey at www.routingnull0.com
-
razam Member Posts: 39 ■■□□□□□□□□ok, wil take care next time..
here is the sample of the configuration of access switches
configure terminal
hostname <...>
vlan 100
name Management-VLAN
vlan 10
name Data-Vlan
Interface vlan 100
description ##Management-Interface##
Ip address 192.168.100.x 255.255.255.0
no shut
exit
ip dhcp snooping
ip dhcp snooping vlan 10
int range fastEthernet 0/1 - 24/48
description ##TO-END-USERS##
switchport mode access
switchport access vlan 10
speed auto
duplex auto
spanning-tree portfast
spanning-tree bpdufilter enable
no shut
no ip dhcp snooping trust
ip dhcp snooping limit rate 70
storm-control broadcast level 30.00 10.00
storm-control action shutdown
exit
Interface gigabitethernet 0/1
Description ##Uplink-to-Core-Switch##
Switchport mode trunk
switchport trunk allowed vlan 10-30,100
No shut
ip dhcp snooping trust
exit
Service password-encryption
Ip default-gateway 192.168.100.1
errdisable recovery cause all
errdisable recovery interval 30
vtp mode transparent
no ip domain lookup
ntp server 192.168.100.1
clock timezone gmt 3
line console 0
login local
exit
Line vty 0 4
Transport input telnet
Login local
end
write -
razam Member Posts: 39 ■■□□□□□□□□on the core side, configuration is
1) interfaces configuration that are connected to access switches
int range gig 2/0/1 - 48
description <.....>
switchport mode trunk
ip dhcp snooping trust
2) DHCP Pools are created
ip dhcp pool DATA_vlan10
network 192.168.10.0 255.255.255.0
default-router 192.168.10.1
dns-server <......>
domain-name <.....>
3) interface vlan configuration
interface vlan 10
ip address 192.168.10.1 255.255.255.0
4) dhcp snooping is enabled
ip dhcp snooping
ip dhcp snooping vlan 10-30 -
razam Member Posts: 39 ■■□□□□□□□□i have checked interface statistics, no errors on interfaces, not even a single error on any interface
"show interface gig 0/x counters error"
i have run this command both on access side and on core side.. there are no errors...
checked one more command on the access switches side, will share its results
bldg3#show interfaces gigabitEthernet 0/1 transceiver
ITU Channel not available (Wavelength not available),
Transceiver is internally calibrated.
If device is externally calibrated, only calibrated values are printed.
++ : high alarm, + : high warning, - : low warning, -- : low alarm.
NA or N/A: not applicable, Tx: transmit, Rx: receive.
mA: milliamperes, dBm: decibels (milliwatts).
Optical Optical
Temperature Voltage Tx Power Rx Power
Port (Celsius) (Volts) (dBm) (dBm)
Gi0/1 33.7 3.30 -5.1 -9.8 -
Dieg0M Member Posts: 861Is this a production network? Is it possible you are encountering broadcast storms and the error-recovery mechanism is re-enabling those ports over and over again? This would match the packet loss cycle you are describing. Enable a logging at debug level and give us a show log for a 10 minute period.
Thank youFollow my CCDE journey at www.routingnull0.com -
razam Member Posts: 39 ■■□□□□□□□□this network is set up for residential area...
there are around 130 access switches, core switch 4506-E has 3 slots of 48 Gig SFP ports / slot .. all 130 access switches connected to this single core.
there are some users who connect their access points on the ports and use 3 to 4 devices in their rooms.
i think i need to modify errdisable recovery configuration. if the interface gets in an err disabled state due to any cause, recovery time set is 30 seconds.
i will get the debug output soon and share it with you..
thank you -
razam Member Posts: 39 ■■□□□□□□□□i would like to confirm one thing here,, if there is storm broadcast, will it shut down the port ? or will it put it in an err disabled state ?
i have used command storm-control action shutdown -
EdTheLad Member Posts: 2,111 ■■■■□□□□□□So you have 3 x 48 gig links from your core switch to access switches, these are all trunks and no allowed list configured on the core switch. This means you have 22 vlans x 130 ports = 2860 spanning-tree instances on the core switch, that's at a minimum, if you have any additional vlans configured in the vlan database on the core switch multiple them by 130 and add to 2860.
First thing to do is to add a vlan allow list to all trunk ports on the core switch.It's possible you have exceeded the bpdu limit on the switch, if some random bpdu's don't get generated by the core that would manifest into the problem your seeing with ports going blocking and forwarding etc.
Check cpu utilization and see what stp stats are available on the switches. Check the logs of access switches to check if ports are transistioning spanning-tree states.
If this is the issue, do you really need spanning-tree on the access switches? it sounds like you have a hub and spoke environment so without redundancy i don't see the need.
Might be something completely different, buts that's what i'd look at given the info you provided.Networking, sometimes i love it, mostly i hate it.Its all about the $$$$ -
deth1k Member Posts: 312EdTheLad, where did you get 22 Vlans from also how is that 2860 STP instances? He has 3 Vlans which would be 3 STP instances regardless of the number of access switches being used PVST (clue is in the name).
Razam, configure down-when-looped on your access switches also how many end hosts do you have? Did you check you TCAM utilisation for number of MAC's? Also suggest you run SPAN session and sniff one of the core uplinks, you might have an ARP storm, check CPU usage... -
EdTheLad Member Posts: 2,111 ■■■■□□□□□□vlan 10-30 = 21 vlans
vlan 100 = 1 vlan
Total = 22 vlans
Each core port has a minimum of 22 vlans, since no vlan allow list is configured it could be alot more. I meant bpdu instances per chassis, each core port will send a bpdu per vlan, which means the switch has to support sending and receiving 130 x 22 bpdu's.If the switch cant handle this, bpdu's wont be sent and remote blocked ports will go forwarding, causing a temporary loop etc.Networking, sometimes i love it, mostly i hate it.Its all about the $$$$ -
deth1k Member Posts: 312meh, can't read. swear it was 10,30 Either way you will have an instance per vlan in the database so 22 instances in total well 23 including vlan 1. I'd be more worried with TCAM usage as 2960 has 8K limit also remove BPDU filter and enable root guard, them cheap access points also run STP so there's all sorts of problems could come from those. Stormcontrol would drop broadcast storms however take the hit of the CPU so something to consider too.
"sh platform tcam utilization" would be nice to check... -
deth1k Member Posts: 312btw Razam, why not move away from flat L2 model and run individual vlan to each access switch and have them in common subnet using "ip unnumbered" on an SVI? Are there any reasons for so many vlans spanned to each access switch?
-
Dieg0M Member Posts: 861i would like to confirm one thing here,, if there is storm broadcast, will it shut down the port ? or will it put it in an err disabled state ?
i have used command storm-control action shutdown
The shutdown command will put it in err-disable mode. See : Catalyst 2950 and Catalyst 2955 Switch Software Configuration Guide, 12.1(22)EA7 - Configuring Port-Based Traffic Control [Cisco Catalyst 2950 Series Switches] - Cisco Systems
Ed, there will only be 21 PVST+ instances. I still need a show log at debugging level to give more information.
Thank youFollow my CCDE journey at www.routingnull0.com -
EdTheLad Member Posts: 2,111 ■■■■□□□□□□
Ed, there will only be 21 PVST+ instances. I still need a show log at debugging level to give more information.
I know there are 21 pvst+ instances, but the number of instances is not an issue, the number of bpdu's is the issue. If you had 2 ports and 100 instances it would be similar to 2 instances and 100 ports. The hard work done by the switch is generating the bpdus, i've seen large switching networks melt due to switches having issues processing bpdu's. I'd be pretty sure the issue he is seeing is due to the core switch, otherwise he would have seen the issue previously before getting to 130 access sites.Networking, sometimes i love it, mostly i hate it.Its all about the $$$$ -
razam Member Posts: 39 ■■□□□□□□□□thank you all for your suggestion,
@ EdTheLad
now i have allowed only 2 vlans per port from the core side, vlan 100 & one data vlan.
cpu utilization is 50%, last 60secs, last 60 mins, last 72 hours.. for all it shows 50%.. i will test the performance tomorrow from the access switches side, will share the update.
@ Dieg0M
thank you for sharing the article, next thing ill do is modify my err-disable configuration, so that it does not recover the port if it has gone down because of storm broadcast.
i will run debug command and share the results.
@ deth1k
ill check the TCAM utilizatio tomorrow, and share the result with you.. i will implement root-guard on all the switches, there is a possibility that any access point / network device the end user connects it tries to become the root.. you suggested on thing,
why not move away from flat L2 model and run individual vlan to each access switch and have them in common subnet using "ip unnumbered" on an SVI? Are there any reasons for so many vlans spanned to each access switch?
can you please share any article or give some more input in it for its implementation ?
once again thank you for your suggestions -
razam Member Posts: 39 ■■□□□□□□□□after yesterdays modification of vlan allow list on trunk interfaces today checked the performance of the core switch, cpu utilization reduced from 50% before to 15% ... a lot of improvement...
before if we connected computer to any access switch port, response time from the gateway configured on the Core 4500 used to be 100 ms.. now it is 1 or 2 ms...
today have also modified my err-disable recovery configuration, before it used to be any cause but restore it in 30 seconds, now if the port gets in err disable state because of storm control broadcast it will not restore it... by this i will get to know that which users access points are violating network traffic...
@ deth1k
please see the below results for show platform tcam utilization, taken from one of the access switches.
CAM Utilization for ASIC# 0 Max Used
Masks/Values Masks/values
Unicast mac addresses: 1040/8320 15/38
IPv4 IGMP groups + multicast routes: 56/448 7/28
IPv4 unicast routes: 0/0 0/0
IPv4 policy based routing aces: 0/0 0/0
IPv4 qos aces: 384/384 260/260
IPv4 security aces: 384/384 39/39 -
EdTheLad Member Posts: 2,111 ■■■■□□□□□□Yup, as i expected bpdu issue, you shouldn't see any broadcast storms anymore.I don't agree with your err disable modification, this means your users will be offline until user intervention. Why not just setup a syslog server and monitor which ports go err-disabled.Networking, sometimes i love it, mostly i hate it.Its all about the $$$$
-
Dieg0M Member Posts: 861Yup, as i expected bpdu issue, you shouldn't see any broadcast storms anymore.I don't agree with your err disable modification, this means your users will be offline until user intervention. Why not just setup a syslog server and monitor which ports go err-disabled.Follow my CCDE journey at www.routingnull0.com
-
EdTheLad Member Posts: 2,111 ■■■■□□□□□□If the broadcast storm threshold is set low enough it wont bring the network down, but isolating a customer until somebody manually brings the port online might not fit with his company policy regarding sla's etc, so its better he thinks about this before blindly configuring it. Using a low enough threshold which is just a little higher than customer usage and below a rate that can affect the rest of the network is the best solution.Networking, sometimes i love it, mostly i hate it.Its all about the $$$$
-
Dieg0M Member Posts: 861If the broadcast storm threshold is set low enough it wont bring the network down, but isolating a customer until somebody manually brings the port online might not fit with his company policy regarding sla's .
True but it is much harder to pinpoint the root cause of an intermittent problem rather than an ongoing problem (a port that is down and staying down). If the packet drops were more spaced out (let's say a couple of hours), the user might not even notice the interface resets and this problem can be present without you knowing about it. I don't know about you but we don't care about interfaces going up/down at the access-layer, it happens all the time.Follow my CCDE journey at www.routingnull0.com -
EdTheLad Member Posts: 2,111 ■■■■□□□□□□If nobody notices there is a problem, there's not a problem lol. But in the case you highlight above, what would happen is a port is in err-disabled mode, a customer is down, nobody knows why. First thing that will happen is somebody will toggle the port and see if everything is ok, it will be fine and 2 hours later it will fail again. To troubleshoot this issue, you will more than likely need the port up and perform a span etc to see what traffic is causing the problem. So in my opinion it would be better to monitor ports using syslog or traps, if a broadcast threshold is exceeded constantly, span the port and notify the customer of the issue, rather then have them calling you.
Anyway, i think we've given razam enough info for him to decide which way he prefers to go.Networking, sometimes i love it, mostly i hate it.Its all about the $$$$ -
Dieg0M Member Posts: 861If nobody notices there is a problem, there's not a problem lol. But in the case you highlight above, what would happen is a port is in err-disabled mode, a customer is down, nobody knows why. First thing that will happen is somebody will toggle the port and see if everything is ok, it will be fine and 2 hours later it will fail again.Follow my CCDE journey at www.routingnull0.com
-
EdTheLad Member Posts: 2,111 ■■■■□□□□□□So we'll agree to disagree, firstly you don't know how his network is managed, as an engineer i look at the worst case scenario and work around that.If using err-disable was the ideal solution it would be a default, because it's traffic affecting and can be controlled with thresholds Cisco have chosen to use the less aggressive approach. I really cant understand where your coming from, a customer has a broadcast storm which when it hits a threshold, traffic is transmitted up to that threshold, this threshold would be set to a value that the customer is permitted for unicast, broadcast would be set to a much lower level, which means it cant affect the rest of the network. A trap or syslog is sent as soon as the threshold is reached, you are now aware of the issue so troubleshooting can begin. Maybe the storm is a random event every 6 hrs for 2 mins, you are diagnosing and troubleshooting the issue before the customer is even aware of it. You find that traffic is all sourced from the same mac address, at this time you shutdown the port and forward your findings to the company.
But you prefer to have the port shutdown and affect a customer without any clue as to whats causing the issue. The customer is calling complaining, your boss is asking you to resolve the issue asap etc etc... Maybe that's just how you roll!Networking, sometimes i love it, mostly i hate it.Its all about the $$$$ -
Dieg0M Member Posts: 861Let's agree to disagree. There are arguments for and against and I think this was informative either way for the OP.Follow my CCDE journey at www.routingnull0.com
-
razam Member Posts: 39 ■■□□□□□□□□hello guys,
recently checked the cpu utilization, its showing 50% on average. checked "show process cpu", it is because of "ARP input" process that the CPU is showing very high utilization. wat can be done to resolve this issue ? -
EdTheLad Member Posts: 2,111 ■■■■□□□□□□First find out why you have alot of arp traffic, what's the timeout set on the arp cache? you don't route to the ethernet interface rather than next-hop ip?
You'll probably have to span the port and find out where the requests are coming from.Networking, sometimes i love it, mostly i hate it.Its all about the $$$$