How to troubleshoot intermittent interVLAN routing issues

--chris----chris-- Member Posts: 1,518 ■■■■■□□□□□
I have run into an issue that pops up 1-2x a week for 40 to 180 minutes. It appears as thought all routing between VLAN's just drops out. I can ping hosts within the same VLAN, but I am unable to ping hosts in other VLANS. This of course means internet access is down as well (I have tested that; the ISP connection past the demarc point is good...its also able to get through the firewall, the connection "drops" out at the next hop past the firewall which is a Dell 3548 L3 switch)

If I change my laptops IP config to match a different VLAN, then I can ping the hosts in that VLAN but again I can not ping past the gateway.

So far this issue has pop'd up twice and it is starting to give me gray hair. I know just enough about switching/routing to start working on the issue, but my core knowledge really does not extend into troubleshooting.

Any suggestions on where to begin with this issue? Any debugs that could be suggested?

Comments

  • VAHokie56VAHokie56 Member Posts: 783
    So I am guessing the Dell switch is doing all the routing for LAN sub-nets and may has a default route pointing to you internet router or fW?
    .ιlι..ιlι.
    CISCO
    "A flute without holes, is not a flute. A donut without a hole, is a Danish" - Ty Webb
    Reading:NX-OS and Cisco Nexus Switching: Next-Generation Data Center Architectures
  • --chris----chris-- Member Posts: 1,518 ■■■■■□□□□□
    VAHokie56 wrote: »
    So I am guessing the Dell switch is doing all the routing for LAN sub-nets and may has a default route pointing to you internet router or fW?

    Yes; thank you for verbalizing what I could not.

    The Dell L3 has a default route of 0.0.0.0/0 via 10.1.0.254 which is the firewall/internet gateway. Knowing this then, the L3 switch has to have at least one trunk port with sub-interfaces on it that are acting as the gateway for each VLAN right?
  • VAHokie56VAHokie56 Member Posts: 783
    I think more likely ( and mind you I have never worked on a dell switch ) that it has SVI's for each vlan making them routes local to the switch or "connected" and they are in the routing table...the link to the router I would guess is a routed switch port off the the dell with a IP address...if you want to PM with config I can try and help you just make sure you ***** out and Passwords or public IP's.
    I really think it sounds like the switch is taking a fart every so often though and it most likely may not be a config issue...I assume the dell can log sys messages , do you have any from the time in question?
    .ιlι..ιlι.
    CISCO
    "A flute without holes, is not a flute. A donut without a hole, is a Danish" - Ty Webb
    Reading:NX-OS and Cisco Nexus Switching: Next-Generation Data Center Architectures
  • it_consultantit_consultant Member Posts: 1,903
    ^^ This ^^

    Chris you are thinking of a router on a stick which is old school. Most likely the Dell's have a virtual router which will make this unnecessary. In my stack, which isn't Dell, I have a couple of virtual routers (in Brocade it is called VE or 'virtual ethernet) which I configure in each VLAN. When I configure those they automatically route for the VLAN to other VEs in the stack assuming that there is at least one untagged (or 'access') port active on the VLAN. Plugged into the firewall is a port which is untagged in one of those VLANs and then a simple 0.0.0.0 0.0.0.0 route to the firewall port. I think the Dell's are configured similarly.

    Seeing the running config will help greatly.
  • --chris----chris-- Member Posts: 1,518 ■■■■■□□□□□
    This makes sense. SVI's right?

    This is the running config of the dell 3548 L3 which is the next hop after the firewall:

    SW_SOUTH# show run
    spanning-tree mode rstp
    interface range ethernet e(30-47)
    spanning-tree portfast
    exit
    interface ethernet e12
    description DELL_PC_NEW_AUGUST_2014
    exit
    interface ethernet e26
    description SNIFFER_COMPUTER
    exit
    interface ethernet e27
    description GRANDSTREAM
    exit
    interface ethernet e29
    description FIREPANEL
    exit
    interface range ethernet e(30-31,33)
    description PHONES
    exit
    interface ethernet e32
    description PHONE_R_X112
    exit
    interface range ethernet e(34-47)
    description FIELDPOINT
    exit
    interface ethernet e48
    description RUCKUS_AP__828
    exit
    interface ethernet g2
    description LINK_TO_NO
    exit
    interface ethernet g3
    description LINK_TO_W
    exit
    interface range ethernet e(30-33,4icon_cool.gif,g(2,4)
    switchport mode trunk
    exit
    vlan database
    vlan 8,21,52,100-101,104,112,120,128,900,999
    exit
    interface range ethernet e(34-47)
    switchport access vlan 8
    exit
    interface range ethernet e(30-33)
    switchport trunk native vlan 8
    exit
    interface range ethernet e48,g(2,4)
    switchport trunk allowed vlan add 8
    exit
    interface range ethernet e(30-33,4icon_cool.gif,g(2,4)
    switchport trunk allowed vlan add 21
    exit
    interface ethernet e27
    switchport access vlan 52
    exit
    interface range ethernet e48,g(2,4)
    switchport trunk allowed vlan add 52
    exit
    interface range ethernet e(30-33,4icon_cool.gif,g(2,4)
    switchport trunk allowed vlan add 100
    exit
    interface range ethernet e(30-33,4icon_cool.gif,g(2,4)
    switchport trunk allowed vlan add 101
    exit
    interface ethernet e48
    switchport trunk native vlan 104
    exit
    interface range ethernet e(30-33),g(2,4)
    switchport trunk allowed vlan add 104
    exit
    interface range ethernet e(30-33,4icon_cool.gif,g(2,4)
    switchport trunk allowed vlan add 112
    exit
    interface range ethernet e(30-33,4icon_cool.gif,g(2,4)
    switchport trunk allowed vlan add 120
    exit
    interface range ethernet e(30-33,4icon_cool.gif,g(2,4)
    switchport trunk allowed vlan add 128
    exit
    interface ethernet g4
    switchport trunk allowed vlan add 900
    exit
    interface ethernet g3
    switchport access vlan 999
    exit
    interface range ethernet e(30-33,4icon_cool.gif,g(2,4)
    switchport trunk allowed vlan add 999
    exit
    interface vlan 101
    name SWITCH_MGMT
    exit
    interface vlan 900
    name FIELDPOINT
    exit
    voice vlan oui-table add 0001e3 Siemens_AG_phone________
    voice vlan oui-table add 00036b Cisco_phone_____________
    voice vlan oui-table add 00096e Avaya___________________
    voice vlan oui-table add 000fe2 H3C_Aolynk______________
    voice vlan oui-table add 0060b9 Philips_and_NEC_AG_phone
    voice vlan oui-table add 0080f0 PANASONIC______________
    voice vlan oui-table add 00d01e Pingtel_phone___________
    voice vlan oui-table add 00e075 Polycom/Veritel_phone___
    voice vlan oui-table add 00e0bb 3Com_phone______________
    voice vlan id 52
    voice vlan cos 5
    interface ethernet e30
    voice vlan enable
    exit
    interface ethernet e31
    voice vlan enable
    exit
    interface ethernet e32
    voice vlan enable
    exit
    interface ethernet e33
    voice vlan enable
    exit
    interface vlan 999
    ip address 10.1.0.240 255.255.255.128
    exit
    interface vlan 101
    ip address 10.1.101.32 255.255.255.0
    exit
    interface vlan 1
    ip address 192.168.1.240 255.255.255.0
    exit
    interface vlan 900
    ip address 192.168.50.253 255.255.255.0
    exit
    ip default-gateway 10.1.101.1
    hostname SW_SOUTH
    logging 10.1.0.10
    username -- password -- level 15 encrypted
    snmp-server community -- rw view --
    clock timezone -5 zone EDT
    clock source sntp
    sntp unicast client enable
    sntp unicast client poll
    sntp anycast client enable
    sntp broadcast client enable
    sntp server -- poll
    sntp server 2610:20:6f15:15::27 poll
    ip name-server 8.8.8.8
    snmp-server set rlEventsDeleteEvents rlEventsDeleteEvents 1


    Default settings:
    Service tag:


    SW version


    Fast Ethernet Ports
    ==========================
    no shutdown
    speed 100
    duplex full
    negotiation
    flow-control off
    mdix auto
    no back-pressure


    Gigabit Ethernet Ports
    =============================
    no shutdown
    speed 1000
    duplex full
    negotiation
    flow-control off
    mdix auto
    no back-pressure


    interface vlan 1
    interface port-channel 1 - 15


    spanning-tree
    spanning-tree mode STP


    qos basic
    qos trust cos

    ___________________



    Interesting note:

    I was talking to an employee that sits near the network closet where the firewall for this location is, the L3 switch (next hop after firewall) and another l2 switch. He said the APC battery backup that all of this stuff is plugged into went off (Audio alarm) yesterday. He went in and pushed the button to make the noise stop. He said this turned off everything (firewall, L3 & L2 switch) so he pushed the button and everything came back up. He said this was at about 2:10 PM. This is the exact time the network came "back up" yesterday. Coincidence?
  • --chris----chris-- Member Posts: 1,518 ■■■■■□□□□□
    I checked for the logs, but because of the end user power-cycling it yesterday there is nothing from before the power-cycle event. When I view the logs all I can see is it powering up, ports checking in then my remote connection getting created...
  • it_consultantit_consultant Member Posts: 1,903
    I have a couple of suggestions based on what I see. When I initially looked at this I thought this was a dell stack created by using the stacking ports available on the Dell, but what I am seeing (because of the trunk ports 30-33 etc) is that you are probably plugging in an aggregation or access layer into these switches. Now I am thinking STP and or loops that are introduced at the access layer.

    This may not solve the problem but I would convert to 802.1W RSTP across the switching fabric. I would declare all of the access ports as spanning tree access mode (it might be a different term in dell) but that will essentially prevent a switch added on an access port from participating in STP. I would then declare your trunk ports as spanning tree point to point links which are ports allowed to participate in spanning tree topology updates. I would then put a BPDU filter on all the access ports which will prevent a looped switch at someone's desk from spewing a ton of BPDUs into your aggregation and core.

    Doing that will at least prevent one of the most common scenarios which cause "network down" situations. Beyond that, I would take a look at the release notes for the newer Dell firmwares (yours are from 2009) and see if a problem similar to yours is called out as a defect. This actually happened to me, I have hardened switches which would go to 100% utilization after a PCI scan. The next firmware released fixed that defect. My core switches went to 100% utilization if you hit a certain keystroke in the console - the VRRP service took all the CPU even though we weren't using it. Sometimes there are good things to be found in those releases.
  • VAHokie56VAHokie56 Member Posts: 783
    Is it just one L2 switch? kind of sounds like the APC maybe died nucked your L3 switch ( bye bye inter-vlan routing) the L2 switch at this point would still allow communication to stuff on same sub-net
    .ιlι..ιlι.
    CISCO
    "A flute without holes, is not a flute. A donut without a hole, is a Danish" - Ty Webb
    Reading:NX-OS and Cisco Nexus Switching: Next-Generation Data Center Architectures
  • --chris----chris-- Member Posts: 1,518 ■■■■■□□□□□
    VAHokie56 wrote: »
    Is it just one L2 switch? kind of sounds like the APC maybe died nucked your L3 switch ( bye bye inter-vlan routing) the L2 switch at this point would still allow communication to stuff on same sub-net

    I left out some network details to limit the scope of my original question.

    Your suggestion would be logical but the l3 switch is one of 3 and there are 2 L2 switches in there as we'll ( all of this is spread across 3 buildings). The APC, one L3 and firewall are in building 2. I was in building 1 with another L3 switch and was also unable to ping out of any vlan. Same story in building 3.
  • VAHokie56VAHokie56 Member Posts: 783
    doesn't really make sense for there to be three L3 switches IMO...two I can live with for FHRP's but I dont see any HSPR or GLBP on that dell switch (my dell skills suck thought). Is it possible you have three L3 switches but only one (the one you posted config from) is actually handling routing? and if so is the one above connected tot he jacked up APC?
    .ιlι..ιlι.
    CISCO
    "A flute without holes, is not a flute. A donut without a hole, is a Danish" - Ty Webb
    Reading:NX-OS and Cisco Nexus Switching: Next-Generation Data Center Architectures
  • phoeneousphoeneous Member Posts: 2,333 ■■■■■■■□□□
    1. Is this a new design?
    2. Did the problem just start happening or has it been happening since day one?
    3. Have you checked the logs in the UPS?

    Always start at the physical later and in this case would be the ups. Plug the ups comm cable to a pc and install the ups software so you can check it. L2 or l3 troubleshooting wont do much if power is wonky.
  • --chris----chris-- Member Posts: 1,518 ■■■■■□□□□□
    VAHokie56 wrote: »
    doesn't really make sense for there to be three L3 switches IMO...two I can live with for FHRP's but I dont see any HSPR or GLBP on that dell switch (my dell skills suck thought). Is it possible you have three L3 switches but only one (the one you posted config from) is actually handling routing? and if so is the one above connected tot he jacked up APC?

    See the quick and dirty physical diagram I made, does this configuration make sense now? The L3 we have been discussing is not the only L3 that performs routing.

    Yes, the config posted above is the one actually connected to the APC that had an alarm go off.

    t6w9ax.jpg
  • --chris----chris-- Member Posts: 1,518 ■■■■■□□□□□
    phoeneous wrote: »
    1. Is this a new design?
    2. Did the problem just start happening or has it been happening since day one?
    3. Have you checked the logs in the UPS?

    Always start at the physical later and in this case would be the ups. Plug the ups comm cable to a pc and install the ups software so you can check it. L2 or l3 troubleshooting wont do much if power is wonky.


    1) No, this design has been in place since August of this year.
    2) This issue started on the 6th of this month.
    3) I have not thought to check the logs in the UPS, I will be Monday though. Thats a great idea, thanks!
  • VAHokie56VAHokie56 Member Posts: 783
    lol this is so confusing...OK so if there is L3 in every building then there has to be routing between them also, that means L3 boundary between each building meaning that vlans are pretty much local to each building as well so there would be no way users in diff building could be on the same sub-net. Would you say that's an accurate statement? now if you tell me you have L3 trunks between the buildings and are doing routing and L2 across them all , I would say rip it all out and start over because that's ugly...also which building has the bad APC?
    .ιlι..ιlι.
    CISCO
    "A flute without holes, is not a flute. A donut without a hole, is a Danish" - Ty Webb
    Reading:NX-OS and Cisco Nexus Switching: Next-Generation Data Center Architectures
  • --chris----chris-- Member Posts: 1,518 ■■■■■□□□□□
    VAHokie56 wrote: »
    lol this is so confusing...OK so if there is L3 in every building then there has to be routing between them also, that means L3 boundary between each building meaning that vlans are pretty much local to each building as well so there would be no way users in diff building could be on the same sub-net. Would you say that's an accurate statement? now if you tell me you have L3 trunks between the buildings and are doing routing and L2 across them all , I would say rip it all out and start over because that's ugly...also which building has the bad APC?

    Yes, building 1,2,3 each have a L3 switch. Building 2 has a 3560, so if your more comfortable with Cisco I could do a sh run on that and post it here. The L3's have static routes that point to each other.

    There are two VLANs that are in all three buildings (VoIP & "admin" aka office PC's) i.e. some users in building 1-2-3 are in the same VLAN.

    Building 1 has the "bad" APC which I will be looking into tomorrow.

    I agree on rebuilding the network, its very complicated and it does not need to be. We have customers 3-4X this size on 1 to 3 VLANs, not a dozen. However, the customer has short term goals that need to be met and that is not one of them.
  • it_consultantit_consultant Member Posts: 1,903
    It would surprise me if the routing was the problem, we would expect that to be a constant problem. I am still at spanning tree and people looping switches but it is also possible that you have a VLAN leak. Since you are running a configuration where some VLANs are end to end and some are not I wonder if the VLAN config is messed up on the links somewhere. When you say "switchport mode trunk" and then declare the allowed VLANs, that doesn't necessarily remove all other VLANs from using the trunk. It might in Dell but my experience is that once you say "mode trunk" then all VLANs are allowed to cross the port (hence the term 'trunk') and you prune. From what I see the pruning is half done.
  • phoeneousphoeneous Member Posts: 2,333 ■■■■■■■□□□
    When you say "switchport mode trunk" and then declare the allowed VLANs, that doesn't necessarily remove all other VLANs from using the trunk.

    Huh? That's exactly what that command does.

    Catalyst 2960 and 2960-S Software Configuration Guide, 12.2(55)SE - Configuring VLANs [Cisco Catalyst 2960 Series Switches] - Cisco
  • it_consultantit_consultant Member Posts: 1,903
    He is not using a Cisco - well, one of them is a Cisco, the rest are Dells. BTW, the reference you sent actually doesn't say that. Reference:

    "By default, a trunk port sends traffic to and receives traffic from all VLANs. All VLAN IDs, 1 to 4094, are allowed on each trunk. However, you can remove VLANs from the allowed list, preventing traffic from those VLANs from passing over the trunk. To restrict the traffic a trunk carries, use the switchport trunk allowed vlan remove vlan-list interface configuration command to remove specific VLANs from the allowed list."

    In his pasted config he does not have the 'allowed vlan remove xxx' he just has 'allowed vlan xxx', hence my conclusion that the trunks are not properly pruned. What I am unsure of is that in the absence of the "remove vlan list' command, does explicitly allowing certain VLANs implicitly block the other VLANs that are allowed by default. I do not know that to be true.
  • --chris----chris-- Member Posts: 1,518 ■■■■■□□□□□
    He is not using a Cisco - well, one of them is a Cisco, the rest are Dells. BTW, the reference you sent actually doesn't say that. Reference:

    "By default, a trunk port sends traffic to and receives traffic from all VLANs. All VLAN IDs, 1 to 4094, are allowed on each trunk. However, you can remove VLANs from the allowed list, preventing traffic from those VLANs from passing over the trunk. To restrict the traffic a trunk carries, use the switchport trunk allowed vlan remove vlan-list interface configuration command to remove specific VLANs from the allowed list."

    In his pasted config he does not have the 'allowed vlan remove xxx' he just has 'allowed vlan xxx', hence my conclusion that the trunks are not properly pruned. What I am unsure of is that in the absence of the "remove vlan list' command, does explicitly allowing certain VLANs implicitly block the other VLANs that are allowed by default. I do not know that to be true.

    I to am still focusing on the noisy APC and possible VLAN issues. I am more focused on the L1 than than the L2/L3 because this network has been fine for 3 months, no changes at all then two weeks ago it had that outage (no outage since either, knock on wood).

    I had food poisoning or something yesterday, but today I will be able to check out the APC's logs. Hopefully that has a big red flag in it.
  • phoeneousphoeneous Member Posts: 2,333 ■■■■■■■□□□
    He is not using a Cisco - well, one of them is a Cisco, the rest are Dells. BTW, the reference you sent actually doesn't say that. Reference:

    "By default, a trunk port sends traffic to and receives traffic from all VLANs. All VLAN IDs, 1 to 4094, are allowed on each trunk. However, you can remove VLANs from the allowed list, preventing traffic from those VLANs from passing over the trunk. To restrict the traffic a trunk carries, use the switchport trunk allowed vlan remove vlan-list interface configuration command to remove specific VLANs from the allowed list."

    In his pasted config he does not have the 'allowed vlan remove xxx' he just has 'allowed vlan xxx', hence my conclusion that the trunks are not properly pruned. What I am unsure of is that in the absence of the "remove vlan list' command, does explicitly allowing certain VLANs implicitly block the other VLANs that are allowed by default. I do not know that to be true.


    The reason why add and remove are not in the config is because theyre dynamic commands. Sure 1 to 4094 are allowed on the trunks by default until you specify otherwise. Why else would the allowed vlan command even exist if you had no control over which vlans traversed the trunk??

    Lab it up, capture some traffic, you'll see.
  • it_consultantit_consultant Member Posts: 1,903
    On my switches you see both the adding of allowed VLANs and the removing of the non-allowed VLANs; which we should all be doing for at least VLAN 1, which OP doesn't have. However, the idea that we are implicitly denying other VLANs because we have allowed some VLANs is not evident to me unless I see it documented. To me this is like having an allow rule in a firewall without a catch all deny any any at the end of the ruleset. I am not saying you are wrong I am saying that the documentation does not tell me that allowing specific VLANs then denies all others; unless I just didn't see it.
  • networker050184networker050184 Mod Posts: 11,962 Mod
    Phoeneous is correct. When you specifically allow certain VLANs the rest are removed on a Cisco switch. Just do a show interface trunk command to see for yourself. It will list all allowed VLANs. Not sure about Dell though.
    An expert is a man who has made all the mistakes which can be made.
  • it_consultantit_consultant Member Posts: 1,903
    Believe it or not I do not have a Cisco switch sitting around! I have HP and Brocade so I believe you. I do it both ways on my switches, maybe I don't have too. There is more than one way to skin a cat and I suppose more than one way to prune a trunk.
  • --chris----chris-- Member Posts: 1,518 ■■■■■□□□□□
    The APC's logging is hosed, its time & dates appear to reset every 24 hour. Every log entry is dated Dec 31 1999.

    I am putting in a new battery backup tomorrow, and sorting out the other garbage as we move along.

    Thanks to all the other contributors, I learned that pruning VLAN trunks can be done a few ways :)
Sign In or Register to comment.