QOS Mistake

ClaymooreClaymoore Member Posts: 1,637
Yeah, I screwed up but at least I will admit it. For over a year I have been telling the staff at work that you can't hurt our network because we are running QOS. Every time they want to blame the network for a problem or express concerns about bandwidth or lag and how it would affect our VOIP system I would say 'no worries' - we are running QOS.

Today I found out we weren't...

About a year ago (1 month after I passed my CCNA and 3 total months since I inherited the network) I got tired of the port shortage and cabling mess that came from plugging phones and PCs into different ports on our 5 Cisco 4507s. I grabbed our test 3550 switch and went to work testing the commands necessary to allow our phones and PCs to be connected on the same ports. That wasn't very hard, but I was having trouble deciding on how to best implement QOS. Everything I was reading was pushing me to use 'Auto QOS VOIP', so I ran that idea past the consultants who were supporting our phone system. They supported using auto qos, and when I wrote up the template for the interface commands they approved the whole plan. I make all the interface changes, we eliminated about 400 patch cords, and everything was fine for a year.

Fast-forward to the end of February which is when the first of our oft-delayed dual 100 Mb MetroEthernet connections came online to replace the overpriced OC-3 connection to our DR site where all our 800 numbers terminate. This implementation also includes 2 new multi-layer 4948 switches to replace the 2970 switch and 7204 router at our DR site. The only routers we have left are the 2 3825s (one at each location) handling our voice traffic. 2 weeks later we fired up 2 more 4948s (one of them 10Gb) to support the new IP-based NS3-80 EMC SAN that we are implementing. Included in this new SAN implementation are some NFS mounts we are testing that are connected and backed up over the normal data network (our NDMP license hasn't arrived yet). Pile on a server crash that forced an impromptu migration to a new SQL cluster which resulted in the SQL cluster losing its connection to the Backup vlan, thus causing all of its data (about 3.5 TB) to be backed up over the data vlan. In a very short time we changed how we moved a lot of data - some routed, some not - through our network. However, I didn't expect any problems because we are running QOS. I also thought the big data transfers were occurring after hours when any brief burst of traffic would not affect our call centers.

Except we weren't really running QOS and we started having problems. The sub and pub would lose connectivity and the phones would unregister and re-register. We were getting 'temp fail' messages quite frequently. The phones were marking the frames correctly, but those tags were useless because I didn't use the command 'QOS TRUST DSCP' on all of my trunk and etherchannel ports connecting my switches. The engineer assigned to the TAC case we opened pointed this out. Without that command, all the qos tags set by the phones are stripped when the frame leaves the original switch. I did not know that. I thought that once the tag was set then the qos infomation would stay with the frame the entire time it was in my network. I then ran some wireshark captures to verify that what he said was true - sure enough, no qos tags are in the frames received by my phone.

Enabling qos on all those connections is an important step and I missed it. I don't know if I just skipped that information in the documentation and books I was referencing when I did the design or its not well documented. Either way, I screwed up and our voice guys were taking the heat for it. I enabled qos on those ports and set the IPCC, Unity and CM switch ports to trust the dscp tags set by the servers. Enabling QOS on the 3825s was easy with the SDM - 4 mouse clicks send 56 commands to the routers. More wireshark captures proved that the phones are sending and receiving qos tags - cos 3 for call control and cos 5 for voice traffic.

At 5 PM we recapped the days events in order to make sure everyone was on the same page. Our Unix admin was involved this time and he informed us that he has been testing about 500 GB of NFS mounted data - data that is synchronized between the new and old SAN using scripts that run at noon and midnight. That, combined with the changes in our backup network, show just how much it takes to overload a 4507 without QOS.

Comments

  • LuckycharmsLuckycharms Member Posts: 267
    Looks like you learned your lesson... ( though it was an extremely brutal ego buster lesson)...
    But don't feel bad, Qos problems are one of the most common problems I see in voip network related issues. Auto Qos is a god sent if you truly know what it does and you know how to shape it after it builds your maps for you. But congratulations on fixing a problem that sounds like you might have spent a little time trouble shooting.
    The quality of a book is never equated to the number of words it contains. -- And neither should be a man by the number of certifications or degree's he has earned.
  • mikej412mikej412 Member Posts: 10,086 ■■■■■■■■■■
    Claymoore wrote:
    I screwed up and our voice guys were taking the heat for it.
    I'd have to say this one was their fault -- a CCVP should have known not to take your word for the QoS configuration and would have verified and caught your QoS configuration problem.
    :mike: Cisco Certifications -- Collect the Entire Set!
  • LuckycharmsLuckycharms Member Posts: 267
    Agreed... and what exactly are these people doing besides MAC's, if you are the one doing the Qos architecture, and maintenance? I mean any decent consulting firm that has a invested interest in the company would have looked at MOS scores/any other monitoring that is happening and known something wasn't right. And how often are you guys doing network traffic analysis? But I am not here to rag on any one... Good job on getting the problem fixed. icon_wink.gif
    The quality of a book is never equated to the number of words it contains. -- And neither should be a man by the number of certifications or degree's he has earned.
  • ClaymooreClaymoore Member Posts: 1,637
    The good news is that fixing QOS appears to have fixed our disconnect problems. We have had no errors since I made the change at 12:30 yesterday, including when our backup jobs kicked off last night.

    The bad news is that I am not sure when (if ever) a CCVP has been involved in our VOIP system beyond the week after it was turned on. It's possible that a couple of the engineers provided by the consulting company (who replaced the firm who performed the implementation) were CCVPs, but I don't know for sure. Our in-house person has no Cisco certification whatsoever. That's not entirely his fault because the only training he has was about a week of a CCVP bootcamp so he could learn about IPCC scripting. Our manger is his backup and he has no voice training either. They primarily handle MAC and support of our 50 or so call center scripts. We perform backoffice administration for insurance companies, so multiple clients with different business units means we have a lot of call centers.

    More bad news - we have no network monitoring tools (besides WireShark). I have been asking for them for about 18 months (when I first inherited the network) and keep getting turned down. I used the 30 day free trial of Orion and the SolarWinds engineer toolset to build out the monitoring infrastructure, but when it came time to purchase licenses they balked. The only information I can get is from the SDM or averages from various Show commands. As you can imagine, no tools and no training makes identifying and solving problems very difficult.
  • mikej412mikej412 Member Posts: 10,086 ■■■■■■■■■■
    Claymoore wrote:
    the firm who performed the implementation
    Ah!! Probably a company that jumped into the VoIP Business back in the days when they would hire anyone who could spell C-C-V-P. And don't even get me started on "consultants" icon_lol.gif

    While QoS covers a lot of material, on the Voice side it should be burned into your brain during study that QoS for Voice is end-to-end.

    But the important thing here is that you found the problem and fixed it -- and were nice enough to let everyone know about it. :D
    :mike: Cisco Certifications -- Collect the Entire Set!
  • mikeeomikeeo Member Posts: 71 ■■□□□□□□□□
    Now you know why the CCIE R&S lab has up to 14 pts on QoS :)
  • SlowhandSlowhand Mod Posts: 5,161 Mod
    mikeeo wrote:
    Now you know why the CCIE R&S lab has up to 14 pts on QoS :)
    Heh, heh. . . that's right. Just remember, a CCIE is an engineer that's broken the network in every way possible, not just the usual ones. icon_lol.gif

    Free Microsoft Training: Microsoft Learn
    Free PowerShell Resources: Top PowerShell Blogs
    Free DevOps/Azure Resources: Visual Studio Dev Essentials

    Let it never be said that I didn't do the very least I could do.
Sign In or Register to comment.