High CPU ready and poor performance on an SQL VM

EssendonEssendon Member Posts: 4,546 ■■■■■■■■■■
Been having performance issues with a CRM environment which is housed in 3 VM's.

Application VM1: 2 vCPU's and 8GB RAM
Application VM2: 2 vCPU's and 8GB RAM
SQL database VM: 4 vCPU's and 12GB RAM

Users are experiencing performance issues when running reports, these are run against the database as you'd imagine. They reckon the system slows down considerably when they do this. There are about 400 users accessing this system, I dont know how many of these concurrently use the system. I have had reports of people experiencing slowness even when not generating reports. The CRM sys admin has reported 100% CPU utilization when people are generating these reports and 80-90% utilization at other times.

I had a look at the SQL VM's advanced performance charts and Ready times are quite high (almost 10%). The screenshot shows the average to be 1732ms, I have seen it higher than 2200ms. So rounding off 1732 to 1800, divided by 20000*100 = 9%. But that's only ~ 2.5% per vCPU. That's not high, is it?



One host runs all 3 VM's, there are 3 other hosts in the cluster. Each host has 2 sockets, with 6 cores per socket and HT enabled. This host runs five VM's with 4 vCPU's and ten VM's with 2 vCPU's and nine VM's with 1 vCPU. Going by what I read from the below link, these larger VM's are running on a host that has lots of smaller VM's.

CPU Ready Time in VMware and How to Interpret its Real Meaning - Jonathan Kehayias

The CPU ready times on the other two VM's are very low, like 1-2% or thereabouts. Disk latency for the host is low, GAVG values are about 2.5.

I intend doing the following. What do the smart cats at TE recommend?

- Decrease the vCPU count on all 3 VM's. Bring the SQL VM down to 2 vCPU's and the other two down to 1 vCPU.
- Put in a DRS rule to separate these VM's to different hosts. This may or may not be a good idea because other hosts are running similar workloads.
- Introduce a culture shift from believing more vCPU's are better. The CRM sys admin wanted me to give him 12 vCPU's icon_lol.gif

It's difficult to obtain downtime for this environment, so do I bring down the vCPU count on all VM's in one hit or just do the SQL VM for now and see how it goes? This is considering the fact that the ready times for the other two VM's arent high.
NSX, NSX, more NSX..

Blog >> http://virtual10.com

Comments

  • dave330idave330i Member Posts: 2,091 ■■■■■■■■■■
    If I understand correctly, you have a single host running:

    5x VM with 4vCPU
    10x VM with 2 vCPU
    9x VM with 1 vCPU

    Among those VMs, you have your SQL as well?

    Adding up the vCPU, you've got 49 vCPUs. Your host has 2x 6core = 12 pCPU. Your consolidation ratio is ~ 4vCPU:1pCPU. Which is high considering the number of multi-CPU VMs you're running.

    You could lower the CPU on the SQL, but considering it's utilization is near max, I wouldn't do it. I'd recommend giving your SQL VM higher CPU shares.
    2018 Certification Goals: Maybe VMware Sales Cert
    "Simplify, then add lightness" -Colin Chapman
  • instant000instant000 Member Posts: 1,745
    Well, based on what you say, and what the article says, it looks like an issue of oversubscription.

    Your SQL VM probably needs to frequently interrupt for CPU services, since it's probably one of your beefier VMs.

    unfortunately, with so many vCPU allocated already, they also get their minimal interrupt services, which could in turn cause your SQL VM to have to "wait in line" more than it would like to.

    Based on the situation you describe, decreasing the number of vCPUs should have a positive effect.

    Also, having more vCPU cannot help the situation. It's just adding more oversubscription.

    You should be able to explain it like so (depending on who you have to talk to):

    It's like there are only so many checkout registers at the supermarket. (However many processing cores you have.)

    The vCPUs just take up spots in line and wait for a chance to checkout. As soon as they checkout, they circle around and get back in line again. If you add more and more vCPU, you only make the queues longer.

    Hope this helps.
    Currently Working: CCIE R&S
    LinkedIn: http://www.linkedin.com/in/lewislampkin (Please connect: Just say you're from TechExams.Net!)
  • EssendonEssendon Member Posts: 4,546 ■■■■■■■■■■
    Thanks for taking the time to reply Dave. You did understand correctly and the SQL VM is amongst those VM's.I also had a look at the CPU utilization on the SQL VM (because I didnt entirely believe the sys admin), check out the following picture:After looking at the above picture, do you still recommend giving the VM more shares? Weird thing is, he's shown me screenshots of 100% CPU utilization when he generated reports last Friday, but the above graph doesnt show a spike in usage icon_scratch.gif . Can you shed some light on this too please mate? Why doesnt the above picture reflect the high CPU usage on the VM that the sys admin saw?

    P.S. The attached attachment is of the host, I didnt mean to attach it, cant delete it for some reason.
    NSX, NSX, more NSX..

    Blog >> http://virtual10.com
  • jibbajabbajibbajabba Member Posts: 4,317 ■■■■■■■■□□
    Were the drives used by SQL formated with a 64k cluster size and properly aligned (WIn2008R2 aligns automatically, but cluster size is still sub-optimal for SQL) ?

    This shouldn't affect CPU, but worth mentioning, anyway, back to CPU - Your admin probably saw the 100% CPU inside the VM rather than the report made by the client. As you know, you cannot really trust the CPU / Memory utilization from inside the VM as the guest OS is simply not aware of overheads / oversubscriptions or schedules ...

    Here is an article I sent our a few times to the guys wondering the exact same thing:

    Inaccuracy of In-guest Performance Counters « vPivot

    be6a1.jpg
    My own knowledge base made public: http://open902.com :p
  • dave330idave330i Member Posts: 2,091 ■■■■■■■■■■
    There are serveral methods to reduce CPU ready time.

    1. Reduce the # of vCPU to ease scheduling. The SQL VM is spiking to 90%, so not really an option.
    2. Reduce the # of vCPU in other VMs. Take a look at your 4 vCPU VMs to see if they really need them. If not, reduce accordingly.
    3. Change the CPU share to high on the SQL VM (assuming other VMs are at normal).
    4. Set CPU reservation on the SQL VM.
    5. vMotion other CPU intensive VMs to other hosts. DRS probably balanced your cluster, but doesn't hurt to check.
    2018 Certification Goals: Maybe VMware Sales Cert
    "Simplify, then add lightness" -Colin Chapman
  • blargoeblargoe Member Posts: 4,174 ■■■■■■■■■□
    Someone needs to actually look at what the OS and the application (SQL) is reporting as well. If, for example, the OS is spending lots of time in the "SYSTEM" process, then the OS is spending too much time with kernel related issues... drivers, anti-virus, hardware issues (virtual or physical), etc. A SQL DBA needs to look at the SQL server from a performance point of view, or call it in to Microsoft and get their thoughts. What is SQL spending most of its time doing? What is the page life expectancy? Are the queries being produced by the application optimized? Are the databases themselves optimized?

    From a VMware perspective:

    Have you double-checked that you do not have any unintended or forgotten limits or reservations set any of the VM's on this host?

    How do the overall CPU stats on at the host level look?

    Any swapping/balooning? Anything else that might be an issue aside from CPU?
    IT guy since 12/00

    Recent: 11/2019 - RHCSA (RHEL 7); 2/2019 - Updated VCP to 6.5 (just a few days before VMware discontinued the re-cert policy...)
    Working on: RHCE/Ansible
    Future: Probably continued Red Hat Immersion, Possibly VCAP Design, or maybe a completely different path. Depends on job demands...
  • it_consultantit_consultant Member Posts: 1,903
    I think you are on the wrong track here. I am sure I am not the only one that has had the "the database goes slowly when people are running reports", almost 100% of the time it is because the report is locking tables. You could have 40 processors (like my main DB does) and you will still have slowness if you running a report off an active database. Try forking off a copy of your database (using replication or log shipping) to a read only copy and run your reports from that database.
  • EssendonEssendon Member Posts: 4,546 ■■■■■■■■■■
    Thank you for the valuable suggestions! This is why TE rocks!!

    From the guest OS's perspective, I cannot log in to find out if SQL is doing some funky stuff (this is a cloud environment). But I have engaged the customer's DBA to investigate this for me. Good suggestions there it_consultant and blargoe, I've passed on your ideas to the DBA and I'll take it from there.

    From a VMware perspective, the host is not under CPU contention. There are no limits or reservations on any VM's on the host. These are the one of the first thing I checked when I got this ticket. There's been no swapping/ballooning either. DRS is balancing the cluster, no worries there. I had a look at the other SMP VM's, and all have similarly high CPU ready times. All have 4 vCPU's and dont need them.

    I'll wait till the DBA comes back with indepth analysis on the SQL side of things. I'll see what I can do about reducing the number of vCPU's on the other VM's. Due to the way the environment is built, I cannot do this without asking/consulting/coercing scores of people here.

    I'll update this thread when I can.
    NSX, NSX, more NSX..

    Blog >> http://virtual10.com
  • dave330idave330i Member Posts: 2,091 ■■■■■■■■■■
    Essendon wrote: »
    I'll see what I can do about reducing the number of vCPU's on the other VM's. Due to the way the environment is built, I cannot do this without asking/consulting/coercing scores of people here.

    Have fun explaining CPU scheduling. icon_wink.gif
    2018 Certification Goals: Maybe VMware Sales Cert
    "Simplify, then add lightness" -Colin Chapman
  • QHaloQHalo Member Posts: 1,488
    This type of thread makes me appreciate that I have full autonomy when it comes to the VMware environment where I work. I only have to explain CPU scheduling to my boss who already understands it. :)

    vCOPS could help you show reasoning as well. If you have it.
  • dave330idave330i Member Posts: 2,091 ■■■■■■■■■■
    QHalo wrote: »
    This type of thread makes me appreciate that I have full autonomy when it comes to the VMware environment where I work. I only have to explain CPU scheduling to my boss who already understands it. :)

    vCOPS could help you show reasoning as well. If you have it.

    vCOPs foundation is part of vSphere now.
    2018 Certification Goals: Maybe VMware Sales Cert
    "Simplify, then add lightness" -Colin Chapman
  • QHaloQHalo Member Posts: 1,488
    Looks like he'd need at least Standard to get any of the resource optimization components though.
  • EssendonEssendon Member Posts: 4,546 ■■■■■■■■■■
    Yeah I did install the basic version of vCOPs and didnt get what I needed. HP's Cloud Matrix does similar stuff from what I've heard, will begin investigating if it's indeed the case (we're an HP shop).
    NSX, NSX, more NSX..

    Blog >> http://virtual10.com
Sign In or Register to comment.