Admission Control - # Host Failures vs. % Cluster Resources

I started in a new job this week, and I've been exhaustively performing discovery of my new surroundings. From a vSphere point of view, I'm used to fewer hosts than I'm working with here. However, I ran across a configuration that has me scratching my head as to why it was set up this way.

So, we have 13 hosts running ESXi 4.1. All hosts are managed by the same vCenter server in a single HA cluster, and are identically configured with Dual-socket, 6-core CPU with 72GB RAM, and are all connected to the same datastores.

For admission control, if I have it enabled, I'm used to using either a single failover host, or setting a number of tolerated host failure. In this case, the admnission control policy is set to reserve 51% of the cluster's resources. On top of that, almost every VM is using a Memory reservation that is equal to the allocated memory, while the guest memory utliization in the VM's is running 10-20% of allocated memory in most cases. Since the reservations are in place, the consumed capacity is getting pretty close from the reserved capacity for failover, so if status quo is maintained, more hosts will be required soon.

With this many hosts in the cluster, would it be normal to use such a large % of capacity reserved for failover? This seems like extreme overkill to me. Because of the reservations, I can see not using the # of host failures tolerated policy (slot size would be set equal to the VM with the largest memory reservation + overhead)... but then again, I don't see the point of having all the reservations either. I'm seriously shaking my head at that one, but I fear that the decision to make changes there will be out of my hands.

Thanks,

Blargoe

Find more posts tagged with

Comments

jmritenour

I'm not one to criticize how an engineer/admin set up an environment without actually seeing it for myself. Some things just don't play nice a virtual environment unless they have a memory reservation - like Java, for isntance. VMware KB: Best practices for running Java in a virtual machine. One of our customers encountered this on several Red Had VMs, where the VMs would swap memory to disk even though there was plenty of RAM free on the host - setting the memory reservation solved it.

So yeah, it seems like overkill, and something I would probably try to avoid myself, there might be a very good reason for it.

blargoe

The "good reason" is the "just in case" theory because of people having no backbone to push back. The same reason why we also have over 60 Domain Admins in AD right now. It's mostly just regular stuff like print servers, web apps, sharepoint, all utility apps or front-end apps, no databases at all. Even the templates are set to have all allocated memory reserved (with limit equal to the reservation). I understand there could be some one-off situations that may be appropriate to have reservations set, but EVERY VM is provisioned this way. It seems unusual to me.

The real question was about the Admission Control, though. I'm just not familiar with it being set up in this way, and was wondering if this is more typical than I think it is.

jmritenour

Ah, well, that's pretty stupid then.

The percentage of resources admission control policy is actually considered a best practice by VMWare, as it offers the most flexibility in host & VM sizing. The host failures tolerated is great if you're dealing with an environment with similar sized VMs. Otherwise, the "slots" are based on the largest CPU/memory reservation configured in the cluster.

vSphere Documentation Center

jibbajabba

If they do use reservations then it makes sense that they are using % rather than hosts I think. With 51% I agree - it might be overkill, but since every VM, icluding powered off VMs and templates have reservations, they probably just want to make sure that they have enough resources no matter what - if that is what they want, then I am sure they are aware of the fact that they need to add hosts soon. Depending on your colleagues (whether they feel like you are stepping on their toes), you could always mention that.

I for example started a new job as well and being a triple VCP, the first thing you obviously do is checking out the cluster. I also noticed that one cluster was not just overcomitted (2 hosts, single host failure allowed), meaning that they don't have enough RAM for HA, but both hosts are complete different Xeon families (no DRS anyway) ...

Well, I dropped the hint and they said "yepp we know, it is what the customer wants" - so there you go

ptilsen

blargoe wrote: »

The same reason why we also have over 60 Domain Admins in AD right now.

I just threw up in my mouth.

blargoe

ptilsen wrote: »

I just threw up in my mouth.

lol....

jibbajabba

ptilsen wrote: »

I just threw up in my mouth.

<Dr Pepper>
What's the worst that could happen
</obvious>

dave330i

It sounds like they built the VMs like you would build physical machines. The 51% reserve is nuts. Maybe if every VM is tier 1 I might think about doing that, but probably not.

I'm curious, do they have their SANs setup similarly?

blargoe

dave330i wrote: »

It sounds like they built the VMs like you would build physical machines. The 51% reserve is nuts. Maybe if every VM is tier 1 I might think about doing that, but probably not.

I'm curious, do they have their SANs setup similarly?

I haven't been given access to the SANs yet, but I wouldn't be surprised if there wasn't some of that going on too. My manager recently got control over the infrastructure team (previously he was storage and DBA) and he seems to be more of a best practices kind of guy, but his boss is the total opposite. I have a feeling the storage is going to be ridiculous when I look at it. They basically can't delete anything because of federal regulations that we fall under. Even when employees have left the company, their computer is imaged and saved forever.

Back to the VM's... I was thinking 51% had to be way off the mark from where it should be. I wonder if he set it that high when he had fewer hosts in the cluster and forgot to go back and change it? After talking to the other admin about other stuff today, I think this setup (including the reservations) was basically a mandate from higher-ups, when company's production was just getting started... production cannot be interrupted or degraded, ever, and no changes can ever happen on the servers, ever. Whoever was in charge in the beginning didn't even allow Windows Updates for the first several years because they were so averse to changes.

bertieb

blargoe wrote: »

Whoever was in charge in the beginning didn't even allow Windows Updates for the first several years because they were so averse to changes.

There's a genius right there!
Apart from these oddities hope the new gig is going well.

blargoe

bertieb wrote: »

There's a genius right there!
Apart from these oddities hope the new gig is going well.

I think overcoming those initial oddities is part of the reason why the company made a big, sudden push to add Sr. level IT people to the staff... they were initially focused on making sure their engineering and design stayed online all the time, and now that that end of the business is stable I think current leadership is trying to repent of past sins

I think this will still be a good situation for me. I figure I will either quit in 3 months or stay for 25 years and retire.

`ariel

Sorry to bring this thread back. but i was reading about VMware HA:AC and found this: admission control « Cloud-Buddy

acording to the article 51% is a bad design. the article is about vSphere 5 but i think it apply to vSphere 4 too.

JBrown

Blargoe, any updates on this? Were the adjustments made or did you find out what was the reason for setting it up that way?

blargoe wrote: »

I think overcoming those initial oddities is part of the reason why the company made a big, sudden push to add Sr. level IT people to the staff... they were initially focused on making sure their engineering and design stayed online all the time, and now that that end of the business is stable I think current leadership is trying to repent of past sins

I think this will still be a good situation for me. I figure I will either quit in 3 months or stay for 25 years and retire.

QHalo

The only reason for 51% admission control would be if you had a two node cluster. Everything that I've seen and read shows that percentage of cluster failover should be basically a division of 100 by the number of nodes in the cluster. This isn't taking things like reservations into account though but is a good starting point.

blargoe

The guy that set it all up was long gone before I started with the company and no one else knows enough about VMware to know why he might have set it up that way. We have these hosts split between two BladeCenter chassis and the only thing I can think of is maybe he wanted to reserve enough capacity to be able to lose an entire BladeCenter chassis. But some of the other decisions that he made just make me think he was just randomly changing settings and didn't know what he was doing.

I set it to something more reasonable (25%). I'm finally getting to redo the environment soon, thankfully. I'm re-installing everything so there will be no traces left of the original configuration other than whatever may have been set at the VM level that I haven't already changed.