HA failing

DevilWAHDevilWAH Member Posts: 2,997 ■■■■■■■■□□
I get a lot of "insufficient resources to fail over ..... The host cant access virtual machine components"

With in the HA settings I have "do not reserve fail over capacity" (for testing)

And the guests are running on a Vsan.

the event logs list a long reasons why it might not be working but is there any way to see what is going on and might be causing it? What resource the host cant support?
  • If you can't explain it simply, you don't understand it well enough. Albert Einstein
  • An arrow can only be shot by pulling it backward. So when life is dragging you back with difficulties. It means that its going to launch you into something great. So just focus and keep aiming.

Comments

  • DPGDPG Member Posts: 780 ■■■■■□□□□□
    Can all of the hosts see the storage?
  • DevilWAHDevilWAH Member Posts: 2,997 ■■■■■■■■□□
    Hi,

    well yes its VSAN and all of the hosts take part, there's no errors on this. And I know when you have VSAN hosts don't do data store monitoring as this is dealt with via vsan heartbeats.
    • If you can't explain it simply, you don't understand it well enough. Albert Einstein
    • An arrow can only be shot by pulling it backward. So when life is dragging you back with difficulties. It means that its going to launch you into something great. So just focus and keep aiming.
  • LexluetharLexluethar Member Posts: 516
    So do you have admission control disabled on the cluster? Sounds like Admission control is enabled and you have an admission control policy set that your hosts cannot support.
  • DevilWAHDevilWAH Member Posts: 2,997 ■■■■■■■■□□
    Lexluethar wrote: »
    So do you have admission control disabled on the cluster? Sounds like Admission control is enabled and you have an admission control policy set that your hosts cannot support.

    As mentioned in my first post I have "do not reserve fail over capacity" for admission control.

    The question is what is the policy that is causing the failure, there are none set up, and the error does not say what it is. We run hardware backed graphics which could also be an issue (although all hosts support it) But are there any logs that will say "HA failed because of X subsystem"
    • If you can't explain it simply, you don't understand it well enough. Albert Einstein
    • An arrow can only be shot by pulling it backward. So when life is dragging you back with difficulties. It means that its going to launch you into something great. So just focus and keep aiming.
  • LexluetharLexluethar Member Posts: 516
    Gotcha - sorry i did read that but how you worded it was you didn't not reserve anything for failover capacity - that's not the same as having admission control disabled. So you have the admission control option disabled for the cluster, got it.

    Just one host within the cluster causing this issue or is it at the cluster level? I've had issues before where after a failure I receive odd messages like that. The fix was to either put the Host in maintenance mode and exit maintenance mode or remove the host from the cluster and then put it back into the cluster.
  • iBrokeITiBrokeIT Member Posts: 1,318 ■■■■■■■■■□
    https://pubs.vmware.com/vsphere-50/index.jsp?topic=%2Fcom.vmware.vsphere.troubleshooting.doc_50%2FGUID-A7C83499-FA81-4893-8376-A4B91AC81FA7.html

    Do all the hosts in your cluster have the HA state on the summary tab listed as "Connected"?

    What does the "View Resource Distribution Chart" look like on the Summary tab of the cluster? / Are there any free slots?

    Are you using cpu/ram reservation?
    2019: GPEN | GCFE | GXPN | GICSP | CySA+ 
    2020: GCIP | GCIA 
    2021: GRID | GDSA | Pentest+ 
    2022: GMON | GDAT
    2023: GREM  | GSE | GCFA

    WGU BS IT-NA | SANS Grad Cert: PT&EH | SANS Grad Cert: ICS Security | SANS Grad Cert: Cyber Defense Ops SANS Grad Cert: Incident Response
  • markulousmarkulous Member Posts: 2,394 ■■■■■■■■□□
    Lexluethar wrote: »
    Gotcha - sorry i did read that but how you worded it was you didn't not reserve anything for failover capacity - that's not the same as having admission control disabled. So you have the admission control option disabled for the cluster, got it.

    Just one host within the cluster causing this issue or is it at the cluster level? I've had issues before where after a failure I receive odd messages like that. The fix was to either put the Host in maintenance mode and exit maintenance mode or remove the host from the cluster and then put it back into the cluster.


    I would recommend doing this. Then disable HA and reenable it. That message really is misleading because you can overcommit resources quite a bit and it's not going to fail like that.
  • blargoeblargoe Member Posts: 4,174 ■■■■■■■■■□
    DevilWAH wrote: »
    I get a lot of "insufficient resources to fail over ..... The host cant access virtual machine components"

    Just to clarify, is this a real production environment, or something you are testing? You "get a lot of" these messages, it is planned/intentional failover or is it unintended HA attempts that you are troubleshooting?

    If it is some kind of mismatch between hosts on available datastores, configured networks, graphics card, etc, then vmotion between these hosts for the VMs in question should not be possible. Is that actually working?
    IT guy since 12/00

    Recent: 11/2019 - RHCSA (RHEL 7); 2/2019 - Updated VCP to 6.5 (just a few days before VMware discontinued the re-cert policy...)
    Working on: RHCE/Ansible
    Future: Probably continued Red Hat Immersion, Possibly VCAP Design, or maybe a completely different path. Depends on job demands...
  • DevilWAHDevilWAH Member Posts: 2,997 ■■■■■■■■□□
    This is a live environment, and it has happened a few times now, when one host has failed and machines fail to move over.

    Sorry HA is enabled on the cluster but I have disabled "reserve capacity".

    Now if I disable HA the guests will boot, but not while it is on.

    All host are identical, same CPU, Data stores, and network is configured on distributed switches identicaly, Oh and the HA is showing as OK on all hosts. And we dont have any reservations of CPU and memory.

    Turning it off is not an option as the point is to insure guest come back up in the event of a failed host.
    • If you can't explain it simply, you don't understand it well enough. Albert Einstein
    • An arrow can only be shot by pulling it backward. So when life is dragging you back with difficulties. It means that its going to launch you into something great. So just focus and keep aiming.
  • LexluetharLexluethar Member Posts: 516
    Okay now i'm confused sorry. So you have HA enable which makes sense - what specifically do have you enabled under Admission Control? Is it enabled or disabled? Sounds like you have Admission Control enabled. I don't know about your capacity requirements but i've found in small clusters (say 4 or less hosts) it's better to disable admission control unless you have an abundance of resources in that cluster. Otherwise when a host fails it will NOT allow you to start that VM unless the Host is back online (sounds like your issue).
  • markulousmarkulous Member Posts: 2,394 ■■■■■■■■□□
    vCenter also may not be synced properly so removing the host from the cluster and readding it, then disable/renable HA on the cluster may sync everything back up.
  • DevilWAHDevilWAH Member Posts: 2,997 ■■■■■■■■□□
    Lexluethar wrote: »
    Okay now i'm confused sorry. So you have HA enable which makes sense - what specifically do have you enabled under Admission Control? Is it enabled or disabled? Sounds like you have Admission Control enabled. I don't know about your capacity requirements but i've found in small clusters (say 4 or less hosts) it's better to disable admission control unless you have an abundance of resources in that cluster. Otherwise when a host fails it will NOT allow you to start that VM unless the Host is back online (sounds like your issue).

    OK sorry decided to wait till i was in fount of the console :)

    HA is turned on,
    Host monitoring is enabled.
    Host hardware Monitoring - VM component protection (protect against storage connectivity loss) is unchecked

    Admission control is Disabled.
    • If you can't explain it simply, you don't understand it well enough. Albert Einstein
    • An arrow can only be shot by pulling it backward. So when life is dragging you back with difficulties. It means that its going to launch you into something great. So just focus and keep aiming.
  • blargoeblargoe Member Posts: 4,174 ■■■■■■■■■□
    What version/update are you running in vCenter and ESXi?
    IT guy since 12/00

    Recent: 11/2019 - RHCSA (RHEL 7); 2/2019 - Updated VCP to 6.5 (just a few days before VMware discontinued the re-cert policy...)
    Working on: RHCE/Ansible
    Future: Probably continued Red Hat Immersion, Possibly VCAP Design, or maybe a completely different path. Depends on job demands...
  • DevilWAHDevilWAH Member Posts: 2,997 ■■■■■■■■□□
    Hi,

    version 6.

    Oh I have found the issues I believe after reading a few KB Articles. Seems that some bright spark has mounted NFS stores that are using the same VMkernal as Vmotion and VSAN! And a number of host seem to randomly refuse to join the HA cluster as it is enabled / disabled.

    So as the NFS stores don't need to be there I was going to delete them but this throws an error the store is busy! I hate people who play with things they don't understand, then come running when they break it.
    • If you can't explain it simply, you don't understand it well enough. Albert Einstein
    • An arrow can only be shot by pulling it backward. So when life is dragging you back with difficulties. It means that its going to launch you into something great. So just focus and keep aiming.
  • DevilWAHDevilWAH Member Posts: 2,997 ■■■■■■■■□□
    Oh the gits had also put the log files on to one of the "temporary" nfs shares!

    go away for a week and come back and spend first few days sorting out the mess that has been made.
    • If you can't explain it simply, you don't understand it well enough. Albert Einstein
    • An arrow can only be shot by pulling it backward. So when life is dragging you back with difficulties. It means that its going to launch you into something great. So just focus and keep aiming.
  • LexluetharLexluethar Member Posts: 516
    Haha ya those log files should definitely be going to a dedicated location (DAS or on your SAN). Not configured=temp location.

    So when you go to unmount the NFS volume - what specific section do you get the red X on?
Sign In or Register to comment.