failed host didn't initiate HA

slinuxuzerslinuxuzer Member Posts: 665 ■■■■□□□□□□
So I had an interesting failure today, I have a two host cluster running 5.0 ENT+, NFS storage (Netapp), proliant DL380. I start getting alerted on failed servers, login to Vsphere and my host has lost almost all network connevtivity, all storage is showing unavailable, can't get CDP off any of my uplinks, ILO is inaccesible, so at this point I determine that the hardware has died and HA didn't restart my VM's, I right click the host in question in Vpshere and reboot, it reboots and then triggers HA immediatley.

So I wanted to poll some opinions here about how best to mitigate this failure scenario in the future, I don't restart VMs today based on GOS heartbeats.

In you opinion what is the best way to automate the recovery of a host thats lost everything except managment network?

Thanks in advance all.

Comments

  • EssendonEssendon Member Posts: 4,546 ■■■■■■■■■■
    I believe the default isolation address is the default gateway of the management network so if you were to change the isolation address to something else, you could HA over your VM's to another host. I dont know how to automate the recovery though if the isolation address was left at the default.
    NSX, NSX, more NSX..

    Blog >> http://virtual10.com
  • dave330idave330i Member Posts: 2,091 ■■■■■■■■■■
    In an IP based storage, you can switch the default isolation address to the SAN IP(s). Usually the mgmt VLAN and SAN VLAN are different, so the mgmt GW would have to route traffic to SAN VLAN. This would test mgmt GW and SAN connectivity. The drawback is that SAN VLAN now has to be routable. You could lock it down so that only the ESXi hosts have access to SAN VLAN.

    Alternative to making SAN VLAN routable is to enable mgmt traffic on the SAN uplink. Drawback is that you're not testing the mgmt GW.
    2018 Certification Goals: Maybe VMware Sales Cert
    "Simplify, then add lightness" -Colin Chapman
  • slinuxuzerslinuxuzer Member Posts: 665 ■■■■□□□□□□
    dave330i wrote: »
    In an IP based storage, you can switch the default isolation address to the SAN IP(s). Usually the mgmt VLAN and SAN VLAN are different, so the mgmt GW would have to route traffic to SAN VLAN. This would test mgmt GW and SAN connectivity. The drawback is that SAN VLAN now has to be routable. You could lock it down so that only the ESXi hosts have access to SAN VLAN.

    Alternative to making SAN VLAN routable is to enable mgmt traffic on the SAN uplink. Drawback is that you're not testing the mgmt GW.

    Awesome, thats kind of what I figured, do you know if using GOS heartbeats will restart my Vms in the event I lose storage? I've never tested this.
  • dave330idave330i Member Posts: 2,091 ■■■■■■■■■■
    slinuxuzer wrote: »
    Awesome, thats kind of what I figured, do you know if using GOS heartbeats will restart my Vms in the event I lose storage? I've never tested this.

    It should try to restart, but will fail since host can't access storage. It won't initiate HA.
    2018 Certification Goals: Maybe VMware Sales Cert
    "Simplify, then add lightness" -Colin Chapman
  • blargoeblargoe Member Posts: 4,174 ■■■■■■■■■□
    Not knowing the details of your environment, is it possible that your hosts only detected a "partition" rather than an "isolation" event?

    Have you checked out Duncan Epping's HA deep dive article?

    vSphere High Availability (HA) Technical Deepdive - Yellow Bricks
    IT guy since 12/00

    Recent: 11/2019 - RHCSA (RHEL 7); 2/2019 - Updated VCP to 6.5 (just a few days before VMware discontinued the re-cert policy...)
    Working on: RHCE/Ansible
    Future: Probably continued Red Hat Immersion, Possibly VCAP Design, or maybe a completely different path. Depends on job demands...
  • TheProfTheProf Users Awaiting Email Confirmation Posts: 331 ■■■■□□□□□□
    Another option could be adding a second host isolation address.
Sign In or Register to comment.