vSphere HA Host Isolation Response in a SimpliVity Design
In this post I am going to look at a few typical SimpliVity deployment topologies and provide a recommendation for the Host Isolation Response setting, along with my reasoning behind the choice. Remember that SimpliVity is below the hypervisor, so even though I am writing specifically about SimpliVity deployments the same concepts can be easily transferred to other types of infrastructures.
By default vSphere High Availability (HA) uses the default gateway of the Management VMkernel to determine host isolation. Additional isolation addresses can, and should, be set to reduce the possibility of hosts falsely becoming isolated. Details on setting multiple isolation addresses and using an isolation address other than the default gateway can be found here https://kb.vmware.com/kb/1002117
A host becomes isolated if the HA agent is unable to access any other hosts in the cluster and if it is unable to ping the configured isolation addresses. The host is still running and VMs can still be running on the host, but the host no longer has connectivity to the networks it tests. The vSphere 6 documentation provides more details on host isolation.
vSphere HA can be configured to respond to a host becoming isolated in 3 ways:
- Disabled (Leave powered on) – This is the default setting
- Shut down and restart VMs – Attempt gracefully shutdown VMs and restart them on a non-isolated host.
- Power off and restart VMs – Power off VMs and restart them on a non-isolated host.
A default host isolation response is set for the cluster in the vSphere HA configuration and this is set to Disabled by default.
Each virtual machine can be configured with a host isolation response which differs from the cluster default. When vSphere HA is configured in a SimpliVity environment the Host Isolation Response for the SimpliVity OmniCube Virtual Controller (OVC) VM should be configured to Disabled so that it is not shutdown during an isolation event.
Datastore heartbeating is a second mechanism which is used to determine whether a host is isolated or failed. If the host cannot reach other hosts in the cluster or its isolation addresses, but it is still able to write to storage for heartbeating the host will be determined to be up but isolated. This is also a factor in which isolation response to use.
The following table outlines how vSphere HA determines the state of a host:
HA Agent Reachable | Isolation Addresses Reachable | Datastore Accessible (Heartbeat) | HA Event |
---|---|---|---|
Yes | N/A Isolation addresses only tested if HA agent connectivity is lost. |
Yes | No HA Event Host poweron file indicates it is not isolated. |
No | Yes | Yes | No HA Event Host poweron file indicates it is not isolated. |
No | No | Yes | Host is Isolated. Isolated host updates its poweron file indicating it is isolated. HA will trigger Isolation Response. |
No | No | No | Host is failed. HA will restart VMs on surviving hosts |
This first example is typical of a small SimpliVity deployment. We call this a 2+ since there are 2 SimpliVity nodes deployed in a datacenter. With a 2+ SimpliVity deployment there is no requirements for 10 GbE switching, the nodes are directly connected for SimpliVity Storage traffic. Management and virtual machine traffic share 1 GbE connectivity.
With this deployment if both uplinks to a single host fail, the host will become isolated. Storage is still accessible for datastore heartbeating and the VMs will remain running on the host, but the VMs will not be accessible over the network. For this reason the Response for Host Isolation on the cluster should be set to Shut down and restart VMs or Power off and restart VMs.
If Shut down and restart VMs is selected and a host becomes isolated the virtual machines running on the host will be gracefully shutdown and restarted on the non-isolated host. This can delay the time it takes to restart VMs but ensures VMs are gracefully shutdown before restarting them on a non-isolated host. If Power off and restart VMs is used, VMs will immediately power off to be restarted.
When configuring isolation addresses it is important to not use an address on the SimpliVity Storage network since it is likely this network will remain available even if the management and VM Networks are not, in which case the host would not be identified as isolated.
vSphere HA Settings for a 2+ Direct Connect SimpliVity Deployment:
- vSphere HA: Enabled
- Response to Host Isolation: Shut down and restart VMs or Power off and restart VMs
- Datastore for Heartbeating: Automatically select datastores accessible from host
Another common SimpliVity deployment is an all 10 GbE deployment. In this topology all network traffic (VM Network, Management, and Storage) is logically separated but is converged on the same physical 10 GbE NICs and 10 GbE switch fabric. In this deployment if a host becomes isolated it is likely management networks, virtual machine networks, and storage networks will be impacted.
If the network is only partially impacted and a host becomes isolated, not able to reach other hosts in the cluster and not able to reach all of its configured isolation addresses, the Response for Host Isolation to Shut down and restart VMs or Power off and restart VMs. If the host is determined to be failed – unable to reach other hosts in the cluster, unable to reach the isolation addresses, and not able to datastore heartbeat – vSphere HA will restart the VMs on the surviving hosts in the cluster.
vSphere HA Settings for an all 10 GbE SimpliVity Deployment:
- vSphere HA: Enabled
- Response to Host Isolation: Shut down and restart VMs or Power off and restart VMs
- Datastore for Heartbeating: Automatically select datastores accessible from host
The final SimpliVity deployment topology example, which is common in enterprise deployments, is physical separation of traffic types. Management Networks, VM Networks, and SimpliVity Storage networks are physically separated. In this example the Management Network is on one physical switch fabric with NICs dedicated to management. SimpliVity storage and VM networks are converged, logically separated, and sharing physical 10 GbE switch fabric and 10 GbE NICs.
Things can get a bit complicated here, if the Management Network is used to determine isolation it is unlikely other networks will be affected. If a host become isolated, the management network is unavailable, but datastore heartbeats are still received virtual machine workloads will continue to run and continue to be accessible. In this case the Response for Host Isolation will be set to Disabled. If a host becomes isolated virtual machines will remain powered on and will not be restarted on other hosts. If datastore heartbeats for a host are not received from a host the host will be identified as failed and virtual machines will be restart on other available hosts.
vSphere HA Settings for a SimpliVity Deployment with physical network separation:
- vSphere HA: Enabled
- Response to Host Isolation: Disabled (Leave powered on)
- Datastore for Heartbeating: Automatically select datastores accessible from host
These are just a few examples of configuring vSphere HA Isolation Responses in a SimpliVity environment. As with any deployment the business availability requirements will play a large part in determining when and how or how fast virtual machine workloads should failover in an HA event. There may be other configurations based on different availability requirements.
SimpliVity vSphere HA Configuration Tips:
- Ensure Management Networks, VM Networks, and SimpliVity Storage Networks are highly available with no single points of failure.
- Set VM Override for Response to Host Isolation on OVC VMs to Disabled.
- Do not use isolation addresses on the SimpliVity Storage network.
- Do not use OVC management addresses as isolation addresses.
- Configure vSphere HA Datastore for Heartbeating to Automatically select datastores accessible from host.
- Choose a Response to Host Isolation which ensures virtual machines will be available during an isolation event.
For a deeper dive into the fundamentals of vSphere HA and how vSphere HA functions check out @DuncanYP’s VMware vSphere HA Deepdive.