HA is the ability of VMware to power on VMs on another ESX host in case one or more hosts fail in a cluster. Some of the things that you need to know about VMware HA are:
- HA Calculations: HA is based on what is called a “slot”. A slot is basically a reserved space for running a VM in the cluster. Two features that determine a slot are CPU and RAM used by VM. By default VMware HA selects the slot size to be the largest chunk of CPU and RAM used by a VM as shown in the diagram below. HA uses this to determine the number of slots available in the cluster to run the VMs in an HA event. The default selection can be changed by modifying the following advanced attributes:
das.slotMemInMB – Change the memory usage by a VM for slot calculations
das.slotCpuInMHz – Change CPU usage by a VM for slot calculations
- First 5 hosts in a cluster are designated as Primary for HA. At least 1 node in the cluster is selected as Active Primary.
- HA agent is installed on the ESX host and polls for heartbeat every second by default.
- Failure is detected if no heartbeat is received in 15-seconds
- Isolation Network is polled every 12 seconds after which host is designated as Isolated. Service Console default gateway for ESX and VMkernel default gateway for ESXi are pinged for detecting isolation. You can also designate other addresses by modifying das.isolationaddress include other destination in the detection process.
- Admission Control Policy: Ability to reserve resources for HA. You can use Host, VM Restart Priority to control admission policy. VM Restart priorities are defined as
- High – First to start. Use for database servers that will provide data for applications.
- Medium – Start second. Application servers that consume data in the database and provide results on web pages.
- Low – Start Last Web servers that receive user requests, pass queries to application servers, and return results to users.
- Disables admission control: If you select this option, virtual machines can, for
- example, be powered on even if that causes insufficient failover capacity. When
- this is done, no warnings are presented, and the cluster does not turn red. If a cluster has insufficient failover capacity, VMware HA can still perform
- failovers and it uses the VM Restart Priority setting to determine which virtual
- machines to power on first.
- Enable port fast (Cisco) on the switches as best practice for bypassing spanning tree delays.
- VM Monitoring – HA also has the ability to monitor VMs but requires that VMware Tools be installed and running. Occasionally, virtual machines that are still functioning properly stop sending heartbeats. To avoid unnecessarily resetting such virtual machines, the VM Monitoring service also monitors a virtual machine's I/O activity. If no heartbeats are received within the failure interval, the I/O stats interval (a cluster-level attribute) is checked. The I/O stats interval determines if any disk or network activity has occurred for the virtual machine during the previous two minutes (120 seconds). If not, the virtual machine is reset. This default value (120 seconds) can be changed using the advanced attribute das.iostatsInterval.