High Availability (HA)
VMware's high availability has a simple goal, if a ESXi server crashes all the VM's running on that server do as well, the other ESXi servers within the cluster will detect this and move the VM's onto the remain ESXi servers, once the failed ESXi server has been repaired and is back online, if DRS is enabled the cluster will be rebalanced if DRS has been set to automatic. The features regarding HA in are
Normally in a cluster you have a redundancy of +1 or more, for example if you need say five ESXi servers to support your environment, then you should have six ESXi servers within the cluster, the additional ESXi server would help during a server crash or if you need to update/repair a server, thus there would be no degradation of your services.
If you have previous experience of clusters you would have heard of split-brain, basically this means that either one of more ESXi server become orphaned from the cluster due to network issues, this is also know as the isolated host. The problem with the split is that each part thinks it is the real cluster, VMware's default behavior is that the isolated host powers off all its VM's, thus the locks on the VM's files are then available for other ESXi servers to use. So how does a ESXi server know that its isolated, you could configure a default gateway and thus if it cannot get to this gateway then there is a problem, you can also use an alternative IP device as a ping source. Try and make sure that you have redundancy built into your Service Console network (multiple NIC's or even a second Service Console).
With version 4 you must have at least one management port enabled for HA for the HA agent to start and not produce any error messages, try and setup this port on the most reliable network.
To configure a management port follow below
Management port | Confirm that you have setup the management network for each of the ESXi servers that you wish to use HA |
Now that you have a Management Port Group setup you are ready to configure HA, you should have a cluster already setup (if not then see my Clustering and DRS section), select the cluster and the configure tab and then vSphere Availibility. When you first look at HA there will be a number of alerts indicating that some configuration needs to be completed.
There are two HA configurations that you can setup normal HA and a Proactive HA, we will start with the normal HA, the first screen deals with Failures and Responses, there are a number of conditions that you can set, what happens to a Host Failure (ESXi failure), what to do with a Host that is isolated, VM component protections such as Permanent Device Loss (PDL) and All-Paths-Down (APD) and VM monitoring.
Permanent Device Loss (PDL):
The next section regarding Admission Control is a policy that ensures failover capacity within a cluster, for example how many ESXi servers the cluster can tolerate, or the capacity loss of a ESXi server, etc. Most of the options are self explaining and are determined by how many ESXi servers are within the cluster.
Next we come to HeartBeat datastores, the datastores are used to monitor the ESXi servers and VM's when a network issue occurs, this helps with split brains within a cluster where ESXi servers become deattached from the cluster. Two datastores are used and you have a few options to either automatically select them or to specifically specify which ones to use.
Lastly we have an advanced options which are paramneters you can set for the cluster, these are areas where VMWare many ask you to add entries that are beyond the normal configuration.
Next we have the Proactive HA seection, this allows the vCenter a more proactive approach to handling issues within the cluster, the first screen allows you to either take a manual approach or let VMWare handle the situation thus automatically move VM's to surviving ESXi servers within the cluster. The remediation determines what happens to partially degraded hosts
Last we come to the Providers screen, Providers appear when their corresponding vSphere Web Client plugin has been installed and the providers monitor every host in the cluster for example if you have installed ESXi on a DELL server you should see the DELL provider, these providers monitor the specific hardware for failure conditions and thus communicate with VMWare to alert that issues may occur shortly and to take action before the ESXi server fails, in other words a controlled action is taken before a hardware failure occurs. In my case I am using nested ESXi servers and thus no provider exists as its a non-supported solution.
Once you have configured the various options you are looking to get the HA/Proactive HA sreen as much green as you can.
The methods to test a cluster are
Try and test as much as you can before the solution becomes live, as it will be alot harder to test all the HA/Proactive HA features as it will be very risky when you have LIVE customers VM's running.