Fault Tolerance (FT)

VMware Fault Tolerance is a new feature, at the heart of FT is the record/play feature, which was a programmers debugging tool, with record/play you can capture all the virtual interrupts that take place inside a VM. This means in the future you will be able to redirect this recording process to another VM on a different ESXi server in real time. This means that two ESXi could have the same events that are replayed and both servers will be in a synchronous state. This feature is know as lockstep technology and is an attribute of modern CPU's. VMware is working in conjunction with Intel and AMD to offer support for this feature, which is known to them as vLockstep.

Fault Tolerance has some advantages and disadvantages

Advantages
  • offers real time protection for VM's
  • Avoids end users being affected by downtime or hardware failure
  • Provides seamless failover without affecting the users client application
  • Works for all VM's regardless of the software state (stateful or stateless)
  • Protects systems that cannot be given fault tolerance or HA using other vendors technologies
Disadvantages
  • FT requires modern CPUs that have lockstep attribute
  • VMware recommend a maximum of 8 VM's (4 primaries and 4 secondaries) per ESXi server
  • Secondary VM consumes CPU and memory resources, but is only used when a failure occurs
  • There is a network overhead to maintain the FT logging network (50Kbps for each FT protection)
  • Currently FT protects only VM's with one vCPU
  • CPU's speeds should not vary to much (<400MHZ) between ESXi servers
  • There are many features that you cannot use with FT - VMDirectPath I/O, VM clustering, snapshots, SVMotion, DRS

Bear in mind this is new technology and I will presume that as it matures many of the disadvantages will be addressed, you can work around some of the disadvantages by using affinity rules to prevent specific multinodes systems residing on the same ESXi server.

There are a number of requirements that you need to enable FT

Configuring FT

CPU compatibility is the most challenging aspect to getting FT working, currently there is limited support, but as new CPU's hit the market these will support the lockstep feature. Check the VMware to see if your CPU is supported, I generally try to enable FT and see if I get any error messages. Follow below to enable and configure FT

Enabling FT

First you have to confirm the certificate management has been setup, this enhances security by making sure the ESXi server is not spoofed, if ESXi servers are added to vCenter with just a username and password without this certificate check, VMware FT will not start correctly. From the home page -> administration > "vCenter server settings" you get the screen below, make sure "vCenter requires verified host SSL certificates" is ticked

Make sure both VMotion and HA are working, then you need to enable a FT logging VMKernel port group, all ESXi server will require an additional IP address for this port group, make sure when creating the port group you select "Use this port group for Fault Tolerance logging"

Hopefully you should end up with something like below

Check that the VM's disk types are thick, you can do this by selecting the VM -> select "Edit settings" -> then select each disk and check the Disk Provisioning type, you can see in the screen shot below that this virtual disk is type thick. You can convert thin disks into thick to make them compatible with FT.

Finally we can enable FT on a VM, right-click on the VM -> select Fault Tolerance -> select "turn on Fault Tolerance"

You will see the below warning message, regarding disk provisioning and other information

Here I get two warnings, one regarding that this VM has two vCPU's, remember you can only have one vCPU, the other warning is that my hardware (HP DC7800) is not compatible, however we will continue

I can double check if the hardware is compatible by selecting the ESXi server, then in the general panel you should see "Host Configured for FT" and a small speech bubble at the end, click the speech bubble and you get the "Fault Tolerance requirement error messages", as you can see my HP DC7800's are not compatible

After I remove one vCPU from the VM I tried again, you can watch the progress from the "recent tasks" window

Although my hardware does not support FT, VMware happily configures it for this VM, once configured if you select the VM you will notice an extra "Fault Tolerance" panel, VM is not running as it will not let me start it due to the hardware compatibility problem

If you select the cluster and then the "virtual machines" tab, you will notice that there are two linux01 VM's, the primary and the secondary

Looking at each ESXi server in the "Fault Tolerance" panel you can see which one is the primary and the secondary

vmware1
vmware2

Lastly you can either migrate, disable the fault tolerance or turn off the fault tolerance for this VM

If I had got this working you could have started the VM on both the primary and the secondary, the secondary VM you would not be able to interact with.

If I get the chance to setup a FT on compatible hardware I will revisit this section.