Sunday, February 19, 2012

vSphere HA: isolation, partition and failure

I needed to internalise the above, so I made this diagram. The actual process is certainly more comprehensive than this. I need to keep the chart simpler so I can remember it.

The main lesson for me is Isolation is no longer a big deal in 5.0, because of datastore heartbeating.

 

Below is the process for a single Slave host. The process starts with the slave not getting the heartbeat from its master, which it should get every second.

 

Below is the process from the Master point of view. There are 2 processes here, as a master carries 2 roles. Besides being a master, it also runs VM just like any other Slave. A master receives heartbeat from all its slaves, so if any of the heartbeat is received, that means the master is not isolated. It might be partitioned, but definitely not isolated.

The bottom process above is for Failure of a slave. If a master does not receive heartbeat from a slave, it will check for datastore heartbeat file. It will also ping the host IP address. The ping and the heartbeat go through the same network. A situation where ping might work but no network heartbeat is the FDM agent failure itself.

If the datastore heartbeat file is not updated by the slave, then the slave is assumed dead.

No comments:

Post a Comment