How VSAN Handles a Disk or Host Failure

In a traditional storage system, such as an iSCSI server, if there is a failure or the server cannot be accessed, you may lose the ability to access your data or rebuild your cluster, which could take hours or days. With a VSAN-based storage system, this can never happen to your data. One of the biggest challenges facing SAN (or any VM hypervisor-based storage system) is the effect of host failure. The host running the VM could fail at any time. The host has several components that fail, and all the components must be handled properly. Typically, host failure is treated differently in terms of its impact on the storage system itself.

Host Failures

Host failure is often seen with virtual machines that run critical systems (e.g., Exchange) on virtual hardware. It is not considered normal and can potentially be a very disruptive event. Let’s walk through how SAN handles the failure of a host.

  • The VM hypervisor detects the host failure. The hypervisor must check that the disk is valid, and if it is, then the disk is handed over to the Storage Array. The Storage Array can check the disk and remove any invalid data (e.g., old data from previous VMs, bad sectors) from the disk before starting a new VM. The storage array must also notify the hypervisor that the host has failed and that the SAV disk will be resynced to a new host. When the hypervisor receives this information, it will notify any dependent systems that depend on the state of the host (e.g., Virtual Machine) to make appropriate changes.
  • Host failure of the VM. The storage array notifies the hypervisor that the host has failed. The hypervisor will notify any VM running on the host and stop the VM. It will then switch the SAV disk to the new host. At this point, the Storage Array will need to resync the SAV disk. This may be a significant amount of data. The resync process will likely involve the same process used to resync a new VM. Host failure is a good example of how a host failure can impact SAV. The SAV disk will be resynced, which will be slow and likely impact the entire system’s performance.

How Does VSAN Technology Deal With the Failure of a Disk or Host?

How does Microsoft’s Virtual Storage Area Network (VSAN) technology deal with the failure of a disk or host? You might be surprised to learn that it doesn’t. Microsoft makes it clear that the Virtual SAN product is designed to tolerate the failure of disks or hosts, but that does not mean that it “tolerates” this failure or even doesn’t care.

There is no “automatic recovery” provision when a disk fails or a host is removed, nor is there any built-in monitoring to detect such failures. However, the failure of a disk or host does not result in any service interruption, and any data that VSAN can read is not destroyed in the process. The failure of a disk or host configured in a VSAN cluster or VSAN file system does not result in the loss of the VMs running on the disk or host.

The failure of a disk or host within a VSAN cluster will not cause any service interruption, and VSAN will continue to offer its features and support functions even if a disk or host fails. If a disk or host in a VSAN cluster cannot read or store data, VSAN will attempt to access another storage pool to which the disk or host is attached. If VSAN determines that the other storage pool cannot accommodate the disk or host, it will not attempt to access the disk or host at all.

If the failure of a storage pool is detected, VSAN will “read” it, request that the storage pool be reallocated to another pool, and then automatically re-register the pool with a new, “healthy” disk or host. The reallocation process will result in the move of the data being stored by the pool to the “healthy” disk or host. VSAN does not detect the failure of a storage pool that is no longer part of a cluster or file system. The failure of a storage pool will not result in any service interruption or cause any data loss.

The failure of a storage pool or disk or host configured as a redundant or spare disk or host is detected by VSAN. VSAN will attempt to use the spare disk or host to complete the read or write operations that the failed disk or host was performing until either the spare disk or host cannot read or write data or the service interruption threshold has been reached. VSAN will use another disk or host in the cluster or file system. If the spare disk or host cannot complete the read or write operation, VSAN will only use the remaining disk or host if the other disks or hosts cannot access the storage pool.

Data written to a storage pool will not be re-written to the same storage pool if a disk or host failure causes the data loss unless the disk or host is also in the cluster or file system. If the failure of a disk or host is detected, VSAN will attempt to reallocate the data being stored by the pool to another pool and then automatically re-register the pool with the new, “healthy” disk or host.

Final Say!

Get up to speed on VMware vSAN troubleshooting basics and know how virtual SAN handles different types of system failure. In failure mode, VSAN will store a copy of the storage pool’s data on another storage pool. The copy of the data will be stored at a location and time determined by VSAN. The time and location of the copy of the data on the other storage pool will be written in the log. Thus, you can locate everything easily and assess the system’s actions.