I have been an avid VMware VSAN user for two years now, and whilst I am incredibly impressed by how it performs at scale, I feel there are three considerations that need to clarified with regards to how VSAN works, and how its technology has a profound impact on the usability of two and three-node deployments.
Consideration One: Recovery from failure
In VSAN, replication (ignoring erasure coding for now) is based on a RAID1/RAID10 methodology, with there being two components (copies) of each object, and with the components residing on physically separate hosts (I'm ignoring disk stripes because they are not applicable in two and three-node deployments). A third host is used as the witness, which is simply the arbitrator to stop split brain from occurring.
As you can see from the picture below, in a three-node cluster, these two components and witness align nicely to the three hosts. In a two-node cluster the witness is a virtual ESXi host running elsewhere.
So what happens when a host fails? Well, the good news is that the object (VMDK) remains online as there is a secondary copy available from which read/write access can occur. If the host failure is recovered within 60 minutes, nothing much happens, and the out of date component will resync its missing data, and things carry on as usual. However, if the host remains offline for more than 60 minutes, then the secondary component gets marked as "degraded" and therefore cannot be bought back online.
No problem you say, VSAN has the ability to "repair" the missing objects (recreate them) - akin to a RAID rebuild, right?
Well, not so...
You see VSAN won't simply overwrite the now degraded components. Instead it requires a totally separate host onto which it will begin the replica recreation process. Interestingly, there is no fourth host. This is a three-node (or even a two-node) cluster, so nothing happens. The VMs keep running with a single component (basically just in RAID0).
So, how do you fix this?
It's as simple (or elegant) as you might think. It requires the administrator to create a new storage profile, with FTT=0, and then for ALL VMs on that cluster to be manually assigned to that storage profile (thereby removing all data redundancy for the duration). Once that is done and all VMs are back "in compliance", the admin then needs to manually reassign ALL VMs back to the original storage profile, with FTT=1, and then wait for all replicas to be recreated (which can take several hours). And remember, while the replicas are being recreated, there is no data redundancy. Also, the cleanup of the outdated replicas is manual, so you need to go through the datastore browser and delete these orphaned files.
I do wonder - why would VSAN not just move the witness role to the recovered host, and then use the old witness host as the rebuild target? Seems logical to me. Maybe someone from VSAN engineering knows...
Consideration Two: Maintenance Mode
In a VSAN cluster, when taking a host offline for maintenance, it is important to select he correct data migration options. Ensuring accessibility is pretty much the default behaviour, and ensures that objects will not be taken offline as a result of shutting down the host (really only an issue for FTT=0). Full data migration is required if the host will be offline for an extended period of time (i.e. over 60 minutes), or you want to ensure that the FTT=1 protection is retained during the maintenance window.
OK, so what's the issue?
Linked to Consideration One; in a two or three-node VSAN cluster, it is impossible to select "full data migration" as this requires an additional host. Also, whilst ensuring that accessibility works and the objects remain online, if the host is offline for more than 60 minutes, then regardless of the maintenance mode setting, the objects on the host being maintained will come back online in a degraded state (and then manual work is needed to recreate them).
My thoughts here are that if a two or three-node configuration is detected, there should be a warning generated stating the considerations.
Consideration Three: vCenter on VSAN
There is much publicity about how to bootstrap VSAN so that it can be operational on a single host, allowing for the installation of vCenter on the VSAN datastore. Whilst this is an awesome concept, and is being used extensively in Management Cluster and ROBO deployments, there is one big issue that is only briefly touched on my William Lam, and that is how to do a clean shutdown of the cluster.
When shutting down a VSAN cluster with vCenter running on it, the only way to achieve this is to drop to CLI and execute commands to stop objects being marked missing/degraded.
If you are running a cluster in this mode, I would strongly recommend testing this process; both on the same network as the cluster (simulating an action whereby you are physically in the DC), and also remote via VPN or whatever other means you have to access.
The need for a clean shutdown is even more important if there are environmental issues occurring that may have already caused a failure of one host in the cluster; objects may already be in a degraded/missing state, so not performing a clean shutdown would most definitely cause serious issues.
I ask myself - if vCenter on VSAN is such an unusual deployment, why would VMware recommend that VSAN is a perfect use case for Management Clusters? And if it is an expected deployment method, why would they not address the clean shutdown in the GUI (after all, a simple timed/coordinated shutdown would fix this: change the data recovery setting on VSAN, send shutdown command to each host concurrently, timed delay to 5 minutes, then send shutdown command to vCenter VM).
I think that VMware are doing themselves a disservice by claiming that VSAN is perfect for ROBO and three-node clusters. These considerations are what I would call "critical" as they have a direct impact on the stability and durability of the platform. Customers looking to deploy two and three-node environments should be made aware of these considerations so that they can put in place the required operational processes to ensue reliability of the platform.
Oh, and for the record, I am running a three-node management cluster, and I have experienced these issues first hand. I am now about to place an order for a fourth node, as I cannot accept the risks I am exposing myself to in a three-node deployment.
FYI - all three of these considers are covered in VMware documentation, however they are in fine print buried in a very large book.