While designing a vSAN based platform for a customer lately, I was looking into NVMe based ReadyNodes which needed to run ESXi 6.7 U3. While looking for a specific Dell rackmount config in the VMware Compatibility Guide, something caught my interest. Those were the lines at the bottom of the nodes config, which say “NVMe surprise removal is not supported at ESXi 6.7x ReadyNodes“.
Why does it matter
So the question rises, what exactly is NVMe surprise removal and why does it matter. Dell wrote a good white paper on the matter: NVMe Hot-Plug on Dell EMC PowerEdge servers running VMware vSphere or vSAN.
First let’s get on the same page of the terminology and scenarios when removing and adding drives to and from a running system.
- Surprise removal: Removing a device from the system without notifying the system beforehand.
- Hot insertion: Connecting an NVMe device when VMware ESXi operating system is booted.
- Orderly removal: Removing an NVMe device after completing the pre-requisites, such as suspending all processes accessing the device.
- Hot swap: Replacement of an NVMe SSD with a new SSD (from the same or different vendor) while the host is up and running. Hot swap is a surprise removal followed by a hot insertion operation with a different NVMe device.
This terminology makes sense. Surprise removal and hot swapping are tasks that an operations team would often perform and frankly are very common with SAS based systems. The issue is that surprise removal is not supported in ESXi versions prior to 7.0b (build 16324942).
If doing so in an unsupported ESXi version a system could PSOD when an NVMe drive is surprise removed and re-inserted within 1 minute. Also other potential problems and corner case issues could arise from performing frequent surprise removals.
The table below from the Dell white paper has a good overview of supported operations on NVMe drives. All operations are supported on SAS and SATA drive as far as I know, because the servers host bus adapter (HBA) takes care of it. The main difference for NVMe drive in comparisation to SAS is that NVMe devices are directly attached to the processors PCIe bus and do not have an HBA.
Preventing mentioned issues
Having almost 20 years experience in operations, one thing that stood out is that recovering from an issue (even as simple a disk failure) should be dead simple. If checklists or runbooks come into play, steps are forgotten or misinterpreted which lead to more issues or even downtime or data loss.
That is the main reason why I wrote this post, besides its interesting matter, operational task should be simple. If not, issues arise sooner or later.
In general I see a couple of options to prevent issues with surprise removing or hot-swapping NVMe drives in ESXi versions before v7.0b, which depend if systems are already purchased and if vSphere 7 is already used. Important to note it that the advise is my personal opinion.
|NVMe ReadyNode already purchased||ESXi version||Advise|
|No||Planning to use v6.x for a while||Buy SAS based ReadyNodes|
|No||Using v7||Upgrade to 7.0b or later|
|Yes||Still using v6.x||Consider upgrading to 7.0b+ or train staff|
|Yes||Using v7||Upgrade to 7.0b or later|
If you have the possibility to use full NVMe based vSAN ReadyNodes, go for it. NVMe drive are so much faster compared to SAS and for sure those inferior SATA drives. Main reasons to go for NVMe drives are lower CPU overhead, increased bandwidth, lower latency and huge I/O queue per drive.
That does not mean SAS based systems cannot do their job, far from it. Equipped with good drives and HBAs, they go well over 100K IOPS per node.
Getting on-topic, from my point of view it is important to know which drive removal options are available and which of them are support in vSphere. If you’re going to use vSphere 6.x for a while, consider buying SAS based ReadyNodes.
When you organization is already moved to vSphere 7, be sure to run the version that has support for surprise removal and hot-plugging, since it will prevent issues, downtime and maybe even data loss.