VMware Front Experience: How to disable Storage I/O Control for an unavailable datastore

If you run your VMs on FC based storage then you occasionally have the necessity to unmap a storage LUN from one or more ESX(i) hosts (e.g. when retiring a storage array or re-organizing your storage layout). It is important to do this in the right way using the procedures that are documented in

KB1029786 for ESX(i) 4.1
KB2004605 for ESXi 5.0

The procedure is very complex for vSphere 4.1, but fortunately is has become much easier with vSphere 5.0.

What happens if you fail to follow these procedures and just unpresent the LUN on the storage array, so that the hosts can not access it anymore? The ESX(i) hosts will detect an APD (all paths down) condition in this case, and - particularly ESX(i) 4.1 hosts - can become very unhappy about this (see KB1016626 and KB1030980). Again, ESXi 5.0 hosts are much more resilient to APD conditions: they will eventually turn them into PDL (= permanent device loss, s. KB2004684) conditions and will completely recover from the LUN loss after rescanning the HBAs ... unless you have Storage I/O control (SIOC) enabled on the lost datastores ... and this is what happened to us today :-( The vmkernel.log files were flooded with the following messages, because SIOC was trying to access the lost datastore:

Permanent Device Loss (PDL) with SIOC enabled

Now the problem is: You cannot just disable SIOC on a lost datastore using the vSphere client - you should have done this before unmapping the storage LUN!

One way to recover from this situation is to reboot any of the affected hosts. However, I really wanted to save the time to put all the hosts in maintenance mode and reboot them one after the other. So I looked for a way to forcibly unmount the datastore directly on the host via esxcli or similar. All the ways that are documented in the VMware KB articles and docs did not work in my case, but I finally stumbled over this wonderful blog post by William Lam:

Does SIOC actually require Enterprise Plus and vCenter?

It is a bit old and refers to ESX(i) 4.1, but it is still valid for ESXi 5.0! Here William describes a way to enable (and disable!) SIOC for a datastore directly on a host without using (even without having available!) a vCenter server.

And it is really easy: All you need to know is the device ID of the datastore/LUN. In ESXi 5.0 you can find it out by using the command

# esxcli storage vmfs extent list

It will list all datastores with their labels and device IDs (starting with naa.). And then you can use the following vsish command to disable SIOC for the device

# vsish -e set /storage/scsifw/devices/<naa-id>/iormState 1496

That's it! After waiting a few seconds the affected datastore suddenly disappeared from the list of mounted datastores in the vSphere client, and the VMkernel.log error messages also stopped.

Please note: vsish is a powerful but largely undocumented utility to query and set VMkernel parameters. It is only available in an ESXi local or remote shell. William has quite a few posts about it on his virtuallyGhetto blog. This time it saved us the trouble and time of rebooting a whole cluster of hosts ...

VMware Front Experience

How to disable Storage I/O Control for an unavailable datastore

No comments:

Post a Comment