The procedure is very complex for vSphere 4.1, but fortunately is has become much easier with vSphere 5.0.
What happens if you fail to follow these procedures and just unpresent the LUN on the storage array, so that the hosts can not access it anymore? The ESX(i) hosts will detect an APD (all paths down) condition in this case, and - particularly ESX(i) 4.1 hosts - can become very unhappy about this (see KB1016626 and KB1030980). Again, ESXi 5.0 hosts are much more resilient to APD conditions: they will eventually turn them into PDL (= permanent device loss, s. KB2004684) conditions and will completely recover from the LUN loss after rescanning the HBAs ... unless you have Storage I/O control (SIOC) enabled on the lost datastores ... and this is what happened to us today :-( The vmkernel.log files were flooded with the following messages, because SIOC was trying to access the lost datastore:
Permanent Device Loss (PDL) with SIOC enabled |
Now the problem is: You cannot just disable SIOC on a lost datastore using the vSphere client - you should have done this before unmapping the storage LUN!
One way to recover from this situation is to reboot any of the affected hosts. However, I really wanted to save the time to put all the hosts in maintenance mode and reboot them one after the other. So I looked for a way to forcibly unmount the datastore directly on the host via esxcli or similar. All the ways that are documented in the VMware KB articles and docs did not work in my case, but I finally stumbled over this wonderful blog post by William Lam:
Does SIOC actually require Enterprise Plus and vCenter?
It is a bit old and refers to ESX(i) 4.1, but it is still valid for ESXi 5.0! Here William describes a way to enable (and disable!) SIOC for a datastore directly on a host without using (even without having available!) a vCenter server.
And it is really easy: All you need to know is the device ID of the datastore/LUN. In ESXi 5.0 you can find it out by using the command
# esxcli storage vmfs extent list
It will list all datastores with their labels and device IDs (starting with naa.). And then you can use the following vsish command to disable SIOC for the device
# vsish -e set /storage/scsifw/devices/<naa-id>/iormState 1496
That's it! After waiting a few seconds the affected datastore suddenly disappeared from the list of mounted datastores in the vSphere client, and the VMkernel.log error messages also stopped.
Please note: vsish is a powerful but largely undocumented utility to query and set VMkernel parameters. It is only available in an ESXi local or remote shell. William has quite a few posts about it on his virtuallyGhetto blog. This time it saved us the trouble and time of rebooting a whole cluster of hosts ...
Please note: vsish is a powerful but largely undocumented utility to query and set VMkernel parameters. It is only available in an ESXi local or remote shell. William has quite a few posts about it on his virtuallyGhetto blog. This time it saved us the trouble and time of rebooting a whole cluster of hosts ...
No comments:
Post a Comment
***** All comments will be moderated! *****
- Please post only comments or questions that are related to this post's contents!
- Advertising and link spamming will not be tolerated!