VMFS Volumes Missing!?!?!

Here’s the scenario:

After performing maintenance on an ESX server (patches, storage re-scan, reboot), VMFS volumes are no longer visible, even though the hosting LUN can be seen on the Storage Adapters page of the ESX Configuration tab.  Most VMware administrators will see this play out at some point; I saw it in one of my environments today and figured I should make a note of the steps required to correct the issue.

Typically, the root cause of the issue is a change on the storage array that causes the h(id) of the LUN(s) in question to change.  This change could be anything from an array firmware update, LUN removal/recreation, or RAID/LUN reconfiguration.  These changes could cause the h(id) of the LUN to be updated.  When a rescan takes place on the ESX storage adapters (through a manual instantiation, reboot, etc.), the new h(id) is observed.  Because it does not match the previously observed ID, the LUN is tagged as a snapshot LUN and access to that LUN is disabled.

Diagnosis of this problem is fairly easy.  In addition to the behavior I have described, as observed through the Virtual Center Client, the problem can also be confirmed through the ESX command line.

To diagnosis this issue from the console, view the vmkernel log by issuing the following command: tail -f /var/log/vmkernel

You will see messages in the log similar to the following:

Jun  2 16:01:29 esx04 vmkernel: 0:00:31:14.543 cpu3:1039)ALERT: LVM: 4482: vml.0200020000600a0b80005add7800000a494a1d0be6313732362d33:1 may be snapshot: disabling access. See resignaturing section in SAN config guide.
Jun  2 16:01:29 esx04 vmkernel: 0:00:31:14.552 cpu3:1039)LVM: 5579: Device vml.0200010000600a0b80005add7800000a474a1d0bc8313732362d33:1 detected to be a snapshot:
Jun  2 16:01:29 tccesx04 vmkernel: 0:00:31:14.552 cpu3:1039)LVM: 5586:   queried disk ID: <type 2, len 22, lun 1, devType 0, scsi 6, h(id) 5103533129706062046>
Jun  2 16:01:29 esx04 vmkernel: 0:00:31:14.552 cpu3:1039)LVM: 5593:   on-disk disk ID: <type 2, len 22, lun 1, devType 0, scsi 6, h(id) 2153359415130143165>

After confirming that this is indeed the problem you are experiencing, stop and take a deep breath.  The fix is easy, but you need to take steps before fixing it to prevent further damage.  If you are lucky, the problem has only manifested itself on one ESX server (and hopefully that ESX was not hosting any VM’s because you put it into maintenance mode).  Prevent your other ESX servers from rescanning storage – don’t reboot them, don’t manually rescan, don’t update them.

If the affected ESX server was hosting running VM’s, HA (if licensed and properly configured) should have kicked in if applicable and restarted the VM’s on another node in the ESX cluster.

If multiple ESX servers (or all of them) are affected, your VM’s are likely all powered off after hard stops, so there is not much you can do but to get on with fixing the issue and trust your backups (you do have backups, right?).  This is where array-level snapshots come in handy.  In my experience, most if not all VM’s recover after a hard stop like this, but don’t let that keep you from having a robust DR plan.

To correct the issue you must not have any running VM’s on the affected VMFS volumes to alternate volumes.  Shut down the VM’s or use Storage VMotion to move running VM’s to alternate LUN’s.

In the VI Client, select the affected ESX host in the Hosts & Clusters view.  Switch to the Configuration Tab.  Click ‘Advanced Settings’ and then choose the LVM node.  Change the LVM.DisallowSnapshotLun from the default setting of ‘1’ to ‘0’ and click OK.  Next, rescan your storage from the ‘Configuration | Storage Adapters’ pane.  Your missing VMFS volumes should re-appear.  You’re doing fine, but not done yet.

Even if the other hosts that use the affected VMFS volume appear to be fine, they will most likely lose access to the volume once a rescan/reboot takes place.  You need to perform the LVM.DisallowSnapshotLun = 0 setting change on all ESX servers connected to the volume, followed by a re-scan of your storage.

Once all affected ESX servers see the VMFS volumes, change the LVM.DisallowSnapshotLun setting back to the default of 1.  Migrate back and/or power up VM’s on the volume and see what the damage is.  If you are lucky, everything is good to go.  If not, it’s a great time to check out those backups.

If you do not know what caused the storage change, check your ESX logs to try to determine if the server was rebooted or if storage was rescanned. This will give you an idea of when the change occurred – a starting point to work back from to find the root cause.  Use this command to get started: less /var/log/vmksummary

Here are some suggestions on how to avoid this problem:

1.) Minimize changes to LUN’s once configured on an ESX.

2.) Coordinate Storage Maintenance with VMware maintenance windows.

3.) Have stand-by storage so you can Storage VMotion running VM’s off of the affected LUNS.

4.) Consider NFS, as NFS volumes are not impacted by resignaturing.

For more information on this problem, or to better understand the advanced settings changes involved, check out the VMware SAN Configuration Guide at http://www.vmware.com/pdf/vi3_35/esx_3/r35u2/vi3_35_25_u2_3_server_config.pdf, page 114, or the VMware iSCSI SAN Configuration Guide at http://www.vmware.com/pdf/vi3_35/esx_3/r35u2/vi3_35_25_u2_iscsi_san_cfg.pdf, page 117.

Comments

  1. Great work! You’re a lifesaver. I was up all night trying to figure out what happened to lun 1.

  2. I have a similar but more troublesome problem. The changed ID is of the 2nd extent of a VMFS volume. So I can’t mount it as standalone.

    Why is it so difficult to just change the ID that vmware it’s expecting I don’t know… If it’s expecting the ID to be “xxx”, then modify “xxx” with the new ID of the LUN. Or vice-versa. Can it be done ?

  3. Thanks this saved me a lot of work.

  4. Thanks for this help!

Drop a comment below: