I recently ran into an issue with one of my vSphere clusters after upgrading from vSphere 4.0 to vSphere 4.1 (with ESXi 4.1 and vCenter 4.1). After the upgrade, I attempted to enable VMware High Availability (HA) on the upgraded cluster. Each of the ESXi hosts in the cluster appeared to have been properly configured for HA (as observed in the ‘Recent Tasks’ pane of the vSphere Client). Despite having appeared to configure HA correctly, I found that each host in the cluster was displaying an error on the Summary tab of the vSphere Client that read ‘Error <date> <time> HA agent on <host> in cluster <clustername> in <datacenter> has an error: Error while running health check script’.
I’ve dealt with HA errors in the past, so I quickly jumped into my standard troubleshooting and quick-fixes proceedure:
- Verify host connectivity.
- Right-click on each host and choose ‘Reconfigure for VMware HA’
- Disable & Re-enable HA on the cluster.
- Disable HA, place hosts into Maintenance Mode & Reboot (one at a time). Re-enable HA.
- Get frustrated that a quick fix is not probably not in my future….
- Verify host name resolution for each host in the cluster from the service console/tech support mode of each host.
- Review log files on vCenter Server and each host for glaring issues. All Greek to me in this case….
- Call VMware Support.
VMware Support reviewed the log files I had attached to my Service Request (SR) when I opened the case and had me try a few different things to fix the issue. First, we verified the steps I had taken and collected some fresh logs. Next, the support rep had me verify that Distributed Power Management (DPM) was not enabled on the cluster as there apparently is a known issue (although a KB is not available at this time) with configuring HA when DPM is enabled under certain circumstances. I did not have DPM enabled on this particular cluster so I didn’t spend time chasing down this particular bug.
Finally, the following proceedure, run on each ESXi server in the cluster, resolved the issue (Note – this procedure is safe to do during normal operations as it does not affect running VM’s):
- Verify SSH or Console access to the host (this requires enabling Remote SSH/Tech Support Mode on ESXi hosts on the Configuration tab | Security Profile node of the vSphere Client, or by pressing F2 to login to ESXi 4.1 | troubleshooting options | enable remote SSH.
- Disable HA on the affected cluster.
- Right-click | Disconnect each host in the cluster from the ‘Hosts & Clusters’ view of the vSphere Client.
- SSH to the host and run the following commands:
- In the vSphere Client, right-click on each host and Connect.
- Enable HA on the cluster.
This procedure cleanly removes the VMware vCenter agent and the VMware HA agent from the ESX or ESXi host. Reconnecting the host to vCenter pushes the vCenter management agent back to the host and installs it cleanly. Enabling HA on the cluster re-installs the HA agent. After completing these steps I had no further issues with HA on the cluster – case closed. I hope this is helpful for anyone else who might be experiencing HA errors after upgrading to vSphere 4.1.
For those wanting to learn HA best practices or go a bit deeper into the inner workings of VMware HA, I highly recommend Duncan Epping’s VMware HA Deep Dive article and/or VMware vSphere 5.1 HA and DRS Technical Deepdive (Volume 1) book.