My last post described a problem I experienced with VMware HA after upgrading to vSphere 4.1. Here is my experience with a similar issue after applying the ESXi410-201010401-SG patch to one of my test/dev ESXi clusters. The patch, released on November 15th and weighing in at a hefty 212MB, fixes a number of issues from Likewise authentication on ESXi hosts to allowing configurable NOOP timout and interval values for faster failover of certain iSCSI arrays (like the DS3300 or MD3000i).
The environment where this problem occured has a single vCenter server managing both a production cluster and the test/dev cluster. After applying this particular update to the ESXi hosts in the cluster, the vCenter server began to crash every 5 minutes or so. The crash was logged on the vCenter server with Event ID 7031: The VMware VirtualCenter Server service terminated unexpectedly. My go-to troubleshooting question (“What changed?”) pointed at the ESXi patch, but a VMware KB search and a little Google action yielded no results directly related to ESXi410-201010401-SG and the vCenter Server service terminating unexpectedly. VMware KB article 1003926 provides some basic troubleshooting steps for vCenter Server, such as checking for port conflicts, vCenter DB health & availability, and log locations. The environment was healthy until the patch was applied to a sub-set of my ESXi hosts so I could confidently eliminate credentials, port conflicts and the like as the cause of the problem, so I jumped right to the log files for vCenter. The vpxd-*.log is found in “C:ProgramDataVMwareVMware VirtualCenterLogs” on Windows 2008 vCenter servers and “%ALLUSERSPROFILE%VMwareVMware VirtualCenterLogsvpxd.log” on Windows 2003 servers. I found a few lines of interest in the log file but decided I had better call VMware Support to further analyze the issue.
To make a long story short, what the logs revealed is a bug that is triggered whenever VMware Distributed Resource Scheduler (DRS) ran on the updated test/dev cluster. Disabling DRS stopped the symptom of the vCenter Server Service terminating unexpectedly, but this was obviously not a long-term solution. A bit more digging by my VMware support rep led to VMware Distributed Power Management (DPM) being enabled on the cluster as the root cause of the issue. Disabling DPM but leaving DRS enabled on the cluster fixed the glitch. I can live without DPM, but DRS is pretty darn handy.
At this point, VMware engineering knows about the issue, and a fix is planned for vCenter 4.1 Update 1. Interesting that DPM was fingered in this case, as well as in the case I wrote about last week where HA and DPM apparently do not always play well together. It seems like DPM is not fully baked, even though it is now officially supported. This is unfortunate as DPM is promising to me – I can imagine the technology behind DPM being used for intelligent load shedding during peak electrical cost hours, power outages, or cooling outages in datacenters with some good integration between a DPM API and environmental management and monitoring systems like APC’s NetBotz. Anyone else using DPM without having problems? Any ideas for extending DPM or leveraging it for other purposes in the datacenter – I’d love to hear ideas in the comments.