Issues & Troubleshooting

I recently stumbled on two vSphere 5 ESXi networking bugs that I thought I would share. The issues are very similar from a cursory level, but have different symptoms, troubleshooting steps, and implications for your architecture, so I’m going to split the issues into two separate posts. Because troubleshooting these issues was a real pain, I’ll provide some details on how to identify these issues in your environments and wrap up with a third post on what I believe to be some best practices to avoid these same problems and achieve greater redundancy and resiliency in your vSphere environments.

The Problem

Today, we’ll look at an ESXi 5 networking issue that caused massive iSCSI latency, lost iSCSI sessions, and lost network connectivity. I’ve been able to reproduce this issue in several environments, on different hardware configurations. Here’s the background information on how all this started: I upgraded an ESXi 4.1 host to ESXi 5 using vSphere Update Manager (VUM). Note that I did use the host upgrade image that contained the ESXi500-201109001 iSCSI fixes – if you are upgrading to vSphere 5 and have iSCSI in your environment, use this image. Here’s a quick look at how the networking was configured on this host:

 

The iSCSI networking was configured in a very typical setup, and per best practices, as outline in VMware’s documentation, as well as from many vendors (see EMC’s Chad Sakac’s ‘A Multivendor Post on using iSCSI with VMware vSphere’), with two vmnic uplinks, two vmknics, with one active adapter on the correct layer-2/layer-3 network, and the other unused.

vSwitch iSCSI vmknic override failover order with unused NICvSwitch iSCSI vmknic override failover order with unused NIC

After the upgrade, the standard vSwitch with two vmnics for uplinks (Broadcom NetXtreme II BCM5709 1000Base-T) and two vmknics that serviced the software iSCSI adapter failed to pass traffic (vmkping to the iSCSI targets failed) and could not mount ANY iSCSI LUN’s. VM network, management, and vMotion ports were not affected.

If I let the host sit long enough, it *might* find a couple paths to the storage, but even then performance was deteriorated per the vmkernel.log:

WARNING: ScsiDeviceIO: 1218: Device naa.60026b90003dcebb000003ca4af95792 performance has deteriorated. I/O latency increased from average value of 5619 microseconds to 495292 microseconds.

Troubleshooting

I’m going to dump a whole bunch of my troubleshooting steps on you – hopefully they not only help folks dealing with this particular bug, but help with general network and configuration troubleshooting in VMware vSphere. During troubleshooting, I removed the vmk binding for these two on the iSCSI adapter, removed the software iSCSI Adapter itself, removed the vmknics on the vSwitch, and removed the vSwitch itself. I then recreated the vSwitch, set vSwitch MTU to 9000, recreated two vmk ports, set 9000MTU, assigned IP, and set failover order for multipath iSCSI. I then re-created the software iSCSI adapter and bound the two vmk ports. I was able to pass vmk traffic and mount iSCSI LUN’s. Great – problem solved!?!?! Not so much – I rebooted the host and the problem returned.

Here are my next troubleshooting steps:

  • I repeated the procedure above and re-gained connectivity, but the problem returns on subsequent reboots. I can verifiably recreate the problem.
  • I verified end-to-end connectivity for other hosts on the same Layer 1, Layer 2, and Layer 3 network as the iSCSI initiator and iSCSI targets.
  • I verified the ESXi host’s networking configuration using the vSphere client, double-checking the vSwitch, vmnic uplinks, and vmknic configurations. Everything looked good so I canceled out.
  • I then reinstalled ESXi from scratch (maybe something was left over from 4.1 that a clean install would weed out), built up the same configuration, and was again able to re-create the problem.
  • I poured over logs (vmkernel.log, syslog.log and storagerm.log primarily). I could see an intermittent loss of storage connectivity, failure to log into the storage targets (duh – there is no connectivity, no vmkping) and high storage latency on hosts where I had rebuilt the iSCSI stack and run a few VM’s.
  • I switched out the Broadcom NIC for an Intel NIC (the Broadcom had hardware iSCSI capabilities – I wanted to be sure the hardware iSCSI was not interfering).
  • I verified TOE was enabled.

The ‘Ah-Ha’ Moment

Next, I verified the ESXi host’s networking configuration using the vSphere client one more time – the properties of the vSwitch, the properties of the vmkernel (vmk) ports, the manual NIC teaming overrides, IP addressing, etc. Everything looked correct – I MADE NO CHANGES – but when I clicked OK (last time I canceled) to close the vSwitch properties and was greeted with this warning:

changing an iscsi initiator port group warning

Wait a second… I didn’t change anything, why am I being prompted with a you’re ‘Changing an iSCSI Initiator Port Group’ warning? I like to live dangerously, and wanted to see what would happen, so I said ‘Yes’.

Much to my surprise, after only viewing and closing the vSwitch and iSCSI vmk port group settings, I was able to complete a vmkping on the iSCSI-bound vmk’s. And moreover, I completed a Rescan of all storage adapters and my iSCSI LUN’s were found, mounted, and ready for use. Problem solved? Nope. The same ugly issue re-appeared after a reboot.

While the problem wasn’t solved, I now had something to work with. My go-to troubleshooting question “What Changed?” could maybe be answered. Even though I didn’t change anything in the vSwitch Properties GUI, something changed. To see what changed in the background, I compared the output of the following ESXi Shell (or vCLI, or PowerCLI) commands before and after making ‘the change’ happen (by viewing the properties of the vSwitch/vmk ports), but found no changes.

  • esxcfg-vswitch -l
  • esxcfg-vmknic -l
  • esxcfg-nics -l

Then, I made backup copy of esx.conf

 cp /etc/vmware/esx.conf /etc/vmware/esx.conf.bak

Then I caused ‘the change’ and then compared checksums using md5sum, but found no differences:

 md5sum /etc/vmware/esx.conf /etc/vmware/esx.conf.bak

I compared the running .conf and the backup .conf, but found no differences:

 diff /etc/vmware/esx.conf /etc/vmware/esx.conf.bak

Call in Air Support
At this point, I was out of ideas so I called for help: “Hello, 1-866-4VMWARE, option 4, option 2 – help!”

After repeating many of the same troubleshooting steps, the support engineer decided that I had hit on a known, and not yet patched, bug. The details of the bug are included in KB 2008144: Incorrect NIC failback occurs when an unused uplink is present. That’s right – my iSCSI traffic, vmkpings, etc were being sent down the wrong NIC – the UNUSED NIC. Ouch. The bug caused the networking stack to behave in a very unpredictable way, making my troubleshooting steps next to useless, and any other advanced troubleshooting ideas I had (sniffing, logs, etc.)

Once I knew what the issue was, I could see a bit of evidence in the logs:

WARNING: VMW_SATP_LSI: satp_lsi_pathIsUsingPreferredController:714:Failed to get volume access control data for path "vmhba33:C0:T0:L4": No connection

NMP: nmp_DeviceUpdatePathStates:547: Activated path "NULL" for NMP device "naa.60026b90003dcebb0000c7454d5cc946".

WARNING: ScsiPath: 3576: Path vmhba33:C0:T0:L4 is being removed

Notice the NULL path – the path can’t be interpreted correctly when being sent down the wrong (unsued) vmnic that is on a different subnet and VLAN. The gotcha on this issue is that I had followed best practices where applicable, and accepted default settings on the vSwitch and vmknics.

The Quick Fix
VMware KB 2008144 offers two workaround for this bug. The quick fix for the problem is to simply change the Failback setting on either the vSwitch running the software iSCSI vmknic’s to “No” (default is yes), or to change the setting on the vmknic itself if you have other port groups on the vSwitch (such as a VM Network port group to give your guest VM’s access to the iSCSI network).

Change vSwitch or Portgroup Failback

Changing Failback = No on the iSCSI vmknics and then rescanning the storage adapters fix the glitch immediately.

Architecture Changes
The second workaround from VMware is “Do not have any unused NICs present in the team.”. This translates to a slightly different architecture than that described in many documents. To achieve this workaround, the configuration would have to change to two vSwitches, each with a single vmnic uplink and a single vmk port, bound to the iSCSI adapter. This change does not impact redundancy or availability when compared with the single-vSwitch:two-vmk configuration that I was running with as one of the vmnics was set to unused anyway. This workaround does add a bit more complexity, as there are a few more elements to configure, monitor, manage, and document.

 

This problem seems to only present itself on vSphere Standard Switches (vSwitch), although I could not get confirmation of this (please post a comment if you know!). Assuming this is true, a vDistributed Switch (vDS) could be used for Software iSCSI traffic. Mike Foley has a write-up on how to migrate iSCSI from a vSwitch to a vDS on his blog here: http://www.yelof.com/?p=72.

A Couple More Notes
My troubleshooting fix of viewing the vSwitch settings and clicking ok seemed to temporarily resolve the issues because it triggered an up/down event on the vmk of the unused uplink. This caused the network stack to re-evaluate paths and start using the correct, Active, uplink.

Note that this problem can occur outside of my iSCSI use case – any vSwitch, Port Group, or VMKNIC with an unused adapter set in the NIC Teaming Failover Order are susceptible to this bug, so watch for it on redundant vMotion networks (vMotion randomly fails), VM Network networks (sudden loss of guest connectivity), or even your management network (hosts fall out of manageability from vCenter, and can’t be contacted via SSH, vSphere client, etc.
Leave a comment if you’ve experienced this bug – your notes on the problem may help others find and fix the issue until VMware releases a fix. I understand that a fix for this particular bug is not due out until at least vSphere 5 Update 1.

I’ll have another (shorter) writeup on the 2nd networking bug I found in ESXi 5 later in the week – check back here for a link once it is published.

{ 0 comments }

My last post described a problem I experienced with VMware HA after upgrading to vSphere 4.1.  Here is my experience with a similar issue after applying the ESXi410-201010401-SG patch to one of my test/dev ESXi clusters.  The patch, released on November 15th and weighing in at a hefty 212MB, fixes a number of issues from Likewise authentication on ESXi hosts to allowing configurable NOOP timout and interval values for faster failover of certain iSCSI arrays (like the DS3300 or MD3000i).

The environment where this problem occured has a single vCenter server managing both a production cluster and the test/dev cluster.  After applying this particular update to the ESXi hosts in the cluster, the vCenter server began to crash every 5 minutes or so.  The crash was logged on the vCenter server with Event ID 7031: The VMware VirtualCenter Server service terminated unexpectedly.  My go-to troubleshooting question (“What changed?”) pointed at the ESXi patch, but a VMware KB search and a little Google action yielded no results directly related to ESXi410-201010401-SG and the vCenter Server service terminating unexpectedly.  VMware KB article 1003926 provides some basic troubleshooting steps for vCenter Server, such as checking for port conflicts, vCenter DB health & availability, and log locations.  The environment was healthy until the patch was applied to a sub-set of my ESXi hosts so I could confidently eliminate credentials, port conflicts and the like as the cause of the problem, so I jumped right to the log files for vCenter.  The vpxd-*.log is found in “C:\ProgramData\VMware\VMware VirtualCenter\Logs” on Windows 2008 vCenter servers and “%ALLUSERSPROFILE%\VMware\VMware VirtualCenter\Logs\vpxd.log” on Windows 2003 servers.  I found a few lines of interest in the log file but decided I had better call VMware Support to further analyze the issue.

To make a long story short, what the logs revealed is a bug that is triggered whenever VMware Distributed Resource Scheduler (DRS) ran on the updated test/dev cluster.  Disabling DRS stopped the symptom of the vCenter Server Service terminating unexpectedly, but this was obviously not a long-term solution.  A bit more digging by my VMware support rep led to VMware Distributed Power Management (DPM) being enabled on the cluster as the root cause of the issue.  Disabling DPM but leaving DRS enabled on the cluster fixed the glitch.  I can live without DPM, but DRS is pretty darn handy.

At this point, VMware engineering knows about the issue, and a fix is planned for vCenter 4.1 Update 1.  Interesting that DPM was fingered in this case, as well as in the case I wrote about last week where HA and DPM apparently do not always play well together.  It seems like DPM is not fully baked, even though it is now officially supported.  This is unfortunate as DPM is promising to me – I can imagine the technology behind DPM being used for intelligent load shedding during peak electrical cost hours, power outages, or cooling outages in datacenters with some good integration between a DPM API and environmental management and monitoring systems like APC’s NetBotz.  Anyone else using DPM without having problems?  Any ideas for extending DPM or leveraging it for other purposes in the datacenter – I’d love to hear ideas in the comments.

{ 1 comment }

I recently ran into an issue with one of my vSphere clusters after upgrading from vSphere 4.0 to vSphere 4.1 (with ESXi 4.1 and vCenter 4.1).  After the upgrade, I attempted to enable VMware High Availability (HA) on the upgraded cluster.  Each of the ESXi hosts in the cluster appeared to have been properly configured for HA (as observed in the ‘Recent Tasks’ pane of the vSphere Client).  Despite having appeared to configure HA correctly, I found that each host in the cluster was displaying an error on the Summary tab of the vSphere Client that read ‘Error <date> <time> HA agent on <host> in cluster <clustername> in <datacenter> has an error: Error while running health check script’.

I’ve dealt with HA errors in the past, so I quickly jumped into my standard troubleshooting and quick-fixes proceedure:

  1. Verify host connectivity.
  2. Right-click on each host and choose ‘Reconfigure for VMware HA’
  3. Disable & Re-enable HA on the cluster.
  4. Disable HA, place hosts into Maintenance Mode & Reboot (one at a time).  Re-enable HA.
  5. Get frustrated that a quick fix is not probably not in my future….
  6. Verify host name resolution for each host in the cluster from the service console/tech support mode of each host.
  7. Review log files on vCenter Server and each host for glaring issues.  All Greek to me in this case….
  8. Call VMware Support.

VMware Support reviewed the log files I had attached to my Service Request (SR) when I opened the case and had me try a few different things to fix the issue.  First, we verified the steps I had taken and collected some fresh logs.  Next, the support rep had me verify that Distributed Power Management (DPM) was not enabled on the cluster as there apparently is a known issue (although a KB is not available at this time) with configuring HA when DPM is enabled under certain circumstances.  I did not have DPM enabled on this particular cluster so I didn’t spend time chasing down this particular bug.

Finally, the following proceedure, run on each ESXi server in the cluster, resolved the issue (Note – this procedure is safe to do during normal operations as it does not affect running VM’s):

  1. Verify SSH or Console access to the host (this requires enabling Remote SSH/Tech Support Mode on ESXi hosts on the Configuration tab | Security Profile node of the vSphere Client, or by pressing F2 to login to ESXi 4.1 | troubleshooting options | enable remote SSH.
  2. Disable HA on the affected cluster.
  3. Right-click | Disconnect each host in the cluster from the ‘Hosts & Clusters’ view of the vSphere Client.
  4. SSH to the host and run the following commands:
  5. services.sh stop
    /opt/vmware/uninstallers/VMware-vpxa-uninstall.sh
    /opt/vmware/uninstallers/VMware-aam-ha-uninstall.sh
    services.sh start

  6. In the vSphere Client, right-click on each host and Connect.
  7. Enable HA on the cluster.

This procedure cleanly removes the VMware vCenter agent and the VMware HA agent from the ESX or ESXi host.  Reconnecting the host to vCenter pushes the vCenter management agent back to the host and installs it cleanly.  Enabling HA on the cluster re-installs the HA agent.  After completing these steps I had no further issues with HA on the cluster – case closed.  I hope this is helpful for anyone else who might be experiencing HA errors after upgrading to vSphere 4.1.

For those wanting to learn HA best practices or go a bit deeper into the inner workings of VMware HA, I highly recommend Duncan Epping’s VMware HA Deep Dive article and/or VMware vSphere 4.1 HA and DRS Technical Deepdive (Volume 1) book.

{ 2 comments }

I ran into an issue with a customer today where a VM was performing terribly.  From within the guest OS (a Windows 2003 application server running .NET in IIS which I will call BigBadServer) things appeared sluggish and CPU time was high.  The amount of time being spent on the kernel was notably high.  The VM in question had 4 vCPU’s and a good helping of memory.

high kernel time in perfmon

I don’t have access to the VMware client at this particular site – just some of the guests, so I was flying blind.  Gut feeling told me that I was dealing with a resource contention issue.  I had the VMstats provider running in the guest (http://vpivot.com/2009/09/17/using-perfmon-for-accurate-esx-performance-counters/) showed me that there was no ballooning or swapping going on, and that the vCPU’s were not limited and the CPU share value seemed to be at the default.

I strongly suspected that the physical server running VMware ESX was oversubscribed on physical CPU (pCPU) resources.  Essentially, the guest VM’s that are sharing the resources of the physical machine are demanding more resources than the machine can handle.  To verify this theory, I had the client check the ‘CPU Ready’ metric on BigBadServer and bingo!

CPU Ready is a measure of the amount of time that the guest VM is ready to run against the pCPU, but the VMware CPU Scheduler cannot find time to run the VM because other VM’s are competing for the same resources.

From the stats the customer provided on our phone call, the CPU Ready for any one of the 4 vCPU’s on the BigBadServer was on average 3723ms (min: 1269ms, max:8491ms).  (Update 8/25/2010 to clarify summation stat) The summation for the entire VM was around 12,000ms on average and peaked around 35,000.  The stats came from the real-time performance  graph/table in the vSphere client. The real-time stats in the vSphere Client update every 20 seconds, so the CPU Ready summation value should be divided by 20,000 to get a percentage of CPU ready for the 20 second time slice.  If I take the worst case scenario of 8491ms per vCPU, this VM spent nearly 43% (8491/20,000) of the 20 second time slice waiting for CPU resources.

The CPU Ready summation in milliseconds counter in the vCenter Client is not always the most accurate or easy to interpret stat – to better quantify the problem it might be best to go to the ESX command line and run ESXTOP.  CPU Ready over 5% could be a sign of trouble, over 10% and there is a problem.  Running ESXTOP in batch mode and then analyzing the output using Windows Perfmon or Excel might be a good way to go on this to get a view over several hours rather than the realtime stats we were looking at.  I wrote a post a while back with more info on ESXTOP batch mode: http://vmtoday.com/2009/09/esxtop-batch-mode-windows-perfmon/

To help quantify the problem a bit more, the BigBadServer is on an ESX 4.0 server with about 10 other servers.  The physical blade has two dual-core CPU’s (AMD Opteron 2218HE’s which are not hyperthreaded).  The other VM’s on the blade have different vCPU and vMemory configurations.  3 VM’s (including BigBadServer) have 4 vCPU’s.  A couple have 2 vCPU’s, and the remainder are configured with 1 vCPU.  In ESX 4.x, the VMware console OS actually runs as a hidden VM, pegged to pCPU #1.

I generally recommend a pCPU:vCPU ration of 1:4 for mid-sized VMware deployments of single vCPU VM’s.  The blade we are running on is a 1:5 with several multi-vCPU VM’s.  The multi-vCPU’s start to skew the ratio recommendation and require some advanced design decisions.  VMware’s scheduler requires that all the vCPU’s on a VM run concurrently (even if the Guest OS is trying to execute a single thread).  Also, the VMware CPU Scheduler prefers to have all the vCPU’s from a VM run on the same pCPU.  As workloads are bounced around between pCPU’s, the benefits of CPU cache are lost.  This is one of those ‘more-is-less’ situations that you run into on virtualized environments.

What this CPU Scheduler nonsense means in this case is that the 4 vCPU’s on BigBadServer have to wait until all logical pCPU’s on the box are idle (including the one that runs ESX itself) before it can run.  If ESX can’t accomplish that (we are experiencing resource contention) it starts prioritizing workloads according to what it can best run.  It is much easier to schedule the smaller VM’s, so it tends to run those on pCPU more frequently.  The larger VM’s tend to suffer a bit more than the smaller ones.  We are competing with 2 other VM’s with 4 vCPU’s that use up all of the logical pCPU’s when they need to run, as well as with the smaller VM’s.

I suggested a few ways to fix this issue for the BigBadServer web server:

  1. Using Shares and/or Reservations on the VM.  This probably won’t work in our situation as the physical server is too over-subscribed.  We might see a slight improvement in BigBadServer (or we might not see any change), but possibly at the extreme expense of the other VM’s sharing the blade.
  2. Reduce the number of vCPU’s on BigBadServer AND the other multi-vCPU VM’s on the same physical server.  This would reduce resource contention and open up a whole bunch of scheduling options for the VMware CPU Scheduler.  This is the quickest/cheapest fix, but will not work if the VM’s really do need 4 vCPU’s.  A little workload analysis should determine which can be made smaller (the vCenter server graphs/stats should be enough for this).  For what it’s worth, by our analysis BigBadServer seems to be happier with 4 vCPU assuming we can run with a low CPU Ready on those 4.
  3. Move the BigBadServer VM to a physical ESX server with fewer multi-vCPU VM’s so there is less contention.
  4. Move the BigBadServer VM to a physical ESX server with quad-core pCPU’s (ideally two quad-cores or bigger).  This would give a lot more flexibility to the VMware CPU Scheduler and allow it to run quad-vCPU VM’s on the same pCPU for greater efficiency.
  5. Split BigBadServer into 2 smaller VM’s – The server currently runs a couple sites.  We could split them onto two servers – one for Project1 and one for Proejct2.  This configuration would take some design, testing, and time but could scale out better, give more flexibility and availability in the long run.

I’m not sure which way the customer will go on this one yet, but I feel good having armed them with enough knowledge and options to make an informed decision.

To avoid problems like this in the future, I recommend these rules of thumb:

  • Design your hosts for your guests.  Taking your Guest VM sizes into account when designing your environment and choosing physical hardware is crucial if you need bigger VM’s.
  • Don’t make your VM’s bigger than you have to.  It is always easier to add resources than take them away.  Hot Add of CPU and Memory in vSphere make adding incredibly easy.
  • Monitor your environment for CPU Ready, Swapping, and other metrics that can indicate an inefficient design.
  • Call for help when you can’t figure out what is going on (I’m happy to help!).  VMware is super powerful, but some things can be downright backwards when it comes to resource allocation on a fixed set of hardware.

If you are looking for some resources to help explain CPU Scheduling a bit more, I recommend:

(Updated 8/25/2010 to include a few additional reference links and corrected summation divided by time slice to get accurate values)

{ 10 comments }