On Wednesday, I wrote about a VMware vSphere 5 networking bug that caused issues with iSCSI networking. That bug, described in VMware KB 2008144 caused vmk traffic to be sent over the unused vmnic uplink in a team where there is an unused uplink and an explicit failover policy present. See the diagram below to better understand what was going on there….
The second bug vSphere 5 networking bug I experienced was similar to the first: traffic was sent out of an unexpected interface after upgrading to ESXi 5. This particular bug surfaced while troubleshooting my iSCSI bug (because why not have two unrelated bugs at the same time). Many of the troubleshooting steps I used in the first networking bug were employed on this, so I won’t bore you again with the details. I will, however, give you a quick overview of the network setup that this issue appeared in.
Configuration
Here’s layer 1 connectivity for ESXi host vmnics to the switching stack.
Here’s the ESXi network config:
The specific portion of the configuration that was impacted by this bug was vSwitch0, which contained vmnic0 & vmnic1, my Management Network vmknic and vMotion vmk port group. The Management and vMotion port groups had a manually set failover order as pictured below:
This is all pretty standard network configuration for a VMware ESXi host with 6 physical network adapters, and follows best practice for management network redundancy for VMware HA (I highly recommend reading more on HA best practices in Duncan Epping and Frank Dennemon’s VMware vSphere 5 Clustering Technical Deepdive book).
The Problem
The problems that were manifested as a result of the bug were:
- ESXi hosts would intermittently fall out of manageability by vCenter, the vSphere Client, and SSH (which was enabled from the console of the hosts). Management connectivity could be restored (most times) by restarting the ESXi Management Network from the console. I could usually ping the management network IP address even though the host was not manageable.
- ESXi syslogs stopped being sent to the vCenter Syslog collector.
- vMotion between hosts in the cluster intermittently worked. vMotion success was not always in sync with management network connectivity. vMotion capabilities could be restored by restarting the ESXi Management Network from the console.
- As an added bonus, VMware High Availability (HA) would sometimes detect host failures and restart VM’s on the surviving HA nodes.
Notice my use of ‘intermittently, usually, and sometimes’ – this made for tough troubleshooting. If you’re gonna fail, fail big. None of this wimpy on-again, off-again nonsense.
The Resolution
Luckily, I had VMware support on the phone as this problem appeared. The support tech seemed to know just what the problem was:
A known issue on ESXi 5 occurs when two or more vmkernel NIC’s (vmknic) are on the same standard vSwitch. Under this configuration traffic may be sent out the incorrect vmknic.
As far as I am aware, there is no VMware Knowledgebase article for this issue yet (comment if you know of one), so details are based on my own conversation with the support engineers working the case. From what I was able to infer, this bug appears:
- More often when ESXi hosts are under stress (my iSCSI involving network bug really stressed out the hosts – and me – when all paths were down)
- Seems to happen more on Broadcom NIC’s then Intel
- Triggered and/or fixed by a network up/down event (such as restarting the management network on the host).
- Does NOT happen with a Distributed Virtual Switch.
- Is scheduled to be patched with or after vSphere 5 Update 1.
Whereas my iSCSI bug involved a vmnic team with an unused uplink, and traffic being sent out the wrong vmnic, this second bug occurred with two vmnic’s (one active, one standby), two vmk ports on a standard switch, and traffic being sent out the wrong vmk port, which happened to have a different active vmnic than the correct vmk port had. Here’s a diagram of the traffic flow gone wrong:
I still find it a bit odd that ICMP traffic continued to flow to the interface, but that the Management traffic took an alternate route out and landed on my non-routed vMotion VLAN (and different subnet).
Workaround
The workaround for this bug is simple – remove the second NIC and second VMkernel Port (vMotion for me) from the vSwitch and restart the ESXi Management Network. Once this was done, management traffic flowed normally.
I then created a new vSwitch, attached the second vmnic to it, and then re-created the VMkernel port for vMotion.
While the work-around was great for getting my hosts back into manageability, it was not so great for the redundant architecture I had originally implemented. After splitting the VMkernel ports onto two different vSwitches, I received warnings in vCenter that “Host currently has no management network redundancy.
” KB 1004700 addresses this message if you are looking for more info on it. I could disable the warning, but that would be like slapping a fresh coat of paint on a jalopy.
Architecture Changes
The workaround for this bug kills redundancy. Simply adding another two physical NIC’s and, in my case, binding one to the Management vSwitch and one to the vMotion vSwitch. This change would require host downtime to install new hardware if your host only had 6 NIC’s like this environment did.
Alternatively, you could migrate your Management network and vMotion networks to a virtual Distributed Switch (vDS) as this bug does not appear to impact vDS – only standard vSwitches. Side note: Check Duncan Epping’s post on using a virtual vCenter server connected to a vDS if that’s holding you back from going to a vDS. Also read the new vDS Best Practices whitepaper from VMware.
Final Note
This bug could impact more configurations than the one I highlighted. For example, I could see it causing issues with Multiple-NIC vMotion in Sphere 5.
Drop a comment if you have experienced this bug, know of a KB article, or can think of any other ways it might be manifested.
Update (Nov 8 2012): I received an email from Nick Eggleston about this issue – Nick experienced this problem at a customer site and heard from VMware support that “There is a vswitch failback fix which may be relevant and has not yet shipped in ESXi 5.0 (it postdates 5.0 Update 1 and is currently targeted for fixing in Update 2). However this fix *has* already shipped in the ESXi 5.1 GA release.” Thanks, Nick!