vSphere 5 Networking Bug #2 Affects Management Network Connectivity

February 10, 2012 by Josh Townsend 38 Comments

On Wednesday, I wrote about a VMware vSphere 5 networking bug that caused issues with iSCSI networking. That bug, described in VMware KB 2008144 caused vmk traffic to be sent over the unused vmnic uplink in a team where there is an unused uplink and an explicit failover policy present. See the diagram below to better understand what was going on there….

The second bug vSphere 5 networking bug I experienced was similar to the first: traffic was sent out of an unexpected interface after upgrading to ESXi 5. This particular bug surfaced while troubleshooting my iSCSI bug (because why not have two unrelated bugs at the same time). Many of the troubleshooting steps I used in the first networking bug were employed on this, so I won’t bore you again with the details. I will, however, give you a quick overview of the network setup that this issue appeared in.

Configuration

Here’s layer 1 connectivity for ESXi host vmnics to the switching stack.

Here’s the ESXi network config:

The specific portion of the configuration that was impacted by this bug was vSwitch0, which contained vmnic0 & vmnic1, my Management Network vmknic and vMotion vmk port group. The Management and vMotion port groups had a manually set failover order as pictured below:

This is all pretty standard network configuration for a VMware ESXi host with 6 physical network adapters, and follows best practice for management network redundancy for VMware HA (I highly recommend reading more on HA best practices in Duncan Epping and Frank Dennemon’s VMware vSphere 5 Clustering Technical Deepdive book).

The Problem

The problems that were manifested as a result of the bug were:

ESXi hosts would intermittently fall out of manageability by vCenter, the vSphere Client, and SSH (which was enabled from the console of the hosts). Management connectivity could be restored (most times) by restarting the ESXi Management Network from the console. I could usually ping the management network IP address even though the host was not manageable.
ESXi syslogs stopped being sent to the vCenter Syslog collector.
vMotion between hosts in the cluster intermittently worked. vMotion success was not always in sync with management network connectivity. vMotion capabilities could be restored by restarting the ESXi Management Network from the console.
As an added bonus, VMware High Availability (HA) would sometimes detect host failures and restart VM’s on the surviving HA nodes.

Notice my use of ‘intermittently, usually, and sometimes’ – this made for tough troubleshooting. If you’re gonna fail, fail big. None of this wimpy on-again, off-again nonsense.

The Resolution

Luckily, I had VMware support on the phone as this problem appeared. The support tech seemed to know just what the problem was:

A known issue on ESXi 5 occurs when two or more vmkernel NIC’s (vmknic) are on the same standard vSwitch. Under this configuration traffic may be sent out the incorrect vmknic.

As far as I am aware, there is no VMware Knowledgebase article for this issue yet (comment if you know of one), so details are based on my own conversation with the support engineers working the case. From what I was able to infer, this bug appears:

More often when ESXi hosts are under stress (my iSCSI involving network bug really stressed out the hosts – and me – when all paths were down)
Seems to happen more on Broadcom NIC’s then Intel
Triggered and/or fixed by a network up/down event (such as restarting the management network on the host).
Does NOT happen with a Distributed Virtual Switch.
Is scheduled to be patched with or after vSphere 5 Update 1.

Whereas my iSCSI bug involved a vmnic team with an unused uplink, and traffic being sent out the wrong vmnic, this second bug occurred with two vmnic’s (one active, one standby), two vmk ports on a standard switch, and traffic being sent out the wrong vmk port, which happened to have a different active vmnic than the correct vmk port had. Here’s a diagram of the traffic flow gone wrong:

I still find it a bit odd that ICMP traffic continued to flow to the interface, but that the Management traffic took an alternate route out and landed on my non-routed vMotion VLAN (and different subnet).

Workaround

The workaround for this bug is simple – remove the second NIC and second VMkernel Port (vMotion for me) from the vSwitch and restart the ESXi Management Network. Once this was done, management traffic flowed normally.

I then created a new vSwitch, attached the second vmnic to it, and then re-created the VMkernel port for vMotion.

While the work-around was great for getting my hosts back into manageability, it was not so great for the redundant architecture I had originally implemented. After splitting the VMkernel ports onto two different vSwitches, I received warnings in vCenter that “Host currently has no management network redundancy.” KB 1004700 addresses this message if you are looking for more info on it. I could disable the warning, but that would be like slapping a fresh coat of paint on a jalopy.

Architecture Changes

The workaround for this bug kills redundancy. Simply adding another two physical NIC’s and, in my case, binding one to the Management vSwitch and one to the vMotion vSwitch. This change would require host downtime to install new hardware if your host only had 6 NIC’s like this environment did.

Alternatively, you could migrate your Management network and vMotion networks to a virtual Distributed Switch (vDS) as this bug does not appear to impact vDS – only standard vSwitches. Side note: Check Duncan Epping’s post on using a virtual vCenter server connected to a vDS if that’s holding you back from going to a vDS. Also read the new vDS Best Practices whitepaper from VMware.

Final Note

This bug could impact more configurations than the one I highlighted. For example, I could see it causing issues with Multiple-NIC vMotion in Sphere 5.

Drop a comment if you have experienced this bug, know of a KB article, or can think of any other ways it might be manifested.

Update (Nov 8 2012): I received an email from Nick Eggleston about this issue – Nick experienced this problem at a customer site and heard from VMware support that “There is a vswitch failback fix which may be relevant and has not yet shipped in ESXi 5.0 (it postdates 5.0 Update 1 and is currently targeted for fixing in Update 2). However this fix *has* already shipped in the ESXi 5.1 GA release.” Thanks, Nick!

Comments

Avram Woroch says

February 10, 2012 at 8:54 pm

Thank god, this means I’m *not* crazy (well, not for this reason anyway). I’m having similar issues in my home lab, with vSwitch0 carrying Management and vMotion, and vSwitch1 carrying iSCSI. Having all sorts of issues, regardless of what kind of software SAN I’m trying.

Reply
- hater says
  
  October 31, 2013 at 1:35 pm
  
  After reading this I could only think, what a horrible example to publish, the architecture in place is incorrect to begin with so it doesn’t really properly demonstrate the bug. Your illustration shows vmnic0 and vmnic1 wired to disparate switches, an iscsi storage switch stack and a prod switch??? which is very strange but lets not drill into this piece too much although it should have made everyone start scratching heads.
  
  Then – you have bonded two vmnic into a team where one can fail over to the other, yet have configured the upstream ports to point to different VLANs… “Wrong VLAN Wrong switch WRONG WRONG WRONG” you state, of course, you configured it wrong. Did you notice the observable VLAN ranges are different for the vmnics that you have bonded?
  
  in your illustration it should show that when traffic fails over to a different vmnic it can still reach the desired subnets, at least in a valid configuration. You can’t put vmnics in a team and expect the ports in the upstream switch they wire to be configured differently with different VLANs. You should be able to fail over to any vmnic in the team and have traffic flow upstream to trunked ports that have the same VLANs configured.
  
  Reply
  - Josh Townsend says
    
    October 31, 2013 at 2:51 pm
    
    Thanks for the input, hater. Hopefully your comments help readers better design their own environments so they don’t end up the way I found this environment when the customer called for help.
    
    Reply
Alastair Cooke says

February 10, 2012 at 9:05 pm

Hi Josh,

Are the three physical switches interconnected?

If not then your problem is that your standard vSwitch uplinks don’t have Layer 2 adjacency. Layer 2 adjacency on all uplinks is a requirement for standard vSwitch behavior.

I would have your three vSwitches match up to your three physical switches then you get proper failover between NICs to provide redundancy.

Reply
Jason says

February 14, 2012 at 10:43 pm

I think this affects any two NIC / two vmkernel port setup. I’ve been setting up for a new test run and with the 10GB links (actually UCS vnics) I have separating NFS vmkernel ports from management vmkernel ports wasn’t a priority. I’ve been seeing random host disconnects as well as NFS datastore disconnects. I moved the vmkernel ports to separate vswitches and it appears the problem has been solved.

I have another 50 hosts to setup for my testing so I should know tomorrow if everything is actually fixed. If it is I owe you few rounds should we ever meet up.

Given that I have wasted about 3 days on this your blog post is welcome.

Reply
- Joshua Townsend says
  
  February 15, 2012 at 3:52 am
  
  Jason – glad it helped. If you can recreate the problem, call VMware. Maybe we can get a KB written on this before it impacts more people.
  
  Josh
  
  Reply
Aaron says

March 2, 2012 at 10:47 pm

Wow. I’m having the exact same problem. Thanks for the update.
I’m opening this up to our TAM to see if we can get more pressure on them to get a fix (or at least a KB)

Reply
Jim O'Boyle says

March 19, 2012 at 8:55 am

Hi, Was just checking back to see if there was any more information about this and in doing further research, tripped across KB 2008144 which covers both the iSCSI and management network problem. That was updated on 3/17 to state vSphere 5.0 Update 1 fixes this and is now available, fyi…

Reply
Stefan says

April 25, 2012 at 6:58 am

Hello Joshua,
i run into the same failure than you described here (tx!) during Network configuration

after a few tests, the problem does not accure anymore if i left the Gateway for the new configured vmkernel port (vmotion) blank (so it has the GW from the management network) . this is like the configuration in ESX(i) 4

br
Stefan

Reply
- Brandon Neill says
  
  June 25, 2012 at 3:59 pm
  
  There is only one gateway configured for the vmkernel. The field that you see each time you configure a vmkernel interface will always be the same gateway, if you change it in one screen, it changes in all other screens. The gateway should always be configured for an IP on your manage network, not on the iSCSI, NFS, FT or vMotion interfaces.
  
  Reply
sylvain says

April 26, 2012 at 7:57 am

same issue here using 2x vmotion

It seems to be fixed by removing the 2 vmotion, restart management and recreate vmotion.

I couldn’t find any KB around this at vmware.com

Reply
vmcreator says

May 10, 2012 at 5:09 am

Hi Josh,

The 2nd problem you found is not clear in the Update 1 release notes, that it has fixed it ONLY the “unused adapter problem”.

Can anybody clarify this, as we have some strange problems with Active/Standby on the vmotion/ management network?

Bob

Reply
Dustin Lema says

May 23, 2012 at 12:42 pm

Hi All,
The problem still exists in U1. I’ve just duplicated this behavior.

If anyone has a fix (as opposed to the dVS workaround) I think we’d all appreciate the answer.

Reply
JailBreak says

July 30, 2012 at 3:04 am

Ai All,

I have seen similar behavior on v5.0.0.51581 also using iSCSI with port bindings.

I think right now the option is to change to vDS.

Can anyone confirm that this behavior affects also Active/Standby configurations?

Thank You

Jail

Reply
Austin says

August 2, 2012 at 12:18 pm

Hi everyone,

We are experiencing this same problem and we are on Update 1. I’m also having problems with iscsi losing access to it’s datastores, however i’m not using vmnic’s, i have qlogic hba cards and they are not configured under the networking section, nor are they setup as active/standby config. We are using EMC PowerPath VE version 5.7 to connect to our EMC SAN, and we have intermittent drops of datastores, dead paths, etc. We have exhausted all options, engaged Cisco for the switch side, EMC for the san side, Dell for the server / vmware OS support, and because EMC and Dell see drops to the network they are blaming cisco. However, cisco see’s no problem on our 3750X switches which we use for our isolated iscsi connectivity. We are experiencing the host disconnects as a result of using vmware’s best practices for management and vmotion networks on an active/standby config using 2 nics. I’m going to remove them and set them up on their own vswitch and dedicated vmnic’s for each, but then i lose redundancy. Again, we have update 1 on 90% of our hosts, and an even higher level than that on our latest servers: ESXi 5.0, build 702118. Does anybody have any KB’s or any updates from vmware regarding this? Every vendor is pointing the finger at the other vendor and i’m not getting any headway on my iscsi problem. My qlogic cards are qle4062c, and running the lates vmware/emc certified drivers, and latest qlogic firmware. PLEASE HELP!

Reply
- Joshua Townsend says
  
  August 2, 2012 at 9:14 pm
  
  Austin – I haven’t heard an update from VMware on this. I’ll reach out and see if I can get some updated information. Stay tuned!
  
  Josh
  
  Reply
- JeanSeb says
  
  March 12, 2013 at 2:33 pm
  
  Did you find anything about the datastores disconnection ? We have a similar bug, but using Native Multipathing with EMC Clariion. Anyone blame others and my configurations seems to be good.
  
  Reply
  - Joshua Townsend says
    
    March 12, 2013 at 3:32 pm
    
    Datastore disconnects could be due to a number of other factors – for example, older firmware on QLogic and Emulex HBA’s can cause storage drops. iSCSI switches with buffer overruns can cause similar situations. I recommend pulling up the vmkernel log file and analyzing it for clues.
    
    Reply
Kyle Wallace says

August 9, 2012 at 12:34 pm

If you get a KB from VMware on this, please let me know. Would really like to know what update it will be fixed in.

Thanks, Kyle

Reply
- Joshua Townsend says
  
  August 9, 2012 at 12:58 pm
  
  I asked @VMwareCares on Twitter:
  Josh Townsend ‏@joshuatownsend 2 Aug
  @VMwareCares Any ideas on this problem I blogged about – I’m still hearing from people that it’s an issue: https://bit.ly/NOaZZV . Thanks!
  
  VMware Cares ‏@VMwareCares 3 Aug
  @joshuatownsend We’re still looking into this. We’ll post a KB article / patch when we can.
  
  We’re still in a holding pattern…… I’ll post more when there are updates. –Josh
  
  Reply
forbsy says

August 14, 2012 at 1:28 am

Joshua. Have you configured a trunk carrying the VLANS for management and vmotion over both vmnico and vmnic1? What happens when vmnic0 (or the physical switch connected to vmnic0) dies? Is the management network available on vmnic1. I don’t know for sure but those look like just access ports with a single VLAN configured on each uplink.
In any case, try trunking those uplinks at the physical switch(es) and pass 802.1q tagged VLANS over vmnic0/vmnic1. I’ve always done this and have the same basic setup as you with Management and vmotion on one vswitch in an active/standby config.

Reply
Cade says

August 30, 2012 at 1:33 pm

I believe I am seeing this with v5.0 update 1 as well but with NFS data stores.
I have a Management Network vmkernel and a NFS access vmkernel setup on two active/active physical nics. These are the only nics in the host.

Cloning/storage vmotion a 4GB machine takes up to an hour with both vmkernels setup. If I remove the Management Network vmkernel and have vmotion, management, and NFS trafic go through just the one vmkernel the clone/vmotion takes like 2-4 minutes.

Reply
Janåke Rönnblom says

September 6, 2012 at 5:45 am

Any news about this?

Reply
- Joshua Townsend says
  
  September 19, 2012 at 12:09 am
  
  No news yet – this post still gets a ton of hits and my colleagues and I at Clearpath are still running into the issue. I’m hoping to get some lab time on vSphere 5.1 to see if I can reproduce the issue there….
  
  Reply
Michal Rasinski says

October 3, 2012 at 4:09 am

I have some strange problems with Active/Standby on the vmotion/ management network, too. One host works ok, and other not.I’ve tried diffrent things and when I change Number of ports on the vSwitch to 56, the connection to management is ok. I will test it and give a feedback.

Reply
Michal Rasinski says

October 3, 2012 at 4:33 am

I’ve check and it helps only for moment. Still when I add physical adapter to vmotion/management vswitch connection to managemnt can’t be established.

Reply
Nick Eggleston says

October 25, 2012 at 5:34 pm

This issue is supposed to be resolved in 5.0 update 2 (forthcoming) and 5.1 (released). Can anyone test and post results?

Reply
Rene Rodriguez says

October 29, 2012 at 3:51 pm

I have found that moving the standby adapter to unused, resolves the issue with multi-nic vmotion. Every time we tried the active / standby on one vswitch with two vmkernals for vmotion and two physical nics in active/standby, only one nic would do the vmotion work. After i moved the standby adapter to unused on each vmk. vmotion began using both links.

im on ESXi 5.0 Update 1.

Reply
- Nick Eggleston says
  
  October 30, 2012 at 1:35 pm
  
  Rene,
  
  Can you try the same on ESX 5.1? The underlying bug is supposed to be fixed in that release.
  
  Reply
  - Anders Kongsted says
    
    November 23, 2012 at 8:21 am
    
    Hi Nick,
    
    As far as I can see, the problem is there for ESXi.
    Can anyone confirm or reject that?
    
    //Anders
    
    Reply
    - Joshua Townsend says
      
      November 24, 2012 at 12:53 pm
      
      Correct – this is an ESXi issue.
      
      Reply
Kiriki Delany says

February 2, 2013 at 8:01 pm

Has anyone seen is this is possible related to duplicate MAC addresses in a cluster?

We are seeing issues that could cause this behaviour due to duplicate MAC.

Reply
JD says

June 21, 2013 at 8:38 am

Yes we have a similar problem in ESX 5.1.

We have vSwitch0 with vmnic0 and vmnic1 on the same network with a vmkernal management port and vmkernal vmotion port.

When we migrated a vm using vmotion it would flood the other ports and cause problems within the production environment.

The work around was split vswitch0 and create a seperate vswitch just for vmotion and this was fine. The only downside was no redundancy for both management and vmotion.

We are thinking of creating a trunk as forbsy says.

Reply
Tom says

July 1, 2013 at 1:59 pm

the symptom I have which may or may not be related is two VM’s on same VLAN have the SAME IP address and did not complain — no IP conflict warnings — my application just updated two different databases : the new one and the old one —

Reply
Fred C says

August 30, 2013 at 3:06 pm

I have had a similar issue with a flexible NIC on a Windows 2003 server that lost it’s connectivity during a sVmotion. I found out that vmxnet3 NIC did not suffer from the same problems. VMWare will not acknowledge not support the flexible NIC unfortunately since it is deprecated in vSphere 5.1 U1, So stay away from flexible since its implementation is broken.

Reply
- Josh Townsend says
  
  August 30, 2013 at 3:17 pm
  
  Thanks Fred! Great info and recommendation.
  
  Reply
Jd says

October 31, 2013 at 5:05 pm

Josh keep up the good contribution to the web community. Hater I think your comment is discraceful! Why did you even comment?

Reply
- Josh Townsend says
  
  October 31, 2013 at 5:09 pm
  
  Thanks JD. No worries – I’ve got thick skin. Ken (hater) has some good points – could just work on his delivery.
  
  Like the kids say, haters gonna be haters.
  
  Reply