Update: This issue was fixed as of 3/15/2012 in ESXi 5.0 Update 1 per the original knowledge base article: VMware KB 2008144. To download ESXi 5.0/vCenter Server 5.0 Update 1, see the VMware Download Center

 

I recently stumbled on two vSphere 5 ESXi networking bugs that I thought I would share. The issues are very similar from a cursory level, but have different symptoms, troubleshooting steps, and implications for your architecture, so I’m going to split the issues into two separate posts. Because troubleshooting these issues was a real pain, I’ll provide some details on how to identify these issues in your environments and wrap up with a third post on what I believe to be some best practices to avoid these same problems and achieve greater redundancy and resiliency in your vSphere environments.

The Problem

Today, we’ll look at an ESXi 5 networking issue that caused massive iSCSI latency, lost iSCSI sessions, and lost network connectivity. I’ve been able to reproduce this issue in several environments, on different hardware configurations. Here’s the background information on how all this started: I upgraded an ESXi 4.1 host to ESXi 5 using vSphere Update Manager (VUM). Note that I did use the host upgrade image that contained the ESXi500-201109001 iSCSI fixes – if you are upgrading to vSphere 5 and have iSCSI in your environment, use this image. Here’s a quick look at how the networking was configured on this host:

The iSCSI networking was configured in a very typical setup, and per best practices, as outline in VMware’s documentation, as well as from many vendors (see EMC’s Chad Sakac’s ‘A Multivendor Post on using iSCSI with VMware vSphere’), with two vmnic uplinks, two vmknics, with one active adapter on the correct layer-2/layer-3 network, and the other unused.

vSwitch iSCSI vmknic override failover order with unused NICvSwitch iSCSI vmknic override failover order with unused NIC

After the upgrade, the standard vSwitch with two vmnics for uplinks (Broadcom NetXtreme II BCM5709 1000Base-T) and two vmknics that serviced the software iSCSI adapter failed to pass traffic (vmkping to the iSCSI targets failed) and could not mount ANY iSCSI LUN’s. VM network, management, and vMotion ports were not affected.

If I let the host sit long enough, it *might* find a couple paths to the storage, but even then performance was deteriorated per the vmkernel.log:

WARNING: ScsiDeviceIO: 1218: Device naa.60026b90003dcebb000003ca4af95792 performance has deteriorated. I/O latency increased from average value of 5619 microseconds to 495292 microseconds.

Troubleshooting

I’m going to dump a whole bunch of my troubleshooting steps on you – hopefully they not only help folks dealing with this particular bug, but help with general network and configuration troubleshooting in VMware vSphere. During troubleshooting, I removed the vmk binding for these two on the iSCSI adapter, removed the software iSCSI Adapter itself, removed the vmknics on the vSwitch, and removed the vSwitch itself. I then recreated the vSwitch, set vSwitch MTU to 9000, recreated two vmk ports, set 9000MTU, assigned IP, and set failover order for multipath iSCSI. I then re-created the software iSCSI adapter and bound the two vmk ports. I was able to pass vmk traffic and mount iSCSI LUN’s. Great – problem solved!?!?! Not so much – I rebooted the host and the problem returned.

Here are my next troubleshooting steps:

  • I repeated the procedure above and re-gained connectivity, but the problem returns on subsequent reboots. I can verifiably recreate the problem.
  • I verified end-to-end connectivity for other hosts on the same Layer 1, Layer 2, and Layer 3 network as the iSCSI initiator and iSCSI targets.
  • I verified the ESXi host’s networking configuration using the vSphere client, double-checking the vSwitch, vmnic uplinks, and vmknic configurations. Everything looked good so I canceled out.
  • I then reinstalled ESXi from scratch (maybe something was left over from 4.1 that a clean install would weed out), built up the same configuration, and was again able to re-create the problem.
  • I poured over logs (vmkernel.log, syslog.log and storagerm.log primarily). I could see an intermittent loss of storage connectivity, failure to log into the storage targets (duh – there is no connectivity, no vmkping) and high storage latency on hosts where I had rebuilt the iSCSI stack and run a few VM’s.
  • I switched out the Broadcom NIC for an Intel NIC (the Broadcom had hardware iSCSI capabilities – I wanted to be sure the hardware iSCSI was not interfering).
  • I verified TOE was enabled.

The ‘Ah-Ha’ Moment

Next, I verified the ESXi host’s networking configuration using the vSphere client one more time – the properties of the vSwitch, the properties of the vmkernel (vmk) ports, the manual NIC teaming overrides, IP addressing, etc. Everything looked correct – I MADE NO CHANGES – but when I clicked OK (last time I canceled) to close the vSwitch properties and was greeted with this warning:

changing an iscsi initiator port group warning

Wait a second… I didn’t change anything, why am I being prompted with a you’re ‘Changing an iSCSI Initiator Port Group’ warning? I like to live dangerously, and wanted to see what would happen, so I said ‘Yes’.

Much to my surprise, after only viewing and closing the vSwitch and iSCSI vmk port group settings, I was able to complete a vmkping on the iSCSI-bound vmk’s. And moreover, I completed a Rescan of all storage adapters and my iSCSI LUN’s were found, mounted, and ready for use. Problem solved? Nope. The same ugly issue re-appeared after a reboot.

While the problem wasn’t solved, I now had something to work with. My go-to troubleshooting question “What Changed?” could maybe be answered. Even though I didn’t change anything in the vSwitch Properties GUI, something changed. To see what changed in the background, I compared the output of the following ESXi Shell (or vCLI, or PowerCLI) commands before and after making ‘the change’ happen (by viewing the properties of the vSwitch/vmk ports), but found no changes.

  • esxcfg-vswitch -l
  • esxcfg-vmknic -l
  • esxcfg-nics -l

Then, I made backup copy of esx.conf

 cp /etc/vmware/esx.conf /etc/vmware/esx.conf.bak

Then I caused ‘the change’ and then compared checksums using md5sum, but found no differences:

 md5sum /etc/vmware/esx.conf /etc/vmware/esx.conf.bak

I compared the running .conf and the backup .conf, but found no differences:

 diff /etc/vmware/esx.conf /etc/vmware/esx.conf.bak

Call in Air Support
At this point, I was out of ideas so I called for help: “Hello, 1-866-4VMWARE, option 4, option 2 – help!”

After repeating many of the same troubleshooting steps, the support engineer decided that I had hit on a known, and not yet patched, bug. The details of the bug are included in KB 2008144: Incorrect NIC failback occurs when an unused uplink is present. That’s right – my iSCSI traffic, vmkpings, etc were being sent down the wrong NIC – the UNUSED NIC. Ouch. The bug caused the networking stack to behave in a very unpredictable way, making my troubleshooting steps next to useless, and any other advanced troubleshooting ideas I had (sniffing, logs, etc.)

Once I knew what the issue was, I could see a bit of evidence in the logs:

WARNING: VMW_SATP_LSI: satp_lsi_pathIsUsingPreferredController:714:Failed to get volume access control data for path "vmhba33:C0:T0:L4": No connection

NMP: nmp_DeviceUpdatePathStates:547: Activated path "NULL" for NMP device "naa.60026b90003dcebb0000c7454d5cc946".

WARNING: ScsiPath: 3576: Path vmhba33:C0:T0:L4 is being removed

Notice the NULL path – the path can’t be interpreted correctly when being sent down the wrong (unsued) vmnic that is on a different subnet and VLAN. The gotcha on this issue is that I had followed best practices where applicable, and accepted default settings on the vSwitch and vmknics.

The Quick Fix
VMware KB 2008144 offers two workaround for this bug. The quick fix for the problem is to simply change the Failback setting on either the vSwitch running the software iSCSI vmknic’s to “No” (default is yes), or to change the setting on the vmknic itself if you have other port groups on the vSwitch (such as a VM Network port group to give your guest VM’s access to the iSCSI network).

Change vSwitch or Portgroup Failback

Changing Failback = No on the iSCSI vmknics and then rescanning the storage adapters fix the glitch immediately.

Architecture Changes
The second workaround from VMware is “Do not have any unused NICs present in the team.”. This translates to a slightly different architecture than that described in many documents. To achieve this workaround, the configuration would have to change to two vSwitches, each with a single vmnic uplink and a single vmk port, bound to the iSCSI adapter. This change does not impact redundancy or availability when compared with the single-vSwitch:two-vmk configuration that I was running with as one of the vmnics was set to unused anyway. This workaround does add a bit more complexity, as there are a few more elements to configure, monitor, manage, and document.

This problem seems to only present itself on vSphere Standard Switches (vSwitch), although I could not get confirmation of this (please post a comment if you know!). Assuming this is true, a vDistributed Switch (vDS) could be used for Software iSCSI traffic. Mike Foley has a write-up on how to migrate iSCSI from a vSwitch to a vDS on his blog here: http://www.yelof.com/?p=72.

A Couple More Notes
My troubleshooting fix of viewing the vSwitch settings and clicking ok seemed to temporarily resolve the issues because it triggered an up/down event on the vmk of the unused uplink. This caused the network stack to re-evaluate paths and start using the correct, Active, uplink.

Note that this problem can occur outside of my iSCSI use case – any vSwitch, Port Group, or VMKNIC with an unused adapter set in the NIC Teaming Failover Order are susceptible to this bug, so watch for it on redundant vMotion networks (vMotion randomly fails), VM Network networks (sudden loss of guest connectivity), or even your management network (hosts fall out of manageability from vCenter, and can’t be contacted via SSH, vSphere client, etc.
Leave a comment if you’ve experienced this bug – your notes on the problem may help others find and fix the issue until VMware releases a fix. I understand that a fix for this particular bug is not due out until at least vSphere 5 Update 1.

I’ll have another (shorter) writeup on the 2nd networking bug I found in ESXi 5 later in the week – check back here for a link once it is published.

{ 26 comments }

Time to make it official!  I have moved to Clearpath Solutions Group where I’ll take on the role of Virtualization Practice Manager, focusing on delivering VMware solutions and services to our VMware, Cisco, and EMC customers.  I’m looking forward to joining a high energy company where I can focus on the VMware technologies that I am passionate about.  I’m going to hit the ground running with VMware View implementations, Site Recovery Manager (SRM) work, and vSphere upgrades on EMC storage and Cisco UCS servers.

A few perks that I am looking forward to in this new job:

Leaving my role as IT Manager at Tiber Creek Consulting was not an easy decision, but it is time to move from being a constantly distracted, generalist IT Manager to something with a bit more focused. Tiber Creek has been amazingly flexible and encouraging over the past three years as Stephanie Townsend and I have worked though her health issues and the stress it imposes on our family (with Stephanie having improved over the past few months after a new procedure to patch her cerebral spinal fluid (CSF) leaks, I feel some freedom to go after great new things).   Tiber Creek is a great group of people doing great work for our armed forces, and I’ll miss the great team that I have been a part of for the past three years.  Thanks for everything, Tiber Creek!

Here’s to an awesome new role, at an awesome new company.  Thumbs up – let’s do this thing!

{ 0 comments }

Post image for DC VMUG – January 17, 2012 at Nationals Park

DC VMUG – January 17, 2012 at Nationals Park

by Joshua Townsend on January 6, 2012 · 0 comments

in VMUG

This is a re-post from http://dcvmug.com/dc-vmug-january-17th-2011/.  Please follow the DC VMUG site for updates!

VMware :: VMUG
The Washington DC VMUG invites you to the first meeting of the DC VMUG in 2012. The event will be held on Tuesday, January 17th, 2011 at Nationals Park.

Our sponsors, Tintri and Veeam, will have some great give-aways for attendees. We’ll have a tour of the new Nationals Park at the end of the VMUG meeting.

Register Now!

Agenda

7:30am – 8:00am: Registration & Breakfast
8:00am – 8:45: Tintri Presentation
8:45 – 9:00am: Break
9:00 – 9:45am: Apps in the Enterprise : Horizon App Manager vs. Citrix Xenapp; Michael Letschin, vExpert, Convergence Technology Consulting
9:45 – 10:00am: Break
10:00 – 10:45am: Veeam Presentation
10:45 – 11:15am: Ask the Experts and Closing Remarks
11:30am: Nationals Park Tour

Location

Washington Nationals Park

1500 South Capitol St., SE
Washington, DC 20003

Free Parking in Lot C, or Metro to Navy Yard (Green Line) and walk 0.3 mi SW to Nationals Park

Proceed to Red Porch Restaurant

{ 0 comments }

I’ve been asked several times recently to recommend training resources for VMware, so I thought I might write my responses up in a blog post to help out folks in the community who are looking for the best resources to gain VMware knowledge, prepare for their VCP and other certifications, and continue on their journey to becoming a virtualization rockstar.

I’ve picked up a bunch of certifications over the past 10 years.  For me, certification is not the means to an end, but the end of some long, intensive studying and lab work, then doing some deep dive studying and doing.  By the time I get to the test, passing should be a forgone conclusion.  I’ll save details of my lab for a future post and focus on the books and other learning resources that I use.  When getting into a new or updated technology, I start out my studying with a good overall survey of the technology I want to learn.  I like a good book that hits all of the major components, provides background information to help explain why the technology, component, or module really matters and how it fits into the big picture.  Then I get into technology specific books – deep dives, command line references, and architecture books.

Books

Mastering VMware vSphere 5, by Scott Lowe

Mastering VMware vSphere 5, by Scott Lowe

My go-to book for VMware vSphere has been Scott Lowe’s Mastering VMware vSphere 4.  Scott’s updated book, Mastering VMware vSphere 5 started shipping yesterday.  Scott covers everything from the basics of what a hypervisor is to VMware vSphere best practices.  This is a great book to accompany lab work as it includes licensing, planning and installation, setting up virtual networking, storage basics, security, resource allocation, HA, DRS, and even some automation with the CLI and PowerCLI (PowerShell).  The book is well written, taking you methodically through vSphere, while providing plenty of helpful hints along the way.  Do yourself a favor and click the picture to the left to order it from Amazon now (paperback or Kindle format).  This book is a great way to get started with studying for your VCP certification.

VMware vSphere 5 Clustering Technical Deepdive

VMware vSphere 5 Clustering Technical Deepdive

Once I have the basics down, I get into the deep dive work. The first deep-dive book for VMware vSphere 5 is VMware vSphere 5 Clustering Technical Deepdive, by Duncan Epping and Frank Denneman.  This is Duncan and Frank’s second book that focuses on the clustering and high availability technologies available in VMware vSphere.  Readers of Duncan and Frank’s first book, VMware vSphere 4.1 HA and DRS Technical deepdive (Volume 1), got an incredibly deep look at how to configure VMware HA and DRS.  The new vSphere 5 Clustering Technical Deepdive includes Storage DRS as well.  I’ve talked to several readers of both these books and Duncan and Frank’s blogs who have remarked that 1.) I’ve been doing it wrong all along, 2.) I totally understand how HA and DRS work after reading this, and 3.) My environment really is resilient and reliable thanks to this book.

Pearson and VMware teamed up earlier this year to create VMware Press.  There are several books coming from VMware Press, as well as other authors/publishers that are now available for pre-order from Amazon.com.  These include:

There are not many vSphere 5 specific books out yet, but many of the vSphere 4 resources are still very useful.  My library includes these:

Video Training

Train SignalIf you are not a big reader or you are looking for additional topics, check out TrainSignal’s VMware Training Videos.  TrainSignal offers a whole slew of courses (many taught by VMware vExperts), including:

  • vSphere 5 Training
  • VMware View Administration Training
  • vSphere Troubleshooting
  • vSphere Performance Monitoring
  • vSphere Security Design
  • vSphere PowerCLI.

I have a couple of TrainSignal DVD’s and found them to be good quality with deep technical content.

Blogs and BrownBags

Once you are comfortable with the material, you can start to study for your VCP.  Several bloggers have published collections of materials to help you prepare for the VCP, VCAP, and even the VCDX.  I recommend Simon Long’s collection here: http://www.simonlong.co.uk/blog/vcp-vsphere-upgrade-study-notes/ and Cody Bunch’s VCP4 Resource Page and BrownBag sessions.

Instructor Led & Certification

Finally, once you are all read up, head to a VMware Education instructor led class.  You need to take a VMware Authorized Training course to qualify to sit for the VMware Certified Professional (VCP) certification exam.  VMware also offers a nice catalog of eLearning courses.  If you want to get a discount on eLearning, Instructor-Led training, and certification exams from VMware, check out the VMUG Advantage program.

Subscribe to VMUG Advantage

{ 1 comment }