One of my most popular posts to date has been ‘vSphere 5 Networking Bug Affects Software iSCSI‘ with some 20,000 page views and a bunch of comments. The problem is a bug that causes vmkernel traffic (including iSCSI) to be sent out the unused/inactive/null NIC on a vSwitch with a manual failover order and Active and Unused vmnics. It still appears that the issue I covered is affecting customers – we just had a new Clearpath customer call for help with this very issue. VMware’s KB on the issue (http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2008144) suggests that the issue was fixed in vSphere 5.0 Update 1, but I’ve had several reports of the problem on ESXi hosts that were updated to 5.0.1.
The recommended work around was to set the failback mode on the vSwitch/Port Groups for iSCSI to ‘No’, or to put your iSCSI vmknic’s on separate standard vSwitches or migrated to a vSphere Distributed Switch (VDS). Many people opted to simply change the failback mode to No to avoid having to reconfigure networking and potentially taking downtime. That may not be the best approach….
Chris Towles just posted a comment on my original article with a problem he encountered when setting failback to ‘No’ with Dell Equalogic arrays- read his full post here: http://www.christowles.com/2012/09/vmware-esxi-50-update-1-sending-traffic.html. It seems that the quick fix may now actually be breaking things with vSphere 5.0 Update 1. No word on if this bug is present in vSphere 5.1.
I’d probably change my recommendation to not using the failback option to correct this problem, but instead use separate vSwitches or migrate your iSCSI to a VDS. It’s a bit more work, but may help to avoid further problems. Now that vSphere 5.1 is available with an enhanced VDS, putting iSCSI on the distributed switch is probably a good plan anyway. You’ll benefit from the end-to-end health checks to ensure correct configuration of VLANs and Jumbo Frames, as well as VDS Configuration Backup and Restore, and NetFlow 10 support for better troubleshooting/analysis.
If you have up to date info on this bug or other suggestions for working around the problem please post a comment! I haven’t been able to get any good info from VMware on the status of this one.