Posts Tagged ‘VMware’

Please join us for the upcoming Washington DC VMware® User Group meeting on Tuesday, March 10th, 10:30 a.m. – 3:00 p.m.

Register Now

This is a great opportunity to meet with your peers to discuss virtualization trends, best practices and the latest technology.

Agenda
10:30 a.m. Check-In
10:50 a.m. Opening Remarks
11:00 a.m. Compellent Presentation
11:45 a.m. Lunch
12:30 p.m. VMware ThinApp Presentation by Jason Langone
01:30 p.m. Beverage Break
01:45 p.m. Veeam Presentation
02:30 p.m. Wrap-up session

Register today to join us for this free informative event. Space is limited, so respond as soon as possible to reserve your seat.

Location:
Westin City Center
National Ballroom A
1400 M Street NW
Washington, DC 20005
Map Here

Sponsored by:


As I am one of the leaders of the Washington, DC VMware User Group, please feel free to reach out to me with questions, comments, volunteering to present at an upcoming meeting, or just to introduce yourself.

You can stay in touch with the group by visiting http://dcvmug.com (redirect to http://communities.vmware.com/community/vmug/forums/us_northeast/dc) or our LinkedIn Group.

I am finishing up an installation of an EMC Clariion CX4 SAN. One of the final steps of the installation is to configure PowerPath/VE on the ESXi hosts. PowerPath/VE is EMC’s multipathing extension module for VMware (and Hyper-V), designed to replace the Native Multipathing Plugin (NMP) for increased I/O performance and failover management.  To simplify and automate the installation of PowerPath/VE, I decided to use VMware Update Manager (VUM) to push the extension to the ESXi 4.x hosts in the environment.

The process of setting up an additional VUM patch repository to host PowerPath/VE (and other 3rd party extensions such as the Cisco Nexus 1000v) is pretty straight forward.  3rd party extensions are supported in VUM beginning with vSphere 4.0 Update 1.  Chad Sakac has posted a great video guide on YouTube that covers the setup:

I opted to use the tomcat installation on the environment’s vCenter server to host the PowerPath/VE repository.  To accomplish this, I simply created a new directory in the tomcat root directory.  The default path for the root directory on a vSphere vCenter Server is “C:\Program Files\VMware\Infrastructure\tomcat\webapps” (or C:\Program Files (x86)\VMware\Infrastructure\tomcat\webapps on a 64-bit installation).

I created a directory named ‘depot’ and within that directory created a PowerPathVE folder.  I extracted the contents of the VUM folder from the PowerPath .zip file that I downloaded from http://powerlink.emc.com.  A screenshot of the directory is below:

PowerPath/VE Depot Directory Tree

PowerPath/VE Depot Directory Tree

After creating the directory for the patch repository, I simply added an Extension Repository to VMware Update Manager as Chad shows in his video.  I would like to call out one caveat – Because vCenter may not listen on standard HTTP/HTTPS ports, I used https://vcenter.domain.local:8443/depot/PowerPathVE/index.xml as the path to the source.

VUM Patch Source

VUM Patch Source

Once PowerPath was added to an Extension Baseline in VUM, I simply had to scan my hosts for updates and remediate.  Installation of PowerPath/VE requires the host to be in Maintenance Mode and concludes with a reboot.  Pretty simple.

Then all you have to do is fight through an overly-complex licensing setup (seriously, a 112 page PDF on how to install licenses???), a bit of configuration, and you are multi-pathing with the best of them.  If you are interested in learning more about PowerPath/VE, start with this whitepaper: EMC PowerPath/VE for VMware vSphere Best Practices Planning.  For a bit of real-world insight into the performance increase you might see with PowerPath/VE, check out this blog post from Eric Sloof: Massive I/O power increase using EMC PowerPath/VE.

In parts I, II, and III of the Storage Basics series we looked at the basic building blocks of modern storage systems: hard disk drives.  Specifically, we looked at the performance characteristics of disks in terms of IOPS and the impact of combining disks into RAID sets to improve performance and resiliency.  Today we will have a quick look at another piece of the puzzle that impacts storage performance: the interface.  The interface, for lack of a better term, can describe several things in a storage conversation.   It can be let me break it down for you (remember, we’re keeping it simple here).

At the most basic level (assume a direct-attached setup), ‘interface’ can be used to describe the physical connections required to connect a hard drive to a system (motherboard/controller/array).  The ‘interface’ extends beyond the disk itself, and includes the controller, cabling, and disk electronics necessary to facility communications between the processing unit and the storage device.  Perhaps a better term for this would be ‘intra-connect’ as this is all relative to the storage bus.  Common interfaces include IDE, SATA, SCSI, SAS, and FC.  Before data reaches the disk platter (where it is bound by IOPS), it must pass through the interface.  The standards bodies that define these interfaces go beyond the simple physical form factor; they also define the speed and capabilities of the interface, and this is where we find another measure of storage performance: throughput.  The speed of the interface is the maximum sustained throughput (transfer speed) of the interface and is often measured in Gbps or MBps.

Here are the interface speeds for the most common storage interfaces:

  • IDE          100MBps or 133MBps
  • SATA      1.5Gbps or 3.0Gbps (6.0Gbps is coming)
  • SCSI         160MBps (Ultra-160) and 320MBps (Ultra-320)
  • SAS          1.5Gbps or 3.0Gbps (6.0Gbps is coming)
  • FC             1Gb, 2Gb, 4Gb, or 8Gb (Duplex throughput rates are 200MBps, 400MBps, 800MBps, and 1600MBps respectively)

If we take these speeds at face value, we see that a 320MBps SCSI and a 2Gbps FC are not too different.  If you dig a bit deeper you will soon find that simple speed ratings are not the end of the story.  For example, FC throughput can be impacted by the length and type of cable (fiber channel can use twisted pair copper in addition to fiber optic cables).  Also, topologies can limit speeds – serial connected topologies are more efficient than parallel on the SCSI side, and arbitrated loops can incur a penalty on the FC side.  The specifications of each interface type also define capabilities such as the protocol that can be used, the number of devices allowed on a bus, and the command set that can be used in communications on a storage system.  For example, SATA native command queuing (NCQ) can offer a performance increase over parallel ATA’s tagged command queuing with other factors held constant.   Because of this, you  might also see some performance implications of connecting a SATA drive to a SAS backplane, as the SAS backplane translates SAS commands to SATA.

If we move away from the direct-connect model, and into a shared storage environment that you might use in a VMware-virtualized environment, the ‘interface’ takes on an additional meaning.  You certainly still have the bus ‘interface’ that connects your disks to a backplane.  Modern arrays typically use SAS or FC backplanes.  If you have multiple disk enclosures, you also have an interface that connects each disk shelf to the controller/head/storage processor, or to an adjacent tray of disks.  For example, EMC Clariion’s use a copper fiber channel cable in a switched fabric to connect disk enclosures to the back-end of the storage processors.

If we move to the front-end of the storage system, ‘interface’ describes the medium and protocol used by initiating systems (servers) when connecting to the target SAN.  Typical front-end interface mediums on a SAN are Fiber Channel (FC) and Ethernet.  Front-end FC interfaces come in the standard 2Gb, 4Gb, or 8Gb speeds, while Ethernet is 1Gbps or 10Gbps.  Many storage arrays support multiple front-end ports which can be aggregated for increased bandwidth, or targeted by connecting systems using multi-pathing software for increased concurrency and failover.

Various protocols can be sent over these mediums.  VMware currently supports Fiber Channel Protocol (FCP) on FC, and iSCSI and NFS on Ethernet.  FC and iSCSI are block-based protocols that utilize encapsulated SCSI commands.  NFS is a NAS protocol.  Fiber Channel over Ethernet (FCoE) is also available on several storage arrays, sending FCP packets across Ethernet.

Determining which interface to use on both the front-end and back-end of your storage environment requires an understanding of your workload and your desired performance levels.  A post on workload characterization is coming in this series, so I won’t get too deep now.  I will, however, provide a few rules of thumb.  First, capture performance statistics: using Windows Perfmon, look at Physical Disk|Disk Read Bytes/sec or Disk Write Bytes/sec), or check out stats in your vSphere Client if you are already virtualized.

  • If you require low latency, use fiber channel.
  • If your throughput is regularly over 60MBps, you should consider fiber channel connected hosts.
  • iSCSI or NFS are often a good fit for general VMware deployments.

There is a ton of guidance and performance numbers available when it comes to choosing the right interconnect for a VMWare deployment, and a ton of variables that impact performance.  Start with this whitepaper from VMware: http://www.vmware.com/resources/techresources/10034.  For follow up reading, check out Duncan Epping’s post with a link to a NetApp comparison of FC, iSCSI, and NFS: http://www.yellow-bricks.com/2010/01/07/fc-vs-nfs-vs-iscsi/.  If you are going through a SAN purchase process, ask your vendor to assist you in collecting statistics for proper sizing of your environment.  Storage vendors (and their resellers) have a few cool tools for collecting and analyzing statistics – don’t be afraid to ask questions on how they use those tools to recommend a configuration for you.

I’ve kept this series fairly simple.  Next up in this series is a look at cache, controllers and coalescing.  With the next post we’ll start to get a bit more complex and more specific to VMware and Tier 1 workloads, both virtual and physical.  Thanks for reading!

We all know that virtualization allows us to do more with less.  Fewer servers and space-saving storage (talk about an oxymoron) help us put some green in the datacenter and back in the budget.  But with tight budgets demanding greater efficiency, virtualization pushing per-U-space utilization higher, and increasingly rack-dense equipment, proper planning of your physical plant remains an essential part of IT.  I argue that right-sizing your power, cooling, and floor-space is more critical now than it has ever been, and is a knowing how to do it is a darn good skill for a virtualization engineer to possess.

So along those lines… I was just doing some site-prep work for a new Clariion installation and noticed that the EMC Power Calculator has been updated.  It is now a pretty slick little web app that can be found on the PowerLink site (login required) here: https://powerlink.emc.com/nsepn/webapps/powercalculator/Main.aspx.

While I am at it, here are some links to other power consumption calculators.  Let me know if you have others and I will update this post:

There’s some fun and timely chatter happening right now on Twitter around power consumption and sizing – join in by following me at http://twitter.com/joshuatownsend/!

In Part I of this series, I discussed the important of storage performance in a virtual environment (really any environment, virtual or not, where you want acceptable performance), and introduced some of the basic measures of a storage environment.  In Part II, we will look more closely at what may be the most important storage design consideration in a VMware server-consolidation enviornments, many SQL environments, and VDI environments to name a few: IOPS.

If we stick with a single-disk-centric approach as we did in Part I, IOPS is quite simply a measure of how many read and write commands a disk can complete in a second.  IOPS is an important measure of performance in a shared storage environment (such as VMware) and in high-transaction-rate workloads like SQL.  Because hard drives are forced to abide by the laws of physics, the IOPS capabilities of a disk are consistent and predictable given a specific configuration.  The formula for calculating IOPS for a given disk is pretty straight forward (please show your work):

IOPS = 1000/(Seek Latency + Rotational Latency)

Exact latencies vary by disk type, quality, number of platters, etc.  You can look up the tech specs for most drives on the market.  As an example, I have randomly chosen the technical specifications of the Seagate Cheatah 15k.7 SAS drive.  This particular drive has the following performance characteristics:

- Average (rotational) latency: 2.0msec

- Average read seek (latency): 3.4msec

- Average write seek (latency): 3.9msec

Using the read latency number, the math works out like this:

1000
———- = 185 maximum read IOPS
2.0+3.4

The maximum write IOPS will be a bit less (~169IOPS) because of the higher write seek latency.  Writing is more ‘expensive’ than reading and therefore slower.

Fortunately, there are some widely accepted ‘working’ numbers, so you do not have to use this formula for each and every disk you might consider using.  Because rotational latency is based on the rotational speed, we can use the published Rotations Per Minute (RPM) rating of the drive to guess-timate the IOPS capabilities.  Typical spindle speeds (measured in RPM) and their equivalent IOPS are in the table below.

RPM………IOPS

7,200          80

10,000       130

15,000       180

SSD           2500 – 6000

While not a traditional spinning disk, I have also included Solid State Disks (SSD’s) for reference as SSD’s are starting to see increased market adoption.  I have seen a wide range of sizing IOPS for SSD depending on the technology, type (SLC, MLC, etc.)  Check out http://en.wikipedia.org/wiki/Solid-state_drive for an introduction, and ask your vendors for more in-depth technical information.

If you are brand-new to this (and you are still reading, congrats!), you can see how many IOPS your Windows computer is asking for by opening Performance Monitor and looking at the ‘Disk Transfers/sec’ counter under Physical Disk.  This is a sum of the ‘Disk Reads/sec’ and ‘Disk Writes/sec’ counters as you can see in the screenshot below:

If you are after some stats for your VMware ESX environment, check out esxtop and looking for CMDS/s in the output.  I published a couple articles on using esxtop here and here.  The numbers from PerfMon and esxtop get you pretty close but can be skewed by a few things we’ll discuss in later posts.

Now that was fun and all, but let’s get real: Single-disk configurations are uncommon in servers.  As such, we’ll part ways with our Simple Jack single disk approach to storage and begin to look at more real-world multi-disk enterprise-class storage configurations.  A discussion of IOPS in a multi-disk array is a great way to start.  From a very elementary perspective, you can combine multiple hard drives together to aggregate their performance capabilities.  For example, two 15k RPM disks working together to server a workload could provide a theoretical 360 IOPS (180 + 180). This also  scales out so ten 15k RPM disks could provide 1800 IOPS, and 100 15k RPM disks could provide 18,000 IOPS.

Designing your environment so that your storage can deliver sufficient IOPS to the requesting workload is of utmost importance.  If you are working on a storage design, arm yourself with data from perfmon, top, iostat, esxtop, and vscsiStats.  I typically gather at least 24 hours of performance data from systems under normal conditions (a few days to a week may be good if you have varying business cycles) and take the 95th percentile as a starting point.  So from a very simple approach, if your data and calculations show a 1800 IOPS demand at the 95th percentile, you ought to have at least ten 15k RPM disks (or twenty-three 7.2k RPM SATA disks) to achieve performance goals.  It’s amazing how some simple data and a pretty little Excel spreadsheet can help you understand and justify the right hardware for the job.

Now before you go and start filling out that PO form for a nice new storage system based on these numbers there are a few more things we ought to discuss.  RAID, cache, and advanced storage technologies will skew these numbers and need to be understood.  Stay tuned to future articles in this series for more on those topics and more.

Finally, there has been a bunch of activity in the VMware ecosystem of vendors, bloggers, and twittering-type-folks around storage performance.  As this here post sat in my drafts folder, Duncan Epping posted this gem of an article that pretty much included all of the content of this article, as well as future ones in my series: http://www.yellow-bricks.com/2009/12/23/iops/.  Do yourself a favor and read his post and the comments from his readers – both are filled with a ton of great information, including some vendor-specific implementations.
I was led to Duncan’s article by a post by Chad Sakac on his blog: http://virtualgeek.typepad.com/virtual_geek/2009/12/whats-what-in-vmware-view-and-vdi-land.html.  This is also a great read that covers some of the same information with a focus on VMware View/VDI and is also worth a few minutes of your time.  Also check out http://vpivot.com/2009/09/18/storage-is-the-problem/ for a rubber-meets-the-road post from Scott Drummonds on the importance of storage performance vis-a-vis IOPS in a VMware-virtualized SQL environment.

I am increasingly finding that both my SMB and Enterprise customers are uneducated on the fundamentals of storage sizing and performance.  As a result, storage is often overlooked as a performance bottleneck despite it being a vital component to consider in a virtualization implementation.  Storage will only increase in importance as hosts are getting bigger, data volumes increase, and more workloads are virtualized.  For some reason, most people can grasp the importance of CPU and memory performance constraints but storage performance is often overlooked and can be hard to explain to business users or executives.

Case in point – I have recently been called into some environments that were not performing well – these environments happened to be running Microsoft SQL, but could just have well been running any application or collection of virtual machines.  Fingers were being pointed in all directions: at applications, at the virtualization layer, at a lack of memory, and DBA’s were insisting that there were too few CPU’s.  The situation was getting political and emotional when I walked into it.  A few minutes with Windows Perfmon was all I needed to identify storage performance as the root cause of the firestorm that had been ignited.  Using a bit of data, I was able to turn the discussion from an emotional fight to a simple problem of physics and mathematics (and a bit of simple math could have avoided the problem in the first place).

I have seen this play out a few too many times and so decided to write-up this multi-part series on the basics of storage with a focus on storage performance.  That said, a little math and physics is where we will start as we look at the basic building block of a storage environment: a hard disk drive.  Wikipedia defines a hard disk drive as “a non-volatile storage device that stores digitally encoded data on rapidly rotating platters with magnetic surfaces.” Your computer, server, or VMware cluster uses hard drives to read and write data.  Wikipedia also covers the history and atomic structure of a hard drive pretty well.  For our purposes, the take away is that hard drives are physical objects, and as such, follow the laws of physics (duh) in the following measurable ways:

1.) Capacity, which is measured in bits or bytes and exponents there of (MB, GB, TB, PB).  This is how much data will fit on your disk, from simple text files to virtual disks, and everything in between.  For example, if you have a 500GB SQL database, you darn well better have a hard drive that has a capacity of at least 500GB.  This is a pretty simple concept, so I’ll leave it there for now.

2.) Performance, which is measured in a couple ways:

- at the disk itself in Input-Output Per Second (IOPS) – a measure of how many read and write commands a disk can complete in a second

- interface throughput, measured in MBps or Gbps – a measure of the peak rate that a volume of data can be read from or written to disk

- latency – the amount of time between when you ask a disk (or storage system if you want to read ahead) to do something and when it can actually do it, very closely related to IOPS as you’ll read in a forthcoming article in this series.

Each disk, array, and storage system has its own fixed set of measurements given a specific configuration.  Knowing the physical capabilities of your storage system as measured in the above ways, and your systems storage requirements will go a long way towards a successful design and implementation of your storage environment.  The remaining parts of this series will take a look at these performance characteristics a bit more in-depth and explain what happens as you introduce factors like RAID, cache, data reduction techniques such as snapshots and deduplication, and varying workloads.

Please keep in mind that while I have designed and implemented a variety of DAS, NAS, and SAN technologies from a host of vendors including Dell, EMC, IBM, and NetApp, I am by no means a storage expert.  The information I will provide is generalized, over-simplified, and does not consider varying approaches from different storage vendors.  Nonetheless, I hope you find this useful information if you are designing a solution, troubleshooting a performance issue or preparing to make a storage purchase.

Keep Reading:

Storage Basics – Part II: IOPS

I recently ran into an issue when installing my first Windows Server 2008 R2 virtual machine.  The VM would hang/freeze randomly when used through the VMware vCenter Client’s console.  It turns out this is a known issue (see this VMware KB Article) with the SVGA driver that is installed as part of the default installation of VMware Tools.  While the article does not explain why you should disable the SVGA driver, it’s advice is correct if you want to avoid problems in your guest VM.  To correct my problem, I removed the SVGA driver from the Windows Device Manager and rebooted.  If you are having problems removing the SVGA driver before the VM hangs, use Remote Desktop to access the guest machine to perform the driver uninstall.  I have not observed hanging/freezing in the VM since removing the SVGA driver from my Windows 2008 R2 guest.  Note that this same issue is present in Windows 7.

Today marks the one year anniversary of my first post on VMtoday.com, and an exciting year it has been in many ways.  First, some stats:

  • VMtoday.com has been visited more than 10,000 times in the past year.  While the number of site visits is far below some of my fellow virtualization bloggers, it is still exciting for me to see that I am making an impact on the community (despite my meager post count).  There are some of you who read my posts through Planet V12n and RSS, which is cool by me.
  • Yesterday was the busiest day for the site.  No coincidence that it comes after adding some new content….  I’ll try to be more faithful in publishing regular, relevant content!
  • My most popular post to date has been: IBM DS3300 iSCSI Write Performance Solved. I’m glad this has been useful for so many, but I hope that you don’t just apply the workarounds I wrote about. I would rather have you build a “bet the business” iSCSI environment by adding that second controller to your MD3000i or DS3300.
  • My least popular post to date has been: VMworld Here I Come. I will try not to be so boring in the future.
  • The first link back to my site was from Scott Lowe’s blog.  Thanks for the link, Scott.  Scott wrote a darn fine book: Mastering VMware vSphere 4.  Buy it.
  • The site theme, like my Twitter page, is very blue.  I am working on a new theme in all of the spare time that a busy professional and father of two boys can muster.

Along with this site, I have made a concerted effort engage the virtualization community in several ways:

  • Twitter keeps me plugged in to the latest news and discussions, and has been a source of help to me (and I would like to think that I have helped some of you as well).  Follow me at http://twitter.com/joshuatownsend.
  • I have stepped up into a leadership role with the Washington DC VMware User Group (VMUG).  It has been awesome to meet with and learn from my local colleagues.  I will continue to work with the DC VMUG leadership team to deliver exciting and relevant content and activities (and also some more seating space).  I welcome your feedback and ideas for ways to improve the VMUG.
  • VMworld 2009 was a great way to meet many of you while learning some of the hottest new technology on the planet.  My wife came along and enjoyed spending time with some of the other vSpouses (or vWidows?).
  • I took a new job early in the year with a VMware Partner in the DC area.  This new job has been both challenging and rewarding, affording me opportunities to more effectively engage customers and spend more time working on virtualization-specific solutions.

I look forward to contributing to the virtualization community as a blogger, VMUG leader, and practitioner.  If you want to learn a bit more about me, check out the About page on this site.  I welcome your feedback and appreciate your reading my work!

- Josh

I recently posted an article on how specific actions during the upgrade of a VMware Virtual Machine’s hardware from v4 to v7 can cause problems with certain services, including DNS, DHCP, and WINS. In that case, the problem was related to Microsoft Windows leaving non-present devices with networking configurations and  the failure of the VMware Upgrade Helper service to copy WINS settings when updating the NIC.  As my fellow blogger and VMUG leader, Jason Boche, responded on Twitter: “Same gotchas, different version.”  And right he is – anyone with experience in P2V or V2V, or who has been working with VMware long enough to have done a 2.5 to 3.0 upgrade experienced the same gotchas.

There are other issues with VMware virtual hardware upgrades, however, that you may not have experienced.  One such issue that I have experienced is highlighted in VMware Knowledge Base article 1013109: “Upgrading virtual hardware in ESX 4 may cause Windows 2008 disks to go offline“.  The problems described in the article are unique to Windows 2008 Enterprise and Datacenter editions only.  The problem is pretty well described in the title of the article – Upgrading virtual hardware in ESX 4 may cause Windows 2008 disks to go offline.  In this case, like with the ghost NIC’s I described last week, is more of a Microsoft issue, but it will rear its head when a VMware Administrator least desires it.  With this particular problem, the Windows Virtual Disk Service (part of the native Storage Management suite) is set to not auto-mount newly discovered disks that do reside on a shared bus.  Microsoft has a MSDN article on the VDS SANS policy here.  Upgrading the virtual hardware version causes the disks to be re-discovered and not auto-mounted.  This can potentially impact all non-system disks on a VM.

You may also experience similar issues when upgrading the vSCSI adapter in a VM from a standard LSI Logic Parallel SCSI adapter to a (new in vSphere 4.0) paravirtualized SCSI (pvSCSI) adapter, move virtual disks to new vSCSI adapters to increase the number of concurrent disk IO operations, or when you change the SCSI node ID of a virtual disk.  These may all trigger a re-discovery of the disks by the Windows Virtual Disk Service, leaving data disks offline on Windows 2008 Enterprise and Datacenter Edition guests.

In my opinion, these issues are not reasons to forgo upgrading your virtual hardware version.  However, when your upgrade/migration plans call for upgrading the virtual hardware version of your guests you should be prepared to resolve any issues caused by ‘ghost hardware’, offline disks, and the like.  Both the MSDN and VMware articles I cited above offer workarounds for the offline disk issue.  Here are the links again:

  • http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1013109
  • http://msdn.microsoft.com/en-us/library/bb525577%28VS.85%29.aspx
  • I recently completed a VMware VI 3.5 to vSphere upgrade in a small environment (5 hosts, 80 VM’s).  Being a small environment, the upgrade was planned for one big overnight blitz.  Unfortunately, the size of the environment did not afford a test environment to uncover potential issues before the upgrade.  The upgrade to vSphere itself went swimmingly (the vCenter server had been upgraded a couple weeks earlier).  However, some things in the environment started to go wonky once the upgrade was complete.  Specifically, name resolution (DNS), DHCP, WINS, Group Policy, and really anything Microsoft Active Directory related just did not work.

    Let me explain a bit about the environment so you can better understand what the problem was and how it was corrected. The environment was an all Microsoft shop, except for VMware of course. The company follows a virtualize-first policy and is about 90% virtualized, including the Active Directory Domain Controllers. The DC’s are Windows 2008 and serve up DHCP, DNS, and WINS in addition to their Directory Services roles.

    The problems really began after I upgraded the virtual hardware version from v4 to v7 (check out page 97 of the vSphere Upgrade Guide for the upgrade procedure).  When a Windows server is upgrade from VMware Hardware Version 4 to 7, the VMware Upgrade Helper Service handles the reconfiguration of network adapters on the upgraded virtual machine.  The VMware Upgrade Helper Service is installed with VMware Tools and is one of the reasons, along with getting drivers installed for the new hardware, for upgrading VMware Tools before upgrading the hardware version.  If you review the Event Viewer Application log on an upgraded machine you will see several entries from VMUpgradeHelper (Source) with several different Event ID’s (26, 280, 272, 108, & 105).  An examination of these events will show that the VMware Upgrade Helper service 1.) Backed up the network configuration at OS shutdown, 2.) Started Automatically with the OS, 3.) Checks the device ID for the network adapter, 4.) If the device ID has changed (as a result of a hardware upgrade), the backed up configuration is restored and Event ID 269 is logged.

    This behavior should be transparent for most configurations, with the exception of a slightly longer boot time following the upgrade.  However, I did notice a few problems with the NIC settings being restored under certain conditions.  First, on servers with a statically configured IPv4 stack, IP addresses and DNS server addresses were restored, but the WINS server addresses were not restored.  I suspect this is an oversight in the VMware Upgrade Helper service, but is probably not a major issue for many servers/environments as WINS is infrequently used.  However, for a WINS server itself to lose its configuration to use itself as a WINS server, bad things happen.  There are several ways to correct this – scripts, DHCP Options, etc.  In the end, this wasn’t really a show stopper for me in this small environment.

    The second, and bigger issue for me, was that after the virtual hardware was upgraded and the VMware Upgrade Helper Service did it’s job my Active Directory and related services were not available.  DNS was not functioning, DHCP was not handing out addresses, and I couldn’t connect to AD using ADUC, GPMC or LDAP.  It took me a few minutes to figure out what was going on.  This seems to be what happened: the virtual hardware upgrade caused a new virtual network adapter to be installed in the VM and all of the settings, including the MAC, address to be restored.  The HW v4 NIC was removed from the machine, but Windows held onto the device as a ‘ghost NIC’ in Device Manager.  The core AD services, including DNS and DHCP, were still attempting to bind to the ghost NIC.  This behavior persisted through service restarts and reboots of the guest.  It wasn’t until I examined the IP configuration on the new NIC and clicked Apply (instead of canceling out) that I was prompted with a message indicating that there was more than one network interface configured with the same IP address, queuing me into the solution.

    The error message should be familiar to anyone who has performed a Physical-to-Virtual migration (P2V) and is easily corrected by removing the old device through Windows Device Manager.  The device is hidden so you first have to expose it before deleting it.  Check http://support.microsoft.com/kb/315539 for details or simply follow my instructions below.  To expose the non-present NIC, open a command prompt and enter:

    set devmgr_show_nonpresent_devices=1

    You can then open Device Manager (enter devmgmt.msc at the command prompt to save some time).  In Device Manager, click View | Show Hidden Devices.  Expand Network Adapters and find the grayed-out entry for the old NIC as pictured below.

    GhostNIC

    Select the ghost NIC and right-click | Uninstall to remove it.

    The final gotcha for me on this is that the set devmgr_show_nonpresent_devices=1 command does not work on Windows 2008 (or Vista, Windows 7, or Windows 2008 R2).  To see and remove ghost NICs from Windows 2008, and environmental variable must be defined.  To set the variable, open Server Manager from the Windows Start Menu.  Highlight ‘Server Manager (%SERVERNAME%)’ in the left-side tree-view pane.  Click ‘Change System Properties’ in the right-hand pane.  Switch to the Advanced tab and click ‘Environment Variables.  Create a new System variable by clicking the New button.  The Variable name should be ‘devmgr_show_nonpresent_devices’ and the value should be ‘1′ as pictured below.

    EnvVariable

    Click OK to close out of any open Windows.  A reboot is not necessary for the variable to take effect, although you may have to close out of all open Device Manager Windows and then reopen devmgmt.msc.  Click View | Show Hidden Devices and remove the ghost NIC as described above.  A quick reboot after I removed the ghost NIC from the domain controllers and all Active Directory, DNS, DHCP, and WINS services immediately began operating normally.  This second issue is more of a Microsoft problem in my opinion, and has been around for some time.

    Before you start getting all upset and the FUD starts flying (“this is Microsoft/VMware’s latest attempt to break VMware/Microsoft?”), it wasn’t really vSphere that broke Active Directory; It was me.  A little better planning and not rushing through the last wee hours of the upgrade Window could have saved some trouble.  If you are planning a similar upgrade, it would be best to upgrade your domain controllers/DNS servers one at a time and remediate the issues I have decribed before upgrading the next.  This will ensure continued availability of your Active Directory and other critical services during your upgrade.

    Follow Me!

        

    Virtualization Jobs

    Virtualization Resources