I recently ran into an issue when installing my first Windows Server 2008 R2 virtual machine. The VM would hang/freeze randomly when used through the VMware vCenter Client’s console. It turns out this is a known issue (see this VMware KB Article) with the SVGA driver that is installed as part of the default installation of VMware Tools. While the article does not explain why you should disable the SVGA driver, it’s advice is correct if you want to avoid problems in your guest VM. To correct my problem, I removed the SVGA driver from the Windows Device Manager and rebooted. If you are having problems removing the SVGA driver before the VM hangs, use Remote Desktop to access the guest machine to perform the driver uninstall. I have not observed hanging/freezing in the VM since removing the SVGA driver from my Windows 2008 R2 guest. Note that this same issue is present in Windows 7.
VMware
Upgrading Virtual Hardware in a VMware Virtual Machine May Cause Disks to go Offline
I recently posted an article on how specific actions during the upgrade of a VMware Virtual Machine’s hardware from v4 to v7 can cause problems with certain services, including DNS, DHCP, and WINS. In that case, the problem was related to Microsoft Windows leaving non-present devices with networking configurations and the failure of the VMware Upgrade Helper service to copy WINS settings when updating the NIC. As my fellow blogger and VMUG leader, Jason Boche, responded on Twitter: “Same gotchas, different version.” And right he is – anyone with experience in P2V or V2V, or who has been working with VMware long enough to have done a 2.5 to 3.0 upgrade experienced the same gotchas.
There are other issues with VMware virtual hardware upgrades, however, that you may not have experienced. One such issue that I have experienced is highlighted in VMware Knowledge Base article 1013109: “Upgrading virtual hardware in ESX 4 may cause Windows 2008 disks to go offline“. The problems described in the article are unique to Windows 2008 Enterprise and Datacenter editions only. The problem is pretty well described in the title of the article – Upgrading virtual hardware in ESX 4 may cause Windows 2008 disks to go offline. In this case, like with the ghost NIC’s I described last week, is more of a Microsoft issue, but it will rear its head when a VMware Administrator least desires it. With this particular problem, the Windows Virtual Disk Service (part of the native Storage Management suite) is set to not auto-mount newly discovered disks that do reside on a shared bus. Microsoft has a MSDN article on the VDS SANS policy here. Upgrading the virtual hardware version causes the disks to be re-discovered and not auto-mounted. This can potentially impact all non-system disks on a VM.
You may also experience similar issues when upgrading the vSCSI adapter in a VM from a standard LSI Logic Parallel SCSI adapter to a (new in vSphere 4.0) paravirtualized SCSI (pvSCSI) adapter, move virtual disks to new vSCSI adapters to increase the number of concurrent disk IO operations, or when you change the SCSI node ID of a virtual disk. These may all trigger a re-discovery of the disks by the Windows Virtual Disk Service, leaving data disks offline on Windows 2008 Enterprise and Datacenter Edition guests.
In my opinion, these issues are not reasons to forgo upgrading your virtual hardware version. However, when your upgrade/migration plans call for upgrading the virtual hardware version of your guests you should be prepared to resolve any issues caused by ‘ghost hardware’, offline disks, and the like. Both the MSDN and VMware articles I cited above offer workarounds for the offline disk issue. Here are the links again:
vSphere Upgrade Breaks Active Directory
I recently completed a VMware VI 3.5 to vSphere upgrade in a small environment (5 hosts, 80 VM’s). Being a small environment, the upgrade was planned for one big overnight blitz. Unfortunately, the size of the environment did not afford a test environment to uncover potential issues before the upgrade. The upgrade to vSphere itself went swimmingly (the vCenter server had been upgraded a couple weeks earlier). However, some things in the environment started to go wonky once the upgrade was complete. Specifically, name resolution (DNS), DHCP, WINS, Group Policy, and really anything Microsoft Active Directory related just did not work.
Let me explain a bit about the environment so you can better understand what the problem was and how it was corrected. The environment was an all Microsoft shop, except for VMware of course. The company follows a virtualize-first policy and is about 90% virtualized, including the Active Directory Domain Controllers. The DC’s are Windows 2008 and serve up DHCP, DNS, and WINS in addition to their Directory Services roles.
The problems really began after I upgraded the virtual hardware version from v4 to v7 (check out page 97 of the vSphere Upgrade Guide for the upgrade procedure). When a Windows server is upgrade from VMware Hardware Version 4 to 7, the VMware Upgrade Helper Service handles the reconfiguration of network adapters on the upgraded virtual machine. The VMware Upgrade Helper Service is installed with VMware Tools and is one of the reasons, along with getting drivers installed for the new hardware, for upgrading VMware Tools before upgrading the hardware version. If you review the Event Viewer Application log on an upgraded machine you will see several entries from VMUpgradeHelper (Source) with several different Event ID’s (26, 280, 272, 108, & 105). An examination of these events will show that the VMware Upgrade Helper service 1.) Backed up the network configuration at OS shutdown, 2.) Started Automatically with the OS, 3.) Checks the device ID for the network adapter, 4.) If the device ID has changed (as a result of a hardware upgrade), the backed up configuration is restored and Event ID 269 is logged.
This behavior should be transparent for most configurations, with the exception of a slightly longer boot time following the upgrade. However, I did notice a few problems with the NIC settings being restored under certain conditions. First, on servers with a statically configured IPv4 stack, IP addresses and DNS server addresses were restored, but the WINS server addresses were not restored. I suspect this is an oversight in the VMware Upgrade Helper service, but is probably not a major issue for many servers/environments as WINS is infrequently used. However, for a WINS server itself to lose its configuration to use itself as a WINS server, bad things happen. There are several ways to correct this – scripts, DHCP Options, etc. In the end, this wasn’t really a show stopper for me in this small environment.
The second, and bigger issue for me, was that after the virtual hardware was upgraded and the VMware Upgrade Helper Service did it’s job my Active Directory and related services were not available. DNS was not functioning, DHCP was not handing out addresses, and I couldn’t connect to AD using ADUC, GPMC or LDAP. It took me a few minutes to figure out what was going on. This seems to be what happened: the virtual hardware upgrade caused a new virtual network adapter to be installed in the VM and all of the settings, including the MAC, address to be restored. The HW v4 NIC was removed from the machine, but Windows held onto the device as a ‘ghost NIC’ in Device Manager. The core AD services, including DNS and DHCP, were still attempting to bind to the ghost NIC. This behavior persisted through service restarts and reboots of the guest. It wasn’t until I examined the IP configuration on the new NIC and clicked Apply (instead of canceling out) that I was prompted with a message indicating that there was more than one network interface configured with the same IP address, queuing me into the solution.
The error message should be familiar to anyone who has performed a Physical-to-Virtual migration (P2V) and is easily corrected by removing the old device through Windows Device Manager. The device is hidden so you first have to expose it before deleting it. Check https://support.microsoft.com/kb/315539 for details or simply follow my instructions below. To expose the non-present NIC, open a command prompt and enter:
set devmgr_show_nonpresent_devices=1
You can then open Device Manager (enter devmgmt.msc at the command prompt to save some time). In Device Manager, click View | Show Hidden Devices. Expand Network Adapters and find the grayed-out entry for the old NIC as pictured below.
Select the ghost NIC and right-click | Uninstall to remove it.
The final gotcha for me on this is that the set devmgr_show_nonpresent_devices=1 command does not work on Windows 2008 (or Vista, Windows 7, or Windows 2008 R2). To see and remove ghost NICs from Windows 2008, and environmental variable must be defined. To set the variable, open Server Manager from the Windows Start Menu. Highlight ‘Server Manager (%SERVERNAME%)’ in the left-side tree-view pane. Click ‘Change System Properties’ in the right-hand pane. Switch to the Advanced tab and click ‘Environment Variables. Create a new System variable by clicking the New button. The Variable name should be ‘devmgr_show_nonpresent_devices’ and the value should be ‘1’ as pictured below.
Click OK to close out of any open Windows. A reboot is not necessary for the variable to take effect, although you may have to close out of all open Device Manager Windows and then reopen devmgmt.msc. Click View | Show Hidden Devices and remove the ghost NIC as described above. A quick reboot after I removed the ghost NIC from the domain controllers and all Active Directory, DNS, DHCP, and WINS services immediately began operating normally. This second issue is more of a Microsoft problem in my opinion, and has been around for some time.
Before you start getting all upset and the FUD starts flying (“this is Microsoft/VMware’s latest attempt to break VMware/Microsoft?”), it wasn’t really vSphere that broke Active Directory; It was me. A little better planning and not rushing through the last wee hours of the upgrade Window could have saved some trouble. If you are planning a similar upgrade, it would be best to upgrade your domain controllers/DNS servers one at a time and remediate the issues I have described before upgrading the next. This will ensure continued availability of your Active Directory and other critical services during your upgrade.
The Skinny on ESXTOP
A reader named Mark contacted me today and asked if there was a way to reduce the size of the batch output from an ESXTOP run. And he asks for good reason: Depending on the number of VM’s on your host, the delay between ESXTOP samplings and the number of samples you collect, using the All Stats option (-a) can yield a massive file in a short period of time. If written to a partition on your ESX Service Console you run the risk of filling the partition, and forget about actually being able to analyze the data in PERFMON or Excel. For example, on an ESX host running ~15 VM’s I produced 100MB worth of CSV using the -a switch, sampling every 15 seconds, for just under 2 hours. ESXTOP uses 10-second intervals by default; I used -d 15 to change the sampling delay. Had I went with the default my output would have been bigger.
To reduce the size of your output, you can change your sampling delay to something larger, say 30-seconds. I suppose you could also capture statistics when the host is not busy so you get fewer characters in the results, but that’s just being goofy. 😉
A better way to reduce your ESXTOP output size is to selectively include only the statistics you are interested in, and is really what Mark was asking. After all, all statistics from ESXTOP can be too many statistics, and chances are you already know what stats you are interested in. Here’s how you can narrow down the collected stats for easier analysis and smaller output:
- Enter ESXTOP in interactive mode on the Service Console by simply typing esxtop at the # prompt
- Switch to a component you are NOT interested in capturing statistics on by pressing the corresponding menu option (c: ESX cpu, m: ESX memory, d: ESX disk adapter, u: ESX disk device, v: ESX disk VM).
- Press f when viewing the component you do not want to capture. A list of fields will be displayed. You can toggle the fields on and off by pressing the letter corresponding to each field. An * indicates that the field is on. You want to turn off all of the fields you don’t want to collect.
- Repeat steps 2 & 3 for the remaining components, leaving only what you want to capture.
- Switch to the component you want to capture in batch mode and repeat step #3, except you will now enable what you want to capture.
- Press W (capital W – case sensitive) to write out the ESXTOP configuration file. You can accept the default or create new configuration files. You may want to create a CPU-only config file, memory-only, and so forth.
- Press CTRL+C to stop ESXTOP.
- Now, invoke ESXTOP in batch mode, calling your updated or new configuration file you created in step #6 using the -c switch. Here’s an example:# esxtop -b -d 30 -n 480 -c .esxtopcpustats > /tmp/esxtop_cpu_stats.csv where .esxtopcpustats is an ESXTOP config file with only CPU stats. -d sets your capture interval to 30 seconds, and -n sets the number of samples to 480 (or 4 hours with a delay of 30 seconds).
Once your capture is complete you can replay the sampling in ESXTOP using replay mode (-R), or you can copy the .csv to a Windows system and use PERFMON or Excel to analyze the stats. If using PERFMON or Excel you will notice that the system summary information displayed at the top of an interactive ESXTOP session is included in the output (console memory, console cpu, etc.). As far as I know, there is no way to disable this, nor would you want to as it includes the time stamp necessary to interpret your data.
It is possible to use the vSphere CLI or the vSphere Management Assistant (vMA) to run RESXTOP, a version of ESXTOP designed for remote administration of ESXi or ESX. You may note, however, RESXTOP from the vSphere CLI only works from a Linux client. Using either of these tools will help you to automate ESXTOP statistics collection from multiple hosts using customized configuration files.
vCenter Database Stats Rollup Troubleshooting
VMware vCenter collects performance statistics, tasks and events for historical performance analysis and auditing. The collection level and retention of performance statistics can be controlled through the vCenter GUI (see Administration | vCenter Server Settings | Statistics). The level of statistics collection and retention periods can have a dramatic impact on your vCenter Server’s performance if not carefully planned and monitored. In particular, the vCenter database can grow quite large and the database server required to support the increase in statistics increases in size and performance characteristics (increased disk IO capacity, CPU, and memory). Fortunately, VMware has provided a vCenter database sizing tool within the vCenter client (see picture). This is all well and good for initial sizing, and my experience shows that vCenter’s sizing estimates are fairly accurate assuming the environment remains healthy.
I recently migrated an environment from vCenter 2.5 to 4.0 and in the process switched from a Windows 2003 32-bit vCenter host and a SQL 2005 server (remote to vCenter) to a Windows 2008 64-bit vCenter server with a SQL 2008 server (again, a remote SQL server). I experienced a few issues during the migration and thought I had worked through them all (I’ll post on those at a later date). However, after a bit of time I found that performance statistics for objects in the vCenter were missing of not rendering at an acceptable pace. Upon further investigation, I discovered the following warnings in the vCenter Service Status node indicating that performance rollups within the vCenter database were not taking place:
- Performance statistics rollup from Past Day to Past Week is not occurring in the database
- Performance statistics rollup from Past Week to Past Month is not occurring in the database
- Performance statistics rollup from Past Month to Past Year is not occurring in the database
In a SQL-backed vCenter, statistics rollups are handled by the SQL Server Agent (note: if you are using SQL Server Express, statistics rollups are handled by vCenter itself as SQL Express does not offer SQL Server Agent jobs). KB 1003570 describes this process (it applies to vCenter 2.5, but the principles in it can be applied to 4.0). To troubleshoot and resolve the issue I opened SQL Server Management Studio and checked several items:
- Is the SQL Server Agent running?
- Are there statistics rollup jobs defined for SQL server agent?
- Are those jobs running?
In my case, the SQL Server Agent was running (you are prompted to configure this during the vCenter install). However, when I checked for the presence of rollup jobs, I discovered that only a Past Day job had migrated with the database to the new SQL server. Upon investigating the job history for that job I discovered that the job had not run since the migration (note to self: add these checks to your standard vCenter migration checklist).
To remediate the problem I completed the following steps:
- Remove the bad ‘Past Day stats rollupVirtualCenter’ job from the list of SQL Server Agent Jobs.
- Recreate the three standard stats rollup jobs. To recreate the jobs, find SQL scripts on your vCenter server in C:Program Files (x86)VMwareInfrastructureVirtualCenter Server. The .sql scripts you’ll need are stats_rollup1_proc_mssql.sql, stats_rollup2_proc_mssql.sql, and stats_rollup3_proc_mssql.sql. Run these scripts in SQL Query Analyzer against your VirtualCenter Database in order from 1 to 3. These scripts should create the rollup jobs and their associated stored procedures (this procedure is detailed at https://communities.vmware.com/thread/123715).
- After recreating the jobs I took a backup of the vCenter database. The Past Day job soon kicked off to begin a stats rollup (this runs every 30 minutes by default).
I checked the server several hours later and discovered that rather than completing successfully, the Past Day job was still running and the drive holding my vCenter database transaction log was full. Back to the drawing board..
- I disabled the Past Week and Past Month rollup jobs to avoid job conflicts.
- I backed up the vCenter database and then performed a shrink of the log file to get it back down to size.
- The vCenter was running as a VM, so I was able to quickly increase its disk size and use diskpart from within the guest to extend the partition. The space required to process weeks of performance statistics is not included in the vCenter Database Sizing tool as it is assumed that the rollup/purge jobs will run as designed.
I wanted to see how bad the problem was before kicking off another job so I ran:
select count(*) from vpx_hist_stat1
against the vCenter database in SQL Query Analyzer. The query ran for several hours (never a good sign) and eventually returned well over 20 million rows of performance statistics (thanks to https://communities.vmware.com/message/1318736 for pointing me in this direction). I investigated options to truncate the tables (see above link), and also looked at a script from VMware KB 1000125: Purging old data from the database used by vCenter Server. In the end, I decided to try to let the Past Day stats job run.
I stopped the vCenter Server Service to prevent new statistics from being written to the database. I also disabled the Past Week and Past Month SQL Agent jobs to prevent job conflicts and then manually started the Past Day job. I had to stop the job several times as it filled the 100GB transaction log volume. A backup & shrink operation gave me back the space on the log volume. I saw about 300GB of transaction logs written over the course of this process, but the Past Day job eventually completed.
Finally, I re-enabled the Past Week and Past Month jobs and manually ran both of them (Past Week first, then Past Month), followed by a backup and shrink of the vCenter database. I was impressed with the performance increase I saw in the vCenter client. Lists and performance graphs rendered much faster than when stats rollups were not taking place.
It would be a good idea to include checking stats rollup job status and a count of rows from the vpx_hist_stat tables in the vCenter database in your regular maintenance tasks. For other vCenter Database best practices, check out breakout session PO2061 from VMworld 2008. If you did not attend or subscribe to VMworld, Scott Lowe covered the session in this post. A VMworld 2009 “online only” session entitled VM3237 vCenter Databases: Setup, Management and Best Practices was also offered (subscription required). I have not viewed this session so I cannot comment on its content.