Posts Tagged ‘sql’

VMware vCenter collects performance statistics, tasks and events for historical performance analysis and auditing.  The collection level and retention of performance statistics can be controlled through the vCenter GUI (see Administration | vCenter Server Settings | Statistics).   vCenter Statistics SettingsThe level of statistics collection and retention periods can have a dramatic impact on your vCenter Server’s performance if not carefully planned and monitored.  In particular, the vCenter database can grow quite large and the database server required to support the increase in statistics increases in size and performance characteristics (increased disk IO capacity, CPU, and memory).  Fortunately, VMware has provided a vCenter database sizing tool within the vCenter client (see picture).  This is all well and good for initial sizing, and my experience shows that vCenter’s sizing estimates are fairly accurate assuming the environment remains healthy.

I recently migrated an environment from vCenter 2.5 to 4.0 and in the process switched from a Windows 2003 32-bit vCenter host and a SQL 2005 server (remote to vCenter) to a Windows 2008 64-bit vCenter server with a SQL 2008 server (again, a remote SQL server).  I experienced a few issues during the migration and thought I had worked through them all (I’ll post on those at a later date).  However, after a bit of time I found that performance statistics for objects in the vCenter were missing of not rendering at an acceptable pace.  Upon further investigation, I discovered warnings in the vCenter Service Status node indicating that performance rollups within the vCenter database were not taking place.

image

In a SQL-backed vCenter, statistics rollups are handled by the SQL Server Agent (note: if you are using SQL Server Express, statistics rollups are handled by vCenter itself as SQL Express does not offer SQL Server Agent jobs).  KB 1003570 describes this process (it applies to vCenter 2.5, but the principles in it can be applied to 4.0).  To troubleshoot and resolve the issue I opened SQL Server Management Studio and checked several items:

  1. Is the SQL Server Agent running?
  2. Are there statistics rollup jobs defined for SQL server agent?
  3. Are those jobs running?

In my case, the SQL Server Agent was running (you are prompted to configure this during the vCenter install).  However, when I checked for the presence of rollup jobs, I discovered that only a Past Day job had migrated with the database to the new SQL server.  Upon investigating the job history for that job I discovered that the job had not run since the migration (note to self: add these checks to your standard vCenter migration checklist).

To remediate the problem I completed the following steps:

  1. Remove the bad ‘Past Day stats rollupVirtualCenter’ job from the list of SQL Server Agent Jobs.
  2. Recreate the three standard stats rollup jobs.  To recreate the jobs, find SQL scripts on your vCenter server in C:\Program Files (x86)\VMware\Infrastructure\VirtualCenter Server.  The .sql scripts you’ll need are stats_rollup1_proc_mssql.sql, stats_rollup2_proc_mssql.sql, and stats_rollup3_proc_mssql.sql.  Run these scripts in SQL Query Analyzer against your VirtualCenter Database in order from 1 to 3.  These scripts should create the rollup jobs and their associated stored procedures (this procedure is detailed at http://communities.vmware.com/thread/123715).
  3. After recreating the jobs I took a backup of the vCenter database.  The Past Day job soon kicked off to begin a stats rollup (this runs every 30 minutes by default).

I checked the server several hours later and discovered that rather than completing successfully, the Past Day job was still running and the drive holding my vCenter database transaction log was full.  Back to the drawing board..

  1. I disabled the Past Week and Past Month rollup jobs to avoid job conflicts.
  2. I backed up the vCenter database and then performed a shrink of the log file to get it back down to size.
  3. The vCenter was running as a VM, so I was able to quickly increase its disk size and use diskpart from within the guest to extend the partition.  The space required to process weeks of performance statistics is not included in the vCenter Database Sizing tool as it is assumed that the rollup/purge jobs will run as designed.

I wanted to see how bad the problem was before kicking off another job so I ran:

select count(*) from vpx_hist_stat1

against the vCenter database in SQL Query Analyzer.  The query ran for several hours (never a good sign) and eventually returned well over 20 million rows of performance statistics (thanks to http://communities.vmware.com/message/1318736 for pointing me in this direction).  I investigated options to truncate the tables (see above link), and also looked at a script from VMware KB 1000125: Purging old data from the database used by vCenter Server.  In the end, I decided to try to let the Past Day stats job run.

I stopped the vCenter  Server Service to prevent new statistics from being written to the database.  I also disabled the Past Week and Past Month SQL Agent jobs to prevent job conflicts and then manually started the Past Day job.  I had to stop the job several times as it filled the 100GB transaction log volume.  A backup & shrink operation gave me back the space on the log volume.  I saw about 300GB of transaction logs written over the course of this process, but the Past Day job eventually completed.

Finally, I re-enabled the Past Week and Past Month jobs and manually ran both of them (Past Week first, then Past Month), followed by a backup and shrink of the vCenter database.  I was impressed with the performance increase I saw in the vCenter client.  Lists and performance graphs rendered much faster than when stats rollups were not taking place.

It would be a good idea to include checking stats rollup job status and a count of rows from the vpx_hist_stat tables in the vCenter database in your regular maintenance tasks.  For other vCenter Database best practices, check out breakout session PO2061 from VMworld 2008.  If you did not attend or subscribe to VMworld, Scott Lowe covered the session in this post.  A VMworld 2009 “online only” session entitled VM3237 vCenter Databases: Setup, Management and Best Practices was also offered (subscription required).  I have not viewed this session so I cannot comment on its content.

I have been meaning to write this up for a while; Scott Drummonds’ ‘Love Your Balloon Driver’ post today at his Virtual Performance blog gave me a nice reminder.  I actually caught a sneak peak at the graphs with an explanation from Scott at his instructor-led lab at VMworld 2009.  Scott calls out that the only workload they discovered suffers from balloon driver activity is Java.  The reason for Java’s problems with balloon driver activity is that Java itself runs in a VM and so the guest OS cannot properly determine which pages should be swapped out when the balloon driver calls for it.

My experiences causes me to agree with Scott and the whitepaper he cites – in a properly designed and equipped environment the balloon driver is not detrimental for most every workload to a point.   However, I recently discovered in a client site that the balloon driver can cause significant issues when the environment is poorly designed and under-sized.  Here the background:

I was called into an already established environment where the client was running on an older blade with VMware ESX 3.5.  The blade maxed out at 16GB RAM and had dual dual-core CPU’s with no hope for an upgrade.  On the blade was a single guest VM running Windows 2003 with SQL 2005, in it’s full 32-bit glory.  The VM was configured with 4 vCPU’s and 16GB of memory.  Some of you can probably already guess where this is going….

The x86 Windows guest had PAE configured, and SQL took advantage of AWE to use the additional memory beyond the 4GB limit of a 32-bit system.  Additionally, the Windows guest had the /3GB switch enabled in boot.ini.  Finally, as per SQL best practices, the ‘Lock Pages in Memory‘ permission was granted to the SQL Server service account.  What the guest was left with was 1GB of kernel mode memory and 15GB of User Mode/Extended addressable memory.

And here’s the problem.  The client was using ESX, not ESX 3.5, so the Service Console required memory.  In this case, the service console had approximately 512MB allocated to it.  Futhermore, VM’s require some overhead on ESX to run.  The memory overhead consumed by a Windows guest on ESX 3.5 with 4 vCPU and 16GB of memory is a bit more than 512MB.  On a properly sized ESX server with multiple similar guests/workloads, you could probably gain much of the overhead back through transparent page sharing; but in this case I had a 1:1 P2V ratio.  If you are any good at math you see that the environment is running about 1GB short of memory.  A quick check of the balloon driver stat in vCenter show that the balloon driver was constantly active and demanding about 1GB back from the guest… constantly.

Under normal circumstances this might not be an issue, but in this case the Windows guest was being absolutely punished.  The guest CPU’s were pegged at 100% with an excessive amount of kernel time, often indicating IO issues.  And indeed I did experience terrible disk and network performance on the guest.  At the root of the problem is this – the Lock Pages in Memory permission allows SQL to get a firm grasp on the user mode memory available to it (15GB) and lock it up.  This left the already starved (because of the 3GB switch in the boot.ini) guest kernel with it’s 1GB the only thing the balloon driver could really swap out.

The client suggested a reservation of 16GB on the VM, knowing that memory reservations prevent balloon driver activity.  I calmly asked them to back away from the keyboard as I explained how if a starved guest was bad, how much worse a starved Service Console would be.  In the end the fix was quiet easy – I convinced the customer that they should reduce the amount of memory allocated to the guest by about 1GB, enough to let the 512MB SC and the 512MB of overhead run without contention.  I was able to show them the difference between allocated and active memory in vCenter – the 1GB being surrendered was not really being actively used, SQL just had it locked up.  In fact, surrendering the 1GB of memory back to ESX breathed new life into the guest VM, bringing its performance back in line with expectations.

Ideally, I would have brought in a bigger ESX server that could serve additional VM’s, driving greater levels of efficiency across the environment.  It just wasn’t an option for the client in this case.  In the end, the problem was fixed and I was reminded just how fun it can be to explain some of these backwards sounding virtualization concepts to customers – fewer vCPU’s can lead to better performance of guests, less guest memory can fix performance issues, and increasing the quantity of similar guests on a host can drive better performance to a point because of transparent page sharing.

Stay tuned over the next few weeks as I digest and write on my VMworld experience – from VMUG activities to Paul Maritz’s press conference announcing the vCloud Express, and plenty of great sessions in between.  Like many of you, I returned from VMworld with quite a backlog of work but I’ll do my best to squeeze in some posts and tweets.

Here are some bookmarks for resources that I have recently referenced:

Follow Me!

    

Virtualization Jobs

Virtualization Resources