Posts Tagged ‘sql’
At the risk of beating a dead horse, it’s time to resurrect my Storage Basics series. I’ve recently had some great feedback on the series and figured I should round out a few of the concepts before I wrap it up. I want to cover a topic often discussed amongst virtualization professionals, but one I often find general practitioners and server admins not understanding: storage alignment. Storage alignment, or the lack of alignment, is not a new issue and is not unique to VMware or virtualization in general. However, the effects of misaligned storage can be more greatly felt in terms of reduced performance and strain on a storage system in shared, oversubscribed or high I/O environments. Many others in the virtualization and storage communities have already covered partition alignment (see Duncan Epping, Vaughn Stewart, and most recently Chad Sakac), but I feel it is an important enough topic for me to re-hash as part of this series.
What is Storage Alignment?
Let’s start with a quick overview of what storage alignment means. Quite simply, storage alignment refers to the positioning (starting offset) of the various pieces of a systems storage components – the physical disk sectors or array’s chunks, the VMware File System (VMFS) in a VMware environment, and the guest file system’s clusters within a partition – in relation to the layer directly under the element in question. A quick graphic often makes quick work of explaining this (I often whiteboard this concept for colleagues and clients):
As you can see, the starting offset of the VMFS partition does not correspond to the physical segmentation of the underlying disks (in this case, the chunks on a SAN – but could be conceptually replaced with the sectors of a single disk). Furthermore, the clusters (or blocks) of the guest VM are not aligned to the VMFS partition nor to the underlying storage. For traditional (physical) systems or VMware RDM’s, the VMFS layer could be abstracted, but the result would be the same – the clusters of a partition would be misaligned to the underlying disk.
What Does it Mean?
Quite simply, misaligned storage (both VMFS partitions and Guest File Systems) can lead to poor performance under certain conditions. How badly performance is impacted depends on the degree of I/O strain your server and storage are under, the caching mechanisms in your environment, and the architecture of your SAN. Again, a visual can help explain how misaligned storage can hurt you. For simplicity let’s leave out the VMFS layer as we consider the following diagram (pardon my hasty Visio visualization):
What we see is that the target data in a tiny 16kb read request spans two 64kb chunks on our storage array. Any reads of that piece of data will result in twice the amount of data as would be minimally necessary being transferred to the host’s storage stack. The net effect is an increase in the work the storage array must do – gobbling up IOPS that would otherwise be available for the real work of reading data, reducing throughput on the interface, and messing with cache algorithms and dedupe mechanisms on some arrays. In short, misaligned storage is an efficiency killer. Now add in the VMFS layer back in and you’ll see how things get complicated.
If (and we’re talking a big IF here) every bit of data you wanted to read spanned a chunk or sector boundary, you could experience half the expected performance due to misalignment. In reality, depending on your workload and storage technology your performance increase from properly aligning your storage will probably be somewhere between 10-30%.
Want to dig deeper?
There have been some great resources published on this issue over the past few years on storage alignment. Major vendors have all begun pushing information on the problem – here are some of the best that I have found:
Microsoft has a Knowledge Base article (http://support.microsoft.com/kb/929491) that describes the problem and symptoms of misaligned partitions, how to determine if your partition is aligned, and the use of diskpart to create aligned partitions.
Microsoft also has an in-depth article on MSDN, including some performance numbers at http://msdn.microsoft.com/en-us/library/dd758814.aspx. Also check out Jimmy May’s series Partition (Sector) alignment for SQL Server here: http://blogs.msdn.com/b/jimmymay/archive/2008/10/14/disk-partition-alignment-for-sql-server-slide-deck.aspx. One of the best descriptions of the complexities of the problem can be found in Jimmy’s blog series.
VMware has an article here: http://www.vmware.com/pdf/esx3_partition_align.pdf. Be aware that this article is for Virtual Infrastructure 3, not vSphere 4.0. Some of the information is now a bit dated.
Netapp has a few documents to check out: http://media.netapp.com/documents/tr-3428.pdf (VI3), and http://media.netapp.com/documents/tr-3749.pdf (vSphere)
EMC covers alignment in their TechBooks for Clariion, Celerra, and Symmetrix.
Tools to Align Partitions:
Ok – so you’ve bought into this whole partition alignment thing as being a real issue. How to you fix it? Here are some tools:
- MSInfo32.exe, wmic, and dmdiag will show you misaligned partitions on Windows machines (check the Microsoft links above for usage info).
- Diskpart.exe (or diskpar.exe on versions of Windows previous to 2003) creates aligned partitions on Windows systems. Diskpart cannot be used to realign a previously created partition, only to create new correctly aligned partitions.
- MBRScan/MBRAlign from NetApp can report on and realign existing virtual disks on a VMware ESX server. Also a nifty PowerShell script from NetApp to find if your partitions are aligned: http://communities.netapp.com/docs/DOC-6175
- vOptimizer from Vizioncore can report on and realign existing virtual disks.
- GParted can be used to create aligned partitions on both Windows and Linux machines, and to realign some existing partitions.
- VMware vCenter – VMFS datastores created using vCenter are aligned automatically. Note – Guest VMDK’s are not aligned automatically by vCenter – you must manually create aligned partitions on your VMDK’s or use a Guest OS that creates properly aligned partitions (Windows 2008 and later).
Best Practices:
Before I wrap this installment up, here are some best practices for storage alignment in your environment:
- Create aligned partitions in your VMware templates. Do it once, do it right – every machine you deploy from the template will be aligned.
- Use caution with tools like Symantec Ghost. Ghost can take images of aligned partitions and misalign them when laying down on a new system.
- Use caution when performing P2V’s using VMware vCenter Converter – it does not align guest disks on import. You might consider using Converter to perform a P2V of the system disk only, then create new VMDK’s on the converted guest. Use Diskpart, gparted, or another tool to create aligned partitions on the new VMDK’s and finally copy the data over to the newly virtualized server using a tool like Robocopy, RichCopy, or rsync.
- SSD’s are particularly sensitive to misalignment, leading to poor performance and excessive wear.
- Local VMFS volumes created by the ESX installer are not aligned. If you are using an installer-created local VMFS for anything where performance matters, you might consider re-creating it through vCenter.
- Watch out when attaching a data disk from an older VM to a new VM. For example, you are upgrading your SQL servers to Windows 2008 R2 from 2003. You decide to do a side-by-side upgrade, using the detach/attach method. You install (or better yet, deploy from template) a new Windows 2008 R2 VM, detach your databases from the old server, move your SQL data and log virtual disks from your 2003 VM to the new VM and attach the SQL DB’s on the new server. Those old VMDK’s may be misaligned! Consider using Robocopy, RichCopy or rsync to ensure an aligned disk.
- Check your storage vendors best practices for your particular environment (OS, workload, SAN, etc.).
- There is some debate on whether or not it is advised to align your OS partitions. There is no clear-cut answer on this as it depends so much on your environment and particular needs. For help in deciding if you should align your Guest OS drives, see the comments in the blogs by Duncan Epping, Vaughn Stewart, and Chad Sakac.
- While working the VMware User Group booth at the Washington, DC Virtualization Forum 2010 I had a user ask me if rules and procedures for alignment on 4k sector disks are different. I forgot to research it until just now, so I honestly don’t know (please comment if you do know!). Check with your storage vendor if this is an issue for you.
- Finally, you can’t realign partitions using tools like mbralign or vOptimizer in ESXi -Aaaron Delp explains the problem here: http://blog.aarondelp.com/2010/06/my-1-issue-with-vmware-esxi-today.html.
I hope this is helpful for you in understanding the problem of storage alignment and how it can impact your environment. Comments or questions are welcomed!
VMware vCenter collects performance statistics, tasks and events for historical performance analysis and auditing. The collection level and retention of performance statistics can be controlled through the vCenter GUI (see Administration | vCenter Server Settings | Statistics).
The level of statistics collection and retention periods can have a dramatic impact on your vCenter Server’s performance if not carefully planned and monitored. In particular, the vCenter database can grow quite large and the database server required to support the increase in statistics increases in size and performance characteristics (increased disk IO capacity, CPU, and memory). Fortunately, VMware has provided a vCenter database sizing tool within the vCenter client (see picture). This is all well and good for initial sizing, and my experience shows that vCenter’s sizing estimates are fairly accurate assuming the environment remains healthy.
I recently migrated an environment from vCenter 2.5 to 4.0 and in the process switched from a Windows 2003 32-bit vCenter host and a SQL 2005 server (remote to vCenter) to a Windows 2008 64-bit vCenter server with a SQL 2008 server (again, a remote SQL server). I experienced a few issues during the migration and thought I had worked through them all (I’ll post on those at a later date). However, after a bit of time I found that performance statistics for objects in the vCenter were missing of not rendering at an acceptable pace. Upon further investigation, I discovered warnings in the vCenter Service Status node indicating that performance rollups within the vCenter database were not taking place.
In a SQL-backed vCenter, statistics rollups are handled by the SQL Server Agent (note: if you are using SQL Server Express, statistics rollups are handled by vCenter itself as SQL Express does not offer SQL Server Agent jobs). KB 1003570 describes this process (it applies to vCenter 2.5, but the principles in it can be applied to 4.0). To troubleshoot and resolve the issue I opened SQL Server Management Studio and checked several items:
- Is the SQL Server Agent running?
- Are there statistics rollup jobs defined for SQL server agent?
- Are those jobs running?
In my case, the SQL Server Agent was running (you are prompted to configure this during the vCenter install). However, when I checked for the presence of rollup jobs, I discovered that only a Past Day job had migrated with the database to the new SQL server. Upon investigating the job history for that job I discovered that the job had not run since the migration (note to self: add these checks to your standard vCenter migration checklist).
To remediate the problem I completed the following steps:
- Remove the bad ‘Past Day stats rollupVirtualCenter’ job from the list of SQL Server Agent Jobs.
- Recreate the three standard stats rollup jobs. To recreate the jobs, find SQL scripts on your vCenter server in C:\Program Files (x86)\VMware\Infrastructure\VirtualCenter Server. The .sql scripts you’ll need are stats_rollup1_proc_mssql.sql, stats_rollup2_proc_mssql.sql, and stats_rollup3_proc_mssql.sql. Run these scripts in SQL Query Analyzer against your VirtualCenter Database in order from 1 to 3. These scripts should create the rollup jobs and their associated stored procedures (this procedure is detailed at http://communities.vmware.com/thread/123715).
- After recreating the jobs I took a backup of the vCenter database. The Past Day job soon kicked off to begin a stats rollup (this runs every 30 minutes by default).
I checked the server several hours later and discovered that rather than completing successfully, the Past Day job was still running and the drive holding my vCenter database transaction log was full. Back to the drawing board..
- I disabled the Past Week and Past Month rollup jobs to avoid job conflicts.
- I backed up the vCenter database and then performed a shrink of the log file to get it back down to size.
- The vCenter was running as a VM, so I was able to quickly increase its disk size and use diskpart from within the guest to extend the partition. The space required to process weeks of performance statistics is not included in the vCenter Database Sizing tool as it is assumed that the rollup/purge jobs will run as designed.
I wanted to see how bad the problem was before kicking off another job so I ran:
select count(*) from vpx_hist_stat1
against the vCenter database in SQL Query Analyzer. The query ran for several hours (never a good sign) and eventually returned well over 20 million rows of performance statistics (thanks to http://communities.vmware.com/message/1318736 for pointing me in this direction). I investigated options to truncate the tables (see above link), and also looked at a script from VMware KB 1000125: Purging old data from the database used by vCenter Server. In the end, I decided to try to let the Past Day stats job run.
I stopped the vCenter Server Service to prevent new statistics from being written to the database. I also disabled the Past Week and Past Month SQL Agent jobs to prevent job conflicts and then manually started the Past Day job. I had to stop the job several times as it filled the 100GB transaction log volume. A backup & shrink operation gave me back the space on the log volume. I saw about 300GB of transaction logs written over the course of this process, but the Past Day job eventually completed.
Finally, I re-enabled the Past Week and Past Month jobs and manually ran both of them (Past Week first, then Past Month), followed by a backup and shrink of the vCenter database. I was impressed with the performance increase I saw in the vCenter client. Lists and performance graphs rendered much faster than when stats rollups were not taking place.
It would be a good idea to include checking stats rollup job status and a count of rows from the vpx_hist_stat tables in the vCenter database in your regular maintenance tasks. For other vCenter Database best practices, check out breakout session PO2061 from VMworld 2008. If you did not attend or subscribe to VMworld, Scott Lowe covered the session in this post. A VMworld 2009 “online only” session entitled VM3237 vCenter Databases: Setup, Management and Best Practices was also offered (subscription required). I have not viewed this session so I cannot comment on its content.
I have been meaning to write this up for a while; Scott Drummonds’ ‘Love Your Balloon Driver’ post today at his Virtual Performance blog gave me a nice reminder. I actually caught a sneak peak at the graphs with an explanation from Scott at his instructor-led lab at VMworld 2009. Scott calls out that the only workload they discovered suffers from balloon driver activity is Java. The reason for Java’s problems with balloon driver activity is that Java itself runs in a VM and so the guest OS cannot properly determine which pages should be swapped out when the balloon driver calls for it.
My experiences causes me to agree with Scott and the whitepaper he cites – in a properly designed and equipped environment the balloon driver is not detrimental for most every workload to a point. However, I recently discovered in a client site that the balloon driver can cause significant issues when the environment is poorly designed and under-sized. Here the background:
I was called into an already established environment where the client was running on an older blade with VMware ESX 3.5. The blade maxed out at 16GB RAM and had dual dual-core CPU’s with no hope for an upgrade. On the blade was a single guest VM running Windows 2003 with SQL 2005, in it’s full 32-bit glory. The VM was configured with 4 vCPU’s and 16GB of memory. Some of you can probably already guess where this is going….
The x86 Windows guest had PAE configured, and SQL took advantage of AWE to use the additional memory beyond the 4GB limit of a 32-bit system. Additionally, the Windows guest had the /3GB switch enabled in boot.ini. Finally, as per SQL best practices, the ‘Lock Pages in Memory‘ permission was granted to the SQL Server service account. What the guest was left with was 1GB of kernel mode memory and 15GB of User Mode/Extended addressable memory.
And here’s the problem. The client was using ESX, not ESX 3.5, so the Service Console required memory. In this case, the service console had approximately 512MB allocated to it. Futhermore, VM’s require some overhead on ESX to run. The memory overhead consumed by a Windows guest on ESX 3.5 with 4 vCPU and 16GB of memory is a bit more than 512MB. On a properly sized ESX server with multiple similar guests/workloads, you could probably gain much of the overhead back through transparent page sharing; but in this case I had a 1:1 P2V ratio. If you are any good at math you see that the environment is running about 1GB short of memory. A quick check of the balloon driver stat in vCenter show that the balloon driver was constantly active and demanding about 1GB back from the guest… constantly.
Under normal circumstances this might not be an issue, but in this case the Windows guest was being absolutely punished. The guest CPU’s were pegged at 100% with an excessive amount of kernel time, often indicating IO issues. And indeed I did experience terrible disk and network performance on the guest. At the root of the problem is this – the Lock Pages in Memory permission allows SQL to get a firm grasp on the user mode memory available to it (15GB) and lock it up. This left the already starved (because of the 3GB switch in the boot.ini) guest kernel with it’s 1GB the only thing the balloon driver could really swap out.
The client suggested a reservation of 16GB on the VM, knowing that memory reservations prevent balloon driver activity. I calmly asked them to back away from the keyboard as I explained how if a starved guest was bad, how much worse a starved Service Console would be. In the end the fix was quiet easy – I convinced the customer that they should reduce the amount of memory allocated to the guest by about 1GB, enough to let the 512MB SC and the 512MB of overhead run without contention. I was able to show them the difference between allocated and active memory in vCenter – the 1GB being surrendered was not really being actively used, SQL just had it locked up. In fact, surrendering the 1GB of memory back to ESX breathed new life into the guest VM, bringing its performance back in line with expectations.
Ideally, I would have brought in a bigger ESX server that could serve additional VM’s, driving greater levels of efficiency across the environment. It just wasn’t an option for the client in this case. In the end, the problem was fixed and I was reminded just how fun it can be to explain some of these backwards sounding virtualization concepts to customers – fewer vCPU’s can lead to better performance of guests, less guest memory can fix performance issues, and increasing the quantity of similar guests on a host can drive better performance to a point because of transparent page sharing.
Stay tuned over the next few weeks as I digest and write on my VMworld experience – from VMUG activities to Paul Maritz’s press conference announcing the vCloud Express, and plenty of great sessions in between. Like many of you, I returned from VMworld with quite a backlog of work but I’ll do my best to squeeze in some posts and tweets.
Here are some bookmarks for resources that I have recently referenced:
- vCenter 4 and ESX 4 Now Use 10 Year Default SSL Certificate | VM /ETC – Rich Brambly has some guidance on installing a new SSL certificate on vCenter, with very useful links in his post to official VMware documentation and KB’s on the subject.
- VMware vSphere Client on Microsoft Windows 7! | Virtual Lifestyle – Heiko Verlande has found a way to run the VMware vSphere Client on Windows 7.
- Virtu-Al » PowerCLI: Daily Report V2 – Version two of a handy PowerShell based VMware Environment Daily Report from VMware vExpert and PowerShell guru Alan Renouf
- What’s new/Bug Fixes
* Active VMs count
* Inactive VMs count
* DRS Migrations count and list
* Correct NTP Server check for each host
* VMs stored on local datastores
* NTP Service check for each host
* vmkernel warning messages for each host
* VM CPU ready over x% - VMware Self-Service- VMware Update Manager Plug-In fails to install -Troubleshooting steps for vCenter Plug-in install problems.
- Using VMware VDI and vmSight for Stronger and Sustainable HIPAA and PCI Compliance – Virtualization brings new options for protecting sensitive data by moving it from the desktop into the datacenter.
- Counter of the Week : Analyzing Storage Performance – The purpose of this article is to provide prescriptive guidance on how to troubleshoot logical and physical disk response times in regards to Windows performance analysis. Start with the following performance counters to analyze disk response…
- NetApp, Compellent, HP, Dell top the field in 12-product test – Network World – A terabyte isn’t what it used to be. Disks are slower than you think. And a Gigabit Ethernet is plenty of bandwidth for many storage applications.






