Posts Tagged ‘performance’
I ran into an issue with a customer today where a VM was performing terribly. From within the guest OS (a Windows 2003 application server running .NET in IIS which I will call BigBadServer) things appeared sluggish and CPU time was high. The amount of time being spent on the kernel was notably high. The VM in question had 4 vCPU’s and a good helping of memory.
I don’t have access to the VMware client at this particular site – just some of the guests, so I was flying bling. Gut feeling told me that I was dealing with a resource contention issue. I had the VMstats provider running in the guest (http://vpivot.com/2009/09/17/using-perfmon-for-accurate-esx-performance-counters/) showed me that there was no ballooning or swapping going on, and that the vCPU’s were not limited and the CPU share value seemed to be at the default.
I strongly suspected that the physical server running VMware ESX was oversubscribed on physical CPU (pCPU) resources. Essentially, the guest VM’s that are sharing the resources of the physical machine are demanding more resources than the machine can handle. To verify this theory, I had the client check the ‘CPU Ready’ metric on BigBadServer and bingo!
CPU Ready is a measure of the amount of time that the guest VM is ready to run against the pCPU, but the VMware CPU Scheduler cannot find time to run the VM because other VM’s are competing for the same resources.
From the stats the customer provided on our phone call, the CPU Ready for any one of the 4 vCPU’s on the BigBadServer was on average 3723ms (min: 1269ms, max:8491ms). (Update 8/25/2010 to clarify summation stat) The summation for the entire VM was around 12,000ms on average and peaked around 35,000. The stats came from the real-time performance graph/table in the vSphere client. The real-time stats in the vSphere Client update every 20 seconds, so the CPU Ready summation value should be divided by 20,000 to get a percentage of CPU ready for the 20 second time slice. If I take the worst case scenario of 8491ms per vCPU, this VM spent nearly 43% (8491/20,000) of the 20 second time slice waiting for CPU resources.
The CPU Ready summation in milliseconds counter in the vCenter Client is not always the most accurate or easy to interpret stat – to better quantify the problem it might be best to go to the ESX command line and run ESXTOP. CPU Ready over 5% could be a sign of trouble, over 10% and there is a problem. Running ESXTOP in batch mode and then analyzing the output using Windows Perfmon or Excel might be a good way to go on this to get a view over several hours rather than the realtime stats we were looking at. I wrote a post a while back with more info on ESXTOP batch mode: http://vmtoday.com/2009/09/esxtop-batch-mode-windows-perfmon/
To help quantify the problem a bit more, the BigBadServer is on an ESX 4.0 server with about 10 other servers. The physical blade has two dual-core CPU’s (AMD Opteron 2218HE’s which are not hyperthreaded). The other VM’s on the blade have different vCPU and vMemory configurations. 3 VM’s (including BigBadServer) have 4 vCPU’s. A couple have 2 vCPU’s, and the remainder are configured with 1 vCPU. In ESX 4.x, the VMware console OS actually runs as a hidden VM, pegged to pCPU #1.
I generally recommend a pCPU:vCPU ration of 1:4 for mid-sized VMware deployments of single vCPU VM’s. The blade we are running on is a 1:5 with several multi-vCPU VM’s. The multi-vCPU’s start to skew the ratio recommendation and require some advanced design decisions. VMware’s scheduler requires that all the vCPU’s on a VM run concurrently (even if the Guest OS is trying to execute a single thread). Also, the VMware CPU Scheduler prefers to have all the vCPU’s from a VM run on the same pCPU. As workloads are bounced around between pCPU’s, the benefits of CPU cache are lost. This is one of those ‘more-is-less’ situations that you run into on virtualized environments.
What this CPU Scheduler nonsense means in this case is that the 4 vCPU’s on BigBadServer have to wait until all logical pCPU’s on the box are idle (including the one that runs ESX itself) before it can run. If ESX can’t accomplish that (we are experiencing resource contention) it starts prioritizing workloads according to what it can best run. It is much easier to schedule the smaller VM’s, so it tends to run those on pCPU more frequently. The larger VM’s tend to suffer a bit more than the smaller ones. We are competing with 2 other VM’s with 4 vCPU’s that use up all of the logical pCPU’s when they need to run, as well as with the smaller VM’s.
I suggested a few ways to fix this issue for the BigBadServer web server:
- Using Shares and/or Reservations on the VM. This probably won’t work in our situation as the physical server is too over-subscribed. We might see a slight improvement in BigBadServer (or we might not see any change), but possibly at the extreme expense of the other VM’s sharing the blade.
- Reduce the number of vCPU’s on BigBadServer AND the other multi-vCPU VM’s on the same physical server. This would reduce resource contention and open up a whole bunch of scheduling options for the VMware CPU Scheduler. This is the quickest/cheapest fix, but will not work if the VM’s really do need 4 vCPU’s. A little workload analysis should determine which can be made smaller (the vCenter server graphs/stats should be enough for this). For what it’s worth, by our analysis BigBadServer seems to be happier with 4 vCPU assuming we can run with a low CPU Ready on those 4.
- Move the BigBadServer VM to a physical ESX server with fewer multi-vCPU VM’s so there is less contention.
- Move the BigBadServer VM to a physical ESX server with quad-core pCPU’s (ideally two quad-cores or bigger). This would give a lot more flexibility to the VMware CPU Scheduler and allow it to run quad-vCPU VM’s on the same pCPU for greater efficiency.
- Split BigBadServer into 2 smaller VM’s – The server currently runs a couple sites. We could split them onto two servers – one for Project1 and one for Proejct2. This configuration would take some design, testing, and time but could scale out better, give more flexibility and availability in the long run.
I’m not sure which way the customer will go on this one yet, but I feel good having armed them with enough knowledge and options to make an informed decision.
To avoid problems like this in the future, I recommend these rules of thumb:
- Design your hosts for your guests. Taking your Guest VM sizes into account when designing your environment and choosing physical hardware is crucial if you need bigger VM’s.
- Don’t make your VM’s bigger than you have to. It is always easier to add resources than take them away. Hot Add of CPU and Memory in vSphere make adding incredibly easy.
- Monitor your environment for CPU Ready, Swapping, and other metrics that can indicate an inefficient design.
- Call for help when you can’t figure out what is going on (I’m happy to help!). VMware is super powerful, but some things can be downright backwards when it comes to resource allocation on a fixed set of hardware.
If you are looking for some resources to help explain CPU Scheduling a bit more, I recommend:
- VMware’s Official documentation of CPU Scheduler in vSphere 4.1 – http://www.vmware.com/files/pdf/techpaper/VMW_vSphere41_cpu_schedule_ESX.pdf.
- A nice summary of co-scheduling from VMware’s Performance Blog: http://blogs.vmware.com/performance/2008/06/esx-scheduler-s.html
- Description and stats on Ready Time metrics for VI3: http://www.vmware.com/pdf/esx3_ready_time.pdf
- Understanding Virtual Center Performance Statistics: http://communities.vmware.com/docs/DOC-5230.pdf
(Updated 8/25/2010 to include a few additional reference links and corrected summation divided by time slice to get accurate values)
At the risk of beating a dead horse, it’s time to resurrect my Storage Basics series. I’ve recently had some great feedback on the series and figured I should round out a few of the concepts before I wrap it up. I want to cover a topic often discussed amongst virtualization professionals, but one I often find general practitioners and server admins not understanding: storage alignment. Storage alignment, or the lack of alignment, is not a new issue and is not unique to VMware or virtualization in general. However, the effects of misaligned storage can be more greatly felt in terms of reduced performance and strain on a storage system in shared, oversubscribed or high I/O environments. Many others in the virtualization and storage communities have already covered partition alignment (see Duncan Epping, Vaughn Stewart, and most recently Chad Sakac), but I feel it is an important enough topic for me to re-hash as part of this series.
What is Storage Alignment?
Let’s start with a quick overview of what storage alignment means. Quite simply, storage alignment refers to the positioning (starting offset) of the various pieces of a systems storage components – the physical disk sectors or array’s chunks, the VMware File System (VMFS) in a VMware environment, and the guest file system’s clusters within a partition – in relation to the layer directly under the element in question. A quick graphic often makes quick work of explaining this (I often whiteboard this concept for colleagues and clients):
As you can see, the starting offset of the VMFS partition does not correspond to the physical segmentation of the underlying disks (in this case, the chunks on a SAN – but could be conceptually replaced with the sectors of a single disk). Furthermore, the clusters (or blocks) of the guest VM are not aligned to the VMFS partition nor to the underlying storage. For traditional (physical) systems or VMware RDM’s, the VMFS layer could be abstracted, but the result would be the same – the clusters of a partition would be misaligned to the underlying disk.
What Does it Mean?
Quite simply, misaligned storage (both VMFS partitions and Guest File Systems) can lead to poor performance under certain conditions. How badly performance is impacted depends on the degree of I/O strain your server and storage are under, the caching mechanisms in your environment, and the architecture of your SAN. Again, a visual can help explain how misaligned storage can hurt you. For simplicity let’s leave out the VMFS layer as we consider the following diagram (pardon my hasty Visio visualization):
What we see is that the target data in a tiny 16kb read request spans two 64kb chunks on our storage array. Any reads of that piece of data will result in twice the amount of data as would be minimally necessary being transferred to the host’s storage stack. The net effect is an increase in the work the storage array must do – gobbling up IOPS that would otherwise be available for the real work of reading data, reducing throughput on the interface, and messing with cache algorithms and dedupe mechanisms on some arrays. In short, misaligned storage is an efficiency killer. Now add in the VMFS layer back in and you’ll see how things get complicated.
If (and we’re talking a big IF here) every bit of data you wanted to read spanned a chunk or sector boundary, you could experience half the expected performance due to misalignment. In reality, depending on your workload and storage technology your performance increase from properly aligning your storage will probably be somewhere between 10-30%.
Want to dig deeper?
There have been some great resources published on this issue over the past few years on storage alignment. Major vendors have all begun pushing information on the problem – here are some of the best that I have found:
Microsoft has a Knowledge Base article (http://support.microsoft.com/kb/929491) that describes the problem and symptoms of misaligned partitions, how to determine if your partition is aligned, and the use of diskpart to create aligned partitions.
Microsoft also has an in-depth article on MSDN, including some performance numbers at http://msdn.microsoft.com/en-us/library/dd758814.aspx. Also check out Jimmy May’s series Partition (Sector) alignment for SQL Server here: http://blogs.msdn.com/b/jimmymay/archive/2008/10/14/disk-partition-alignment-for-sql-server-slide-deck.aspx. One of the best descriptions of the complexities of the problem can be found in Jimmy’s blog series.
VMware has an article here: http://www.vmware.com/pdf/esx3_partition_align.pdf. Be aware that this article is for Virtual Infrastructure 3, not vSphere 4.0. Some of the information is now a bit dated.
Netapp has a few documents to check out: http://media.netapp.com/documents/tr-3428.pdf (VI3), and http://media.netapp.com/documents/tr-3749.pdf (vSphere)
EMC covers alignment in their TechBooks for Clariion, Celerra, and Symmetrix.
Tools to Align Partitions:
Ok – so you’ve bought into this whole partition alignment thing as being a real issue. How to you fix it? Here are some tools:
- MSInfo32.exe, wmic, and dmdiag will show you misaligned partitions on Windows machines (check the Microsoft links above for usage info).
- Diskpart.exe (or diskpar.exe on versions of Windows previous to 2003) creates aligned partitions on Windows systems. Diskpart cannot be used to realign a previously created partition, only to create new correctly aligned partitions.
- MBRScan/MBRAlign from NetApp can report on and realign existing virtual disks on a VMware ESX server. Also a nifty PowerShell script from NetApp to find if your partitions are aligned: http://communities.netapp.com/docs/DOC-6175
- vOptimizer from Vizioncore can report on and realign existing virtual disks.
- GParted can be used to create aligned partitions on both Windows and Linux machines, and to realign some existing partitions.
- VMware vCenter – VMFS datastores created using vCenter are aligned automatically. Note – Guest VMDK’s are not aligned automatically by vCenter – you must manually create aligned partitions on your VMDK’s or use a Guest OS that creates properly aligned partitions (Windows 2008 and later).
Best Practices:
Before I wrap this installment up, here are some best practices for storage alignment in your environment:
- Create aligned partitions in your VMware templates. Do it once, do it right – every machine you deploy from the template will be aligned.
- Use caution with tools like Symantec Ghost. Ghost can take images of aligned partitions and misalign them when laying down on a new system.
- Use caution when performing P2V’s using VMware vCenter Converter – it does not align guest disks on import. You might consider using Converter to perform a P2V of the system disk only, then create new VMDK’s on the converted guest. Use Diskpart, gparted, or another tool to create aligned partitions on the new VMDK’s and finally copy the data over to the newly virtualized server using a tool like Robocopy, RichCopy, or rsync.
- SSD’s are particularly sensitive to misalignment, leading to poor performance and excessive wear.
- Local VMFS volumes created by the ESX installer are not aligned. If you are using an installer-created local VMFS for anything where performance matters, you might consider re-creating it through vCenter.
- Watch out when attaching a data disk from an older VM to a new VM. For example, you are upgrading your SQL servers to Windows 2008 R2 from 2003. You decide to do a side-by-side upgrade, using the detach/attach method. You install (or better yet, deploy from template) a new Windows 2008 R2 VM, detach your databases from the old server, move your SQL data and log virtual disks from your 2003 VM to the new VM and attach the SQL DB’s on the new server. Those old VMDK’s may be misaligned! Consider using Robocopy, RichCopy or rsync to ensure an aligned disk.
- Check your storage vendors best practices for your particular environment (OS, workload, SAN, etc.).
- There is some debate on whether or not it is advised to align your OS partitions. There is no clear-cut answer on this as it depends so much on your environment and particular needs. For help in deciding if you should align your Guest OS drives, see the comments in the blogs by Duncan Epping, Vaughn Stewart, and Chad Sakac.
- While working the VMware User Group booth at the Washington, DC Virtualization Forum 2010 I had a user ask me if rules and procedures for alignment on 4k sector disks are different. I forgot to research it until just now, so I honestly don’t know (please comment if you do know!). Check with your storage vendor if this is an issue for you.
- Finally, you can’t realign partitions using tools like mbralign or vOptimizer in ESXi -Aaaron Delp explains the problem here: http://blog.aarondelp.com/2010/06/my-1-issue-with-vmware-esxi-today.html.
I hope this is helpful for you in understanding the problem of storage alignment and how it can impact your environment. Comments or questions are welcomed!
Most of what I covered in Storage Basics Parts 1 through 5 was at a very elementary level. The math I used to do IOPS calculations, for example, is only true under very certain conditions. RAID controllers implement caching and other techniques that skew the simple math that I provided. I mentioned that the type of interface that you ought to use on your storage array should not be randomly chosen. In fact, choosing the right array with the appropriate components and characteristics can only be done when you enlighten your decision with a characterization of workloads it will be running.
The character of your storage workload can be broken down into several traits – random vs. sequential I/O, large vs. small I/O request size, read vs. write ratio, and degree of parallelism. The traits of your particular workload dictate how it interacts with the components of your storage system and ultimately determine the performance of your environment under a given configuration. There is an excellent whitepaper available from VMware entitled “Easy and Efficient Disk I/O Workload Characterization inVMware ESX Server” that is authoritative on this subject. If you want to get down and dirty with the topic, it’s a good read. I’m aiming for something a bit less academic. With that said, let’s break down workload characterization a bit so as to better understand how it will impact your real-world systems.
Random vs. Sequential Access
In Part II of this series we looked at the formula for calculating IOPS capabilities for a single disk. That formula goes something like this:
IOPS = 1000/(Seek Latency + Rotational Latency)
You’ll recall that we divide into 1000 to remove milliseconds from the equation, leaving (Seek Latency + Rotational Latency) as the important part of the equation. Rotational latency is based on the spindle speed of the disk – 7.2k, 10k, or 15k RPM for standard server or SAN disks. If we consider the same Seagate Cheetah 15k drive from Part II, we see that rotational latency is 2.0ms. The only way to change rotational latency is to buy faster (or slower) disks. This essentially leaves seek latency as the only variable that we can “adjust”. You’ll also recall that seek latency was the larger of the latencies (3.4ms for read seeks, and 3.9ms for write seeks) and counts more against IOPS capability than does rotational latency. Seeking is the most expensive operation in terms of performance.
It is next to impossible to adjust seek latency on a disk because it is determined by the speed of the servos that move the heads across the platter. We can, however, send workloads with different degrees of randomness to the platter. The more sequential a workload is, the less time that will be spent in seek operations. A high degree of sequentiality ultimately leads to faster disk response and higher throughput rates. Sequential workloads may be candidates for slower disks or RAID levels. Conversely, workloads that are highly randomized ought to be placed on fast spindles in fast RAID configurations.
You’ll notice that I said it was next to impossible to adjust seek latency on a disk. While not common, some storage administrators employ a method know as ‘short stroking’ when configuring storage. Short stroking uses less than the full capacity of the disk by placing data at the beginning of the disk where access is faster, and not placing data at the end of the disk where seeks times are greater. This results in a smaller area on the disk platter for heads to travel over, effectively reducing seek time at the expense of capacity.
While not applicable to all workloads, storage arrays, or file systems, fragmentation can cause higher degrees of randomness leading to degraded performance. This is the prime reason some vendors recommend that you regularly defragment your file system. It should be noted that a VMware VMFS file system is resilient against the forces of fragmentation. Whereas a Windows NTFS parition may hold hundreds, thousands or tens of thousands of files of different sizes, accessed randomly throughout the system’s cycle of operations, a VMFS datastore typically holds no more than a couple hundred files. Additionally, most of the files on a VMFS datastore are created contiguously if you are using thick-provisioned virtual disks (VMDK). Thin-provisioned VMDK’s are slightly more susceptible to fragmentation, but do not typically suffer a high enough degree of fragmentation to register a performance impact. See this VMware whitepaper for more on VMFS fragmentation: Performance Study of VMware vStorage Thin Provisioning.
Examples of sequential workloads include backup-to-disk operations and the writing of SQL transaction log files. Random workloads may include collective reads from Exchange Information Stores or OLTP database access. Workloads are often a mix of random and sequential access, as is the case with most VMware vSphere implmentations. The degree to which they are random or sequential dictates the type of tuning you should perform to obtain the best possible performance for your environment.
I/O Request Size
I/O request size is another important factor in workload characterization. Generally speaking, larger reads/writes are more efficient than smaller I/O to a certain point. The use of larger I/O requests (64KB instead of 2KB, for example) can result in faster throughput and reduced processor time. Most workloads do not allow you to adjust your I/O request size. However, knowing your I/O request size can help with appropriate configuration of certain parameters such as array stripe size and file system cluster size. Check with your storage vendor for more information as it pertains to your specific configuration.
If you are in a Windows shop, you can use perfmon counters such as Avg. Disk Bytes/Read to determine average I/O size. If you are running a VMware-virtualized workload, you can take advantage of a great tool – vscsiStats – to identify your I/O request size. More on vscsiStats later in this article.
Read vs. Write
Every workload will display a differing amount of read and write activity. Sometimes a specific workload, say Microsoft Exchange, can be broken down into sub-workloads for logging (write-heavy) and reading the database (read-heavy). Understanding the read-to-write ratio may help with designing the underlying storage system. For example, a write-heavy workload may perform better on a RAID10 LUN than a RAID5 array due to the write penalty associated with RAID5. The ratio of read:write may also dictate caching strategies. The read:write ratio, when combined with a degree of randomness measure, can be quite useful in architecting your storage strategy for a given application or workload.
Parallelism/Outstanding I/O’s
Some workloads are capable of performing multi-threaded I/O. These types of workloads can place a higher amount of stress on the storage system and should be understood when designing storage, both in terms of IOPS and throughput. Multipathing may help with multi-threaded I/O workloads. A typical VMware vSphere environment is a good example of a workload capable of queuing up outstanding I/O.
Measuring the Characteristics of Your Workload
So how do we actually characterize storage workloads? Start with the application vendor – many have published studies that can shed light on specific storage workloads in a standard implementation. If you are interested in measuring your own for planning/architecture reasons, or performance troubleshooting reasons, read on…. There are several tools to measure storage characteristics, depending on your operating system and storage environment. Standard OS performance counters, such as Windows Performance Monitor (perfmon) can reveal some of the characteristics. Array based tools such as NaviAnalyzer on EMC gear can also reveal statistics on the storage end of the equation.
One of the most exciting tools for storage workload characterization comes from VMware in the form of vscsiStats. vscsiStats is a tool that has been included in VMware ESX server since version 3.5. Because all I/O commands pass through the Virtual Machine Monitor (VMM), the hypervisor can inspect and report on the I/O characteristics of a particular workload, down to a unique VM running on an ESX host. There is a ton of great information on using vscsiStats, so I won’t re-hash it all here. I recommend starting with Using vscsiStats for Storage Performance Analysis as it contains an overview and usage instructions. If you want to dig a bit deeper into vscsiStats, read both Storage Workload Characterization and Consolidation in Virtualized Environments and vscsiStats: Fast and Easy Disk Workload Characterization on VMware ESX Server.
vscsiStats can generate an enormous amount of data which is best viewed as a histogram. If you’re a glutton for punishment, the data can be reviewed manually on the COS. To extract vscsiStat output data, use the -c option to export to a .csv file. From there you can analyze the data and create histograms using Excel. Paul Dunn has a nifty Excel macro for analyzing and reporting on vscsiStats output here. Gabrie van Zanten more detailed instructions for using Paul’s macro here. Here are a couple histogram examples that I just generated from a test VM.
vscsiStats is only included with ESX, not ESXi. However, Scott Drummond was kind enough to post a download of vscsiStats for ESXi on his Virtual Pivot blog: http://vpivot.com/2009/10/21/vscsistats-for-esxi/. Using vscsiStats on ESXi requires dropping into Tech Support Mode (unsupported) and enabling ESXi for scp to transfer the binary to the ESXi server.
VMware esxtop can display some information but is limited in scope and does not currently support NFS. A community-supported python script called nfstop can parse vscsiStats data and display esxtop-like data per VM on screen.
Experiment
If you are interested in generating workloads with various characteristics, check out Iometer and Bonnie++. These tools will allow you to generate I/O that you can monitor with the tools I covered in this article.
Put it to Use
If you are provisioning a new workload or expanding an existing, invest some time in understanding your storage workload characteristics and convey those characteristics to your storage team. A request for storage that includes the workload characteristics I discussed here, as well as expected IOPS requirements, will go much further in ensuring performance for your applications – physical or virtual – than simply asking for a certain capacity of disk.
I posted an article in December on how the SVGA driver included with VMware Tools caused the guest VM to freeze. I referenced VMware’s KB Article 1011709, which directed you to not use the SVGA drivers included with VMware Tools. KB1011709 has since been updated (as of February 25, 2010) to indicate that the VMware Tools package included with ESX 4.0 Update 1 includes a new WDDM driver that is fully supported. If you have updated to Update 1, you should upgrade VMware Tools to take advantage of the new driver.
If you followed the KB1011709′s original advice and did a custom install of VMware Tools (leaving out the SVGA driver through a custom install), you may have to do a re-install of VMware Tools before the new driver is available. Once you get VMware Tools upgraded, the new driver can be found in the guest VM at C:\Program Files\Common Files\VMware\Drivers\wddm_video. These drivers are not automatically installed, so you’ll have to update your guest’s video adapter driver in Device Manager.
It’s a bummer that the WDDM SVGA drivers are not automatically installed. You could probably copy these drivers to other VM’s and use Windows Device Manager to replace the standard driver with the newer WDDM driver without having to do the uninstall, reboot, reinstall of VMware tools on all of your VM’s.
Just as I was about to publish this, I saw a TweetDeck pop-up from @jasonboche saying that he had published pretty much the same update here: http://www.boche.net/blog/index.php/2010/03/28/windows-2008-r2-and-windows-7-on-vsphere/. Not only does he have pretty pictures to go with his post, but also points out that VMware Tools installs/upgrades executed with VMware Update Manager (VUM) will not install the upgraded SVGA driver. He also recommends updating templates to include the upgraded drivers. Great points, Jason.









