Posts Tagged ‘disk’
At the risk of beating a dead horse, it’s time to resurrect my Storage Basics series. I’ve recently had some great feedback on the series and figured I should round out a few of the concepts before I wrap it up. I want to cover a topic often discussed amongst virtualization professionals, but one I often find general practitioners and server admins not understanding: storage alignment. Storage alignment, or the lack of alignment, is not a new issue and is not unique to VMware or virtualization in general. However, the effects of misaligned storage can be more greatly felt in terms of reduced performance and strain on a storage system in shared, oversubscribed or high I/O environments. Many others in the virtualization and storage communities have already covered partition alignment (see Duncan Epping, Vaughn Stewart, and most recently Chad Sakac), but I feel it is an important enough topic for me to re-hash as part of this series.
What is Storage Alignment?
Let’s start with a quick overview of what storage alignment means. Quite simply, storage alignment refers to the positioning (starting offset) of the various pieces of a systems storage components – the physical disk sectors or array’s chunks, the VMware File System (VMFS) in a VMware environment, and the guest file system’s clusters within a partition – in relation to the layer directly under the element in question. A quick graphic often makes quick work of explaining this (I often whiteboard this concept for colleagues and clients):
As you can see, the starting offset of the VMFS partition does not correspond to the physical segmentation of the underlying disks (in this case, the chunks on a SAN – but could be conceptually replaced with the sectors of a single disk). Furthermore, the clusters (or blocks) of the guest VM are not aligned to the VMFS partition nor to the underlying storage. For traditional (physical) systems or VMware RDM’s, the VMFS layer could be abstracted, but the result would be the same – the clusters of a partition would be misaligned to the underlying disk.
What Does it Mean?
Quite simply, misaligned storage (both VMFS partitions and Guest File Systems) can lead to poor performance under certain conditions. How badly performance is impacted depends on the degree of I/O strain your server and storage are under, the caching mechanisms in your environment, and the architecture of your SAN. Again, a visual can help explain how misaligned storage can hurt you. For simplicity let’s leave out the VMFS layer as we consider the following diagram (pardon my hasty Visio visualization):
What we see is that the target data in a tiny 16kb read request spans two 64kb chunks on our storage array. Any reads of that piece of data will result in twice the amount of data as would be minimally necessary being transferred to the host’s storage stack. The net effect is an increase in the work the storage array must do – gobbling up IOPS that would otherwise be available for the real work of reading data, reducing throughput on the interface, and messing with cache algorithms and dedupe mechanisms on some arrays. In short, misaligned storage is an efficiency killer. Now add in the VMFS layer back in and you’ll see how things get complicated.
If (and we’re talking a big IF here) every bit of data you wanted to read spanned a chunk or sector boundary, you could experience half the expected performance due to misalignment. In reality, depending on your workload and storage technology your performance increase from properly aligning your storage will probably be somewhere between 10-30%.
Want to dig deeper?
There have been some great resources published on this issue over the past few years on storage alignment. Major vendors have all begun pushing information on the problem – here are some of the best that I have found:
Microsoft has a Knowledge Base article (http://support.microsoft.com/kb/929491) that describes the problem and symptoms of misaligned partitions, how to determine if your partition is aligned, and the use of diskpart to create aligned partitions.
Microsoft also has an in-depth article on MSDN, including some performance numbers at http://msdn.microsoft.com/en-us/library/dd758814.aspx. Also check out Jimmy May’s series Partition (Sector) alignment for SQL Server here: http://blogs.msdn.com/b/jimmymay/archive/2008/10/14/disk-partition-alignment-for-sql-server-slide-deck.aspx. One of the best descriptions of the complexities of the problem can be found in Jimmy’s blog series.
VMware has an article here: http://www.vmware.com/pdf/esx3_partition_align.pdf. Be aware that this article is for Virtual Infrastructure 3, not vSphere 4.0. Some of the information is now a bit dated.
Netapp has a few documents to check out: http://media.netapp.com/documents/tr-3428.pdf (VI3), and http://media.netapp.com/documents/tr-3749.pdf (vSphere)
EMC covers alignment in their TechBooks for Clariion, Celerra, and Symmetrix.
Tools to Align Partitions:
Ok – so you’ve bought into this whole partition alignment thing as being a real issue. How to you fix it? Here are some tools:
- MSInfo32.exe, wmic, and dmdiag will show you misaligned partitions on Windows machines (check the Microsoft links above for usage info).
- Diskpart.exe (or diskpar.exe on versions of Windows previous to 2003) creates aligned partitions on Windows systems. Diskpart cannot be used to realign a previously created partition, only to create new correctly aligned partitions.
- MBRScan/MBRAlign from NetApp can report on and realign existing virtual disks on a VMware ESX server. Also a nifty PowerShell script from NetApp to find if your partitions are aligned: http://communities.netapp.com/docs/DOC-6175
- vOptimizer from Vizioncore can report on and realign existing virtual disks.
- GParted can be used to create aligned partitions on both Windows and Linux machines, and to realign some existing partitions.
- VMware vCenter – VMFS datastores created using vCenter are aligned automatically. Note – Guest VMDK’s are not aligned automatically by vCenter – you must manually create aligned partitions on your VMDK’s or use a Guest OS that creates properly aligned partitions (Windows 2008 and later).
Best Practices:
Before I wrap this installment up, here are some best practices for storage alignment in your environment:
- Create aligned partitions in your VMware templates. Do it once, do it right – every machine you deploy from the template will be aligned.
- Use caution with tools like Symantec Ghost. Ghost can take images of aligned partitions and misalign them when laying down on a new system.
- Use caution when performing P2V’s using VMware vCenter Converter – it does not align guest disks on import. You might consider using Converter to perform a P2V of the system disk only, then create new VMDK’s on the converted guest. Use Diskpart, gparted, or another tool to create aligned partitions on the new VMDK’s and finally copy the data over to the newly virtualized server using a tool like Robocopy, RichCopy, or rsync.
- SSD’s are particularly sensitive to misalignment, leading to poor performance and excessive wear.
- Local VMFS volumes created by the ESX installer are not aligned. If you are using an installer-created local VMFS for anything where performance matters, you might consider re-creating it through vCenter.
- Watch out when attaching a data disk from an older VM to a new VM. For example, you are upgrading your SQL servers to Windows 2008 R2 from 2003. You decide to do a side-by-side upgrade, using the detach/attach method. You install (or better yet, deploy from template) a new Windows 2008 R2 VM, detach your databases from the old server, move your SQL data and log virtual disks from your 2003 VM to the new VM and attach the SQL DB’s on the new server. Those old VMDK’s may be misaligned! Consider using Robocopy, RichCopy or rsync to ensure an aligned disk.
- Check your storage vendors best practices for your particular environment (OS, workload, SAN, etc.).
- There is some debate on whether or not it is advised to align your OS partitions. There is no clear-cut answer on this as it depends so much on your environment and particular needs. For help in deciding if you should align your Guest OS drives, see the comments in the blogs by Duncan Epping, Vaughn Stewart, and Chad Sakac.
- While working the VMware User Group booth at the Washington, DC Virtualization Forum 2010 I had a user ask me if rules and procedures for alignment on 4k sector disks are different. I forgot to research it until just now, so I honestly don’t know (please comment if you do know!). Check with your storage vendor if this is an issue for you.
- Finally, you can’t realign partitions using tools like mbralign or vOptimizer in ESXi -Aaaron Delp explains the problem here: http://blog.aarondelp.com/2010/06/my-1-issue-with-vmware-esxi-today.html.
I hope this is helpful for you in understanding the problem of storage alignment and how it can impact your environment. Comments or questions are welcomed!
My Storage Basics series has been neglected for some time (sick kids, snow storms, VMware Upgrades, SAN implementations and some Cisco switch upgrades took all my free time), so let’s jump right in to Part V – Cache, Controllers, and Coalescing. Between the alliteration and fancy words, it might seem like I am about to tell a tale of international espionage. Unfortunately, my introductory treatment of these aspects of a storage system will probably not keep you on the edge of your seat – but I’ll try to keep it interesting.
Throughout this series, we’ve been working our way from the basic building block of any storage system – the disks – outwards towards the brains of the operation – the controller. You’ll recall that in Part II I introduced IOPS and the math that goes into calculating the IOPS capacity of a disk array. In Part III we considered a RAID implementation’s impact on performance and availability. And most recently in Part IV we looked at the common interface types when dealing with storage arrays. If we put the previous parts together we still don’t have a functional storage system. The missing piece is the controller. Simply put, the storage controller is the hardware adapter between the disks and the servers that connect to the storage. The controller has a specific ‘interface‘ type, is responsible for RAID operations, and handles advanced storage functionality. A controller can be as simple as the Dell PERC or HP Smart Array add-in card on your server, or as complex as the Storage Processor in an enterprise class Storage Area Network (SAN) such as an EMC CLARiiON or NetApp FAS.
Controllers
As we look at controllers and the advanced features they provide we’ll see that some of the earlier performance equations start to break down. The simplest controllers take disk read/write commands from the operating system and send commands down to the disk(s) attached to be read or written. This gets data onto the disk, but often does not do so in an efficient or reliable manner. RAID-capable controllers take on the added responsibility of configuring disks in the desired RAID level, calculating & writing parity data, and writing the data in disk-spanning stripes or mirrors depending on the RAID level.
Cache
To increase performance and improve reliability, storage vendors implement a caching system on their controllers. Cache is memory that acts as a buffer for disk I/O, and is usually battery-backed to prevent data loss in the event of a power failure. Because of the exponentially greater speed of RAM over spinning magnetic disks, cache can improve performance by orders of magnitude. Cache can operate on both reads and writes to disk.
When dealing with writes, the controller cache is typically used in one of two ways: write-through or write-back. In write-through mode, data is written to volatile cache and then to disk, and only acknowledged as written once the data resides on the non-volatile disk. Write-back mode allows the controller to acknowledge the data as having been written as soon as it is held in cache. This allows the cache to buffer writes quickly and then write them to the slower disk when the disk has cycles to accept I/O. The greater your cache size, the more data that can be buffered, ultimately resulting in better performance as measured in both IOPS and throughput. This graph from my article on troubleshooting write performance on an IBM DS3300 iSCSI array shows how throughput increased and latency decreased when enabling write cache. The extent to which cache increases performance is highly dependent on the workload characteristics (I/O size, randomness, and ratio of reads:writes).
Read-cache acts as a buffer for reads in a couple ways. First, some controllers attempt to ‘read-ahead’, anticipating future read requests from the operating system and buffering what it expects to be the next blocks of desired data. Some entry-level controllers simply buffer the next physical chunk of data and fill cache memory with it, while more advanced controllers may attempt to predict the right block of data based on previous requests (you just asked for 3 blocks in a row, I’m guessing you’ll come asking for the 4th next so I’ll just buffer it in fast cache for you now). Secondly, read cache holds data that has been previously read, regardless of any pre-fetching the controller may have done. This allows for much faster subsequent access of the same data because it is held in the faster cache, eliminating the need for the controller to go to disk for the data again. Just like with write cache, the extent to which cache increases performance is highly dependent on the workload characteristics.
A given storage array controller only has so much cache to work with. A Dell PERC5/E, for example, has 256MB of cache that can be used for both read and write. While this may be enough for a direct-attached storage array, SAN’s serving multiple systems demand more cache. In contrast, an EMC CLARiiON CX4-960 has 32GB. Some storage vendors, such as NetApp, are getting creative with cache. NetApp’s Performance Acceleration Module (PAM) is an add-in card that provides up to a whopping 512GB of Layer 2 cache to the storage system.
Caching mechanisms can dramatically influence performance under the right conditions. With healthy cache in place, IOPS calculations become skewed. However, cache can be exhausted or may not hold the data you are interested in. If cache is insufficient to satisfy read requests, or has reached its high-water mark for writes, performance can drop off. When cache is exhausted, the backing disk must be able to satisfy the I/O workload or performance will be unacceptable. This is where the IOPS calculations kick in, and where having the right disk type and configuration really matters.
Queuing & Coalescing
Advanced storage systems introduce additional features to reduce I/O contention and improve cache utilization. I won’t go into all of the features here because they vary by storage vendor. However, I will point out two common techniques – queuing and coalescing.
Queuing refers to the ability of a storage system to queue storage commands for later processing. Queuing can take place at various points in your storage environment, from the HBA to the storage processor/controller. A little queuing may be OK depending on your workload, but too many outstanding I/Os can negatively impact performance (this is measured in latency). Queue depths can be adjusted on many components in your storage and VMware landscape, but check with your vendor’s support group before you make changes to these settings.
Coalescing is performed by some storage systems to modify the character of the workload. To better understand coalescing, picture a bunch of random write activity. Without cache in place, the disk heads will be bouncing all over the platters trying to get the data on to disk. A little write cache will allow the storage array to acknowledge the write for the OS, but the array still needs to de-stage that data from cache to disk quickly to prevent cache exhaustion. The back-end disks will still be doing the chicken dance, bouncing around trying to write the random workload…. Now picture an intelligent system that re-orders the random writes that are held in cache and writes them to the disk in nice sequential stripes. The disk heads will be less prone to jumping around the platter and the behavior will start to look more like a nice waltz than the funky chicken dance. Coalescing is used for writes, not reads, so not all workloads benefit.
Wrap-up
With this article on Controllers, Cache, and Coalescing we’ll end our look at the basic building blocks of a storage array. Before we end the Storage Basic series I am planning a few more articles on Storage Workload Characterization (which has been mentioned, but not directly addressed in this and previous articles), Identifying a Stressed Storage System, and Best Practices for Storage Performance in a VMware Environment.
If you are interested in more reading on Controllers, Cache, and Coalescing, I recommend the following:
Additional Reading:
- Impact of cache on the performance of the HP StorageWorks XP12000 Disk Array white paper
- Performance impact of controller cache: SQL Server read only workloads
- IOps? – Dig into the article’s comments for some great dialog between some people who really know their stuff!
- Storage Performance for SQL Server
- Storage Caching 101 – Chuck Hollis (EMC)
- Improving Performance with Interrupt Coalescing for Virtual Machine Disk IO in VMware ESX Server
This is the third in a multi-part series on storage basics. I’ve had some good feedback from folks in the SMB space saying that the first couple posts in this series have been beneficial, so we’ll be sticking with some basic concepts for another post or two before we dive into some nitty-gritty details and practical applications of these concepts in a VMware environment. In the second post of this series I introduced the concept of IOPS and explained how the physical characteristics of a hard disk drive determine the theoretical IOPS capability of a disk. I then noted that you can aggregate disks to achieve a greater number of IOPS for a particular storage environment. Today, we will look at just how you combine multiple disks and the performance impact of doing so. Remember that we are keeping this simple; the concepts I present here may not apply to that fancy new SAN you just purchased with your end-of-year money or the cheap little SATA controller on your desktop’s motherboard (not that there’s anything wrong with it) – we’re more in the middle ground of direct attached storage (DAS) as we firm up concepts.
Enterprise servers and storage systems have the ability to combine multiple disks into a group using Redundant Array of Independent Disks (RAID) technology. We’ll assume a hardware RAID controller is responsible for configuring and driving storage IO to the connected disks. RAID controllers typically have battery-backed cache (we’ll talk cache in a future post), an interconnect where the drives plug in, such as SCSI or SAS (we’ll talk about these too in a future post), and hold the configuration of the RAID set including stripe size and RAID level. The controller also does the basic work of reading and writing on RAID set – mirroring, striping, and parity calculations. There are several different types of RAID level – rather than rehash the details of them, read the Wikipedia entry on RAID and then come back here….
Ok, great. So you now know that RAID is implemented to increase performance through the aggregation of multiple disks, and to increase reliability though mirroring and parity. Now let’s consider the performance implications of some basic RAID levels. As with many things in the IT industry, there are trade-offs: security vs. usability, brains vs. brawn, and now performance vs. reliability. As we increase reliability in a RAID array through mirroring and parity, performance can be impacted. This is where the more disks = more IOPS bit starts to fall apart. The exact impact depends on the RAID type. Here are some examples of how RAID impact the maximum theoretical IOPS using the most common RAID levels, where:
I = Total IOPS for Array (note that I show Read and Write separately)
i = IOPS per disk in array (based on spindle speed averages from Part II: IOPS)
n = Number of disks in array
r = Percentage of read IOPS (calculated from the Average Disk Reads/Sec divided by total Average Disk Transfers/Sec in your Windows Perfmon)
w = Percentage of write IOPS (calculated from the Average Disk Writes/Sec divided by total Average Disk Transfers/Sec in your Windows Perfmon)
RAID0 (striping, no redundancy)
This is basic aggregation with no redundancy. A single drive error/failure could render your data useless and as such it is not recommended for production use. It does allow for some simple math:
I =n*i
Because there is no mirroring or parity overhead, theoretical maximum Read and Write IOPS are the same.
RAID 1 & RAID10 (mirroring technologies):
Because data is mirrored to multiple disks
Read I = n*i
For example, if we have six 15k disks in a RAID10 config, we should expect a theoretical maximum number of IOPS for our array to be 6*180 = 1080 IOPS
Write I = (n*i)/2
RAID5 (striping with a single parity disk)
Read I = (n-1)*i
Example: Five 15k disks in a RAID 5 (4 + 1) will yield a maximum IOPS of (5-1)*180 = 720 READ IOPS. We subtract 1 because one of the disks holds parity bits, not data.
Write I = (n*i)/4
Example: Five disks in a RAID 5 (4 + 1) will yield a maximum IOPS of (5*180)/4 = 225 WRITE IOPS
Again, these formulas are very basic and have little practical value. Furthermore, it is seldom that you will find a system that is doing only reads or only writes. More often, as is the case with typical VMware environments, reads and writes are mixed. An understanding of your workload is key to accurately sizing your storage environment for performance. One of the workload characteristics (we’ll explore some more in the future) that you should consider in your sizing is the percentage of read IOPS vs. the percentage of write IOPS. A formula like this gets you close if you want to do the math for a mixed read/write environment in a RAID5 set:
I = (n*i)/(r+4 *w)
Example: a 60% read/40% write workload with five 15k disks in a RAID5 would provide (5*180)/(.6+4*.4) = 409 IOPS.
The previous examples have all been from the perspective of the storage system. If we take a look at this from the server/OS/application side, something interesting shows up. Let’s say you fired up Windows perfmon and collected Physical Disk Transfers/sec counters every 15 seconds for 24 hours and analyzed the data in Excel to find the 95th Percentile for total average IOPS (this is a pretty standard exercise if you are buying enterprise storage array or SAN). Let’s say that you find that the server in question was asking for 1000 IOPS at the 95th Percentile (let’s stick with our 60% read/40% write workload). And finally, let’s say we put this workload on a RAID5 array. That’s saying a lot of stuff, but what does it all mean? Because RAID5 has a write penalty factor of 4 (again, Duncan Epping’s posted a great article here which I referenced in Part II that describes this in a slightly different way) we can tweak the previous formula to show the IO’s to the backend array given a specific workload.
I = Target workload IOPS
f = IO penalty
r = % Read
w = % Write
IO = (I * r) + (I * w) * f
Our example then looks like this (remember work inside parenthesis first, and then My Dear Aunt Sally):
(1000 * .6) + ((1000 * .4) * 4) = 2200
Simply stated, this means that for every 1000 IOPS that our workload requests from our storage system, the backing array perform 2200 IO’s, and it better do it quickly or you will start to see latency and queuing (we call this performance degradation, boys and girls!). Again, this is a very simplistic approach neglecting factors like cache, randomness of the workload, stripe size, IO size, and partition alignment which can all impact requirements on the backend. I’ll cover some of those later.
As you can hopefully see, the laws of physics combined with some simple math can provide some pretty useful numbers. A basic understanding of your array configuration against your workload requirements can go a long way in preventing storage bottlenecks. You may also find that as you consider the cost per disk against various spindle speeds, capacities and RAID levels that you are better off buying smaller, faster, fewer, more, slower…. disks depending on your requirements. The geekier amongst us could even take these formulas and some costs per disk and hit up Excel Goal Seek to find the optimal level, but that’s more than this little blog can do for you today.
Before I wrap up this post, I want to leave you with a few more links that I have bookmarked around the topics of IOPS and RAID over the past several years:
- DB sizing for Microsoft Operations Manger, includes a nice chart with formulas similar to the ones I provided in this article: http://blogs.technet.com/jonathanalmquist/archive/2009/04/06/how-can-i-gauge-operations-manager-database-performance.aspx
- An Experts Exchange post with some good info in the last entry on the page (subscription required) http://www.experts-exchange.com/Storage/Storage_Technology/Q_22669077.html
- A Microsoft TechNet article with storage sizing for Exchange – a bit dated but still applicable: http://technet.microsoft.com/en-us/library/aa997052(EXCHG.65).aspx
- A simple whitepaper from Dell on their MD1000 DAS array – easy language to help the less technical along: http://support.dell.com/support/edocs/systems/md1120/multlang/whitepaper/SAS%20MD1xxx.pdf
- A great post that uses some math to show performance and cost trade-offs of RAID level, disk type, and spindle speed. http://www.yonahruss.com/architecture/raid-10-vs-raid-5-performance-cost-space-and-ha.html
- Another nifty post that looks at cost vs. performance vs capacities of various disks speeds in an array http://blogs.zdnet.com/Ou/?p=322
A reader named Mark contacted me today and asked if there was a way to reduce the size of the batch output from an ESXTOP run. And he asks for good reason: Depending on the number of VM’s on your host, the delay between ESXTOP samplings and the number of samples you collect, using the All Stats option (-a) can yield a massive file in a short period of time. If written to a partition on your ESX Service Console you run the risk of filling the partition, and forget about actually being able to analyze the data in PERFMON or Excel. For example, on an ESX host running ~15 VM’s I produced 100MB worth of CSV using the -a switch, sampling every 15 seconds, for just under 2 hours. ESXTOP uses 10-second intervals by default; I used -d 15 to change the sampling delay. Had I went with the default my output would have been bigger.
To reduce the size of your output, you can change your sampling delay to something larger, say 30-seconds. I suppose you could also capture statistics when the host is not busy so you get fewer characters in the results, but that’s just being goofy.
A better way to reduce your ESXTOP output size is to selectively include only the statistics you are interested in, and is really what Mark was asking. After all, all statistics from ESXTOP can be too many statistics, and chances are you already know what stats you are interested in. Here’s how you can narrow down the collected stats for easier analysis and smaller output:
- Enter ESXTOP in interactive mode on the Service Console by simply typing esxtop at the # prompt
- Switch to a component you are NOT interested in capturing statistics on by pressing the corresponding menu option (c: ESX cpu, m: ESX memory, d: ESX disk adapter, u: ESX disk device, v: ESX disk VM).
- Press f when viewing the component you do not want to capture. A list of fields will be displayed. You can toggle the fields on and off by pressing the letter corresponding to each field. An * indicates that the field is on. You want to turn off all of the fields you don’t want to collect.
- Repeat steps 2 & 3 for the remaining components, leaving only what you want to capture.
- Switch to the component you want to capture in batch mode and repeat step #3, except you will now enable what you want to capture.
- Press W (capital W – case sensitive) to write out the ESXTOP configuration file. You can accept the default or create new configuration files. You may want to create a CPU-only config file, memory-only, and so forth.
- Press CTRL+C to stop ESXTOP.
- Now, invoke ESXTOP in batch mode, calling your updated or new configuration file you created in step #6 using the -c switch. Here’s an example:# esxtop -b -d 30 -n 480 -c .esxtopcpustats > /tmp/esxtop_cpu_stats.csv where .esxtopcpustats is an ESXTOP config file with only CPU stats. -d sets your capture interval to 30 seconds, and -n sets the number of samples to 480 (or 4 hours with a delay of 30 seconds).
Once your capture is complete you can replay the sampling in ESXTOP using replay mode (-R), or you can copy the .csv to a Windows system and use PERFMON or Excel to analyze the stats. If using PERFMON or Excel you will notice that the system summary information displayed at the top of an interactive ESXTOP session is included in the output (console memory, console cpu, etc.). As far as I know, there is no way to disable this, nor would you want to as it includes the time stamp necessary to interpret your data.
It is possible to use the vSphere CLI or the vSphere Management Assistant (vMA) to run RESXTOP, a version of ESXTOP designed for remote administration of ESXi or ESX. You may note, however, RESXTOP from the vSphere CLI only works from a Linux client. Using either of these tools will help you to automate ESXTOP statistics collection from multiple hosts using customized configuration files.






