I/O

Storage Basics – Part VI: Storage Workload Characterization

April 8, 2010 by Josh Townsend 21 Comments

Most of what I covered in Storage Basics Parts 1 through 5 was at a very elementary level. The math I used to do IOPS calculations, for example, is only true under very certain conditions. RAID controllers implement caching and other techniques that skew the simple math that I provided. I mentioned that the type of interface that you ought to use on your storage array should not be randomly chosen. In fact, choosing the right array with the appropriate components and characteristics can only be done when you enlighten your decision with a characterization of workloads it will be running.

The character of your storage workload can be broken down into several traits – random vs. sequential I/O, large vs. small I/O request size, read vs. write ratio, and degree of parallelism. The traits of your particular workload dictate how it interacts with the components of your storage system and ultimately determine the performance of your environment under a given configuration. There is an excellent whitepaper available from VMware entitled “Easy and Efficient Disk I/O Workload Characterization inVMware ESX Server” that is authoritative on this subject. If you want to get down and dirty with the topic, it’s a good read. I’m aiming for something a bit less academic. With that said, let’s break down workload characterization a bit so as to better understand how it will impact your real-world systems.

Random vs. Sequential Access

In Part II of this series we looked at the formula for calculating IOPS capabilities for a single disk. That formula goes something like this:

IOPS = 1000/(Seek Latency + Rotational Latency)

You’ll recall that we divide into 1000 to remove milliseconds from the equation, leaving (Seek Latency + Rotational Latency) as the important part of the equation. Rotational latency is based on the spindle speed of the disk – 7.2k, 10k, or 15k RPM for standard server or SAN disks. If we consider the same Seagate Cheetah 15k drive from Part II, we see that rotational latency is 2.0ms. The only way to change rotational latency is to buy faster (or slower) disks. This essentially leaves seek latency as the only variable that we can “adjust”. You’ll also recall that seek latency was the larger of the latencies (3.4ms for read seeks, and 3.9ms for write seeks) and counts more against IOPS capability than does rotational latency. Seeking is the most expensive operation in terms of performance.

It is next to impossible to adjust seek latency on a disk because it is determined by the speed of the servos that move the heads across the platter. We can, however, send workloads with different degrees of randomness to the platter. The more sequential a workload is, the less time that will be spent in seek operations. A high degree of sequentiality ultimately leads to faster disk response and higher throughput rates. Sequential workloads may be candidates for slower disks or RAID levels. Conversely, workloads that are highly randomized ought to be placed on fast spindles in fast RAID configurations.

You’ll notice that I said it was next to impossible to adjust seek latency on a disk. While not common, some storage administrators employ a method know as ‘short stroking’ when configuring storage. Short stroking uses less than the full capacity of the disk by placing data at the beginning of the disk where access is faster, and not placing data at the end of the disk where seeks times are greater. This results in a smaller area on the disk platter for heads to travel over, effectively reducing seek time at the expense of capacity.

While not applicable to all workloads, storage arrays, or file systems, fragmentation can cause higher degrees of randomness leading to degraded performance. This is the prime reason some vendors recommend that you regularly defragment your file system. It should be noted that a VMware VMFS file system is resilient against the forces of fragmentation. Whereas a Windows NTFS parition may hold hundreds, thousands or tens of thousands of files of different sizes, accessed randomly throughout the system’s cycle of operations, a VMFS datastore typically holds no more than a couple hundred files. Additionally, most of the files on a VMFS datastore are created contiguously if you are using thick-provisioned virtual disks (VMDK). Thin-provisioned VMDK’s are slightly more susceptible to fragmentation, but do not typically suffer a high enough degree of fragmentation to register a performance impact. See this VMware whitepaper for more on VMFS fragmentation: Performance Study of VMware vStorage Thin Provisioning.

Examples of sequential workloads include backup-to-disk operations and the writing of SQL transaction log files. Random workloads may include collective reads from Exchange Information Stores or OLTP database access. Workloads are often a mix of random and sequential access, as is the case with most VMware vSphere implmentations. The degree to which they are random or sequential dictates the type of tuning you should perform to obtain the best possible performance for your environment.

I/O Request Size

I/O request size is another important factor in workload characterization. Generally speaking, larger reads/writes are more efficient than smaller I/O to a certain point. The use of larger I/O requests (64KB instead of 2KB, for example) can result in faster throughput and reduced processor time. Most workloads do not allow you to adjust your I/O request size. However, knowing your I/O request size can help with appropriate configuration of certain parameters such as array stripe size and file system cluster size. Check with your storage vendor for more information as it pertains to your specific configuration.

If you are in a Windows shop, you can use perfmon counters such as Avg. Disk Bytes/Read to determine average I/O size. If you are running a VMware-virtualized workload, you can take advantage of a great tool – vscsiStats – to identify your I/O request size. More on vscsiStats later in this article.

Read vs. Write

Every workload will display a differing amount of read and write activity. Sometimes a specific workload, say Microsoft Exchange, can be broken down into sub-workloads for logging (write-heavy) and reading the database (read-heavy). Understanding the read-to-write ratio may help with designing the underlying storage system. For example, a write-heavy workload may perform better on a RAID10 LUN than a RAID5 array due to the write penalty associated with RAID5. The ratio of read:write may also dictate caching strategies. The read:write ratio, when combined with a degree of randomness measure, can be quite useful in architecting your storage strategy for a given application or workload.

Parallelism/Outstanding I/O’s

Some workloads are capable of performing multi-threaded I/O. These types of workloads can place a higher amount of stress on the storage system and should be understood when designing storage, both in terms of IOPS and throughput. Multipathing may help with multi-threaded I/O workloads. A typical VMware vSphere environment is a good example of a workload capable of queuing up outstanding I/O.

Measuring the Characteristics of Your Workload

So how do we actually characterize storage workloads? Start with the application vendor – many have published studies that can shed light on specific storage workloads in a standard implementation. If you are interested in measuring your own for planning/architecture reasons, or performance troubleshooting reasons, read on…. There are several tools to measure storage characteristics, depending on your operating system and storage environment. Standard OS performance counters, such as Windows Performance Monitor (perfmon) can reveal some of the characteristics. Array based tools such as NaviAnalyzer on EMC gear can also reveal statistics on the storage end of the equation.

One of the most exciting tools for storage workload characterization comes from VMware in the form of vscsiStats. vscsiStats is a tool that has been included in VMware ESX server since version 3.5. Because all I/O commands pass through the Virtual Machine Monitor (VMM), the hypervisor can inspect and report on the I/O characteristics of a particular workload, down to a unique VM running on an ESX host. There is a ton of great information on using vscsiStats, so I won’t re-hash it all here. I recommend starting with Using vscsiStats for Storage Performance Analysis as it contains an overview and usage instructions. If you want to dig a bit deeper into vscsiStats, read both Storage Workload Characterization and Consolidation in Virtualized Environments and vscsiStats: Fast and Easy Disk Workload Characterization on VMware ESX Server.

vscsiStats can generate an enormous amount of data which is best viewed as a histogram. If you’re a glutton for punishment, the data can be reviewed manually on the COS. To extract vscsiStat output data, use the -c option to export to a .csv file. From there you can analyze the data and create histograms using Excel. Paul Dunn has a nifty Excel macro for analyzing and reporting on vscsiStats output here. Gabrie van Zanten more detailed instructions for using Paul’s macro here. Here are a couple histogram examples that I just generated from a test VM.

vscsiStats is only included with ESX, not ESXi. However, Scott Drummond was kind enough to post a download of vscsiStats for ESXi on his Virtual Pivot blog: https://vpivot.com/2009/10/21/vscsistats-for-esxi/. Using vscsiStats on ESXi requires dropping into Tech Support Mode (unsupported) and enabling ESXi for scp to transfer the binary to the ESXi server.

VMware esxtop can display some information but is limited in scope and does not currently support NFS. A community-supported python script called nfstop can parse vscsiStats data and display esxtop-like data per VM on screen.

Experiment

If you are interested in generating workloads with various characteristics, check out Iometer and Bonnie++. These tools will allow you to generate I/O that you can monitor with the tools I covered in this article.

Put it to Use

If you are provisioning a new workload or expanding an existing, invest some time in understanding your storage workload characteristics and convey those characteristics to your storage team. A request for storage that includes the workload characteristics I discussed here, as well as expected IOPS requirements, will go much further in ensuring performance for your applications – physical or virtual – than simply asking for a certain capacity of disk.

Keep Reading:

Storage Basics – Part V: Controllers, Cache and Coalescing

March 23, 2010 by Josh Townsend 14 Comments

My Storage Basics series has been neglected for some time (sick kids, snow storms, VMware Upgrades, SAN implementations and some Cisco switch upgrades took all my free time), so let’s jump right in to Part V – Cache, Controllers, and Coalescing. Between the alliteration and fancy words, it might seem like I am about to tell a tale of international espionage. Unfortunately, my introductory treatment of these aspects of a storage system will probably not keep you on the edge of your seat – but I’ll try to keep it interesting.

Throughout this series, we’ve been working our way from the basic building block of any storage system – the disks – outwards towards the brains of the operation – the controller. You’ll recall that in Part II I introduced IOPS and the math that goes into calculating the IOPS capacity of a disk array. In Part III we considered a RAID implementation’s impact on performance and availability. And most recently in Part IV we looked at the common interface types when dealing with storage arrays. If we put the previous parts together we still don’t have a functional storage system. The missing piece is the controller. Simply put, the storage controller is the hardware adapter between the disks and the servers that connect to the storage. The controller has a specific ‘interface‘ type, is responsible for RAID operations, and handles advanced storage functionality. A controller can be as simple as the Dell PERC or HP Smart Array add-in card on your server, or as complex as the Storage Processor in an enterprise class Storage Area Network (SAN) such as an EMC CLARiiON or NetApp FAS.

Controllers

As we look at controllers and the advanced features they provide we’ll see that some of the earlier performance equations start to break down. The simplest controllers take disk read/write commands from the operating system and send commands down to the disk(s) attached to be read or written. This gets data onto the disk, but often does not do so in an efficient or reliable manner. RAID-capable controllers take on the added responsibility of configuring disks in the desired RAID level, calculating & writing parity data, and writing the data in disk-spanning stripes or mirrors depending on the RAID level.

Cache

To increase performance and improve reliability, storage vendors implement a caching system on their controllers. Cache is memory that acts as a buffer for disk I/O, and is usually battery-backed to prevent data loss in the event of a power failure. Because of the exponentially greater speed of RAM over spinning magnetic disks, cache can improve performance by orders of magnitude. Cache can operate on both reads and writes to disk.

When dealing with writes, the controller cache is typically used in one of two ways: write-through or write-back. In write-through mode, data is written to volatile cache and then to disk, and only acknowledged as written once the data resides on the non-volatile disk. Write-back mode allows the controller to acknowledge the data as having been written as soon as it is held in cache. This allows the cache to buffer writes quickly and then write them to the slower disk when the disk has cycles to accept I/O. The greater your cache size, the more data that can be buffered, ultimately resulting in better performance as measured in both IOPS and throughput. This graph from my article on troubleshooting write performance on an IBM DS3300 iSCSI array shows how throughput increased and latency decreased when enabling write cache. The extent to which cache increases performance is highly dependent on the workload characteristics (I/O size, randomness, and ratio of reads:writes).

Read-cache acts as a buffer for reads in a couple ways. First, some controllers attempt to ‘read-ahead’, anticipating future read requests from the operating system and buffering what it expects to be the next blocks of desired data. Some entry-level controllers simply buffer the next physical chunk of data and fill cache memory with it, while more advanced controllers may attempt to predict the right block of data based on previous requests (you just asked for 3 blocks in a row, I’m guessing you’ll come asking for the 4th next so I’ll just buffer it in fast cache for you now). Secondly, read cache holds data that has been previously read, regardless of any pre-fetching the controller may have done. This allows for much faster subsequent access of the same data because it is held in the faster cache, eliminating the need for the controller to go to disk for the data again. Just like with write cache, the extent to which cache increases performance is highly dependent on the workload characteristics.

A given storage array controller only has so much cache to work with. A Dell PERC5/E, for example, has 256MB of cache that can be used for both read and write. While this may be enough for a direct-attached storage array, SAN’s serving multiple systems demand more cache. In contrast, an EMC CLARiiON CX4-960 has 32GB. Some storage vendors, such as NetApp, are getting creative with cache. NetApp’s Performance Acceleration Module (PAM) is an add-in card that provides up to a whopping 512GB of Layer 2 cache to the storage system.

Caching mechanisms can dramatically influence performance under the right conditions. With healthy cache in place, IOPS calculations become skewed. However, cache can be exhausted or may not hold the data you are interested in. If cache is insufficient to satisfy read requests, or has reached its high-water mark for writes, performance can drop off. When cache is exhausted, the backing disk must be able to satisfy the I/O workload or performance will be unacceptable. This is where the IOPS calculations kick in, and where having the right disk type and configuration really matters.

Queuing & Coalescing

Advanced storage systems introduce additional features to reduce I/O contention and improve cache utilization. I won’t go into all of the features here because they vary by storage vendor. However, I will point out two common techniques – queuing and coalescing.

Queuing refers to the ability of a storage system to queue storage commands for later processing. Queuing can take place at various points in your storage environment, from the HBA to the storage processor/controller. A little queuing may be OK depending on your workload, but too many outstanding I/Os can negatively impact performance (this is measured in latency). Queue depths can be adjusted on many components in your storage and VMware landscape, but check with your vendor’s support group before you make changes to these settings.

Coalescing is performed by some storage systems to modify the character of the workload. To better understand coalescing, picture a bunch of random write activity. Without cache in place, the disk heads will be bouncing all over the platters trying to get the data on to disk. A little write cache will allow the storage array to acknowledge the write for the OS, but the array still needs to de-stage that data from cache to disk quickly to prevent cache exhaustion. The back-end disks will still be doing the chicken dance, bouncing around trying to write the random workload…. Now picture an intelligent system that re-orders the random writes that are held in cache and writes them to the disk in nice sequential stripes. The disk heads will be less prone to jumping around the platter and the behavior will start to look more like a nice waltz than the funky chicken dance. Coalescing is used for writes, not reads, so not all workloads benefit.

Wrap-up

With this article on Controllers, Cache, and Coalescing we’ll end our look at the basic building blocks of a storage array. Before we end the Storage Basic series I am planning a few more articles on Storage Workload Characterization (which has been mentioned, but not directly addressed in this and previous articles), Identifying a Stressed Storage System, and Best Practices for Storage Performance in a VMware Environment.

If you are interested in more reading on Controllers, Cache, and Coalescing, I recommend the following:

Additional Reading:

Impact of cache on the performance of the HP StorageWorks XP12000 Disk Array white paper
Performance impact of controller cache: SQL Server read only workloads
IOps? – Dig into the article’s comments for some great dialog between some people who really know their stuff!
Storage Performance for SQL Server
Storage Caching 101 – Chuck Hollis (EMC)
Improving Performance with Interrupt Coalescing for Virtual Machine Disk IO in VMware ESX Server

Keep Reading:

Installing PowerPath/VE using VMware Update Manager

February 5, 2010 by Josh Townsend 7 Comments

I am finishing up an installation of an EMC Clariion CX4 SAN. One of the final steps of the installation is to configure PowerPath/VE on the ESXi hosts. PowerPath/VE is EMC’s multipathing extension module for VMware (and Hyper-V), designed to replace the Native Multipathing Plugin (NMP) for increased I/O performance and failover management. To simplify and automate the installation of PowerPath/VE, I decided to use VMware Update Manager (VUM) to push the extension to the ESXi 4.x hosts in the environment.

The process of setting up an additional VUM patch repository to host PowerPath/VE (and other 3rd party extensions such as the Cisco Nexus 1000v) is pretty straight forward. 3rd party extensions are supported in VUM beginning with vSphere 4.0 Update 1. Chad Sakac has posted a great video guide on YouTube that covers the setup:

I opted to use the tomcat installation on the environment’s vCenter server to host the PowerPath/VE repository. To accomplish this, I simply created a new directory in the tomcat root directory. The default path for the root directory on a vSphere vCenter Server is “C:Program FilesVMwareInfrastructuretomcatwebapps” (or C:Program Files (x86)VMwareInfrastructuretomcatwebapps on a 64-bit installation).

I created a directory named ‘depot’ and within that directory created a PowerPathVE folder. I extracted the contents of the VUM folder from the PowerPath .zip file that I downloaded from https://powerlink.emc.com. A screenshot of the directory is below:

After creating the directory for the patch repository, I simply added an Extension Repository to VMware Update Manager as Chad shows in his video. I would like to call out one caveat – Because vCenter may not listen on standard HTTP/HTTPS ports, I used as the path to the source.

Once PowerPath was added to an Extension Baseline in VUM, I simply had to scan my hosts for updates and remediate. Installation of PowerPath/VE requires the host to be in Maintenance Mode and concludes with a reboot. Pretty simple.

Then all you have to do is fight through an overly-complex licensing setup (seriously, a 112 page PDF on how to install licenses???), a bit of configuration, and you are multi-pathing with the best of them. If you are interested in learning more about PowerPath/VE, start with this whitepaper: EMC PowerPath/VE for VMware vSphere Best Practices Planning. For a bit of real-world insight into the performance increase you might see with PowerPath/VE, check out this blog post from Eric Sloof: Massive I/O power increase using EMC PowerPath/VE.

Update – 3/27/09: VMware published a Knowledge Base article on this procedure a few weeks after I wrote this post. You can find it in article 1018740.

Update – 4/15/11: You may have to set the NTFS permissions on the ‘depot’ folder to allow ‘anonymous’ read access when running on a 2008 or 2008 R2 server before you can validate and download from the new repository.

ESXTOP Batch Mode Analysis with Windows Perfmon

September 10, 2009 by Josh Townsend 12 Comments

I needed to grab some stats from my ESX hosts for off-line analysis so I fired up my trusty ESXTOP intent on using batch mode to capture a .csv formatted output. I started to manually select the counters I was interested in while working in ESXTOP interactive mode (you can save your selected counters to the esxtop configuration file with the ‘w’ command) and thought that there must be a better way. I found that better way in the VMware Performance Community: https://communities.vmware.com/docs/DOC-3930. There is now a -a switch that can be used to include ALL performance counters. I’m sold.

I wanted detailed information, so I decided on a 15 second capture interval to run for a 2 hour window. Here’s the command I used:

esxtop -a -b -d 15 -n 480 > /tmp/esxtopout.csv

where -a is for ALL, -b is for batch mode, -d is for delay, and -n is for the number of iterations ((60/15)*60*2). I wrote out the results to a .csv in /tmp. The resulting CSV weighed in at a whopping 100MB after 2 hours.

The CSV can be analyzed in Excel (pivot tables work well for this) or in Windows Perfmon. I opened the log in Perfmon as I was after basic Min/Average/Max counters and Perfmon makes those easy to see. When adding the CSV log to Perfmon, you are prompted to select counters. I added all instances of Commands/sec, Reads/sec, and Writes/sec from Physical Disk (I was gathering some IOPS counts for a new storage proposal). I got a bit more than I bargained for: a mostly unresponsive Perfmon window and the ugliest darn graph I’ve ever seen.

Switching from a graph view to the report view allows you to easily view and remove specific counters that you are not interested in, or open the Properties of the data set, switch to the data tab and bulk select counters that you want to remove. I was not interested in vmhba1:x, specific VM’s or worlds, so I killed all of those, leaving just the base iSCSI device (vmhba32 in my case).

After some cleanup the graph looked a bit better and more importantly, I was able to easily read my Min/Average/Max stats:

Here are the takeaways –

ESXTOP is a powerful utility for performance monitoring
All stats (-a) can result in a huge file – use it wisely in batch mode; else use interactive mode to select your counters and write them to the user-defined configuration file. Invoke the config file with the -c option when running in batch mode.
Consider using vscsiStats for more granular reporting of storage performance and storage workload characterization.
ESXTOP physical disk stats do not include NFS volumes.

Do you use other tools or methods to collect basic disk IO counters for storage sizing purposes? If so, leave a comment describing your approach!

Balloon Driver Problems with SQL

September 9, 2009 by Josh Townsend 7 Comments

I have been meaning to write this up for a while; Scott Drummonds’ ‘Love Your Balloon Driver’ post today at his Virtual Performance blog gave me a nice reminder. I actually caught a sneak peak at the graphs with an explanation from Scott at his instructor-led lab at VMworld 2009. Scott calls out that the only workload they discovered suffers from balloon driver activity is Java. The reason for Java’s problems with balloon driver activity is that Java itself runs in a VM and so the guest OS cannot properly determine which pages should be swapped out when the balloon driver calls for it.

My experiences causes me to agree with Scott and the whitepaper he cites – in a properly designed and equipped environment the balloon driver is not detrimental for most every workload to a point. However, I recently discovered in a client site that the balloon driver can cause significant issues when the environment is poorly designed and under-sized. Here the background:

I was called into an already established environment where the client was running on an older blade with VMware ESX 3.5. The blade maxed out at 16GB RAM and had dual dual-core CPU’s with no hope for an upgrade. On the blade was a single guest VM running Windows 2003 with SQL 2005, in it’s full 32-bit glory. The VM was configured with 4 vCPU’s and 16GB of memory. Some of you can probably already guess where this is going….

The x86 Windows guest had PAE configured, and SQL took advantage of AWE to use the additional memory beyond the 4GB limit of a 32-bit system. Additionally, the Windows guest had the /3GB switch enabled in boot.ini. Finally, as per SQL best practices, the ‘Lock Pages in Memory‘ permission was granted to the SQL Server service account. What the guest was left with was 1GB of kernel mode memory and 15GB of User Mode/Extended addressable memory.

And here’s the problem. The client was using ESX, not ESX 3.5, so the Service Console required memory. In this case, the service console had approximately 512MB allocated to it. Futhermore, VM’s require some overhead on ESX to run. The memory overhead consumed by a Windows guest on ESX 3.5 with 4 vCPU and 16GB of memory is a bit more than 512MB. On a properly sized ESX server with multiple similar guests/workloads, you could probably gain much of the overhead back through transparent page sharing; but in this case I had a 1:1 P2V ratio. If you are any good at math you see that the environment is running about 1GB short of memory. A quick check of the balloon driver stat in vCenter show that the balloon driver was constantly active and demanding about 1GB back from the guest… constantly.

Under normal circumstances this might not be an issue, but in this case the Windows guest was being absolutely punished. The guest CPU’s were pegged at 100% with an excessive amount of kernel time, often indicating IO issues. And indeed I did experience terrible disk and network performance on the guest. At the root of the problem is this – the Lock Pages in Memory permission allows SQL to get a firm grasp on the user mode memory available to it (15GB) and lock it up. This left the already starved (because of the 3GB switch in the boot.ini) guest kernel with it’s 1GB the only thing the balloon driver could really swap out.

The client suggested a reservation of 16GB on the VM, knowing that memory reservations prevent balloon driver activity. I calmly asked them to back away from the keyboard as I explained how if a starved guest was bad, how much worse a starved Service Console would be. In the end the fix was quiet easy – I convinced the customer that they should reduce the amount of memory allocated to the guest by about 1GB, enough to let the 512MB SC and the 512MB of overhead run without contention. I was able to show them the difference between allocated and active memory in vCenter – the 1GB being surrendered was not really being actively used, SQL just had it locked up. In fact, surrendering the 1GB of memory back to ESX breathed new life into the guest VM, bringing its performance back in line with expectations.

Ideally, I would have brought in a bigger ESX server that could serve additional VM’s, driving greater levels of efficiency across the environment. It just wasn’t an option for the client in this case. In the end, the problem was fixed and I was reminded just how fun it can be to explain some of these backwards sounding virtualization concepts to customers – fewer vCPU’s can lead to better performance of guests, less guest memory can fix performance issues, and increasing the quantity of similar guests on a host can drive better performance to a point because of transparent page sharing.

Stay tuned over the next few weeks as I digest and write on my VMworld experience – from VMUG activities to Paul Maritz’s press conference announcing the vCloud Express, and plenty of great sessions in between. Like many of you, I returned from VMworld with quite a backlog of work but I’ll do my best to squeeze in some posts and tweets.