Posts Tagged ‘performance’
In Part I of this series, I discussed the important of storage performance in a virtual environment (really any environment, virtual or not, where you want acceptable performance), and introduced some of the basic measures of a storage environment. In Part II, we will look more closely at what may be the most important storage design consideration in a VMware server-consolidation enviornments, many SQL environments, and VDI environments to name a few: IOPS.
If we stick with a single-disk-centric approach as we did in Part I, IOPS is quite simply a measure of how many read and write commands a disk can complete in a second. IOPS is an important measure of performance in a shared storage environment (such as VMware) and in high-transaction-rate workloads like SQL. Because hard drives are forced to abide by the laws of physics, the IOPS capabilities of a disk are consistent and predictable given a specific configuration. The formula for calculating IOPS for a given disk is pretty straight forward (please show your work):
IOPS = 1000/(Seek Latency + Rotational Latency)
Exact latencies vary by disk type, quality, number of platters, etc. You can look up the tech specs for most drives on the market. As an example, I have randomly chosen the technical specifications of the Seagate Cheatah 15k.7 SAS drive. This particular drive has the following performance characteristics:
- Average (rotational) latency: 2.0msec
- Average read seek (latency): 3.4msec
- Average write seek (latency): 3.9msec
Using the read latency number, the math works out like this:
1000
———- = 185 maximum read IOPS
2.0+3.4
The maximum write IOPS will be a bit less (~169IOPS) because of the higher write seek latency. Writing is more ‘expensive’ than reading and therefore slower.
Fortunately, there are some widely accepted ‘working’ numbers, so you do not have to use this formula for each and every disk you might consider using. Because rotational latency is based on the rotational speed, we can use the published Rotations Per Minute (RPM) rating of the drive to guess-timate the IOPS capabilities. Typical spindle speeds (measured in RPM) and their equivalent IOPS are in the table below.
RPM………IOPS
7,200 80
10,000 130
15,000 180
SSD 2500 – 6000
While not a traditional spinning disk, I have also included Solid State Disks (SSD’s) for reference as SSD’s are starting to see increased market adoption. I have seen a wide range of sizing IOPS for SSD depending on the technology, type (SLC, MLC, etc.) Check out http://en.wikipedia.org/wiki/Solid-state_drive for an introduction, and ask your vendors for more in-depth technical information.
If you are brand-new to this (and you are still reading, congrats!), you can see how many IOPS your Windows computer is asking for by opening Performance Monitor and looking at the ‘Disk Transfers/sec’ counter under Physical Disk. This is a sum of the ‘Disk Reads/sec’ and ‘Disk Writes/sec’ counters as you can see in the screenshot below:
If you are after some stats for your VMware ESX environment, check out esxtop and looking for CMDS/s in the output. I published a couple articles on using esxtop here and here. The numbers from PerfMon and esxtop get you pretty close but can be skewed by a few things we’ll discuss in later posts.
Now that was fun and all, but let’s get real: Single-disk configurations are uncommon in servers. As such, we’ll part ways with our Simple Jack single disk approach to storage and begin to look at more real-world multi-disk enterprise-class storage configurations. A discussion of IOPS in a multi-disk array is a great way to start. From a very elementary perspective, you can combine multiple hard drives together to aggregate their performance capabilities. For example, two 15k RPM disks working together to server a workload could provide a theoretical 360 IOPS (180 + 180). This also scales out so ten 15k RPM disks could provide 1800 IOPS, and 100 15k RPM disks could provide 18,000 IOPS.
Designing your environment so that your storage can deliver sufficient IOPS to the requesting workload is of utmost importance. If you are working on a storage design, arm yourself with data from perfmon, top, iostat, esxtop, and vscsiStats. I typically gather at least 24 hours of performance data from systems under normal conditions (a few days to a week may be good if you have varying business cycles) and take the 95th percentile as a starting point. So from a very simple approach, if your data and calculations show a 1800 IOPS demand at the 95th percentile, you ought to have at least ten 15k RPM disks (or twenty-three 7.2k RPM SATA disks) to achieve performance goals. It’s amazing how some simple data and a pretty little Excel spreadsheet can help you understand and justify the right hardware for the job.
Now before you go and start filling out that PO form for a nice new storage system based on these numbers there are a few more things we ought to discuss. RAID, cache, and advanced storage technologies will skew these numbers and need to be understood. Stay tuned to future articles in this series for more on those topics and more.
Finally, there has been a bunch of activity in the VMware ecosystem of vendors, bloggers, and twittering-type-folks around storage performance. As this here post sat in my drafts folder, Duncan Epping posted this gem of an article that pretty much included all of the content of this article, as well as future ones in my series: http://www.yellow-bricks.com/2009/12/23/iops/. Do yourself a favor and read his post and the comments from his readers – both are filled with a ton of great information, including some vendor-specific implementations.
I was led to Duncan’s article by a post by Chad Sakac on his blog: http://virtualgeek.typepad.com/virtual_geek/2009/12/whats-what-in-vmware-view-and-vdi-land.html. This is also a great read that covers some of the same information with a focus on VMware View/VDI and is also worth a few minutes of your time. Also check out http://vpivot.com/2009/09/18/storage-is-the-problem/ for a rubber-meets-the-road post from Scott Drummonds on the importance of storage performance vis-a-vis IOPS in a VMware-virtualized SQL environment.
I am increasingly finding that both my SMB and Enterprise customers are uneducated on the fundamentals of storage sizing and performance. As a result, storage is often overlooked as a performance bottleneck despite it being a vital component to consider in a virtualization implementation. Storage will only increase in importance as hosts are getting bigger, data volumes increase, and more workloads are virtualized. For some reason, most people can grasp the importance of CPU and memory performance constraints but storage performance is often overlooked and can be hard to explain to business users or executives.
Case in point – I have recently been called into some environments that were not performing well – these environments happened to be running Microsoft SQL, but could just have well been running any application or collection of virtual machines. Fingers were being pointed in all directions: at applications, at the virtualization layer, at a lack of memory, and DBA’s were insisting that there were too few CPU’s. The situation was getting political and emotional when I walked into it. A few minutes with Windows Perfmon was all I needed to identify storage performance as the root cause of the firestorm that had been ignited. Using a bit of data, I was able to turn the discussion from an emotional fight to a simple problem of physics and mathematics (and a bit of simple math could have avoided the problem in the first place).
I have seen this play out a few too many times and so decided to write-up this multi-part series on the basics of storage with a focus on storage performance. That said, a little math and physics is where we will start as we look at the basic building block of a storage environment: a hard disk drive. Wikipedia defines a hard disk drive as “a non-volatile storage device that stores digitally encoded data on rapidly rotating platters with magnetic surfaces.” Your computer, server, or VMware cluster uses hard drives to read and write data. Wikipedia also covers the history and atomic structure of a hard drive pretty well. For our purposes, the take away is that hard drives are physical objects, and as such, follow the laws of physics (duh) in the following measurable ways:
1.) Capacity, which is measured in bits or bytes and exponents there of (MB, GB, TB, PB). This is how much data will fit on your disk, from simple text files to virtual disks, and everything in between. For example, if you have a 500GB SQL database, you darn well better have a hard drive that has a capacity of at least 500GB. This is a pretty simple concept, so I’ll leave it there for now.
2.) Performance, which is measured in a couple ways:
- at the disk itself in Input-Output Per Second (IOPS) – a measure of how many read and write commands a disk can complete in a second
- interface throughput, measured in MBps or Gbps – a measure of the peak rate that a volume of data can be read from or written to disk
- latency – the amount of time between when you ask a disk (or storage system if you want to read ahead) to do something and when it can actually do it, very closely related to IOPS as you’ll read in a forthcoming article in this series.
Each disk, array, and storage system has its own fixed set of measurements given a specific configuration. Knowing the physical capabilities of your storage system as measured in the above ways, and your systems storage requirements will go a long way towards a successful design and implementation of your storage environment. The remaining parts of this series will take a look at these performance characteristics a bit more in-depth and explain what happens as you introduce factors like RAID, cache, data reduction techniques such as snapshots and deduplication, and varying workloads.
Please keep in mind that while I have designed and implemented a variety of DAS, NAS, and SAN technologies from a host of vendors including Dell, EMC, IBM, and NetApp, I am by no means a storage expert. The information I will provide is generalized, over-simplified, and does not consider varying approaches from different storage vendors. Nonetheless, I hope you find this useful information if you are designing a solution, troubleshooting a performance issue or preparing to make a storage purchase.
Keep Reading:
VMware vCenter collects performance statistics, tasks and events for historical performance analysis and auditing. The collection level and retention of performance statistics can be controlled through the vCenter GUI (see Administration | vCenter Server Settings | Statistics).
The level of statistics collection and retention periods can have a dramatic impact on your vCenter Server’s performance if not carefully planned and monitored. In particular, the vCenter database can grow quite large and the database server required to support the increase in statistics increases in size and performance characteristics (increased disk IO capacity, CPU, and memory). Fortunately, VMware has provided a vCenter database sizing tool within the vCenter client (see picture). This is all well and good for initial sizing, and my experience shows that vCenter’s sizing estimates are fairly accurate assuming the environment remains healthy.
I recently migrated an environment from vCenter 2.5 to 4.0 and in the process switched from a Windows 2003 32-bit vCenter host and a SQL 2005 server (remote to vCenter) to a Windows 2008 64-bit vCenter server with a SQL 2008 server (again, a remote SQL server). I experienced a few issues during the migration and thought I had worked through them all (I’ll post on those at a later date). However, after a bit of time I found that performance statistics for objects in the vCenter were missing of not rendering at an acceptable pace. Upon further investigation, I discovered warnings in the vCenter Service Status node indicating that performance rollups within the vCenter database were not taking place.
In a SQL-backed vCenter, statistics rollups are handled by the SQL Server Agent (note: if you are using SQL Server Express, statistics rollups are handled by vCenter itself as SQL Express does not offer SQL Server Agent jobs). KB 1003570 describes this process (it applies to vCenter 2.5, but the principles in it can be applied to 4.0). To troubleshoot and resolve the issue I opened SQL Server Management Studio and checked several items:
- Is the SQL Server Agent running?
- Are there statistics rollup jobs defined for SQL server agent?
- Are those jobs running?
In my case, the SQL Server Agent was running (you are prompted to configure this during the vCenter install). However, when I checked for the presence of rollup jobs, I discovered that only a Past Day job had migrated with the database to the new SQL server. Upon investigating the job history for that job I discovered that the job had not run since the migration (note to self: add these checks to your standard vCenter migration checklist).
To remediate the problem I completed the following steps:
- Remove the bad ‘Past Day stats rollupVirtualCenter’ job from the list of SQL Server Agent Jobs.
- Recreate the three standard stats rollup jobs. To recreate the jobs, find SQL scripts on your vCenter server in C:\Program Files (x86)\VMware\Infrastructure\VirtualCenter Server. The .sql scripts you’ll need are stats_rollup1_proc_mssql.sql, stats_rollup2_proc_mssql.sql, and stats_rollup3_proc_mssql.sql. Run these scripts in SQL Query Analyzer against your VirtualCenter Database in order from 1 to 3. These scripts should create the rollup jobs and their associated stored procedures (this procedure is detailed at http://communities.vmware.com/thread/123715).
- After recreating the jobs I took a backup of the vCenter database. The Past Day job soon kicked off to begin a stats rollup (this runs every 30 minutes by default).
I checked the server several hours later and discovered that rather than completing successfully, the Past Day job was still running and the drive holding my vCenter database transaction log was full. Back to the drawing board..
- I disabled the Past Week and Past Month rollup jobs to avoid job conflicts.
- I backed up the vCenter database and then performed a shrink of the log file to get it back down to size.
- The vCenter was running as a VM, so I was able to quickly increase its disk size and use diskpart from within the guest to extend the partition. The space required to process weeks of performance statistics is not included in the vCenter Database Sizing tool as it is assumed that the rollup/purge jobs will run as designed.
I wanted to see how bad the problem was before kicking off another job so I ran:
select count(*) from vpx_hist_stat1
against the vCenter database in SQL Query Analyzer. The query ran for several hours (never a good sign) and eventually returned well over 20 million rows of performance statistics (thanks to http://communities.vmware.com/message/1318736 for pointing me in this direction). I investigated options to truncate the tables (see above link), and also looked at a script from VMware KB 1000125: Purging old data from the database used by vCenter Server. In the end, I decided to try to let the Past Day stats job run.
I stopped the vCenter Server Service to prevent new statistics from being written to the database. I also disabled the Past Week and Past Month SQL Agent jobs to prevent job conflicts and then manually started the Past Day job. I had to stop the job several times as it filled the 100GB transaction log volume. A backup & shrink operation gave me back the space on the log volume. I saw about 300GB of transaction logs written over the course of this process, but the Past Day job eventually completed.
Finally, I re-enabled the Past Week and Past Month jobs and manually ran both of them (Past Week first, then Past Month), followed by a backup and shrink of the vCenter database. I was impressed with the performance increase I saw in the vCenter client. Lists and performance graphs rendered much faster than when stats rollups were not taking place.
It would be a good idea to include checking stats rollup job status and a count of rows from the vpx_hist_stat tables in the vCenter database in your regular maintenance tasks. For other vCenter Database best practices, check out breakout session PO2061 from VMworld 2008. If you did not attend or subscribe to VMworld, Scott Lowe covered the session in this post. A VMworld 2009 “online only” session entitled VM3237 vCenter Databases: Setup, Management and Best Practices was also offered (subscription required). I have not viewed this session so I cannot comment on its content.
I had some folks from our .NET development team come to me with a problem today – their ASP.NET code was taking forever to recompile after updates to the code base. But these guys are cool – they came with a proposed solution (most people who grace my office door are simply dropping off problems). Their solution? A RAMDisk mounted in a VMware Windows guest. I give them credit for a novel approach, but I could see some issues:
- What would happen if the balloon driver kicked in and demanded the memory the RAMDisk was running on?
- A reservation would get around the balloon driver issue, but there is no way to specifically target the 512MB of RAMDisk, all memory in the VM must be reserved.
- I’m a pragmatic Windows systems administrator at heart, with a heap of systems and processes to manage and monitor. I don’t want the additional burden of making sure the RAMDisk loads at boot, keeps a consistent image across boots, can be easily updated by new code pushes, and remains compatible with new VM hardware and Tools versions.
- A RAMDisk would take from what are already memory constrained VM’s, possibly hurting performance more than helping.
- If the disk subsystem is slow enough to get you thinking down the path of a RAMDisk, maybe it’s time for a new SAN…
I did some Googling around and couldn’t find any decent info. I did find a few hits on people running VMware guests entirely inside a RAMDisk – a concept that peaked my interest almost enough to think about trying it just to say I did…. Have any of you experimented with a RAMDisk inside a VMware guest? If so, what did you take away from the setup? Was there a performance gain? Where there gotcha’s? Leave a comment if you have experience, guesses, or advice on this idea.





