Posts Tagged ‘ESX’
A reader named Mark contacted me today and asked if there was a way to reduce the size of the batch output from an ESXTOP run. And he asks for good reason: Depending on the number of VM’s on your host, the delay between ESXTOP samplings and the number of samples you collect, using the All Stats option (-a) can yield a massive file in a short period of time. If written to a partition on your ESX Service Console you run the risk of filling the partition, and forget about actually being able to analyze the data in PERFMON or Excel. For example, on an ESX host running ~15 VM’s I produced 100MB worth of CSV using the -a switch, sampling every 15 seconds, for just under 2 hours. ESXTOP uses 10-second intervals by default; I used -d 15 to change the sampling delay. Had I went with the default my output would have been bigger.
To reduce the size of your output, you can change your sampling delay to something larger, say 30-seconds. I suppose you could also capture statistics when the host is not busy so you get fewer characters in the results, but that’s just being goofy.
A better way to reduce your ESXTOP output size is to selectively include only the statistics you are interested in, and is really what Mark was asking. After all, all statistics from ESXTOP can be too many statistics, and chances are you already know what stats you are interested in. Here’s how you can narrow down the collected stats for easier analysis and smaller output:
- Enter ESXTOP in interactive mode on the Service Console by simply typing esxtop at the # prompt
- Switch to a component you are NOT interested in capturing statistics on by pressing the corresponding menu option (c: ESX cpu, m: ESX memory, d: ESX disk adapter, u: ESX disk device, v: ESX disk VM).
- Press f when viewing the component you do not want to capture. A list of fields will be displayed. You can toggle the fields on and off by pressing the letter corresponding to each field. An * indicates that the field is on. You want to turn off all of the fields you don’t want to collect.
- Repeat steps 2 & 3 for the remaining components, leaving only what you want to capture.
- Switch to the component you want to capture in batch mode and repeat step #3, except you will now enable what you want to capture.
- Press W (capital W – case sensitive) to write out the ESXTOP configuration file. You can accept the default or create new configuration files. You may want to create a CPU-only config file, memory-only, and so forth.
- Press CTRL+C to stop ESXTOP.
- Now, invoke ESXTOP in batch mode, calling your updated or new configuration file you created in step #6 using the -c switch. Here’s an example:# esxtop -b -d 30 -n 480 -c .esxtopcpustats > /tmp/esxtop_cpu_stats.csv where .esxtopcpustats is an ESXTOP config file with only CPU stats. -d sets your capture interval to 30 seconds, and -n sets the number of samples to 480 (or 4 hours with a delay of 30 seconds).
Once your capture is complete you can replay the sampling in ESXTOP using replay mode (-R), or you can copy the .csv to a Windows system and use PERFMON or Excel to analyze the stats. If using PERFMON or Excel you will notice that the system summary information displayed at the top of an interactive ESXTOP session is included in the output (console memory, console cpu, etc.). As far as I know, there is no way to disable this, nor would you want to as it includes the time stamp necessary to interpret your data.
It is possible to use the vSphere CLI or the vSphere Management Assistant (vMA) to run RESXTOP, a version of ESXTOP designed for remote administration of ESXi or ESX. You may note, however, RESXTOP from the vSphere CLI only works from a Linux client. Using either of these tools will help you to automate ESXTOP statistics collection from multiple hosts using customized configuration files.
Microsoft published a document named “Getting to Know Hyper-V: A Walkthrough from Initial Setup to Common Scenarios” last week. According to Microsoft, “this guide provides detailed step-by-step walkthroughs for testing Hyper-V on a pre-production environment. You can use this guide to become familiar with Hyper-V and the process of creating and managing virtual machines. Also included in this guide are useful scenarios that you can test to better understand how Hyper-V can address the business goals of your organization.” The document serves as a sort of evaluators guide for Hyper-V, stepping the reader through everything from enabling VT in BIOS through virtual networking. It also includes some sections on using snapshots, base virtual machine templates, and managing Hyper-V based virtual machines remotely with Hyper-V Manager. If you want more in-depth documentation on Hyper-V you can go through http://technet.microsoft.com.
As a side note, Microsoft has published the Microsoft Manual of Style for Technical Publications to help standardize technical documentation. I have long been a fan of Microsoft’s technical documentation for its easy to read style, although it sometimes lacks the depth that I desire.
While we’re on the topic of virtualization documentation, I have also been quite pleased with VMware’s technical documentation over the years, and have found it to be continually increasing in quality, providing very specific technical guidance and references to additional resources. I have also been pleased to see that VMware has improved delivery options for documentation. VMware offers several formats for documentation delivery, including web-based and PDF’s. Start with the Documentation Roadmap for a quick introduction to the available documentation, and where to find what you need.
You can find web-based vSphere documentation here: http://pubs.vmware.com/vsp40/. The web-based documentation is great for running searches on. All vSphere documentation can be accessed through this page: http://www.vmware.com/support/pubs/vs_pages/vsp_pubs_esx40_vc40.html. If you want to do a full grab of all of VMware’s documentation for an in-house repository (e.g. SharePoint), check out xtravirt’s VMware Documentation Downloader script.
If you are looking for quick and easy evaluator guide-type documentation from VMware, check out these resources: ESXi Installable and vCenter Server Setup Guide and the Virtualization Kit (registration required) at http://www.vmware.com/resources/wp/virtualization101_register.html.
There is a ton of less formal VMware documentation in several places:
- Technical resources and case studies here: http://www.vmware.com/resources/techresources/
- Proven practices around Strategy, Applications, Security, Management, and Availability at VIOPS.
- Official VMware Blogs at http://www.vmware.com/vmtn/planet/vmware/.
- Community blogs aggregated by VMware at Planet v12n: http://www.vmware.com/vmtn/planet/v12n/
- VMworld Recorded Sessions & Labs (VMworld 2009 Sessions available as of today, September 14th) at http://vmworld.com.
- The VMware Community Forums: http://communities.vmware.com/
- And, 3rd party books like Scott Lowe’s Mastering VMware vSphere 4
.
Do you have other sources of virtualization documentation or easy methods of searching documentation to find exactly what you need when you need it? If so, leave a comment!
I needed to grab some stats from my ESX hosts for off-line analysis so I fired up my trusty ESXTOP intent on using batch mode to capture a .csv formatted output. I started to manually select the counters I was interested in while working in ESXTOP interactive mode (you can save your selected counters to the esxtop configuration file with the ‘w’ command) and thought that there must be a better way. I found that better way in the VMware Performance Community: http://communities.vmware.com/docs/DOC-3930. There is now a -a switch that can be used to include ALL performance counters. I’m sold.
I wanted detailed information, so I decided on a 15 second capture interval to run for a 2 hour window. Here’s the command I used:
esxtop -a -b -d 15 -n 480 > /tmp/esxtopout.csv
where -a is for ALL, -b is for batch mode, -d is for delay, and -n is for the number of iterations ((60/15)*60*2). I wrote out the results to a .csv in /tmp. The resulting CSV weighed in at a whopping 100MB after 2 hours.
The CSV can be analyzed in Excel (pivot tables work well for this) or in Windows Perfmon. I opened the log in Perfmon as I was after basic Min/Average/Max counters and Perfmon makes those easy to see. When adding the CSV log to Perfmon, you are prompted to select counters. I added all instances of Commands/sec, Reads/sec, and Writes/sec from Physical Disk (I was gathering some IOPS counts for a new storage proposal). I got a bit more than I bargained for: a mostly unresponsive Perfmon window and the ugliest darn graph I’ve ever seen.
Switching from a graph view to the report view allows you to easily view and remove specific counters that you are not interested in, or open the Properties of the data set, switch to the data tab and bulk select counters that you want to remove. I was not interested in vmhba1:x, specific VM’s or worlds, so I killed all of those, leaving just the base iSCSI device (vmhba32 in my case).
After some cleanup the graph looked a bit better and more importantly, I was able to easily read my Min/Average/Max stats:
Here are the takeaways -
- ESXTOP is a powerful utility for performance monitoring
- All stats (-a) can result in a huge file – use it wisely in batch mode; else use interactive mode to select your counters and write them to the user-defined configuration file. Invoke the config file with the -c option when running in batch mode.
- Consider using vscsiStats for more granular reporting.
- ESXTOP physical disk stats do not include NFS volumes.
Do you use other tools or methods to collect basic disk IO counters for storage sizing purposes? If so, leave a comment describing your approach!
I have been meaning to write this up for a while; Scott Drummonds’ ‘Love Your Balloon Driver’ post today at his Virtual Performance blog gave me a nice reminder. I actually caught a sneak peak at the graphs with an explanation from Scott at his instructor-led lab at VMworld 2009. Scott calls out that the only workload they discovered suffers from balloon driver activity is Java. The reason for Java’s problems with balloon driver activity is that Java itself runs in a VM and so the guest OS cannot properly determine which pages should be swapped out when the balloon driver calls for it.
My experiences causes me to agree with Scott and the whitepaper he cites – in a properly designed and equipped environment the balloon driver is not detrimental for most every workload to a point. However, I recently discovered in a client site that the balloon driver can cause significant issues when the environment is poorly designed and under-sized. Here the background:
I was called into an already established environment where the client was running on an older blade with VMware ESX 3.5. The blade maxed out at 16GB RAM and had dual dual-core CPU’s with no hope for an upgrade. On the blade was a single guest VM running Windows 2003 with SQL 2005, in it’s full 32-bit glory. The VM was configured with 4 vCPU’s and 16GB of memory. Some of you can probably already guess where this is going….
The x86 Windows guest had PAE configured, and SQL took advantage of AWE to use the additional memory beyond the 4GB limit of a 32-bit system. Additionally, the Windows guest had the /3GB switch enabled in boot.ini. Finally, as per SQL best practices, the ‘Lock Pages in Memory‘ permission was granted to the SQL Server service account. What the guest was left with was 1GB of kernel mode memory and 15GB of User Mode/Extended addressable memory.
And here’s the problem. The client was using ESX, not ESX 3.5, so the Service Console required memory. In this case, the service console had approximately 512MB allocated to it. Futhermore, VM’s require some overhead on ESX to run. The memory overhead consumed by a Windows guest on ESX 3.5 with 4 vCPU and 16GB of memory is a bit more than 512MB. On a properly sized ESX server with multiple similar guests/workloads, you could probably gain much of the overhead back through transparent page sharing; but in this case I had a 1:1 P2V ratio. If you are any good at math you see that the environment is running about 1GB short of memory. A quick check of the balloon driver stat in vCenter show that the balloon driver was constantly active and demanding about 1GB back from the guest… constantly.
Under normal circumstances this might not be an issue, but in this case the Windows guest was being absolutely punished. The guest CPU’s were pegged at 100% with an excessive amount of kernel time, often indicating IO issues. And indeed I did experience terrible disk and network performance on the guest. At the root of the problem is this – the Lock Pages in Memory permission allows SQL to get a firm grasp on the user mode memory available to it (15GB) and lock it up. This left the already starved (because of the 3GB switch in the boot.ini) guest kernel with it’s 1GB the only thing the balloon driver could really swap out.
The client suggested a reservation of 16GB on the VM, knowing that memory reservations prevent balloon driver activity. I calmly asked them to back away from the keyboard as I explained how if a starved guest was bad, how much worse a starved Service Console would be. In the end the fix was quiet easy – I convinced the customer that they should reduce the amount of memory allocated to the guest by about 1GB, enough to let the 512MB SC and the 512MB of overhead run without contention. I was able to show them the difference between allocated and active memory in vCenter – the 1GB being surrendered was not really being actively used, SQL just had it locked up. In fact, surrendering the 1GB of memory back to ESX breathed new life into the guest VM, bringing its performance back in line with expectations.
Ideally, I would have brought in a bigger ESX server that could serve additional VM’s, driving greater levels of efficiency across the environment. It just wasn’t an option for the client in this case. In the end, the problem was fixed and I was reminded just how fun it can be to explain some of these backwards sounding virtualization concepts to customers – fewer vCPU’s can lead to better performance of guests, less guest memory can fix performance issues, and increasing the quantity of similar guests on a host can drive better performance to a point because of transparent page sharing.
Stay tuned over the next few weeks as I digest and write on my VMworld experience – from VMUG activities to Paul Maritz’s press conference announcing the vCloud Express, and plenty of great sessions in between. Like many of you, I returned from VMworld with quite a backlog of work but I’ll do my best to squeeze in some posts and tweets.
I have been pulling my hair out with a small VI3 implementation running against an IBM DS3300 iSCSI array. Performance, for lack of a better term, sucked. Granted, the DS3300 is not an enterprise level workhorse of a storage system, but it fit the budget. Read performance was decent from the array, but write performance was terrible, maxing out at 10Mpbs throughput and insanely high latencies on long writes when the system was under load. This led to some long P2V operations, poor guest performance, and some questions from the project sponsors on why I couldn’t make the environment sing.
The system was configured with a single controller with dual GigE NIC’s. The controller had 512MB of battery backed cache (there is also a 1GB cache upgrade option available). I wrote off some of the poor performance to a single controller with a less-than-optimal amount of cache; blamed the SAS controller to SATA disk command translation overhead; cringed at the 6 disk RAID5 configuration; and engaged in some self doubting. I convinced the powers that be that we were IO constrained and got some funds to fill out the 3U chassis to a full 12 SATA disks, and reconfigured the array as a RAID10. Performance gains were almost unnoticeable with these changes. In addition, I did some basic troubleshooting of the network environment, verifying multiple paths to the storage, setting Flow Control on the switches to receive only, and double-checked my iSCSI initiator settings. Note: The DS3300 is only supported with the ESX software initiator. I found documentation on the DS3300 to be lacking, but did discover that the Dell MD3000i is based on the same LSI Engenio array. Some Googling on the Dell solution led to to the ‘SMcli’ command line interface for both arrays. The commands are slighly different for the Dell and IBM. The links to the IBM CLI documentation were broken, so I had to do a bit of trial and error to get the commands right. I used the Dell documentation as a starting point. (Rant: Seriously, IBM? Can you make your documentation any harder to get through – is it a Redbook, is it an Engineering Whitepaper, is it a support document, is it a case study – and why can I only find these with complex Google searches, not on your own product pages, and why can’t you name for documents intelligently, not with some random string of characters).
Moving on… I received an automated alert from the DS3300 about an incomplete battery learn cycle. Using the IBM Storage Manager GUI I generated a Storage Subsystem Profile’ from the Support tab to check the battery status. In the profile I discovered that while write cache was enabled, it had a status of “Enabled (Suspended)”. Ah ha! Now I’ve got some decent Google material that led me to this: http://communities.vmware.com/thread/195838. Hot damn I love the VMware Community Forums!
It turns out that in a single-controller configuration the setting for cache mirroring remains enabled by default. Because there is no 2nd controller to mirror to, the array suspends write caching. This is probably a safety thing – loss of high availability on the controllers puts data in cache at risk should the only controller fail. I weighed my options and decided that the poor performance I was experiencing beat HA concerns, so I enabled write cache on the array using this command:
c:\program files\ibm_ds4000\client>smcli -n <ARRAYNAME> -c “set allLogicalDrives mirrorEnabled=false;”
And then followed with this for good measure:
c:\program files\ibm_ds4000\client>smcli -n <ARRAYNAME> -p <arraypassword> -c “set allLogicalDrives writeCacheEnabled=true;”
The results were immediately noticeable:
The screen shot is from Veeam Monitor Free Edition, taken during 4 concurrent V2V operations from Hyper-V to VMware. With the write cache fully functional, disk usage peaked at 54MBps, latency dropped to about 6ms, and my blood pressure dropped a few notches.
While poking around the CLI I also found that you can dump performance stats from the array (performance is otherwise hard to find on the thing) using this command:
C:\Program Files\IBM_DS4000\client>smcli -n <ARRAYNAME> -c “set session performanceMonitorInterval=5 performanceMonitorIterations=120;save storageSubsystem performanceStats file=\”c:\\ds3300perfstats.csv\“;”
This will give you a 10 minute record of performance from the array which you can analyze using Excel. The Dell Enterprise Center TechCenter Wiki has a great write-up on how to efficiently analyze the data from this command here: http://www.delltechcenter.com/page/MD3000i+Performance+Monitoring, complete with a YouTube video that walks you through the process:
I am beginning to think that the DS3300 (and MD3000i) may actually be a viable starter solution for SMB’s starting out on a virtualization project. But I would recommend the cache upgrade, 2nd controller, SAS disks instead of SATA to eliminate the SAS-to-SATA translation overhead and more faster disks instead of fewer slower disks so you can drive throughput and IOPS to a higher level.
Have any of you deployed the DS3300 or MD3000i (or the generic LSI solution)? Do you have any performance tuning tips for these arrays? If so, share in the comments!
VMware vExpert and fellow Northern Virginian, Ken Cline, has posted an excellent article on his Ken’s Virtual Reality blog that aims to demystify VMware networking. The article, the first in a new series by Ken, provides an overview of networking in an ESX/ESXi environment and breaks down the intricacies of the vSwitch and VLANs. The article comes complete with some nifty diagrams to help make sense of the topic. The timing of this article is great for me as it helps to frame my thoughts as I delve into the design of my latest VMware project on an IBM BladeCenter with IP SAN storage.
Great article, Ken! I look forward to reading the rest of the series.
One more post to wrap up the nonsense with my DL380 G3 ESX servers….
Vincent Vlieghe noted that you must make a couple changes to your DL380 G3’s for ESX to work correctly. His post was written back in 2006 when we were still working with ESX 2.x, but the same appears to be true of ESX 3.5 RTM (Updates are not supported on this hardware per the HCL). The changes you must make to BIOS are:
For stable operation on these systems, ESX Server requires a BIOS MPS Table Mode setting of Full Table APIC. With the exception of the specific systems referenced below, the following BIOS settings must be applied in order if available:
- System Options > OS Selection: Select Windows 2000.
- Advanced Options > MPS Table Mode: Select Full Table APIC.
- When presented with multiple Windows options (Windows 2000, Windows Server 2003, Windows .NET, and so on) select Windows 2000. If both BIOS settings are available and can be modified, both must be set correctly. You should confirm these settings after any BIOS upgrade operation.
I have seen other references that say that you should also disable hyperthreading on this platform, but I was able to successfully run with Hyperthreading enabled with no performance degradation or stability issues. I hope this information is helpful to those of you still running these dinosaurs!
I wrote some time back about networking problems with a clean install of ESX 3.5 U3 on a HP DL380 G3 server in a lab environment. A simple downgrade to ESX 3.5 RTM corrected the issue and I didn’t think much about it. One of the servers in the lab died and I went about the business of rebuilding it. Having learned my lesson, I started with an ESX 3.5 RTM install and then patched to Update 3 plus other applicable updates. Much to my chagrin, the server began crapping out on me randomly. Some reboots, some networking issues, and other assorted not so good things. Now the DL380 G3 is not the spring chicken it used to be, so I assumed some faulty hardware was probably to blame. Some diagnostics and log reviews yielded no hardware issues.
On a whim, I decided to check the VMware HCL to see if the DL380 G3 was still on the list of compatible servers for ESX. Now, I had checked, or rather ‘remembered’ checking, the HCL before that first problematic install, but a recheck never hurts. When I arrived at the VMware HCL page I saw the same old trusty PDF link with a slightly newer revision date than my previous visit. I was pleasantly surprised when I clicked the PDF link to find that I was redirected to a searchable, filterable forms-based version of the HCL. Nice! Let’s do this thing….
I’m a little lazy, so I simply used a keyword search to look up ‘DL380 G3′. Presto-chango: I’ve got results, and I like what I see:
Search Results for DL380 G3 on the VMware HCL
My eyes jump right to ESX 3.5 – Supported, on my platform, no further questions your honor. Close the old browser window and move on with my life, my life being troubleshooting this darn server.
A few hours later I am still struggling with the server and turn to Ebay for salvation. “If you can’t beat em, cheat em,” my grandfather used to say. I’ll find new hardware for my lab. I identified some other hunk of junk that just might work and decided to check the HCL for it. That’s when it jumped out at me: there are Update versions included in the HCL and I had been to quick to see it on my DL380 G3 search. Back to the HCL.
This time I just do a search for ‘DL380′, leaving off the Generational notation and get the following:

Search Results for DL380 from the VMware HCL
The ProLiant DL380 G5 with Quad-core Intel Xeon processors lists ESX 3.5 U3, ESX 3.5 U2, and ESX 3.5 U1 as supported releases, along with the RTM ESX 3.5. The Update versions are not listed for the G3 or G4. After some self-deprecating curses and a reinstall of ESX 3.5 Update-nada, stability returned.
The lesson learned, double-check the HCL (or if you are a little slow like me, a triple-check doesn’t hurt). The HCL is major version and Update-revision sensitive. And, not all models are treated equally. You’ll notice in the picture to the left that the DL380 G5 has different supported releases depending on the CPU Model.
Also, keep in mind that you need to verify that all components of your VMware infrastructure are on the HCL from Servers and Systems to IO Devices, and Storage/SAN. The VMware HCL site offers some basic tips for searching here: http://www.vmware.com/resources/compatibility/help.php.
Here’s the real take-away: The VMware HCL is there for a reason. Sure, you might be able to get something that is not on the HCL to work, but you may experience instability along the way. In the event that you are running a non-HCL system you may also find that VMware Support may be limited in what they can do for you.
VMware released version 1.5 of the VI Toolkit for Windows – the PowerShell management and reporting tool of choice for many VMware administrators. The new version carries build number 142961. You can download v1.5 here: http://blogs.vmware.com/vipowershell/. The update includes some 32 new cmdlets, including ones for getting/setting NTP settings on ESX, getting/setting Advanced configuration options on ESX, getting/setting ESX Firewall settings, and the ability to modify DRS rules using PowerShell. Existing cmdlets have also been updated with new parameters, and several fixes have been introduced. Check out the release notes here: http://www.vmware.com/support/developer/windowstoolkit/wintk15/windowstoolkit15-200901-releasenotes.html.
There are plenty of examples on the Internet to get you started with the VI Toolkit for Windows. Check out these sites to get started:
Start at the VMware Community site for the Windows Toolkit for great examples and a little help from some friends: http://communities.vmware.com/community/developer/windows_toolkit/
There are also some good example scripts and resources floating around, such as:
http://vmetc.com/2008/08/27/powershell-scripting-examples-for-vmware-virtual-infrastructure/
http://www.peetersonline.nl/
http://www.ivobeerens.nl/?p=106
http://www.vmguru.com/
Not a hard-core scripter? Grab this handy tool for a little GUI on your PowerShell, and extend it with the VMware Infrastructure PowerPack 2.0
What tools or examples are you using to extend the power of PowerShell into your Virtual Infrasturucture? Leave a comment to share!
Between budget cuts and New Year’s resolutions, improving your security posture is probably near the top of your to-do list. Much has been made of security concerns in a virtual environment, but it is always good to re-visit your configurations and make sure they are still on par with recommended best practices. I began re-reviewing VI security best practices after reading at post by Bob Plankers at The Lone SysAdmin (Bob has been on my reading list for years – he has a great style and always brings fresh insights) on why you would want a second super-user account on your ESX servers.
We certainly all have our own opinions and operations procedures when it comes to configuring and hardening our environments, but I decided to take a look at what the experts had to say on this particular subject and other basic build and hardening recommendations. Here is what I found:
VI3.5 Security Hardening Whitepaper
Defense Informaion Systems Agency (DISA) ESX Server Security Technical Implementation Guide
As a side note, DISA publishes many STIG’s at http://iase.disa.mil/stigs/. Your tax dollars paid for these, so you might as well check them out.
NSA VMware ESX Server 3 Configuration Guide
There are also numerous tips and scripts for locking down your virtual infrastructure in the VMware Community Forums (Start here: http://communities.vmware.com/message/941372).
So back to the question of second super user accounts: It seems that best practices are to create a second user account with sufficient access to the console, granting that user SUDO privledges, and then disabling the default root account.


