CPU Ready Revisted – Quick Reference Charts

January 28, 2013 by Josh Townsend 15 Comments

I’ve written in the past about how high CPU Ready values can cause performance problems in VMware vSphere environments. For those who don’t know, CPU Ready is a measure of the amount of time that a guest VM is ready to run, but the VMware ESXi CPU Scheduler on the host is not able to immediately allocate cycles to the guest because it is busy doing work for other VM’s. CPU Ready values are exposed through ESXTOP and in the vSphere Client.

I’m often called into customer environments to do performance troubleshooting, and CPU Ready is one of the first performance measurements I check my first few minutes in the environment (I also look at memory balloon driver metrics, disk latency, CPU utilization and memory utilization of both hosts and guest VM’s). Unfortunately, I’m often called in after the excrement has made physical contact with a hydro-electric powered oscillating air current distribution device, and the customer is demanding a quick fix. Checking a few basic metrics in the vSphere Client is often enough to put me on the trail of the problem.

Note that the summation value is shown on hosts, guest VM’s and guest vCPU’s in the vSphere client. The different counters have slightly different meanings. Host CPU Ready might be a bit higher than an individual guest VM’s CPU Ready counter, for example. Host CPU ready is a good value to look at if all the VM’s are suffering performance issues. If just a single or a few VM’s are suffering performance issues, look at the guest VM CPU Ready value. The guest VM CPU Ready value is a summation of the CPU Ready of each vCPU on the guest.

As a rule of thumb, a Real-Time CPU Ready value of 10% or greater on a vCPU indicates declining performance for server workloads (I usually go with a bit lower value for VMware View virtual desktops (VDI) as users are much more likely to perceive CPU Ready on desktops that they are actively using than on a server they are connected to through a client-server setup). Theoretically, on VM’s with multiple vCPU’s the guest VM counter is safe to go beyond 10% so long as the per-vCPU counter is under 10%. For 2 vCPU VM’s the whole VM CPU Ready value can hit 20%, for a 4 vCPU 40%, etc. before we hit that 10% rule of thumb (Because the ESX CPU Scheduler has to co-schedule all vCPU’s on a VM, bigger VM’s are more prone to CPU Ready on hosts with CPU contention. This probably offsets the theoretical vCPU percentages).

The problem, however, is that the vSphere Client shows CPU Ready as a Summation of Milliseconds of CPU Ready for the Sampling Period. Summation of milliseconds is not always an easy value to wrap your head around as the impact of the number changes depending on the VM configuration, the charting period (View) / sampling interval. In some cases a summation value of 2000 can indicate problems, and in other views 1,000,000 may be ok.

In the vSphere client, the chart/graph’s are shown with an update interval. The summation values are for the entire interval. For the ‘Realtime’ interval, we’re really looking at 20 second time slices. On the Past Day view, the interval is 5 minutes (300 seconds). Past week is 30 minutes, past month is 2 hours, and past year is 1 day.

A little math is needed to convert the summation of milliseconds value to a percentage value – an easier number to understand and compare. I covered how to convert the summation value to a percent here: High CPU Ready, Poor Performance. VMware one-up’d me ( 😉 ) by publishing a KB article a couple years ago that presented the same formula for converting summation in the vSphere Client to a percentage. The formula goes like this:

$\frac {\text{CPU Ready Summation in milliseconds}}{(\text{Chart Default Update Interval in Seconds} \times 1000)} \times 100 = \text{CPU Ready \%}$

As somebody who struggles with numbers, I don’t want a formula, I want easy. To save me and my customers from my slow touch-point math, I made this quick set of reference tables to determine at a glance if the CPU Ready summation value I saw in the vSphere client was something to worry about. I have tables for 1vCPU, 2vCPU, and 4vCPU VM’s.

Note – if you use ESXTOP you see CPU Ready (%RDY) as a percentage, in realtime – no conversion necessary. If you want to capture ESXTOP realtime CPU ready and then analyze it later, use ESXTOP Batch Mode then analyze in Excel or Windows Perfmon. ESXTOP counters are explained here: Interpreting ESXTOP Statistics, and here: ESXTOP Performance Counters.

1 vCPU
Chart View		Realtime	Past Day	Past Week	Past Month	Past Year
Summation Value	Update Interval (sec)	20	300	1800	7200	86400
100		0.500%	0.033%	0.006%	0.001%	0.000%
500		2.500%	0.167%	0.028%	0.007%	0.001%
1000		5.000%	0.333%	0.056%	0.014%	0.001%
1500		7.500%	0.500%	0.083%	0.021%	0.002%
2000		10.000%	0.667%	0.111%	0.028%	0.002%
2500		12.500%	0.833%	0.139%	0.035%	0.003%
3000		15.000%	1.000%	0.167%	0.042%	0.003%
3500		17.500%	1.167%	0.194%	0.049%	0.004%
4000		20.000%	1.333%	0.222%	0.056%	0.005%
4500		22.500%	1.500%	0.250%	0.063%	0.005%
5000		25.000%	1.667%	0.278%	0.069%	0.006%
5500		27.500%	1.833%	0.306%	0.076%	0.006%
6000		30.000%	2.000%	0.333%	0.083%	0.007%
6500		32.500%	2.167%	0.361%	0.090%	0.008%
7000		35.000%	2.333%	0.389%	0.097%	0.008%
10000		50.000%	3.333%	0.556%	0.139%	0.012%
15000		75.000%	5.000%	0.833%	0.208%	0.017%
20000		100.000%	6.667%	1.111%	0.278%	0.023%
50000		NA	16.667%	2.778%	0.694%	0.058%
75000			25.000%	4.167%	1.042%	0.087%
100000		NA	33.333%	5.556%	1.389%	0.116%
250000		NA	83.333%	13.889%	3.472%	0.289%
500000		NA	NA	27.778%	6.944%	0.579%
1000000		NA	NA	55.556%	13.889%	1.157%
1500000		NA	NA	83.333%	20.833%	1.736%
2000000		NA	NA	NA	27.778%	2.315%
2500000		NA	NA	NA	34.722%	2.894%
3000000		NA	NA	NA	41.667%	3.472%
4000000		NA	NA	NA	55.556%	4.630%
5000000		NA	NA	NA	69.444%	5.787%
6000000		NA	NA	NA	83.333%	6.944%
7000000		NA	NA	NA	97.222%	8.102%
8000000		NA	NA	NA	NA	9.259%
9000000		NA	NA	NA	NA	10.417%
10000000		NA	NA	NA	NA	11.574%

2 vCPU
Chart View		Realtime	Past Day	Past Week	Past Month	Past Year
Summation Value	Update Interval (sec)	20	300	1800	7200	86400
100		0.500%	0.033%	0.006%	0.001%	0.000%
500		2.500%	0.167%	0.028%	0.007%	0.001%
1000		5.000%	0.333%	0.056%	0.014%	0.001%
1500		7.500%	0.500%	0.083%	0.021%	0.002%
2000		10.000%	0.667%	0.111%	0.028%	0.002%
2500		12.500%	0.833%	0.139%	0.035%	0.003%
3000		15.000%	1.000%	0.167%	0.042%	0.003%
3500		17.500%	1.167%	0.194%	0.049%	0.004%
4000		20.000%	1.333%	0.222%	0.056%	0.005%
4500		22.500%	1.500%	0.250%	0.063%	0.005%
5000		25.000%	1.667%	0.278%	0.069%	0.006%
5500		27.500%	1.833%	0.306%	0.076%	0.006%
6000		30.000%	2.000%	0.333%	0.083%	0.007%
6500		32.500%	2.167%	0.361%	0.090%	0.008%
7000		35.000%	2.333%	0.389%	0.097%	0.008%
10000		50.000%	3.333%	0.556%	0.139%	0.012%
15000		75.000%	5.000%	0.833%	0.208%	0.017%
20000		100.000%	6.667%	1.111%	0.278%	0.023%
50000		NA	16.667%	2.778%	0.694%	0.058%
75000		NA	25.000%	4.167%	1.042%	0.087%
100000		NA	33.333%	5.556%	1.389%	0.116%
250000		NA	83.333%	13.889%	3.472%	0.289%
500000		NA	NA	27.778%	6.944%	0.579%
1000000		NA	NA	55.556%	13.889%	1.157%
1500000		NA	NA	83.333%	20.833%	1.736%
2000000		NA	NA	NA	27.778%	2.315%
2500000		NA	NA	NA	34.722%	2.894%
3000000		NA	NA	NA	41.667%	3.472%
4000000		NA	NA	NA	55.556%	4.630%
5000000		NA	NA	NA	69.444%	5.787%
6000000		NA	NA	NA	83.333%	6.944%
7000000		NA	NA	NA	97.222%	8.102%
8000000		NA	NA	NA	NA	9.259%
9000000		NA	NA	NA	NA	10.417%
10000000		NA	NA	NA	NA	11.574%

4 vCPU
Chart View		Realtime	Past Day	Past Week	Past Month	Past Year
Summation Value	Update Interval (sec)	20	300	1800	7200	86400
100		0.500%	0.033%	0.006%	0.001%	0.000%
500		2.500%	0.167%	0.028%	0.007%	0.001%
1000		5.000%	0.333%	0.056%	0.014%	0.001%
1500		7.500%	0.500%	0.083%	0.021%	0.002%
2000		10.000%	0.667%	0.111%	0.028%	0.002%
2500		12.500%	0.833%	0.139%	0.035%	0.003%
3000		15.000%	1.000%	0.167%	0.042%	0.003%
3500		17.500%	1.167%	0.194%	0.049%	0.004%
4000		20.000%	1.333%	0.222%	0.056%	0.005%
4500		22.500%	1.500%	0.250%	0.063%	0.005%
5000		25.000%	1.667%	0.278%	0.069%	0.006%
5500		27.500%	1.833%	0.306%	0.076%	0.006%
6000		30.000%	2.000%	0.333%	0.083%	0.007%
6500		32.500%	2.167%	0.361%	0.090%	0.008%
7000		35.000%	2.333%	0.389%	0.097%	0.008%
10000		50.000%	3.333%	0.556%	0.139%	0.012%
15000		75.000%	5.000%	0.833%	0.208%	0.017%
20000		100.000%	6.667%	1.111%	0.278%	0.023%
50000		NA	16.667%	2.778%	0.694%	0.058%
75000		NA	25.000%	4.167%	1.042%	0.087%
100000		NA	33.333%	5.556%	1.389%	0.116%
250000		NA	83.333%	13.889%	3.472%	0.289%
500000		NA	NA	27.778%	6.944%	0.579%
1000000		NA	NA	55.556%	13.889%	1.157%
1500000		NA	NA	83.333%	20.833%	1.736%
2000000		NA	NA	NA	27.778%	2.315%
2500000		NA	NA	NA	34.722%	2.894%
3000000		NA	NA	NA	41.667%	3.472%
4000000		NA	NA	NA	55.556%	4.630%
5000000		NA	NA	NA	69.444%	5.787%
6000000		NA	NA	NA	83.333%	6.944%
7000000		NA	NA	NA	97.222%	8.102%
8000000		NA	NA	NA	NA	9.259%
9000000		NA	NA	NA	NA	10.417%
10000000		NA	NA	NA	NA	11.574%

The color coding may seem a bit odd and arbitrary, so here are some real-world numbers to help clarify: If you hit 10 seconds of CPU Ready out of every 20 seconds? Big deal. 40,000 seconds out of a day’s 86400 seconds? You’ve got problems. These are easy examples – you’re waiting for CPU cycles ~50% of the time. But say you hit 90% CPU ready for 10 seconds out of a whole month? No big deal – a blip on the radar (it wouldn’t even show up in the vCenter Statistics Roll-ups).

High CPU Ready for a short period may not be a huge problem, but the numbers can be deceiving. For example, let’s say you are looking at the Past Year view with a default sampling interval of 86400 seconds (1 day) for a single vCPU VM. Now let’s say that you see an average summation value of 2,000,000 in the ‘Past Year’ table in vCenter. That’s 2.315% by our formula. Sounds low, right? Not so quick. As we deal with longer periods of time, the percentages shift a bit from our 10% rule of thumb.

2.315% of our 86400 second time slice is 2,000 seconds, or 33 minutes per day. This is an average, so there were higher days and there were lower days, but on average, we waited more than a half an hour per day for CPU scheduling for our poor little VM. Bump the scale up, we waited 12,167 minutes per year –> 203 hours per year –> 8.45 days per year for CPU scheduling. Let’s say you have an SLA to deliver 5 nines of reliable, high performance for this workload. 8 ½ days is about 97.65% uptime – a long way off your 99.999% SLA.

I have reflected this in my charts by marking the increasingly longer intervals with diminishing warning values (yellow and red) – not because it indicates a problem at a specific point in time, but because it could indicate a systemic problem in the environment. This is where some critical analysis comes in. Drill into the smaller time intervals (past day) – is that 33 minutes taking place all at once or spread over the day; overnight, during a backup window, when you are exempted from the business hour SLA or is it happening for longer periods at critical times during the workday? Hope you packed a lunch, cause you’ve got some troubleshooting ahead of you!

I hope the explanations of CPU Ready and the CPU Ready cheat sheet tables are helpful to you. Questions, corrections, or additions – leave a comment below!

Note: Something ate my CSS that color coded these charts (thanks for bringing it to my attention, Joel!). I’ve fixed it now, but in case it happens again here is a PDF copy of the charts: VMtoday VMware CPU Ready Quick Reference Charts

Comments

NiTRo says

January 31, 2013 at 2:58 am

Hi Joshua, what do you think about this statement from the vmware doc center https://pubs.vmware.com/vsphere-51/topic/com.vmware.vsphere.monitoring.doc/GUID-FC93B6FD-DCA7-4513-A45E-660ECAC54817.html

“However, if the CPU usage value for a virtual machine is above 90% and the CPU ready value is above 20%, performance is being impacted.”

I’ve seen high ready values with low usage values not harming vms too much, that’s why i ask 🙂

Reply
- Joshua Townsend says
  
  January 31, 2013 at 8:02 am
  
  I think VMware’s numbers are a bit high. I usually go with a 70% for CPU utilization, and 10% for CPU Ready, but of course it all depends on the environment, workload, and budget to determine what is acceptable in terms of a performance impact.
  
  I have also seen your scenario with high CPU Ready and low CPU usage where performance is acceptable at a given point. The danger with this is that it might not take much to tip the balance against a VM in such a position. Either the VM itself or a peer on the same host gets more active, a new/vMotion VM comes onto the host, or resource assignment changes (more vCPU added to guests) to move your scenario from ‘ok’ to ‘oh crap’.
  
  Good quesyand thanks for the comment!
  
  Reply
  - NiTRo says
    
    January 31, 2013 at 9:04 am
    
    thanks for the details Joshua, i strongly agree with the ‘ok’ to ‘oh crap’ by the way 🙂
    
    Reply
Pete says

February 6, 2013 at 6:53 pm

Great post Josh! Quite timely for me, as I will be experimenting on the implications of some big VMs for source code compiling. https://vmpete.com/2012/12/18/vroom-scaling-up-virtual-machines-in-vsphere-to-meet-performance-requirements/

– Pete

Reply
Marc says

April 5, 2013 at 1:40 pm

When viewing roll up statistics of a few days but less than one week, which column applies (300 or 1,800)?

Reply
Kadu says

December 11, 2013 at 9:08 am

I believe the formula on your post is wrong. The correct forumala, as per KB article you mentioned, is:

(CPU summation value / ( * 1000)) * 100 = CPU ready %

Reply
- Kadu says
  
  December 11, 2013 at 9:10 am
  
  Errr, part got cut out due to being interpreted as a tag
  (CPU summation value / (chart default update interval in seconds * 1000)) * 100 = CPU ready %
  
  Reply
  - Josh Townsend says
    
    December 11, 2013 at 7:29 pm
    
    Right you are. Thanks for the correction – I have updated the post to reflect the correct formula.
    
    Reply
JM says

December 3, 2014 at 4:30 pm

thanks. this was helpful

Reply
Stephen says

January 20, 2015 at 8:59 pm

Loved your little graph here. I created a powershell script to query all the VM’s in a vCenter and return the values

get-pssnapin -registered | add-pssnapin -passthru -ErrorAction SilentlyContinue

#Initialize variables

$vCenterServer = “vcenter” #Enter your vCenter server FQDN here (one at a time)
$intervalseconds = 7200 # 300 = day, 1800= Week, 7200 = Month
$PercentCheck = 7
$result = @()

#Connect to vCenter Server
$VC = Connect-VIServer $vCenterServer

# Get Stats for VM’s and check for ready time more than $PercentCheck% per vCPU

$VMs = Get-VM

Foreach ($V in $VMs) {
If ($V.PowerState -eq “PoweredOn”){

$Summation = Get-VM -name $V | get-stat -Stat cpu.ready.summation -MaxSamples 1 -IntervalSecs $intervalseconds #Get the ready time
$Percent = ($Summation.Value / ($intervalseconds * 1000)) * 100 #convert to percentage

#Check if more than $PercentCheck% average per vCPU
[int]$Spread = [int]$Percent / $V.NumCpu

If ( $Spread -gt $PercentCheck ) {
$check = “YES”
}Else{
$check = “no”
}

$Spread = “{0:N2}” -f $Spread
$Percent = “{0:N2}” -f $Percent

$row = “” | select VMname,Summation,Ready,PerCPU, CPUs,IsBad; `
$row.VMname = $V.name; `
$row.Summation = $Summation; `
$row.Ready = $Percent; `
$row.PerCPU = $Spread
$row.CPUs = $V.NumCpu
$row.IsBad = $check
$result += $row

}
}
$result | Format-Table –AutoSize

# Disconnect from VC
Disconnect-VIServer -Confirm:$False

Reply
- Stephen says
  
  January 21, 2015 at 7:00 pm
  
  Actually here is version 2 of it.. much better and can do the real time. Also my maths are better in this one.
  
  #Need powershell 2.0+ and powercli 5.1+
  
  get-pssnapin -registered | add-pssnapin -passthru -ErrorAction SilentlyContinue
  
  #Initialize variables
  
  $vCenterServer = “vcentersrvr.domain.com” #Enter your vCenter server FQDN here (one at a time)
  
  $intervalseconds = 5 # intervals to specify. 5 = realtime, 300 = day, 1800 = Week, 7200 = Month
  $PercentCheck = 10 # Percent per CPU of ready time to warn on.
  $result = @()
  
  #Connect to vCenter Server
  $VC = Connect-VIServer $vCenterServer
  
  # Get Stats for VM’s and check for ready time more than $PercentCheck % per vCPU
  
  $VMs = Get-VM # Get all the VM’s
  
  Foreach ($V in $VMs) {
  If ($V.PowerState -eq “PoweredOn”){ #Check for powered on
  
  $Summation = Get-VM -name $V | get-stat -Stat cpu.ready.summation -MaxSamples 1 -IntervalSecs $intervalseconds #Get the ready time
  
  Foreach ($stat in $Summation){ # Go through all the processors if it has them (only realtime interval)
  
  #Get CPU number and set Percentange to check (divide by total cpu if total)
  If ($stat.instance){
  $CPUNum = $stat.Instance
  $Percentmultipler = 1
  }else{
  $CPUNum = “Total”
  $Percentmultipler = $V.NumCpu
  }
  
  $Percent = ($stat.Value / ($intervalseconds * 1000)) * 100
  
  #Check if more than $PercentCheck % average per vCPU
  
  If ($Percent -gt $Percentmultipler * $PercentCheck ) {
  $check = “YES”
  }Else{
  $check = “no”
  }
  
  $PerSecondWaiting = ($stat.Value / $intervalseconds) # Convert percentage into ms
  
  $Percent = “{0:P3}” -f ($Percent / 100) #format to percentage
  $PerSecondWaiting = “{0:N2}” -f $PerSecondWaiting #formatting to 2nd decimal place .. remember 1000 ms per second
  
  Write-Host “Checking VM: $($v.Name) CPU#: $CPUnum”
  #Write-Host “CPU: $CPUNum”
  #Write-Host “Percent: $Percent”
  
  $row = “” | select VMname,Summation,Ready,CPU,msPerSec,IsBad; `
  $row.VMname = $V.name; `
  $row.Summation = $stat.Value; `
  $row.Ready = $Percent; `
  $row.CPU = $CPUNum; `
  $row.msPerSec = $PerSecondWaiting ; `
  $row.IsBad = $check;`
  
  $result += $row
  }
  }
  }
  $result | Format-Table –AutoSize
  
  # Disconnect from VC
  Disconnect-VIServer -Confirm:$False
  
  Reply
  - EdwardJ says
    
    September 14, 2015 at 12:52 pm
    
    You said
    $intervalseconds = 5 # intervals to specify. 5 = realtime, 300 = day, 1800 = Week, 7200 = Month
    Wouldn’t 20 = realtime? where did 5 come from?
    
    Reply
Steve says

December 7, 2016 at 2:52 pm

Great explanation. I’ve been working on getting my ready times down, but have never seen any mention that number of CPUs was a consideration. By the way, I use https://www.vmcalc.com/ to calculate wait times.

Reply