CPU Ready Revisted – Quick Reference Charts

I’ve written in the past about how high CPU Ready values can cause performance problems in VMware vSphere environments.  For those who don’t know, CPU Ready is a measure of the amount of time that a guest VM is ready to run, but the VMware ESXi CPU Scheduler on the host is not able to immediately allocate cycles to the guest because it is busy doing work for other VM’s.  CPU Ready values are exposed through ESXTOP and in the vSphere Client.

I’m often called into customer environments to do performance troubleshooting, and CPU Ready is one of the first performance measurements I check my first few minutes in the environment (I also look at memory balloon driver metrics, disk latency, CPU utilization and memory utilization of both hosts and guest VM’s).  Unfortunately, I’m often called in after the excrement has made physical contact with a hydro-electric powered oscillating air current distribution device, and the customer is demanding a quick fix.  Checking a few basic metrics in the vSphere Client is often enough to put me on the trail of the problem.

Note that the summation value is shown on hosts, guest VM’s and guest vCPU’s in the vSphere client.  The different counters have slightly different meanings.  Host CPU Ready might be a bit higher than an individual guest VM’s CPU Ready counter, for example.  Host CPU ready is a good value to look at if all the VM’s are suffering performance issues.  If just a single or a few VM’s are suffering performance issues, look at the guest VM CPU Ready value.  The guest VM CPU Ready value is a summation of the CPU Ready of each vCPU on the guest.

vSphere Client CPU Ready Counters

As a rule of thumb, a Real-Time CPU Ready value of 10% or greater on a vCPU indicates declining performance for server workloads (I usually go with a bit lower value for VMware View virtual desktops (VDI) as users are much more likely to perceive CPU Ready on desktops that they are actively using than on a server they are connected to through a client-server setup).  Theoretically, on VM’s with multiple vCPU’s the guest VM counter is safe to go beyond 10% so long as the per-vCPU counter is under 10%.  For 2 vCPU VM’s the whole VM CPU Ready value can hit 20%, for a 4 vCPU 40%, etc. before we hit that 10% rule of thumb (Because the ESX CPU Scheduler has to co-schedule all vCPU’s on a VM, bigger VM’s are more prone to CPU Ready on hosts with CPU contention.  This probably offsets the theoretical vCPU percentages).

The problem, however, is that the vSphere Client shows CPU Ready as a Summation of Milliseconds of CPU Ready for the Sampling Period.  Summation of milliseconds is not always an easy value to wrap your head around as the impact of the number changes depending on the VM configuration, the charting period (View) / sampling interval.  In some cases a summation value of 2000 can indicate problems, and in other views 1,000,000 may be ok.

In the vSphere client, the chart/graph’s are shown with an update interval.  The summation values are for the entire interval.  For the ‘Realtime’ interval, we’re really looking at 20 second time slices.  On the Past Day view, the interval is 5 minutes (300 seconds).  Past week is 30 minutes, past month is 2 hours, and past year is 1 day.

A little math is needed to convert the summation of milliseconds value to a percentage value – an easier number to understand and compare.  I covered how to convert the summation value to a percent here: High CPU Ready, Poor Performance.  VMware one-up’d me ( 😉 ) by publishing a KB article a couple years ago that presented the same formula for converting summation in the vSphere Client to a percentage.  The formula goes like this:

\frac {\text{CPU Ready Summation in milliseconds}}{(\text{Chart Default Update Interval in Seconds} \times 1000)} \times 100 = \text{CPU Ready \%}

As somebody who struggles with numbers, I don’t want a formula, I want easy.  To save me and my customers from my slow touch-point math, I made this quick set of reference tables to determine at a glance if the CPU Ready summation value I saw in the vSphere client was something to worry about.  I have tables for 1vCPU, 2vCPU, and 4vCPU VM’s.

Note – if you use ESXTOP you see CPU Ready (%RDY) as a percentage, in realtime – no conversion necessary.  If you want to capture ESXTOP realtime CPU ready and then analyze it later, use ESXTOP Batch Mode then analyze in Excel or Windows Perfmon.  ESXTOP counters are explained here: Interpreting ESXTOP Statistics, and here: ESXTOP Performance Counters.

1 vCPU
Chart View Realtime Past Day Past Week Past Month Past Year
Summation
Value
Update Interval (sec) 20 300 1800 7200 86400
100 0.500% 0.033% 0.006% 0.001% 0.000%
500 2.500% 0.167% 0.028% 0.007% 0.001%
1000 5.000% 0.333% 0.056% 0.014% 0.001%
1500 7.500% 0.500% 0.083% 0.021% 0.002%
2000 10.000% 0.667% 0.111% 0.028% 0.002%
2500 12.500% 0.833% 0.139% 0.035% 0.003%
3000 15.000% 1.000% 0.167% 0.042% 0.003%
3500 17.500% 1.167% 0.194% 0.049% 0.004%
4000 20.000% 1.333% 0.222% 0.056% 0.005%
4500 22.500% 1.500% 0.250% 0.063% 0.005%
5000 25.000% 1.667% 0.278% 0.069% 0.006%
5500 27.500% 1.833% 0.306% 0.076% 0.006%
6000 30.000% 2.000% 0.333% 0.083% 0.007%
6500 32.500% 2.167% 0.361% 0.090% 0.008%
7000 35.000% 2.333% 0.389% 0.097% 0.008%
10000 50.000% 3.333% 0.556% 0.139% 0.012%
15000 75.000% 5.000% 0.833% 0.208% 0.017%
20000 100.000% 6.667% 1.111% 0.278% 0.023%
50000 NA 16.667% 2.778% 0.694% 0.058%
75000 25.000% 4.167% 1.042% 0.087%
100000 NA 33.333% 5.556% 1.389% 0.116%
250000 NA 83.333% 13.889% 3.472% 0.289%
500000 NA NA 27.778% 6.944% 0.579%
1000000 NA NA 55.556% 13.889% 1.157%
1500000 NA NA 83.333% 20.833% 1.736%
2000000 NA NA NA 27.778% 2.315%
2500000 NA NA NA 34.722% 2.894%
3000000 NA NA NA 41.667% 3.472%
4000000 NA NA NA 55.556% 4.630%
5000000 NA NA NA 69.444% 5.787%
6000000 NA NA NA 83.333% 6.944%
7000000 NA NA NA 97.222% 8.102%
8000000 NA NA NA NA 9.259%
9000000 NA NA NA NA 10.417%
10000000 NA NA NA NA 11.574%
2 vCPU
Chart View Realtime Past Day Past Week Past Month Past Year
Summation
Value
Update Interval (sec) 20 300 1800 7200 86400
100 0.500% 0.033% 0.006% 0.001% 0.000%
500 2.500% 0.167% 0.028% 0.007% 0.001%
1000 5.000% 0.333% 0.056% 0.014% 0.001%
1500 7.500% 0.500% 0.083% 0.021% 0.002%
2000 10.000% 0.667% 0.111% 0.028% 0.002%
2500 12.500% 0.833% 0.139% 0.035% 0.003%
3000 15.000% 1.000% 0.167% 0.042% 0.003%
3500 17.500% 1.167% 0.194% 0.049% 0.004%
4000 20.000% 1.333% 0.222% 0.056% 0.005%
4500 22.500% 1.500% 0.250% 0.063% 0.005%
5000 25.000% 1.667% 0.278% 0.069% 0.006%
5500 27.500% 1.833% 0.306% 0.076% 0.006%
6000 30.000% 2.000% 0.333% 0.083% 0.007%
6500 32.500% 2.167% 0.361% 0.090% 0.008%
7000 35.000% 2.333% 0.389% 0.097% 0.008%
10000 50.000% 3.333% 0.556% 0.139% 0.012%
15000 75.000% 5.000% 0.833% 0.208% 0.017%
20000 100.000% 6.667% 1.111% 0.278% 0.023%
50000 NA 16.667% 2.778% 0.694% 0.058%
75000 NA 25.000% 4.167% 1.042% 0.087%
100000 NA 33.333% 5.556% 1.389% 0.116%
250000 NA 83.333% 13.889% 3.472% 0.289%
500000 NA NA 27.778% 6.944% 0.579%
1000000 NA NA 55.556% 13.889% 1.157%
1500000 NA NA 83.333% 20.833% 1.736%
2000000 NA NA NA 27.778% 2.315%
2500000 NA NA NA 34.722% 2.894%
3000000 NA NA NA 41.667% 3.472%
4000000 NA NA NA 55.556% 4.630%
5000000 NA NA NA 69.444% 5.787%
6000000 NA NA NA 83.333% 6.944%
7000000 NA NA NA 97.222% 8.102%
8000000 NA NA NA NA 9.259%
9000000 NA NA NA NA 10.417%
10000000 NA NA NA NA 11.574%
4 vCPU
Chart View Realtime Past Day Past Week Past Month Past Year
Summation
Value
Update Interval (sec) 20 300 1800 7200 86400
100 0.500% 0.033% 0.006% 0.001% 0.000%
500 2.500% 0.167% 0.028% 0.007% 0.001%
1000 5.000% 0.333% 0.056% 0.014% 0.001%
1500 7.500% 0.500% 0.083% 0.021% 0.002%
2000 10.000% 0.667% 0.111% 0.028% 0.002%
2500 12.500% 0.833% 0.139% 0.035% 0.003%
3000 15.000% 1.000% 0.167% 0.042% 0.003%
3500 17.500% 1.167% 0.194% 0.049% 0.004%
4000 20.000% 1.333% 0.222% 0.056% 0.005%
4500 22.500% 1.500% 0.250% 0.063% 0.005%
5000 25.000% 1.667% 0.278% 0.069% 0.006%
5500 27.500% 1.833% 0.306% 0.076% 0.006%
6000 30.000% 2.000% 0.333% 0.083% 0.007%
6500 32.500% 2.167% 0.361% 0.090% 0.008%
7000 35.000% 2.333% 0.389% 0.097% 0.008%
10000 50.000% 3.333% 0.556% 0.139% 0.012%
15000 75.000% 5.000% 0.833% 0.208% 0.017%
20000 100.000% 6.667% 1.111% 0.278% 0.023%
50000 NA 16.667% 2.778% 0.694% 0.058%
75000 NA 25.000% 4.167% 1.042% 0.087%
100000 NA 33.333% 5.556% 1.389% 0.116%
250000 NA 83.333% 13.889% 3.472% 0.289%
500000 NA NA 27.778% 6.944% 0.579%
1000000 NA NA 55.556% 13.889% 1.157%
1500000 NA NA 83.333% 20.833% 1.736%
2000000 NA NA NA 27.778% 2.315%
2500000 NA NA NA 34.722% 2.894%
3000000 NA NA NA 41.667% 3.472%
4000000 NA NA NA 55.556% 4.630%
5000000 NA NA NA 69.444% 5.787%
6000000 NA NA NA 83.333% 6.944%
7000000 NA NA NA 97.222% 8.102%
8000000 NA NA NA NA 9.259%
9000000 NA NA NA NA 10.417%
10000000 NA NA NA NA 11.574%

The color coding may seem a bit odd and arbitrary, so here are some real-world numbers to help clarify: If you hit 10 seconds of CPU Ready out of every 20 seconds?  Big deal.  40,000 seconds out of a day’s 86400 seconds?  You’ve got problems. These are easy examples – you’re waiting for CPU cycles ~50% of the time.  But say you hit 90% CPU ready for 10 seconds out of a whole month? No big deal – a blip on the radar (it wouldn’t even show up in the vCenter Statistics Roll-ups).

High CPU Ready for a short period may not be a huge problem, but the numbers can be deceiving.  For example, let’s say you are looking at the Past Year view with a default sampling interval of 86400 seconds (1 day) for a single vCPU VM.  Now let’s say that you see an average summation value of 2,000,000 in the ‘Past Year’ table in vCenter.  That’s 2.315% by our formula.  Sounds low, right?  Not so quick.  As we deal with longer periods of time, the percentages shift a bit from our 10% rule of thumb.

2.315% of our 86400 second time slice is 2,000 seconds, or 33 minutes per day.  This is an average, so there were higher days and there were lower days, but on average, we waited more than a half an hour per day for CPU scheduling for our poor little VM.  Bump the scale up, we waited 12,167 minutes per year –> 203 hours per year –> 8.45 days per year for CPU scheduling.  Let’s say you have an SLA to deliver 5 nines of reliable, high performance for this workload. 8 ½ days is about 97.65% uptime – a long way off your 99.999% SLA.

I have reflected this in my charts by marking the increasingly longer intervals with diminishing warning values (yellow and red) – not because it indicates a problem at a specific point in time, but because it could indicate a systemic problem in the environment.  This is where some critical analysis comes in.  Drill into the smaller time intervals (past day) – is that 33 minutes taking place all at once or spread over the day; overnight, during a backup window, when you are exempted from the business hour SLA or is it happening for longer periods at critical times during the workday?  Hope you packed a lunch, cause you’ve got some troubleshooting ahead of you!

I hope the explanations of CPU Ready and the CPU Ready cheat sheet tables are helpful to you.  Questions, corrections, or additions – leave a comment below!

Note: Something ate my CSS that color coded these charts (thanks for bringing it to my attention, Joel!). I’ve fixed it now, but in case it happens again here is a PDF copy of the charts: VMtoday VMware CPU Ready Quick Reference Charts

Comments

  1. Hi Joshua, what do you think about this statement from the vmware doc center http://pubs.vmware.com/vsphere-51/topic/com.vmware.vsphere.monitoring.doc/GUID-FC93B6FD-DCA7-4513-A45E-660ECAC54817.html

    “However, if the CPU usage value for a virtual machine is above 90% and the CPU ready value is above 20%, performance is being impacted.”

    I’ve seen high ready values with low usage values not harming vms too much, that’s why i ask 🙂

    • I think VMware’s numbers are a bit high. I usually go with a 70% for CPU utilization, and 10% for CPU Ready, but of course it all depends on the environment, workload, and budget to determine what is acceptable in terms of a performance impact.

      I have also seen your scenario with high CPU Ready and low CPU usage where performance is acceptable at a given point. The danger with this is that it might not take much to tip the balance against a VM in such a position. Either the VM itself or a peer on the same host gets more active, a new/vMotion VM comes onto the host, or resource assignment changes (more vCPU added to guests) to move your scenario from ‘ok’ to ‘oh crap’.

      Good quesyand thanks for the comment!

  2. Great post Josh! Quite timely for me, as I will be experimenting on the implications of some big VMs for source code compiling. http://vmpete.com/2012/12/18/vroom-scaling-up-virtual-machines-in-vsphere-to-meet-performance-requirements/

    – Pete

  3. When viewing roll up statistics of a few days but less than one week, which column applies (300 or 1,800)?

  4. I believe the formula on your post is wrong. The correct forumala, as per KB article you mentioned, is:

    (CPU summation value / ( * 1000)) * 100 = CPU ready %

  5. thanks. this was helpful

  6. Loved your little graph here. I created a powershell script to query all the VM’s in a vCenter and return the values

    get-pssnapin -registered | add-pssnapin -passthru -ErrorAction SilentlyContinue

    #Initialize variables

    $vCenterServer = “vcenter” #Enter your vCenter server FQDN here (one at a time)
    $intervalseconds = 7200 # 300 = day, 1800= Week, 7200 = Month
    $PercentCheck = 7
    $result = @()

    #Connect to vCenter Server
    $VC = Connect-VIServer $vCenterServer

    # Get Stats for VM’s and check for ready time more than $PercentCheck% per vCPU

    $VMs = Get-VM

    Foreach ($V in $VMs) {
    If ($V.PowerState -eq “PoweredOn”){

    $Summation = Get-VM -name $V | get-stat -Stat cpu.ready.summation -MaxSamples 1 -IntervalSecs $intervalseconds #Get the ready time
    $Percent = ($Summation.Value / ($intervalseconds * 1000)) * 100 #convert to percentage

    #Check if more than $PercentCheck% average per vCPU
    [int]$Spread = [int]$Percent / $V.NumCpu

    If ( $Spread -gt $PercentCheck ) {
    $check = “YES”
    }Else{
    $check = “no”
    }

    $Spread = “{0:N2}” -f $Spread
    $Percent = “{0:N2}” -f $Percent

    $row = “” | select VMname,Summation,Ready,PerCPU, CPUs,IsBad; `
    $row.VMname = $V.name; `
    $row.Summation = $Summation; `
    $row.Ready = $Percent; `
    $row.PerCPU = $Spread
    $row.CPUs = $V.NumCpu
    $row.IsBad = $check
    $result += $row

    }
    }
    $result | Format-Table –AutoSize

    # Disconnect from VC
    Disconnect-VIServer -Confirm:$False

    • Actually here is version 2 of it.. much better and can do the real time. Also my maths are better in this one.

      #Need powershell 2.0+ and powercli 5.1+

      get-pssnapin -registered | add-pssnapin -passthru -ErrorAction SilentlyContinue

      #Initialize variables

      $vCenterServer = “vcentersrvr.domain.com” #Enter your vCenter server FQDN here (one at a time)

      $intervalseconds = 5 # intervals to specify. 5 = realtime, 300 = day, 1800 = Week, 7200 = Month
      $PercentCheck = 10 # Percent per CPU of ready time to warn on.
      $result = @()

      #Connect to vCenter Server
      $VC = Connect-VIServer $vCenterServer

      # Get Stats for VM’s and check for ready time more than $PercentCheck % per vCPU

      $VMs = Get-VM # Get all the VM’s

      Foreach ($V in $VMs) {
      If ($V.PowerState -eq “PoweredOn”){ #Check for powered on

      $Summation = Get-VM -name $V | get-stat -Stat cpu.ready.summation -MaxSamples 1 -IntervalSecs $intervalseconds #Get the ready time

      Foreach ($stat in $Summation){ # Go through all the processors if it has them (only realtime interval)

      #Get CPU number and set Percentange to check (divide by total cpu if total)
      If ($stat.instance){
      $CPUNum = $stat.Instance
      $Percentmultipler = 1
      }else{
      $CPUNum = “Total”
      $Percentmultipler = $V.NumCpu
      }

      $Percent = ($stat.Value / ($intervalseconds * 1000)) * 100

      #Check if more than $PercentCheck % average per vCPU

      If ($Percent -gt $Percentmultipler * $PercentCheck ) {
      $check = “YES”
      }Else{
      $check = “no”
      }

      $PerSecondWaiting = ($stat.Value / $intervalseconds) # Convert percentage into ms

      $Percent = “{0:P3}” -f ($Percent / 100) #format to percentage
      $PerSecondWaiting = “{0:N2}” -f $PerSecondWaiting #formatting to 2nd decimal place .. remember 1000 ms per second

      Write-Host “Checking VM: $($v.Name) CPU#: $CPUnum”
      #Write-Host “CPU: $CPUNum”
      #Write-Host “Percent: $Percent”

      $row = “” | select VMname,Summation,Ready,CPU,msPerSec,IsBad; `
      $row.VMname = $V.name; `
      $row.Summation = $stat.Value; `
      $row.Ready = $Percent; `
      $row.CPU = $CPUNum; `
      $row.msPerSec = $PerSecondWaiting ; `
      $row.IsBad = $check;`

      $result += $row
      }
      }
      }
      $result | Format-Table –AutoSize

      # Disconnect from VC
      Disconnect-VIServer -Confirm:$False

      • You said
        $intervalseconds = 5 # intervals to specify. 5 = realtime, 300 = day, 1800 = Week, 7200 = Month
        Wouldn’t 20 = realtime? where did 5 come from?

Drop a comment below:

%d bloggers like this: