I did a bit of troubleshooting today for a customer who was experiencing very slow logon times to VMware View desktops running Windows XP. I suspect the problem is a fairly common one so I thought I might share my troubleshooting methodology and the solution that got the login time back to normal. Following a rigid methodology may be overkill for many troubleshooting situations. If you strongly suspect a root cause to a problem, check the solution before digging in to analytic troubleshooting. A little bit of Googling may eventually get you to an answer for a particular problem, but having a firm troubleshooting process will help in all situations.
I’m going to lay out my troubleshooting methodology for you, with some VMware View specific examples. If you’re not interested in the lesson, scroll to the bottom for the probable causes and solution to my particular issue. If you want to learn a bit about a tried and true methodology for problem solving, read on!
My troubleshooting approach is borrowed from the Kepner-Tregoe process for Analytic Trouble Shooting as written about in their book The New Rational Manager. The Keppner-Tregoe methodology dates back to the 1950’s and has been used worldwide by corporate, government and other institutions to solve problems and make sound decisions. The Keppner-Tregoe Analytic Trouble Shooting method was used by NASA to help land Apollo 13, and has been identified by ITIL/ITSM as a recommended problem solving technique.
The first step in the method is to define the trouble statement. That is, what exactly is the problem we are trying to solve? The better your trouble statement, the quicker you can zero in on what or where the problem may be. It may seem simplistic or silly, but a trouble statement verbally stated or written makes sure everyone involved in troubleshooting is actually troubleshooting the same issue, not chasing down tangents, unrelated symptoms, etc. In this case, the opening trouble statement from the customer was pretty simple: “Domain account logons to VMware View desktops is slow and/or or doesn’t complete.”
As this was a new customer to me, the opening trouble statement pretty much covered the extent of my knowledge of their particular environment. I have a decent bit of working knowledge on VMware View that can carry me through most troubleshooting, but a more specific understanding of the problem limits the depth of memory (and overworking of already tired neurons) I need to get to the solution. We get more specifics by asking the right questions. The specifying questions you ask can be generalized across most any analytic trouble shooting effort (IT, mechanical, relationships, etc.). The specifying questions attempt to observe the problem (defect) from all dimensions to define a more exact trouble statement that you will use to begin to hone in on a root cause. Specifying questions, in and of themselves, do not attempt to identify the root cause. The questions attempt to answer the IS and the IS NOT of the following dimensions:
- WHAT: What is/is not the object, person or unit with the defect? What is/is not the defect on the object?
- WHERE: Where is/is not the object with the defect observed? Where is/is not the defect on the object?
- WHEN: When is/is not the object with the defect first observed? When is/is not the defect observed in the cycle of the object? What is/is not the pattern of the when?
- EXTENT: How much of the object is/is not affected? How many objects have/do not have the defect? Who many defects on the object? What is the trend?
The IS NOT in these specifying questions deals always, in all four dimensions, with a closely related object or defect which could be affected, but is not related to the problem at hand.
Some examples of specifying questions that could be used in troubleshooting the slow logon times for View desktops are (not all will apply to your particular situation, just some seeds to start you along):
Q. What is the object/process with the defect?
A. The logon. Put your jumping to conclusions mat away – we don’t know much beyond the stated problem of ‘the logon is slow or doesn’t complete’.
Q. What is the defect on the object?
A. Logons are a fairly complex operation, requiring interaction with external components (Active Directory, GPO, etc.). At this point, we’ve only defined the object with the defect as a generic ‘logon’. Is the defect on the desktop itself, or a logon script, or a group policy, or maybe something I haven’t even heard of? I don’t know… Attempt to define this as much as possible. Do Windows Event logs have an entry of interest that suggests something like “the group policy processing engine experienced a failure’ or ‘domain controller cannot be reached’? In these cases, the group policy processing engine or the connection to the domain controller is a better answer to the ‘what is the object with the defect’ question.
Q. What is not the object with the defect?
A. Can we eliminate a what? Maybe the Windows Event Log reports that group policy processing is successful. We can probably state that the IS NOT as the problem IS NOT a failure of the local group policy processing engine, or the failure IS NOT a failure to communicate with a domain controller to obtain group policies.
Q. Is the defect on local logons as well as domain logons?
Q. Which View desktop(s) or desktop pools is the problem observed in? What View desktops or pools is the problem not seen in?
Q. Which OU are the desktops in?
Q. Does the problem exist on physical and virtual desktops, or just virtual?
Q. Which users are impacted (users in OU A but not in OU B)?
Q. Over PCoIP, or on the vSphere console?
Q. When did the problem start?
A. Yesterday; 10:57pm; After desktops have been running for 12 hours;
Q. When does the problem not happen?
A. Saturday. (Is anyone working on Saturday to actually validate that it isn’t happening then too?)
Q. When in the cycle does the problem occur?
A. When I refresh the pool.
Q. When in the cycle does it not occur?
A. At logoff. After recompose with snapshot X.
Q. Are all desktops in the pool affected?
Q. What is the extent of the problem on the desktop?
A. Can’t log in, ever. Or, login completes, but desktop continues to be terribly slow.
Q. What is not the trend?
A. The trend is not that the problem is extending to newly provisioned desktops, just on already provisioned desktops.
Working Tool Questions
So now we know specifics around what our problem is. Now we put our thinking man (or woman) tools to use with ‘working tool questions’. These questions assist us in interpreting our specifications as we week a root cause.
Q. What is different, odd, unusual, peculiar or distinct about the IS compared to the IS NOT?
A. The problem occurs on Windows XP View desktops, but not on Windows 7 View desktops. The distinction is the OS version.
A. The problem occurs on VM’s in this OU, but not that OU. What is different about the OU’s? What are the differences in the GPO’s linked to the OU?
Q. What has changed in, about, or around this difference (deals with WHAT and WHEN)?
A. A new GPO enabling View Persona Management was linked to the OU for the pool with the issue.
A. A Windows update was applied by automatic updates. The update was removed after a refresh of the pool, until the next automatic Windows Update installation.
Q. How could this change possibly cause the trouble?
A. GPO for View Persona Management changes the logon process, as the roamed profile needs to be accessed over the network instead of locally on desktop. If the Persona Profile repository was unavailable, logons may be slow until a temporary profile was created, or could fail.
A. If Windows Update for KB123456 was applied to the desktop, then the Kerberos encryption method would be updating causing authentication problems with the domain.
Most Probable Cause:
Q. IF _________ is the cause, how does that explain the IS and the IS NOT facts?
A. If Persona GPO was linked to OU for View desktops in the affected pool, and NOT linked to OU for non affected pools, the defect would be seen only on the pool with the linked GPO and NOT on the unlinked pool.
Q. Does the cause check out in real life? Verify it.
A. Unlink the GPO and do a gpupdate. Does the problem still exist?
- If no, then problem exists with GPO or functionality called by GPO. Congratulations, but don’t pop the top on a cold one just yet (ok, maybe just one – it’s been a long day)…. Even if the problem is resolved with your verification, you’re not done with your work. In our case, the GPO might bring Persona Management functionality that is required by the design goals of the project. Repeat the process to identify what/where/when/extent about the GPO is causing problems (or use your new found knowledge to craft a more specific Google query).
- If yes, does the verification step add to our trouble statement, provide additional answers to our specifying questions (do we have a new IS or IS NOT), or answer any of our working tool questions (did we find another change in, about, or around the IS or IS NOT)?
Think Beyond the Fix
Once you’ve identified and corrected what you believe to be the root cause of the problem in your trouble statement, you still have some work to do. Finding the root cause using sound analytic troubleshooting methodology makes for a good engineer. Thinking beyond the fix makes for a good leader and manager.
Extend the Fix:
Q. Are there similar unity needing the same fix?
A. Yes, the same GPO is linked to another OU for a View pool that we have not yet tested.
Extend the Cause:
Q. Did the cause do other damage?
A. Yes. An incorrect setting in a GPO caused corruption in the View Persona profile. Rebuild profile to repair corruption.
Stairstep the Cause:
Q. What caused the cause?
A. Poor change management process. –> Use the Advanced Group Policy Management Console in the Microsoft Desktop Optimization Pack for GPO revision tracking and change approvals.
A. A bug in the View Persona Management components causes a deadlock. –> Update to latest version of View.
A. Ignorance. –> RTFM.
If you are still reading, congratulations! You’ve completed my Analytic Problem Solving 101 course. I hope the methodology will prove useful to you. Before you go, I thought I would share some specifics of this particular issue. In this particular case, I ran through the methodology partly in my own mind, and partly out loud with the customer in about 5 minutes. Testing the causes took a bit longer.
I quickly narrowed down the probable causes to be:
- Permissions on the Persona repository. Set it per VMware’s guidance in the View Administration Guide, which points to a Microsoft TechNet Article for Security Recommendations for Roaming Profile Shared Folders (last updated in 2003), which is incorrect per VMware KB 2008377. Experience has shown me that many deployments do not get permissions right because of the old-fashioned DO loop of documentation on this. If permissions are not configured correctly the profile may not be created correctly, causing slow/failed logons. While permissions were not to my liking, profiles were being created. An IS NOT.
- Persona Management Bug: VMware KB 2011823 describes a problem where Logging in to the View desktop takes a long time when persona management is enabled when Persona and a antivirus/endpoint protection product conflict. The problem was resolved in View 5.1.2 which was in use in this deployment. Another IS NOT.
- Performance of CIFS server where Persona repository resides. I’ve seen in the past where slow disk performance on the DFS server, caused by a thin provisioned vmdk not growing fast enough as new profiles were created, led to hanging at logon. The file server seems ok here. Another IS NOT.
- Persona Configuration: Persona had been configured with all of the folder redirection options configured. The IS for this was XP workstations, the IS NOT was Windows 7. Ah ha – there is a known issue with redirecting Desktop and Start Menu folders using Persona on Windows XP (see VMware KB 2019937 – Desktop login takes a long time when using Persona Management in VMware View Manager 5.0.x / 5.1.x). This is the most probable cause. To verify, we removed redirection for Desktop and Start Menu and logons were fast. We applied the ‘fix’ to other desktop pools/OU’s GPOs. Stair stepping revealed a concern that Desktop and Start Menu items would not be protected by backups on the file server. I explained that Desktop and Start Menu items would still be protected by Persona roaming profiles, just not redirected to an alternate folder location on the file server.
If you’re stuck on troubleshooting, drop a comment below or check out VMware’s KB that details the steps for Troubleshooting VMware View Persona Management here: http://kb.vmware.com/kb/2008457.