Xenapp Slow Logons Troubleshooting Part2
Xenapp Slow Logons Troubleshooting Part2Website Visitors:
We call these metrics “The Five Ls”:
- Logon Times
- Load Times
- ChanneL (Yes, including Channels is cheating, but mnemonics are worth it!)
In this blog post, we’ll discuss how the Five Ls impact the performance and delivery of your Citrix applications and what actions you should take to improve performance in those areas. Special thanks to Ken Avram who originally compiled this content!
Wire data analysis can provide you with real-time information about “the 5 Ls.”
A Citrix Launch is simply a user starting a Citrix session. By analyzing data in flight, including ICA communications, the ExtraHop platform breaks down launches into the following categories:
- Normal launches
- Session Sharing (The reuse of an existing connection (session) if the second application lives on the same XenApp server as the first application launched via that session)
- Slow launches (Those greater than 30 seconds, although this may be modified)
Questions you can answer:
Which users are experiencing slow launches?
Which servers are producing slow launches?
Which applications are producing slow launches?
Is there a specific geo-location (subnet) that is experiencing slow launches?
- Load Times
The Citrix Load Time metric shows the time from when a user clicks on (launches) an application until the ICA Server presents the application to the user for use. You can use this metric in conjunction with Logon Times (see below) to narrow down the culprit.
Possible causes for poor load time performance:
Roaming profiles – Profiles can be stored on a server that is slow or overloaded. The first thing to check is CPU and Memory. If these seem normal, then disk latency could be an issue. If the profile is stored on a local disk, the best practice would be to run perfmon (a built-in utility) and look at the disk queue counter. Any disk queue over two indicates a disk bottleneck.If the storage is remote, then check for IOPS load on remote storage. Also, be aware that, by default, Windows only allows 50 UNC handles to be open at one time and if more than that try to access the UNC, it will queue up the requests. This can present itself during boot storms such as when everyone is logging on at the same time in the morning. There is a registry setting that can increase this limit. Some of this can be remediated by using a professional profile manager; Citrix includes its own User Profile Manager that works much better than the built-in Microsoft one.
Redirected folders – It is important to observe the event viewer of the XenApp/XenDesktop machine for any errors with Redirected Folders. This can cause a lot of heartburn and timeouts due to bad policies that deal with redirected folders. Redirected folders are used to speed up the logon times by storing documents to a “share” rather than copying them over and back between logons and logoffs. If this is not configured correctly, this can really slow down logon times. As stated above, professional profile managers can assume this role as can Citrix User Profile Manager.
- Logon Times
The Citrix Logon Time metric refers specifically to the authentication portion of a launch. Slow logon times are likely due to Active Directory or DNS issues.
Active Directory – There can be several issues that affect logon times with Active Directory. When a logon commences, it queries Active Directory for a logon server defined in Active Directory Sites and Services. If this is configured incorrectly, you may get a logon server clear across the country (or the world if you have remote sites that far away) that have terrible latency, so having this configured correctly is crucial. Also, check the event logs on Active Directory controllers for replication errors. If you are seeing these on some of them and you happen to hit that controller as a logon controller then the logon itself may fail. These errors need to be fixed immediately so that the logons can be consistent.
Active Directory Group Policy – This is one of the most misunderstood components of Active Directory and is usually fraught with issues (precedence processing, overrides, competing policies, policies unable to load, etc). This can be difficult to find and that is why it is recommended to have a complete separate test environment in order to ferret out any issues before going into production. Two tools for diagnosing this are the RSOP Tool (Resultant Set of Policies) and the group policy wizard. These tools can be used to verify group policy issues but will only be useful to admins who understand how group policy works in the first place.
DNS – DNS can be the cause of many common Citrix logon issues since so many processes depend on DNS for resolution. If your DNS isn’t working at 100% you’ll see tons of red herrings.
DHCP – This subsystem is often ignored and undermanaged, and it may be to blame for some Citrix logon problems.
LDAP – Failed LDAP authentication can cause Citrix logon issues, too. This is a good place to check if you’re having Citrix logon troubles.
There are two types of latency that you should be tracking and both are monitored by the ExtraHop platform:
- Network Latency: This is reported when by observing a specific ICA packet from the client that contains latency information. This measure of latency is calculated by the Citrix solution.
- Client Latency: This is reported by observing a packet from the client on the End-User Experience Monitoring (EUEM) virtual channel reporting the result of a single ICA round-trip measurement. This is only reported if the EUEM beacon is turned on. In many environments, EUEM will not be enabled.
For practical purposes, you should focus on Network Latency as this will be the one reported to Citrix Director as the user experience. Latency is one of those issues whose cause is quite hard to narrow down because of all the interdependencies that are involved with network transport. Here, we attempt to list the dependencies that are known to plague Citrix environments most often. (Many of these issues require you to “prove the negative” and show that the root cause is not the Citrix environment.)
Possible causes for latency:
Bad network switch or bad switch configuration
Switches not set to fixed speed and instead are auto-negotiated
Sites and Services incorrectly configured (see above under Active Directory)
Users doing large printing jobs (this is a Citrix Policy that can be configured to throttle print jobs on remote sites)
Users using bandwidth-intensive applications (YouTube, watching History Channel, etc)
Users copying large files, especially for remote users
Citrix Policies not tuned for remote users
Applications running slow. This is not a latency issue but can appear to be; the user gets the impression that the network is slow. Poor application performance is a whole different conversation but could be caused by a machine that is starved for CPU/Memory.
Mismatched networks, i.e. 100Mb going to 1Gb or vice versa.
IOPS starved backend storage. This will be especially apparent when using XenDesktop and it requires many more resources than XenApp for backend storage. This is not a latency issue but can appear as such.
“Channels” refers to Citrix Virtual Channels. By observing activities on the Citrix channel, you can derive a wealth of information about what users (such as Dr. Ken Pickles) are doing. Things to look for:
- Which channels take up the most network bandwidth?
- Screen updates are going to take a majority of bandwidth
- Printing is typically the second highest bandwidth consumer
- Audio usually comes up third on the overall bandwidth consumption scale
Other Areas to Watch
Besides the Five Ls, you will also want to keep an eye on the following areas when troubleshooting and tuning your Citrix environment:
- XML Broker – The Web Interface/Storefront uses the XML broker to enumerate applications that the user is allowed to access. Some installations have multiple XML brokers but really only two are required for redundancy for most Citrix installations. XML brokers depend in IIS so if a patch interferes with that operation then the XML broker will not function correctly.
- Provisioning Servers – Provisioning servers are used to provision additional Citrix servers as needed in the environment. This relies heavily on network resources more than anything to deliver these servers and their states over a very consistent network. Provisioning Servers are usually deployed in a cluster of two but can be up to four for redundancy and resiliency.
- CGP – Common Gateway Protocol includes Session Reliability and is based on the SOCKS proxy protocol. This is add-on to the ICA channel was made popular with wireless devices. As wireless devices “roam” from access point to access point, they can “hold” the connection rather than dropping it. In typical ICA fashion, if a connection is momentarily dropped you will see a “connection dropped” message and have to re-logon to the Citrix Session. With CGP, it uses port 2598.
Few scenario based issues and the troubleshooting
Scenario 1 – Application Outage
A major health system running MEDITECH via Citrix Virtual Apps recently performed an upgrade to MEDITECH. A misconfiguration was performed during the upgrade, resulting in 40,000 clinicians and staff being unable to access the application.
Such a widescale fault in the delivery of MEDITECH would put intense pressure on the IT department and Citrix administrators — they were now dealing with a major incident with serious implications on providing patient care.
Troubleshooting Session Slowness Using Free Tools
Typically, the Citrix administrator will manually try to launch MEDITECH to see if an error message was displayed and if that error message was coming from the application itself or the Citrix environment. Here, you are testing many different Citrix infrastructure components such as the Delivery Controllers, StoreFront servers, Citrix license server, SQL, and more.
If no useful error message were displayed, you would check Citrix Director to see if there were any failed VDAs or if there were failed connections being logged and what the error messages recorded were. Director will also show basics around Delivery Controller service health, license server health, and any Hypervisor alerts.
You would also involve the following teams to check out their parts of the infrastructure:
- The server infrastructure team would perform a health check of the Hypervisor blades. That includes checking if any nodes are in a failed state, if CPU/RAM consumption is saturated, if required virtual machines are powered on, and so on.
- The network team would look for faults in the path between end users and the Virtual Apps servers and the backend MEDITECH infrastructure. Given that there is complete outage, investigation would likely be focused on the data center/core network.
- The application support or vendor would perform tests against the MEDITECH infrastructure, ensuring that all the different components are running and appearing healthy.
In this scenario, when trying to log on and launch MEDITECH manually, the failure point was application enumeration. This issue was detected during the manual attempted launch of MEDITECH by a Citrix administrator, but it could also have been reported by an end user during triaging with the help desk. The application enumeration issue was impacted by permissions, which were later updated accordingly.
- Resolution time: Two hours
- Users impacted: 40,000
Troubleshooting Application Outages Using Paid-For Tools
The resolution in this scenario proactively corrected the issue before it had an impact on users. Goliath Application Availability Monitor (GAAM) was running in the environment, with Goliath Virtual Users running tests against MEDITECH on a predetermined schedule.
The Goliath Virtual User detected the outage immediately, and an alert was generated containing details of the outage, including screenshots from each stage of the test to help determine root cause. This allowed the appropriate personnel to review the screenshots for more information and provided a clearer understanding of the issue. Armed with these screenshots, details, and analytics, the Citrix administrator quickly determined that application enumeration and permission settings were the failure point, and the permissions were updated accordingly.
The difference with this troubleshooting scenario was that the entire Citrix infrastructure and supporting components were being tested automatically by GAAM. Alerts and screenshots were provided as a service to the Citrix administrators, and the issue was resolved before users were impacted.
- Resolution Time: 10 minutes
- Users Impacted: 0
Scenario 2 – Slowness During Logon
These incidents increased user frustration, decreased efficiency, and frustrated the help desk team, who were taking calls from angry users.
Troubleshooting Logon Slowness Using Free Tools
Logon slowness can be difficult to troubleshoot on your own because there could be many reasons logon times are slow. First, you should try to establish patterns and run through a process of elimination:
- How many users are reporting logon slowness?
- Are users all located in a particular office?
- Are users all using a particular application or desktop?
- Are users hitting a particular data center for their application or desktop?
- Are users logging on remotely or are they logging on from an office location?
One of the first tools to use is Director and the logon duration metrics that can be captured per user. Search for an affected user and see what areas of the logon is slow. Citrix Director records processing times for HDX session connection, GPOs, Logon Scripts, Profile Load, and more.
Then, engage with other teams such as networking to have them perform network tests and establish if any latency is occurring between each of the global offices and each data center.
The server infrastructure team will review Hypervisor and virtual machine resource consumption and capacity across both VDAs and the supporting infrastructure servers.
If you have identified a pattern, like only a certain set of desktops are affected, concentrate your troubleshooting efforts on those. Launch the desktop yourself and determine if you see the same logon slowness. And review Event Logs and CPU/RAM utilization and see if there are enough desktops in operation to serve connecting users.
- Resolution time: Five hours
Troubleshooting Logon Slowness Using Paid-For Tools
The resolution in this scenario was much simpler and faster. Goliath Performance Monitor was running in the environment and producing advanced logon-duration reports.
The Citrix administrator used the GPM web-based console to search for an affected user and to review the 33+ stages of the logon process to determine the delay point. The Goliath logon duration reports for each user session include metrics such as:
- GPO processing time at a granular per-GPO level
- Profile load time
- Brokering time
- The time it takes to map client drives, devices, and ports from the client device
- The time it takes for the Delivery Controller’s XML service to resolve the name of a published application or desktop to a VDA address
- The time it takes wfica32.exe on your client machine to establish a connection with a VDI
In this scenario, it was quickly detected that a specific GPO was causing 37 seconds of extra login time due to unnecessary registry data being set by the GPO.
- Resolution time: Five to 10 minutes
Scenario 3 – Slowness During a Workflow
A top 10 US health system with more than 100 locations nationally encountered end users from multiple locations reporting significant performance impact when scanning to electronic health records. This increased clinician stress and frustration and had an impact on patient care due to the inability to process records and documents through the EHR system.
Troubleshooting Session Slowness Using Free Tools
When a specific user workflow is slow, often you need to determine what components make up that workflow so that you can begin troubleshooting them. That’s the the difficult part and makes troubleshooting time-consuming and complicated. Not having the correct monitoring solutions in place adds to the difficulties and will often leave you to your own devices to resolve.
In this scenario, initial investigation with end users discovered that the scanning workflow was encountering significant slowness.
What do you do next? Fire in the dark. You may upgrade the scanner drivers on all devices, which has the potential to introduce further instability. You will open calls with the EHR and scanning vendors for troubleshooting advice and assistance, involve network teams to investigate if any slowness is occurring between the EHR environment paths and the scanner, collect and review diagnostic logs in an attempt to find something useful, and so on.
Ultimately all of these tasks will take many hours, and often many of the tasks end up being insufficient and a waste of time.
- Resolution time: Two weeks (to update drivers)
Troubleshooting Session Slowness Using Paid-For Tools
The resolution in this scenario was much quicker. Goliath Performance Monitor was running in the environment and collecting large amount of metrics about the Citrix HDX session.
The Citrix administrator used the GPM web-based console to search for an affected user and review the session bandwidth reports to identify the slow point. The Goliath session bandwidth reports show metrics such as:
- Bandwidth consumption per ICA virtual channel
- Input line speed
- Output line speed
- ICA latency
In this scenario, it was quickly detected that high ICA latency was being reported across multiple PCs on the day that this scanning slowness issue was reported. The Citrix administrator can easily review historic reports to learn a baseline for ICA latency. In this scenario the focus shifted to upgrading scanner drivers to investigating the network. Further investigation found that a large number of packets were being dropped, causing retransmits of data and slowing network traffic.
- Resolution time: One to two hours
Want to learn more on Citrix Automations and solutions???
Subscribe to get our latest content by email.