Although we have a team of engineers who are dedicated to troubleshooting general server performance related problems Microsoft Directory Services specialists are expected to be the “go to” people for Active Directory and domain controller related performance issues. This is especially true when the Lsass.exe process is noticed to be the using more resources than would be expected.
This is in part due to the fact that the Lsass.exe process is seen as a big black box by many. In reality, it is a process like any other which simply takes care of many core aspects of the operating system on any computer, and the additional roles that a domain controller has on a DC. Some of the things which are done in that “black box” are peer domain controller location, authentication, authorization and Active Directory replication.
What about Server Performance Advisor (SPA), you ask? I’ve mentioned using the SPA tool in my blog post “A Day at the SPA” but the SPA AD Data Collector is not the tool for every performance related problem which exists.
For example, what if your domain controller(s) appear to grow less responsive to clients over a period of time and then they eventually crash? The SPA AD data collector runs for 5 minutes by default, and for an issue which takes place over time in that way that would be an insufficient interval of data gathering for any meaningful reporting on what the problem is. Even worse, the SPA report even if extended might not see a problem even if you extend the data gathering to a much larger interval.
Let me reiterate that the “…and then they eventually crash” part of the above example. Although it may not occur in every memory leak instance it is something that a normal performance degradation issue will not see…whereas a memory leak eventually would. The crashing behavior is not something you will see with a simple load based issue on a DC where it is busy for some reason. This blog post is intended to give you a top level overview of what to do when you see just this occurance.
The reason for that is that SPA was not made to give effective diagnosis for memory leak or similar gradual performance degradation issues. It is a tool for establishing an understanding of the performance baseline of a server and for identifying immediate resource bottlenecks. This is a subtle but important detail.
What is a memory leak? Well, to answer that we have to understand a little aspect of programming. Programming languages require that memory be allocated, or set aside, for use in storing values that will be worked with, and then deallocated when the code is finished working with them. More detail on that is here.
Why would this be a concern to an Active Directory administrator? This is a concern because we don’t always have full control over all of the code which runs in our environment. It’s sad to say but AD people are rarely the Kings or Queens of All They Survey. In other words, in real life business needs introduce applications into the environment which may not be entirely bug free. A shocking revelation, I know. Sometimes these applications have the specific problem of not being able to deallocate their memory usage when running on or against a domain controller, resulting in a memory leak. There can be memory leaks in either kernel or user mode but application derived memory leaks are by nature user mode leaks.
In a memory leak situation more memory is allocated to new code executing over time but never deallocated for reuse. So the amount of memory in use by a process is always increasing. Over time the amount of memory needed for further code execution exceeds the amount of memory available for further use, or allocation This is a recipe for disaster on a server and is the central aspect of this which results in the crash behavior that memory leaks often produce.
Now that we have some background on what the issue is let’s talk about identifying whether something is a memory leak.
There are multiple methods to track this down when you need to. Many of these tools require that you enable some tracking of resources in memory so that proper reporting of that memory usage can be done. Put very basically, once you know you have a problem with a particular process (which this article assumes you do…and it is Lsass.exe) you need to find out what resource is leaking and/or what function(s) are doing it. To find out which resource is not being deallocated properly you will need to tag it, so to speak, so that those tags can be counted. Adding a tag in itself utilizes memory and other resources and as a result it’s not something that is enabled by default. Instead the Glfags.exe tool allows you to enable these tags to track the resource usage on “objects”. A better explanation than I can give is available at the above Gflags MSDN link above.
We’re going to go over some common tools and methods used followed by a new one that can give a nice readable report, in a similar fashion to what SPA does.
Two of the “traditional” methods are to use Performance Monitor (also known as perfmon) or the User Mode Dump Heap (UMDH) tool to identify the leak. Memory usage in these tools is referred to in bytes and typically tracked by seeing an increase in the number of private bytes used by a process. Remember, for the purposes of this troubleshooting discussion the process in question is Lsass.exe, which runs your Active Directory code (put simply). These two tools are discussed in good detail here. We won’t be going into step by step detail on how to use Perfmon or UMDH to troubleshoot a memory leak since the MSDN article does a good job of that.
Here’s a really good excerpt from the above MSDN article on this:
The Private Bytes counter indicates the total amount of memory that a process has allocated, not including memory shared with other processes. The Virtual Bytes counter indicates the current size of the virtual address space that the process is using.
Some memory leaks appear in the data file as an increase in private bytes allocated.
Another method is to use Poolmon. Poolmon is useful in that it can display outputs of Gflags.exe-enabled tagged memory and is often used for finding memory leaks.
There are two Poolmon output samples below. Examine the Diff (allocations minus frees) and Bytes (number of bytes allocated minus number of bytes freed) values for each tag, and note any that continually increase.
=== Wed 06/11/2008 07:39:17 ComputerName=DC1 FreePTEs=9,202 ===
SystemUpTime(hours)=20.71; ProcessTotalHandleCount=100,644; SystemThreads=880; SystemProcesses=74
Memory: 3997176K Avail: 1458620K PageFlts:33958221 InRam Krnl: 2544K P:248068K
Commit:1993880K Limit:5938828K Peak:2032972K Pool N: 42,708K P:249,248K
Tag Type Allocs Frees Diff Bytes Per Alloc Mapped_Driver
Toke Paged 3926552 3907851 18701 183714352 9823 [nt!se – Token objects]
And now notice the change in the Toke (token) object paged number…
=== Wed 06/11/2008 07:40:18 ComputerName=DC1 FreePTEs=9,202 ===
SystemUpTime(hours)=20.73; ProcessTotalHandleCount=100,562; SystemThreads=885; SystemProcesses=74
Memory: 3997176K Avail: 1459012K PageFlts:33978385 InRam Krnl: 2544K P:248020K
Commit:1993684K Limit:5938828K Peak:2032972K Pool N: 42,708K P:249,208K
Tag Type Allocs Frees Diff Bytes Per Alloc Mapped_Driver
Toke Paged 3931182 3912503 18679 183890888 9844 [nt!se – Token objects]
Although the above was only a one minute difference and you can clearly see an increase in the bytes for token objects. This suggests that there is some code which is not deallocating, or making the used memory available for use, memory used to store a token once that code has finished. This is a guideline however and you would want to watch the Diff and Byte values over a longer period of time to truly ascertain whether there was a gradual and consistent leak present there. There’s a variety of indicators to look for-many ins and outs-but the numbers cannot lie to you when they continually increase.
Finally, I mentioned a cool tool that provides a nice report. That tool is called the Debug Diagnostic Tool which can be downloaded from here. This tool is commonly referred to as simply DebugDiag . DebugDiag was created to troubleshoot IIS related concerns with custom code running in applications pools and the like. It is a great tool overall, though, and is one that we can occasionally use for troubleshooting Directory Services issues. In this scenario it is useful to use the tool to gather sequential memory dumps and then have DebugDiag generate a report from them which will tell you about any perceived leaks.
So, first you have to have it run and gather the dumps of Lsass.exe while the issue is occurring. When you do that keep in mind that this is not something to do lightly-it is invasive and can degrade the performance of your system in itself. So only do it when you must in order to track down a problem. Here are the steps.
1. Click Start, point to Programs, point to IIS Diagnostics (32 bit), point to
Debug Diagnostics Tool, and then click Debug Diagnostics Tool.
2. Select Memory and Handle Leak Rule, and then click Next.
3. Select LSASS.EXE in the Select Target dialog and then click Next.
4. In Configure Leak Rule dialog you can specify a warm-up time. However, in most cases we should instead click the Configure button under “Userdump Generation”.
5. In the Configure Userdumps for Leak Rule dialogue which appears make sure that the Auto-create a crash rule to get userdump on unexpected process click on the radio button for “Generate a userdump when private bytes reach” to select it. The default is 800Mb. Let’s change that to 900Mb, and select to do additional dumps every 50Mb thereafter.
7. Click Save & Close.
8. Click “Auto-unload LeakTrack…” to add a check mark there.
9. Click Next, and then Next again.
10. Click Finish on the Select Dump Location And Rule Name windows. The Userdump Location can be changed here. Note The status is now active. The Userdump count will increase every time that a dump file is created. The default dump file location is C:Program FilesIIS ResourcesDebugDiagLogs.
Next you need to generate the report. To do that simply open DebugDiag, add the files you gathered above using the Add Files button, choose “Memory Pressure Analyzers” and click the Start Analysis button. Once that analysis script is complete you will have a handy, albeit very detailed, report outlining what code appears to be leaking.
This entire scenario came about because you are seeing your domain controllers performance gradually decrease over time, followed most likely by their crashing, rebooting and starting the cycle all over again. The goal here is to give you the tools and know how to at least understand the issue, but at best be able to track down this issue and find out what the actual problems is.
And then you can fix it. Fixing it may mean an update of some application or perhaps to configure or uninstall the offending code. It may even mean that you will need to contact the company who makes that software and see what remedies they know of for that problem. In any case you’ll have a handle on what is happening, and a leg up on getting past the problem.