I find myself doing blog posts on things that are not frequent enough for most experienced admins to be aware of since it wouldn’t come across their desk often. The reason for that is that in my role I receive the least common unresolved issues that occur from our customers. When I receive a few of them over the years I feel that there can be some value in documenting them informally on the blog.
This one is a case in point. For years we’ve used the Volume Shadow Copy Service as the foundation of the backups that we do in our product. I will not claim to be an overall expert in VSS (if you want to be you can go here) but I do want to relate to you how VSS can affect your domain controllers services in some cases during a backup.
Consider a scenario where you have two domain controllers for a specific AD site. These two DCs provide services to an application server and do nothing else-no workstations are in that site, no other member servers, just that one application server and the single application running on it. As part of what that application needs to do it sends frequent and a high volume of LDAP and authentication queries to the domain controllers. This keeps the DCs quite busy indeed on a relatively constant basis; close enough to be having the beginnings of noticeable performance bottlenecks on the DC’s disks during normal usage.
Now add to the scenario a system state backup (Windows Backup or NTBackup) taken on one of the DCs. During the start of the backup you notice that some of the application queries and authentication requests fail for a short period. Not all of them, but enough showing up in your application servers event logs that it raises a concern. You may not even have end user complaints but it is noticed as an issue. These failures only occur during the “preparation” phase of the DC system state backups, interestingly enough, and never exceed 60 seconds.
Preparing to backup
So what is going on?
To understand what is happening we need to understand a few details about how VSS works. VSS basically takes snapshots of the disk data at the time it runs. It is advertised as a seamless backup service-meaning no interruption-because this snapshot is quickly taken and all backup writing and details take place by working with the snapshot, not the live data on disk. This allows the backup process to be seamless to the user since normal services are not being interrupted throughout the backup. The snapshot is what is happening in the ‘preparation’ phase of taking a system state backup using VSS capable backup utilities like Windows Backup.
However, while the snapshot is being taken VSS imposes a temporary halt to disk writes-but allowing disk reads. To be more precise, there is an Active Directory implementation of the VSS snapshot API which works with VSS to do this. Other applications which use ESE databases, like Exchange, have their own implementation of the snapshot code as well. Going forward, for AD that means there is a short interval where no database writes can take place. This period is typically so short as to not be noticed 99.999% of the time, but there are factors which can make this period longer.
Those factors are:
· High disk utilization taking place, indicated by average disk queue lengths being long. This factor would likely be occurring at all times but would spike during the backup process.
· An application which has a low timeout threshold, client side, for its requests and no retry or failover behaviors in case of a temporary lack of response for an action from the DC.
· Lower memory conditions where more of the database is paged onto disk (page file) and would require more disk access to read in order for the snapshot to proceed.
So how can you tell if the behavior you are seeing is related to a scenario like this? We can look in the Event Viewer ESE (database) Freeze and Thaw events in the Application event log during the preparation phase of your backups.
When the backup preparation begins you will see the ESE (remember that ESE is the type of database AD runs as) source event below:
When it ends you will see its companion event:
Note that in some cases the event 2003 above will have slightly different wording which includes the word “thaw”.
More information can be found here. The Freeze and Thaw intervals correspond to the preparation phase of the backup. The pertinent snippet from the above MSDN article is:
Shadow Copy Freeze and Thaw
The creation of every VSS shadow copy operation is bracketed by Freeze and Thaw events, which writers use to put their files in a stable state prior to shadow copy.
Having Freeze and Thaw events as part of the VSS model means:
Handling the Freeze event means that those who are developing writers must have a clearly delineated point in the backup cycle where they ensure that all write operations to the disk are stopped and that files are in a well-defined state for backup.
Handling the Thaw event provides the mechanism for writers to resume writes to the disk and clean up any temporary files or other temporary state information that were created in association with the shadow copy.
The default window between the Freeze and Thaw events is short (typically 60 seconds); therefore, actual interruption of any service that a writer provides can be minimized.
Handling of other events (such as PrepareForSnapshot) preceding and following the Freeze and Thaw events, respectively, provides the necessary flexibility to allow writers to complete complicated operations to support shadow copies.
How can you tell that this issue is affecting you? If you have application side behavior that correspond to the events 2001 and 2003 then it’s time to do some performance logging on your domain controllers and look for performance bottlenecks. Server Performance Advisor or the Perfmon AD Data Collector in Server 2008 tests ran during the backup are also a good tool for getting a handle on what is going on.
What can you do if you have verified that you are seeing this unusual issue? Here’s what I would recommend:
· Alter the application behavior to better accommodate an occasional delay in server responses from DC.
· Consider moving to x64 platform for the DCs, with more RAM and augmented by more robust drives and network devices. This should make the VSS freeze and thaw intervals even less perceptible.
· Decrease the frequency of the backups for those domain controllers only as a last resort.
Hopefully this helps in another less common scenario and gives a better understanding of how things work under the hood in AD.