Over the last 120 days as more and more enterprises learn about, evaluate and buy Azure Active Directory Premium, we've received a lot of questions about:
- How the service architecture is designed
- How it scales out
- How we provide high availability
- Where customer data resides
At a high level, Azure AD is a high availability, geo-redundant, multi-tenanted, multi-tiered cloud service that has delivered 99.99% uptime for over a year now. We run it across 27 datacenters around the world. Azure AD has stateless gateways, front end servers, application servers, and sync servers in all of those data centers. Azure AD also has a distributed data tier that is at the heart of our high availability strategy. Our data tier holds more than 500 million objects and is running across 13 data centers.
Given how many questions we've received, I thought it would be fun and useful to provide you with a deeper view into how the system works. So I've asked Anandhi Somasekaran who is the Development Manager for our data tier, to write a guest blog post explaining our data tier architecture and how it helps us deliver mission critical availability levels for our customers all around the world.
And as always, we'd love to hear any feedback or suggestions you have.
Alex Simons (Twitter: Alex_A_Simons)
Director of PM
This is Anandhi Somasekaran. I'm the development manager for our data tier and I'm excited to have this opportunity to share with you some of the inner workings of Azure Active Directory.
Today Azure Active Directory manages identity data for over four million organizations and stores more than 500 million objects across data centers around the world (USA, EMEA, APAC and China), all the while maintaining >99.9% (May '14 – 99.99%, June '14 – 99.99%) for service uptime. This post will explain how this has been achieved. The answer, in a word, is carefully.
AAD Architectural Overview: The typical way to build a provably scalable data-rich system is through independent building blocks or scale units – for the AAD data tier this unit is a partition. Our data tier has several front-end services which provide read & write capability. The diagram below shows the components of a single directory partition, located in geographically dispersed data centers. The components are:
- Active Primary – The Active Primary consists of a single clustered write replica per partition and all writes for this partition will be routed to this primary. When data is written, it is generally made multi data center redundant by replicating immediately to at least one other data center.
- Passive Primary – The Passive Primary has the same topology as the Active Primary and writes are replicated from the Active Primary. At any time the Passive Primary can assume the role of Active Primary.
Secondary Replicas (many). All directory reads are served from Secondary Replicas which are physically located in different geographies. There are many of these replicas and data is replicated to these replicas asynchronously. Directory reads such as authentication requests are serviced from data centers that are close to our customers. The secondary replica's provide for read scalability
Continuous availability: Azure Active Directory is highly available for authentication and directory lookups largely as a function of the architecture described above. The key to this is operations across multiple geographically dispersed data centers which have independent de-correlated failure modes.
The partition design is simplified (which is critical for large scale systems) by adopting a single master system with a carefully orchestrated and deterministic failover process. For each partition of the directory there is a highly available master replica: the Active Primary. All writes are performed at the Active Primary cluster. It is highly available in the sense that loss of a single server does not impact the availability of the Active Primary. If needed we can failover (< 5 mins per partition) to the Passive Primary cluster in a different data center. Failover can be accomplished on a per partition basis. In planned failover cases the passive primary is made up to date with the active primary before being promoted to be active.
A write is durably committed to at least two data centers prior to being acknowledged. This is performed by first committing the write on the Active Primary and then immediately replicating the write to a replica in at least one other data center. This ensures that a loss of a data center does not result in loss of data. An exception to this rule is for writes that are replay-able, such as when synchronized from another system; in which case durability is achieved in failover cases by replaying the writes.
We maintain zero Recovery Time Objective (RTO) for token issuance and directory reads and in the order of minutes (~5 mins) RTO for directory writes. We maintain zero Recovery Point Objective (RPO) and will not lose data on failovers.
Today's AAD operates across data centers with the following characteristics:
- For reads, the directory has secondary replicas and corresponding front end services in an active-active configuration operating in multiple data centers. In case of failure of the entire data center; the data center is taken out of rotation automatically. Our authentication and graph services are in front of a Gateway service. The Gateway does load balancing of these services and fails over automatically when unhealthy servers are detected using transactional health probes. Based on these health probes, the Gateway will dynamically route traffic to healthier datacenters.
- For writes, the directory can fail over primary (master) replicas across data centers via planned (new primary is synchronized to old primary) or emergency failover procedures. Data durability is achieved by durable commit to at least two data centers.
Scalability: The directory is partitioned for write scalability. Each partition has many read-only replicas: Secondary replicas. Directory applications target Secondary replicas and their writes are transparently redirected to the Active Primary replica to provide read/write consistency. Secondary replicas significantly extend the scale out of a partition, as directories are mostly serving reads.
Reduced latency: Directory clients connect to their nearest replicas which improves performance. Since a directory partition can have many secondary replicas, secondary replicas can be placed closer to the directory clients. Only internal directory service components that are write-intensive target the Active Primary replica directly.
Data Consistency: The directory model is one of eventual consistency. One typical problem with distributed asynchronously replicating systems is that the data returned from a particular replica may not be up to date. Read/write consistency for an application targeting a secondary replica is provided by routing its writes to the primary replica and synchronously pulling those writes back to the secondary replica.
Application writers using the Graph API of AAD are abstracted away from maintaining affinity to a directory replica for read-write consistency. The AAD Graph service maintains a logical session which has affinity to a secondary replica used for reads; affinity is captured in a "replica token" that the graph service caches using a distributed cache and is used for subsequent operations in the same logical session. Writes directed to primary replicas are immediately pulled back to the secondary replica to which the logical session's reads are issued.
Current data center footprint (Aug 2014): Write Replicas are today located in the USA and China. We are building out EU write replicas for use in the next few months. Secondary Replicas are available in China, Singapore, Amsterdam, and Dublin and at multiple places across the USA.
Protection from accidental deletions and corruptions: The directory today implements soft deletes vs. hard deletes for users and tenants for ease of recovery. If an administrator of a tenant accidently deletes users, they can undo and restore the deleted users. We take daily backups of the data, and can authoritatively restore the data in case of any logical corruptions. Our database tier employs error correcting codes, so that it can check for errors and automatically correct some types of disk errors.
Measure and Respond: Running a high availability service requires world class measurement and monitoring capabilities. We continually measure, analyze and report on key service health metrics and success criteria for each of the AAD services. We develop and tune monitoring and metrics for each scenario both within each AAD service and across services that allow us to be sure it is working and if not, to take rapid action to recover. The most important metric we track is how quickly we can detect and mitigate a customer or live site issue. We heavily invest in monitoring and alerts to minimize time to detect (TTD Target: <5 mins) and operational readiness to minimize time to mitigate (TTM Target: <30 mins)
Secure operations: Azure AD is compliant with ISO 27001 and FISMA standards. We employ operational controls such as 2 factor authentication for any operation, and audit all operations. In addition we use a just in time elevation system to grant necessary access for any operational task on demand on a temporary basis.
Summary: This combination of a well-planned, geo-distributed architecture with extensive monitoring and automated rerouting, failover and recovery enables us to deliver enterprise level availability and performance to customers in >50 countries all around the world.
I hope this was useful and interesting! If you have questions or feedback, please let us know!