Atlassian ran a tabletop DR simulation that revealed it lived in dependency hell

Australian collaborationware company Atlassian has revealed it’s spent four years trying to reduce dangerous internal dependencies, and while it has rebuilt its PaaS, it still has issues – but thinks they’re now manageable.

As explained in a Tuesday post by Senior Engineering Manager Andrew Ross, “Atlassian runs a large service-based platform with thousands of different services, most deployed by our custom orchestration system, ‘Micros’.”

Micros handles over 2,000 services, 5,000-plus daily deploys, works on over 40,000 DynamoDB tables and 80,000-plus Amazon Relational Database Service (RDS) tables. It also manages three million lambda functions.

Another piece of Atlassian’s infrastructure is a private Docker registry called “Artifactory.”

In 2021, Atlassian deployed Artifactory using Micros, and the Micros platform depended on Artifactory at deployment and runtime. That circular dependency meant a failure in both of the tools would make it impossible to recover the other.

And that’s trouble for Atlassian, given it’s a SaaS shop and at the time it started to tackle dependencies was about to shift customers from on-prem products to the cloud.

Atlassian's dependency analysis for a subset of its platform - Click to enlarge

The company created a Continuous PaaS Recovery (CPR) project to address as many dependencies as it could.

As that project progressed, Atlassian realized it could not remove all dependencies “due to their number and complexity.” It therefore prioritized unpicking dependency tangles that made it hard to recover services.

In 2023, the company staged a tabletop disaster recovery exercise that simulated 6.5 days of recovery efforts, to help staff understand and identify risks.

Ross’s post illustrates the result of that exercise with the images below, which show recovered services in green, and unrecovered services that have dependency tangles in red. In the “before” shot, at left, three services were alive. In the “after” shot, dozens of services remained down due to dependencies.

The results of Atlassian's tabletop DR exercise - Click to enlarge

Atlassian has now re-architected its platform into what Ross described as a “layer cake.”

“We decided to divide the cloud infrastructure into layers, with the lowest layers having the fewest dependencies and upper layers having many dependencies,” he wrote. This new cake is not free of dependencies because Atlassian doesn’t think it is possible or practical to eradicate them all. Instead, the company has learned to live with them using the following principles:

A component in layer (N) can only have hard dependencies on lower layers (N → N-1 = Good).
No hard dependencies on the same layer (N → N = Bad).
No hard dependencies on higher layers (N → N+1 = Bad)

The company has also migrated Artifactory from Micros to Kubernetes, eliminating a critical circular dependency, and built a new low-dependency provisioning system called Atlassian Platform Deployer (APD) that uses AWS CloudFormation as its deployment orchestration engine.

APD helped the company to create and deploy its recently announced Government Cloud. After many further adventures, Atlassian migrated Micros itself to APD.

The company still has internal circular dependencies but eliminated hundreds of them and feels it now operates a more reliable platform that’s easier to recover.

It needs to, because Atlassian recently announced a plan to ditch its on-prem products and move all customers to its cloud. And those customers could rightly be wary of that move, given that circular dependencies were big factors in recent outages at Cloudflare and AWS. ®