Atlassian experienced a 14 day outage from April 5th to April 18th 2022 after a faulty script deleted data from 775 customers of their popular JIRA and Confluence platforms.
To their credit, Atlassian kept everyone well informed throughout the outage and have also published a detailed post-incident review online.
In this article we will:
Atlassian's post-incident review gives a detailed description of the events however, in short, an errant deployment script unintentionally deleted data from 775 Atlassian customers' data on April 5th, 2022.
The script was designed to delete data from a de-comissioned app but included incorrect identifiers, causing the unintentional deletion of customer data.
APIs called by the script also ran without any verification prompts, so the team running the script in production received no verification warnings that there was anything wrong.
Atlassian's Data Management guide explains that secondary / disaster recovery databases are replicated synchronously. The deletion of data therefore occurred on both primary and secondary systems concurrently so failing over to secondary databases was not a recovery option.
Atlassian resorted to restoring full and incremental database backups to recover the data to within 5 minutes prior to the deletion. Absence of automated procedures to restore large numbers of databases resulted in a manual & lengthy 14 day outage.
Atlassian's post-incident report does not describe what testing was performed with the script other than stating that the script had been "peer reviewed" prior to live deployment.
If appropriate testing had been performed, the errant deletion of data should have been detected before it affected live systems.
This highlights the importance of pre-release testing for all deployments, a simple but important step which should have averted this outage.
Atlassian did not describe any planned improvements to their testing procedures, which seems to be an omission.
Atlassian's Data Management guide describes its data recovery processes which include regular tests, documentation, review and leadership. These plans would be considered appropriate by most businesses.
However, automated mass recovery of customer databases was not included in these processes, a problem acknowledged by Atlassian in its post-incident report.
Software as a Service (SaaS) providers often store customer data in individual, isolated databases so recovery processes should include automated restoration of individual or large numbers of databases concurrently.
Atlassian's reported that their recovery processes are being improved to include automated, concurrent database restoration in future.
Use of synchronous database replication between primary and secondary (DR) systems meant that the errant data deletion occurred on both nodes concurrently, eliminating usefulness of the secondary systems for recovery in this scenario.
Synchronous replication provides excellent high availability in equipment failure scenarios but not in errant data deletion scenarios.
Errant data deletion or updates are far more common problems than equipment failure, often resulting from:
Automated fail-over is often considered the gold standard of High Availability design but it relies upon synchronous database replication such as Microsoft SQL Server's Always On Availability Groups.
Recovery from data faults is often ommitted from recovery planning, resulting in lengthy outages whilst database support teams work out what to do, as was the case in Atlassian's outage.
We recommend considering alternative solutions such as customised Log Shipping which can be configured to restore secondaries on a delayed basis. This trades off automated fail over to gain a window of opportunity to recover data from, or fail over to secondary databases in the far more common errant data deletion scenarios experienced by Atlassian or those listed above.
Inadequate testing appears to be the root cause of Atlassian's outage. Every deployment should be tested in test environments that match live environments, not simply "peer reviewed".
Provisioning of test environments that match live environments can be a challenge, especially in on-prem environments but are much easier (though more expensive) to create in cloud infrastructure.
All businesses, and specially providers of SaaS solutions who are trusted by their clients should settle for nothing less than thorough testing.
Automated fail over, built on the foundation of synchronous replication has its benefits and has long been considered the "gold standard" of High Availability, but it doesn't cover the most common errant data update disaster scenarios. It also degrades update performance in live systems, as updates must be performed synchronously on secondaries.
If fail-over automation is still essential for you, despite its limited benefit, performance impact & complex frailty, it is still important to cater for errant data deletion scenarios in other ways such as automated restoration from backup processes.
Another simple technique that can reduce risk from faulty deployment script data deletion scenarios is to pause replication to secondaries during deployments and resume replication only after post deployment checks have verified successful deployment.
If an errant script deleted data unintentionally, it would only affect the primary database leaving the secondary database available for fail over.
This approach would only cover faulty scripts during deployment & not the other errant data deletion scenarios described above but it is simple to implement as a standardized deployment procedure.
Being able to fail over in this scenario will be significantly faster than restoring backups, even where the restore process is highly automated.
This is simple with Microsoft SQL Server Always On Availability Groups or Log Shipping synchronization mechanisms and is worth considering in your deployment processes.