On 11-16-2020 between 13:25 UTC and 15:10 UTC, Atlassian customers using Trello may have experienced service interruptions.
This incident was caused by a deploy containing a change which inadvertently increased the number of queries against our database. Subsequently, CPU usage on our production database increased above a critical threshold which alerted the incident response team.
The response team rolled back multiple versions before identifying the faulty deployment. Rolling back to the stable version prior to this faulty deployment addressed the root cause of the incident, but it was not enough to bring Trello back in a stable state. As this incident happened during peak-traffic hours we had to block all traffic to our servers to reduce the load to zero, before allowing full recovery. Hence, after performing all of these actions, the total resolution time for the incident was 1 hour and 45 minutes.
Due to the long duration of this outage and the similarity to a past incident on 10-26-2020 we are taking aggressive steps to prevent future outages. We are building out sophisticated database load monitoring and alerting, and improving our release process. While we work on those long-term improvements we have put the following short-term measures in place to improve reliability. They are:
We understand that outages negatively impact your productivity and we apologize for the inconvenience this has caused.