Trello is slow or unavailable

Incident Report for Trello

Postmortem

On 11-16-2020 between 13:25 UTC and 15:10 UTC, Atlassian customers using Trello may have experienced service interruptions.

This incident was caused by a deploy containing a change which inadvertently increased the number of queries against our database. Subsequently, CPU usage on our production database increased above a critical threshold which alerted the incident response team.

The response team rolled back multiple versions before identifying the faulty deployment. Rolling back to the stable version prior to this faulty deployment addressed the root cause of the incident, but it was not enough to bring Trello back in a stable state. As this incident happened during peak-traffic hours we had to block all traffic to our servers to reduce the load to zero, before allowing full recovery. Hence, after performing all of these actions, the total resolution time for the incident was 1 hour and 45 minutes.

Due to the long duration of this outage and the similarity to a past incident on 10-26-2020 we are taking aggressive steps to prevent future outages. We are building out sophisticated database load monitoring and alerting, and improving our release process. While we work on those long-term improvements we have put the following short-term measures in place to improve reliability. They are:

  1. Increased resource provisioning across our production databases
  2. Increased dashboard monitoring during and post-release, across all clients
  3. Enhanced capabilities to control which clients can generate traffic and how much

We understand that outages negatively impact your productivity and we apologize for the inconvenience this has caused.

Posted 5 years ago. Dec 01, 2020 - 11:33 EST

Resolved

The incident has been resolved, and we're back up and running!
Posted 5 years ago. Nov 16, 2020 - 12:53 EST

Monitoring

We have identified the issue and deployed a fix. Our engineering team is continuing to monitor the situation.
Posted 5 years ago. Nov 16, 2020 - 11:33 EST

Update

Our engineering team is still actively investigating this incident and working to bring Trello back up as quickly as possible.
Posted 5 years ago. Nov 16, 2020 - 09:59 EST

Update

We are continuing to investigate the issue and hope to have an update soon!
Posted 5 years ago. Nov 16, 2020 - 09:04 EST

Investigating

We've noticed that Trello is responding slowly. This will be present in both the web and mobile apps.

Our engineering team is actively investigating this incident and working to bring Trello back up to speed as quickly as possible.

We'll keep you posted with further updates on this page.
Posted 5 years ago. Nov 16, 2020 - 08:37 EST
This incident affected: Trello.com and API.