Trello is slow or unavailable

Incident Report for Trello

Postmortem

On Monday 09-20-2021 13:04–14:47 UTC, and Tuesday 09-21-2021 15:04–16:44 UTC, Atlassian customers using Trello may have experienced service interruptions.

These service interruptions were caused by a series of events: One of the shards of our primary database had been slowly increasing its CPU usage due to normal application growth - on Monday and Tuesday morning, a natural peak in our load patterns associated with weekday mornings and top-of-the-hour tasks caused delays in queries made to that shard. These slow queries used all of the existing connections in the connection pools for the shard, causing new connections to be created. Normally, these connections would be created quickly to handle additional concurrent queries and the system would return to normal after the burst of queries had been processed. However, a bug in the version of the MongoDB database that we were running exacerbated the problem by causing new database connections to have much higher costs and delays than usual. These new connections added load to the database and began an unstable cycle of slow queries, connection pool exhaustion, and new connection attempts. As designed, the nodes of this shard detected the failure and promoted a new node to act as the primary. Unfortunately, this new primary shard entered the same unstable cycle due to the high number of initial connections made to it. The number of initial connections to a database node was much higher than usual due to a configuration change that we were in the process of testing, which increased the number of connection pools per database routing server.

In both cases, the service interruptions were resolved by disabling a significant portion of traffic to Trello, and then restoring it gradually over the following hour as we monitored the recovery. That gave the database time to establish more connections which could be reused, without being overloaded by the cost of establishing too many new connections concurrently.

Upon further investigation of Tuesday’s service interruption, we rolled back our connection pool configuration change and the version of the MongoDB database software that was running on the routing servers, which reduced the number and cost of new connections. We will roll forward again once our fixes are tested. While the total time to restore Trello to full service for all customers was approximately 1 hour and 44 minutes for each service interruption, many customers experienced fully restored access much sooner, as we allowed traffic back in incrementally.

We apologize if you were impacted during these service interruptions. We know that outages are disruptive to your productivity. We are prioritizing the following improvement actions to avoid repeating this type of service interruption and to improve Trello’s reliability:

Rolling out an update to our MongoDB database software that will resolve the bug that caused increases in time to make new database connections.
Scaling out the relevant database shard to reduce resource usage.
Improving our capacity planning processes to ensure that we’re scaling out database shards proactively.
Additional alerting for the number of slow queries running on a database shard.
Reviewing recurring on-the-hour load spikes from internal systems.
Exploring improved load testing capabilities.

Posted Sep 28, 2021 - 15:46 EDT

Resolved

Trello's Engineers have fully resolved the performance incident, and Trello is back up and running as normal. Don't hesitate to reach out with any continued trouble: https://trello.com/contact

Posted Sep 20, 2021 - 13:12 EDT

Monitoring

Trello engineers believe they've found the source of the problem, and site performance is back to normal. If you do continue to see any trouble, please reach out at https://trello.com/contact!

Posted Sep 20, 2021 - 11:42 EDT

Update

We are continuing to investigate this issue.

Posted Sep 20, 2021 - 11:03 EDT

Investigating

Trello is currently slow or unavailable.

Our engineering team is actively investigating this incident and working to bring Trello back up as quickly as possible.

Users affected by this incident may notice that Trello is slow or completely unavailable in both the web and mobile apps.

We will update this page as we have additional information.

Posted Sep 20, 2021 - 09:14 EDT

This incident affected: Trello.com and API.