On Monday 09-20-2021 13:04–14:47 UTC, and Tuesday 09-21-2021 15:04–16:44 UTC, Atlassian customers using Trello may have experienced service interruptions.
These service interruptions were caused by a series of events: One of the shards of our primary database had been slowly increasing its CPU usage due to normal application growth - on Monday and Tuesday morning, a natural peak in our load patterns associated with weekday mornings and top-of-the-hour tasks caused delays in queries made to that shard. These slow queries used all of the existing connections in the connection pools for the shard, causing new connections to be created. Normally, these connections would be created quickly to handle additional concurrent queries and the system would return to normal after the burst of queries had been processed. However, a bug in the version of the MongoDB database that we were running exacerbated the problem by causing new database connections to have much higher costs and delays than usual. These new connections added load to the database and began an unstable cycle of slow queries, connection pool exhaustion, and new connection attempts. As designed, the nodes of this shard detected the failure and promoted a new node to act as the primary. Unfortunately, this new primary shard entered the same unstable cycle due to the high number of initial connections made to it. The number of initial connections to a database node was much higher than usual due to a configuration change that we were in the process of testing, which increased the number of connection pools per database routing server.
In both cases, the service interruptions were resolved by disabling a significant portion of traffic to Trello, and then restoring it gradually over the following hour as we monitored the recovery. That gave the database time to establish more connections which could be reused, without being overloaded by the cost of establishing too many new connections concurrently.
Upon further investigation of Tuesday’s service interruption, we rolled back our connection pool configuration change and the version of the MongoDB database software that was running on the routing servers, which reduced the number and cost of new connections. We will roll forward again once our fixes are tested. While the total time to restore Trello to full service for all customers was approximately 1 hour and 44 minutes for each service interruption, many customers experienced fully restored access much sooner, as we allowed traffic back in incrementally.
We apologize if you were impacted during these service interruptions. We know that outages are disruptive to your productivity. We are prioritizing the following improvement actions to avoid repeating this type of service interruption and to improve Trello’s reliability: