On August 11, 2022 at 10:00pm UTC, a portion of Trello users experienced a slow or degraded experience with the product. The event was triggered by a sudden increase in load on Trello's MongoDB data store, saturating the database's resources and causing it to become slow or unresponsive to queries. The incident was mitigated by disabling a feature flag that had allowed a recently deployed code path to execute. The time to resolution was 2 hours and 5 minutes.
Beginning on August 11, 2022 at 10:00pm UTC and extending to August 12, 2022 at 12:05am UTC (TTR of 2 hours and 5 minutes), Trello became slow or unresponsive for ~31% of users. During this time, all Trello functionality either loaded slowly or did not load at all. The incident was detected within 3 minutes by automated monitoring and was mitigated at 11:53pm UTC when the incident response team disabled an offending feature flag, terminating a code path that was causing the increased load on the database. By 12:05am UTC on August 12, 2022, full functional was restored for all users.
The issue was caused by a change to Trello's server codebase that introduced a new write pattern to Trello's MongoDB data store. The change, coupled with an unexpected interaction with MongoDB's balancer (a system that balances data across MongoDB nodes) caused a sudden and significant spike in writes to the database. This, in turn, quickly overloaded MongoDB resources, rendering the database unable to respond to requests within a reasonable amount of time. The root cause was the introduction of the code change without an incremental rollout.
We know that events such as this impact your productivity. While we have a number of testing and preventative processes in place, we were not able to simulate the unexpected interaction that caused this event during testing prior to deployment to the production environment.
We are prioritizing the following improvements to avoid repeating this type of incident:
We apologize to customers whose services were impacted during this event, and we are taking immediate steps to improve Trello's performance and availability going forward.