On Feb. 13, 2024, between 8:00 AM and 11:34 AM UTC, Trello experienced severely degraded performance appearing as a full or partial outage to Atlassian customers. The event was triggered by a buildup of long-running queries against our database, leading to slowed API response times and causing Trello to be degraded or unavailable for users. The root cause of the incident was identified as a compression change in our database deployed approximately 11 hours earlier during a low-traffic period. As European customers came online, traffic started increasing, resulting in a buildup of queries and the subsequent incident. The incident was detected by our monitoring system at 8:07 AM UTC and was mitigated by reverting the compression change and restarting components of our database system. The total time to resolution was 3 hours and 34 minutes.
The overall impact was between 8:00 AM and 11:34 AM UTC on Feb. 13, 2024. The incident caused Trello to be fully or partially unavailable for customers using or attempting to access the site during this period.
The issue was caused by a compression change in our database, which resulted in the build up of queries in the system. This build up then caused API response times to increase to critical levels. During the incident many users received HTTP 429 errors as the system began rate-limiting in an attempt to recover. Users that did not receive errors experienced API response times 10-100x slower than our standard response times.
We know that outages impact your productivity. We are prioritizing the following actions to avoid repeating this incident and reduce time to resolution:
We apologize to customers whose services were impacted during this incident; we are taking immediate steps to improve Trello’s performance and availability.
Thanks,
Atlassian Customer Support