Trello is slow or unavailable

Incident Report for Trello

Postmortem

Summary

On Feb. 13, 2024, between 8:00 AM and 11:34 AM UTC, Trello experienced severely degraded performance appearing as a full or partial outage to Atlassian customers. The event was triggered by a buildup of long-running queries against our database, leading to slowed API response times and causing Trello to be degraded or unavailable for users. The root cause of the incident was identified as a compression change in our database deployed approximately 11 hours earlier during a low-traffic period. As European customers came online, traffic started increasing, resulting in a buildup of queries and the subsequent incident. The incident was detected by our monitoring system at 8:07 AM UTC and was mitigated by reverting the compression change and restarting components of our database system. The total time to resolution was 3 hours and 34 minutes.

IMPACT

The overall impact was between 8:00 AM and 11:34 AM UTC on Feb. 13, 2024. The incident caused Trello to be fully or partially unavailable for customers using or attempting to access the site during this period.

ROOT CAUSE

The issue was caused by a compression change in our database, which resulted in the build up of queries in the system. This build up then caused API response times to increase to critical levels. During the incident many users received HTTP 429 errors as the system began rate-limiting in an attempt to recover. Users that did not receive errors experienced API response times 10-100x slower than our standard response times.

REMEDIAL ACTIONS PLAN & NEXT STEPS

We know that outages impact your productivity. We are prioritizing the following actions to avoid repeating this incident and reduce time to resolution:

Improve our process for releasing incremental configuration changes which would have allowed the team to identify the root cause before a peak load period and prevent similar incidents.
Adjust priority level of alerts related to this class of incident to improve signal to noise ratio and drive faster time to resolution.

We apologize to customers whose services were impacted during this incident; we are taking immediate steps to improve Trello’s performance and availability.

Thanks,

Atlassian Customer Support

Posted Feb 23, 2024 - 13:03 EST

Resolved

Trello is now available. Thank you for your patience.

Posted Feb 13, 2024 - 06:46 EST

Monitoring

We've identified an issue that was causing slowness in Trello and put a fix in place. We're seeing things improving for all our customers and we'll keep monitoring this latest fix for now. Once again, we appreciate your understanding!

Posted Feb 13, 2024 - 06:40 EST

Update

Our team is still looking into the issue causing Trello to be slow or unavailable. Thank you for your understanding, we're working to have Trello back up and running as quickly as possible.

Posted Feb 13, 2024 - 06:03 EST

Update

We're still investigating the root cause of this problem, and a new update will be shared soon! Thanks for your patience!

Posted Feb 13, 2024 - 05:12 EST

Update

Our team is still investigating the issue that's causing Trello to be slow or unavailable, and working to bring Trello back up to speed as quickly as possible. Thanks for your patience and understanding!

Posted Feb 13, 2024 - 04:30 EST

Investigating

We've noticed that Trello is responding slowly. This will be present in both the web and mobile apps.

Our engineering team is actively investigating this incident and working to bring Trello back up to speed as quickly as possible.

We'll keep you posted with further updates on this page.

Posted Feb 13, 2024 - 03:57 EST

This incident affected: Trello.com and API.