Trello was temporarily inaccessible

Incident Report for Trello

Postmortem

SUMMARY

On May 15, 2025, between 13:55 and 14:18 UTC, Atlassian customers using the Trello product experienced errors or slow loading times when attempting to view their cards and boards. The event was triggered by a database plan cache expiring and high resource usage caused by subsequent database query planning operations. The particular database shard that was impacted held data that was required for every card load. The incident was detected within two minutes by the automated monitoring system and mitigated by increasing resources available to the affected database shard, which put Atlassian systems into a known good state. The total time to resolution was about 23 minutes.

IMPACT

The overall impact was between May 15, 2025, 13:55 and May 15, 2025, 14:18 UTC on the Trello product. The incident caused service disruption for all Trello customers.

ROOT CAUSE

The issue was caused by a query plan expiring from the database cache, which caused incoming queries to go through a replanning operation. These queries had multiple plans that could satisfy them, and depending on the size of the query, one plan might be significantly more efficient than another. This caused the query planner to perform a great many more replanning operations than usual, which consumed all of the CPU on the server for a brief moment. Once the CPU was consumed, the planning operations themselves began taking too long and therefore required constant replanning in an effort to find more efficient options. This negative feedback loop could not be broken without intervention.

REMEDIAL ACTIONS PLAN & NEXT STEPS

We know that outages impact your productivity. While we have a number of testing and preventative processes in place, this specific issue wasn’t identified because it would only occur under very distinct conditions, including the amount of load and the order of database queries.

We are prioritizing the following improvement actions to avoid repeating this type of incident:

Review our capacity planning thresholds and ensure that all shards have sufficient overhead to handle unexpected load.
Improve query planner performance by:
- Implement hinting for known problematic query shapes to circumvent the query planner.
- Investigate long-term generalized solutions to prevent query planner thrashing.

Furthermore, we are prioritizing the following additional measures to reduce the impact of any future incidents:

Analyze and reduce single points of failure for loading Trello boards and cards.

We apologize to customers whose services were impacted during this incident; we are taking immediate steps to improve the platform’s performance and availability.

Thanks,
Atlassian Customer Support

Posted May 27, 2025 - 15:06 EDT

Resolved

A fix has been implemented, and the issue is now resolved.

Posted May 15, 2025 - 11:21 EDT

Identified

We are aware that Trello was temporarily inaccessible for all customers. The system has already been recovered, and we are monitoring the situation.

Posted May 15, 2025 - 10:27 EDT