Trello is slow or unavailable for some users

Incident Report for Trello

Postmortem

SUMMARY

On May 5, 2025, between 2:08 p.m. and 4:29 p.m. UTC, some Atlassian customers using Trello were unable to view their boards or cards. The event was triggered by an unexpected error encountered by our infrastructure management tools, which resulted in an incorrect DNS configuration being deployed to a portion of our database. The incident was detected within four minutes by automated monitoring systems and mitigated by identifying the faulty portion of the database and performing a failover, which put Atlassian systems into a known good state. The total time to resolution was about two hours and 21 minutes.

IMPACT

The overall impact was on the Trello product on May 5, 2025, between 2:08 p.m. and 4:29 p.m. UTC. The incident caused service disruption to Trello customers whose accounts and boards contained or referenced data on the affected shard of our database. Additionally, some Trello customers would have experienced a service disruption due to our use of load-shedding tools during the incident to strategically block portions of our traffic to aid in recovery.

ROOT CAUSE

The day before the incident, on May 4, our infrastructure management tooling encountered an unexpected error when attempting to fetch the networking metadata on a particular host. This led to the host, which was a member of our database cluster, to incorrectly apply the default Operating System DNS configuration. This DNS configuration was not able to resolve internal domains, which led to a partial failure state of the node. The database continued to function normally and there was no immediate customer impact but in the background this incorrect DNS configuration led to the slow buildup of database sessions. These database sessions are usually short-lived and automatically expire when no longer needed, but the DNS misconfiguration prevented this automatic expiration. The database sessions eventually grew to the default maximum on this particular shard. At that point, the shard was unable to generate new sessions, which are required for all basic operations, and the Trello product began experiencing elevated error rates.

REMEDIAL ACTIONS PLAN & NEXT STEPS

We know that outages impact your productivity. While we have a number of testing and preventative processes in place, this specific issue wasn’t identified due to the isolated nature of the database session resource and monitoring gaps around this resource and around DNS resolution.

We are prioritizing the following improvement actions designed to avoid repeating this type of incident:

  • Update our infrastructure management tool to use a safe fall-back DNS configuration in the case of unexpected errors.
  • Expand existing DNS monitoring to include the resolution of internal domains.
  • Expand existing database session count monitoring to include all database node types.

Furthermore, we are prioritizing the following additional measures to reduce the duration of any future incidents:

  • Evaluate our incident response process to identify actions that can be streamlined for quicker resolution.

We apologize to customers whose services were impacted during this incident; we are taking steps designed to improve the platform’s performance and availability.

Thanks,
Atlassian Customer Support

Posted 17 days ago. May 16, 2025 - 09:45 EDT

Resolved

On May 5th, 2025 we identified a degradation for Trello. Trello is now back online and no further impact has been observed.
Posted 28 days ago. May 05, 2025 - 13:37 EDT

Monitoring

We have identified and mitigated the issue causing Trello to be slow or unavailable for some users. We expect API traffic to return to normal within the next 30 minutes. We are now monitoring closely.

We will update within the next 30 minutes.
Posted 28 days ago. May 05, 2025 - 12:56 EDT

Investigating

We've noticed that Trello is slow or unavailable for some users. This will be present in both the web and mobile apps.

Our engineering team is actively investigating this incident and working to bring Trello back up to speed as quickly as possible.

We'll keep you posted with further updates on this page.
Posted 28 days ago. May 05, 2025 - 11:08 EDT
This incident affected: Trello.com.