On May 5, 2025, between 2:08 p.m. and 4:29 p.m. UTC, some Atlassian customers using Trello were unable to view their boards or cards. The event was triggered by an unexpected error encountered by our infrastructure management tools, which resulted in an incorrect DNS configuration being deployed to a portion of our database. The incident was detected within four minutes by automated monitoring systems and mitigated by identifying the faulty portion of the database and performing a failover, which put Atlassian systems into a known good state. The total time to resolution was about two hours and 21 minutes.
The overall impact was on the Trello product on May 5, 2025, between 2:08 p.m. and 4:29 p.m. UTC. The incident caused service disruption to Trello customers whose accounts and boards contained or referenced data on the affected shard of our database. Additionally, some Trello customers would have experienced a service disruption due to our use of load-shedding tools during the incident to strategically block portions of our traffic to aid in recovery.
The day before the incident, on May 4, our infrastructure management tooling encountered an unexpected error when attempting to fetch the networking metadata on a particular host. This led to the host, which was a member of our database cluster, to incorrectly apply the default Operating System DNS configuration. This DNS configuration was not able to resolve internal domains, which led to a partial failure state of the node. The database continued to function normally and there was no immediate customer impact but in the background this incorrect DNS configuration led to the slow buildup of database sessions. These database sessions are usually short-lived and automatically expire when no longer needed, but the DNS misconfiguration prevented this automatic expiration. The database sessions eventually grew to the default maximum on this particular shard. At that point, the shard was unable to generate new sessions, which are required for all basic operations, and the Trello product began experiencing elevated error rates.
We know that outages impact your productivity. While we have a number of testing and preventative processes in place, this specific issue wasn’t identified due to the isolated nature of the database session resource and monitoring gaps around this resource and around DNS resolution.
We are prioritizing the following improvement actions designed to avoid repeating this type of incident:
Furthermore, we are prioritizing the following additional measures to reduce the duration of any future incidents:
We apologize to customers whose services were impacted during this incident; we are taking steps designed to improve the platform’s performance and availability.
Thanks,
Atlassian Customer Support