On April 24, 2023, between 10:50 p.m. and 11:18 p.m. UTC, most Trello users experienced errors when trying to view or edit their board and cards. The event occurred during a routine database maintenance event which erroneously updated DNS records. This incident affected customers in all regions and on all devices including web browsers, desktops, and mobile apps. Our automated monitoring systems detected the incident within three minutes and mitigated it by identifying and reverting the erroneous DNS changes. The total time to resolution was approximately 28 minutes.
Trello experienced a service disruption lasting approximately 28 minutes affecting a large set of active users during the outage window. During this time, key actions such as loading boards and cards frequently failed. Some particular boards and cards may have loaded successfully, but for most users, the application failed to load and was unusable.
In the process of performing database maintenance, DNS records for two database servers were erroneously updated to point to new servers that were not yet ready for service. This caused database queries to those hosts to fail.
The database is designed with redundancy and should quickly and automatically failover to a healthy server. We test this behavior on a regular basis. However, in this particular instance, the replicaset was operating normally among the participating nodes, which prevented the normal failover process from triggering.
The erroneous DNS update prevented services that query this replicaset from reaching it, instead going to newly added servers that did not have data. This partial failure state was previously untested and led to a longer diagnosis and recovery time than expected. It took approximately 3 minutes to detect the outage, 19 minutes to discover the root cause, 3 minutes to implement the fix, and 3 minutes for systems to recover.
We know that outages impact your productivity and strive to avoid incidents like these.
We are prioritizing the following efforts to avoid repeating this type of incident:
We apologize to customers who were affected during this incident. We are taking these immediate steps to improve the platform’s availability.
The Trello team