Trello is slow
Incident Report for Trello
Postmortem

SUMMARY

On April 24, 2023, between 10:50 p.m. and 11:18 p.m. UTC, most Trello users experienced errors when trying to view or edit their board and cards. The event occurred during a routine database maintenance event which erroneously updated DNS records. This incident affected customers in all regions and on all devices including web browsers, desktops, and mobile apps. Our automated monitoring systems detected the incident within three minutes and mitigated it by identifying and reverting the erroneous DNS changes. The total time to resolution was approximately 28 minutes.

IMPACT

Trello experienced a service disruption lasting approximately 28 minutes affecting a large set of active users during the outage window. During this time, key actions such as loading boards and cards frequently failed. Some particular boards and cards may have loaded successfully, but for most users, the application failed to load and was unusable.

ROOT CAUSE

In the process of performing database maintenance, DNS records for two database servers were erroneously updated to point to new servers that were not yet ready for service. This caused database queries to those hosts to fail.

The database is designed with redundancy and should quickly and automatically failover to a healthy server. We test this behavior on a regular basis. However, in this particular instance, the replicaset was operating normally among the participating nodes, which prevented the normal failover process from triggering.

The erroneous DNS update prevented services that query this replicaset from reaching it, instead going to newly added servers that did not have data. This partial failure state was previously untested and led to a longer diagnosis and recovery time than expected. It took approximately 3 minutes to detect the outage, 19 minutes to discover the root cause, 3 minutes to implement the fix, and 3 minutes for systems to recover.

REMEDIAL ACTIONS PLAN & NEXT STEPS

We know that outages impact your productivity and strive to avoid incidents like these.

We are prioritizing the following efforts to avoid repeating this type of incident:

  • Create additional safety checks for DNS record changes in our infrastructure management systems. These checks have been developed, tested, and deployed.
  • Research and test methods for improving automatic database failover during this partial failure state.

We apologize to customers who were affected during this incident. We are taking these immediate steps to improve the platform’s availability.

Thanks,

The Trello team

Posted May 15, 2023 - 10:57 EDT

Resolved
This incident has been resolved. If you're still seeing issues, please reach out at https://trello.com/contact/.
Posted Apr 24, 2023 - 19:44 EDT
Monitoring
Trello is operational. We'll continue to investigate the root cause and monitor until the issue is resolved.
Posted Apr 24, 2023 - 19:34 EDT
Identified
We've noticed that Trello is responding slowly. This will be present in both the web and mobile apps.

Our engineering team is actively investigating this incident and working to bring Trello back up to speed as quickly as possible.

We'll keep you posted with further updates on this page.
Posted Apr 24, 2023 - 19:23 EDT
This incident affected: Trello.com.