Trello is slow or unavailable

Incident Report for Trello

Postmortem

SUMMARY

On Wednesday 02-02-2022 from 13:09-13:40 UTC and again from 20:56-21:23 UTC, Atlassian customers using Trello may have experienced service interruptions.

These incidents were detected within 3 minutes by automated monitoring systems and mitigated by restarting database processes, which put Trello systems into a known good state. Total time to resolution was 31 minutes for the first incident and 27 minutes for the second incident.

ROOT CAUSE

These service interruptions were caused by a series of events. One of the shards of our primary database lost a replica node due to AWS hardware degradation. Normally, data replication allows our systems to continue operating uninterrupted when a single node is unavailable. In this case, automatically-configured default settings from a MongoDB update caused built-in flow control to block writes to the primary replica indefinitely after a secondary node outage. Because of the blocked writes, incoming network connections exhausted resources on the primary node, also bringing down the primary.

In both cases, service interruptions were resolved by restarting MongoDB processes on the nodes.

The second interruption occurred while we were still actively investigating the root cause of the first. After the second interruption, we were able to identify that the root cause of the primary node failures was flow control blocking database writes. We updated MongoDB flow control settings to prevent future service interruptions.

REMEDIAL ACTIONS PLAN & NEXT STEPS

We apologize if you were impacted during these service interruptions. We know that outages are disruptive to your productivity. We are prioritizing the following actions to improve Trello’s reliability and to avoid repeating this type of service interruption in the future:

Disabling the MongoDB setting that caused the outage
Updating processes around DR (Disaster Recovery) testing
Additional monitoring and alerting for network connections to the database
Exploring more resilient MongoDB replica set configurations

Thanks,

Atlassian Customer Support

Posted Feb 11, 2022 - 17:09 EST

Resolved

This root cause of the issue has been resolved. If you are still experiencing any issues, please reach out to us at https://trello.com/contact.

Posted Feb 02, 2022 - 21:10 EST

Monitoring

Monitoring: Trello service has been restored. We will continue to monitor the situation while we confirm the incident has been fully resolved.

Posted Feb 02, 2022 - 16:39 EST

Investigating

Trello is currently slow or unavailable.

Our engineering team is actively investigating this incident and working to bring Trello back up as quickly as possible.

Users affected by this incident may notice that Trello is slow or completely unavailable in both the web and mobile apps.

We will update this page as we have additional information.

Posted Feb 02, 2022 - 16:08 EST

This incident affected: Trello.com and API.