On Wednesday 02-02-2022 from 13:09-13:40 UTC and again from 20:56-21:23 UTC, Atlassian customers using Trello may have experienced service interruptions.
These incidents were detected within 3 minutes by automated monitoring systems and mitigated by restarting database processes, which put Trello systems into a known good state. Total time to resolution was 31 minutes for the first incident and 27 minutes for the second incident.
These service interruptions were caused by a series of events. One of the shards of our primary database lost a replica node due to AWS hardware degradation. Normally, data replication allows our systems to continue operating uninterrupted when a single node is unavailable. In this case, automatically-configured default settings from a MongoDB update caused built-in flow control to block writes to the primary replica indefinitely after a secondary node outage. Because of the blocked writes, incoming network connections exhausted resources on the primary node, also bringing down the primary.
In both cases, service interruptions were resolved by restarting MongoDB processes on the nodes.
The second interruption occurred while we were still actively investigating the root cause of the first. After the second interruption, we were able to identify that the root cause of the primary node failures was flow control blocking database writes. We updated MongoDB flow control settings to prevent future service interruptions.
We apologize if you were impacted during these service interruptions. We know that outages are disruptive to your productivity. We are prioritizing the following actions to improve Trello’s reliability and to avoid repeating this type of service interruption in the future:
Atlassian Customer Support