Trello is unavailable

Incident Report for Trello

Postmortem

SUMMARY

On Nov 30 2023, between 14:04 and 16:57 UTC, Atlassian customers using Trello experienced errors when accessing and interacting with the application. This incident impacted Trello users on the iOS and Android mobile apps as well as those using the Trello web app. The event was triggered by the release of a code change that eventually overloaded a critical part of the Trello database. The incident was detected immediately by our automated monitoring systems and was mitigated by disabling the relevant code change. The issue was extended by the failure of a secondary service whose recovery caused an increase in load on the same critical part of the Trello database, which created a negative feedback loop. This secondary service recovery involved reestablishing over a million connections, with each connection attempt adding load to the same part of the Trello database. We attempted to aid the service recovery by intentionally blocking some of the inbound Trello traffic to reduce load on the database and by increasing the capacity of the Trello database to better handle the high load. Over time the connections were all successfully reestablished, which returned Trello to a known good state. The total time to resolution was just under 3 hours.

IMPACT

The overall impact was between Nov 30 2023, 14:04 UTC and Nov 30 2023, 16:57 UTC on the Trello product. The incident caused service disruption to all Trello customers. Our metrics show there were elevated API response times and increased error rates through the entire incident period, which indicates that most users were unable to load Trello at all or easily interact with the application in any way. The particular database collection that was overloaded was one that is necessary for the Trello service to make authorization decisions, which meant that all requests were impacted.

ROOT CAUSE

The issue was caused by a series of changes intended to standardize Trello’s approach to authorizing requests, but had the unintended side effect of modifying a database query from a targeted operation to a broadcast operation. Broadcast operations are more resource-intensive as they must be sent to all database servers to be satisfied. These broadcast operations eventually overloaded some of the Trello database servers as Trello approached its daily peak usage period on Nov 30 2023.

The first change of this type was deployed over a period of seven days at the end of August and changed the authorization type used by our websocket service. This meant that newly established websocket connections required this new broadcast query. At any given moment, we have a great deal of established websocket connections, but the usual rate of new websocket connections is relatively low. Therefore, our monitoring systems only detected a slight increase in resource usage and flagged this change as a low priority performance regression. We acknowledged the regression and created a task to identify and reduce the resource demands of these new queries.
The second change of this type was deployed over the course of a few days before being fully rolled out on Nov 29, 2023, the day before this incident. This change caused the Trello application server to use the new broadcast query while authorizing standard web browser traffic, which is the vast majority of our traffic. The change was fully deployed at 19:34 UTC on Nov 29, which was during a low traffic period. The next day, as the application approached its daily peak traffic period, our monitoring on the database servers indicated they were overloaded.

When these database nodes were overloaded, users' HTTP requests received very slow responses or HTTP 504 errors. As we activated our load shedding strategies, some users received HTTP 429 errors.

The incident’s length can be attributed to a secondary failure where our websocket servers experienced a rapid increase in memory leading to processes crashing with OutOfMemoryErrors. As new servers came online and the websockets attempted to reconnected, they once again generated the broadcast queries on the Trello database servers. These broadcast queries continued to put load on the database, which meant the Trello API continued to have high latency, thus perpetuating the negative feedback loop. We are working to determine the root cause of the OutOfMemoryErrors.

We also determined after the incident that due to the Trello application server making the load shedding decision AFTER performing the authorization step, the overloaded database servers were still being queried before the request was rejected. We are working to improve our load shedding strategies post incident.

REMEDIAL ACTIONS PLAN & NEXT STEPS

We know that outages impact your productivity and we are continually working to improve our testing and preventative processes to prevent similar outages in the future.

We are prioritizing the following improvement actions to avoid repeating this type of incident:

Increase the capacity of our database (completed during the incident).
- This action is the most critical and is aimed at preventing a recurrence of this particular incident and gracefully recover if the websocket service were to fail again.
Refactor the new authorization approach to avoid broadcast operations.
Add pre-deployment tests to avoid releasing unnecessary broadcast operations.
Determine the root cause of the secondary failure of the websocket service.

Furthermore, we deploy our changes only after thorough review and automated testing, and we deploy them progressively using feature flags to avoid broad impact. To minimize the impact of breaking changes to our environments, we will implement additional preventative measures:

Ensure that our load-shedding strategies fail fast.
Add monitoring to observe broadcast operations in all our environments.

We apologize to customers who were impacted during this incident; we are taking immediate steps to improve the platform’s performance and availability.

Thanks,

Atlassian Customer Support

Posted Dec 08, 2023 - 11:12 EST

Resolved

The fix we released was successful, and all the issues our users were experiencing with Trello have been resolved. Thank you for your patience and understanding!

Posted Nov 30, 2023 - 12:37 EST

Monitoring

We've implemented a new fix to the issue affecting our database, and we're now seeing signs of recovery for all users on Trello. We'll keep monitoring this latest fix for now. Once again, we appreciate your understanding!

Posted Nov 30, 2023 - 12:14 EST

Identified

Our engineering team has identified an issue with Trello's database, and we're now working on implementing a fix to restore Trello's availability to all users. We appreciate everyone's understanding!

Posted Nov 30, 2023 - 11:35 EST

Update

Our teams are still investigating the issue affecting Trello's availability, and a new update will be provided soon. We appreciate your understanding!

Posted Nov 30, 2023 - 10:56 EST

Update

Our team is still investigating the issue that's affecting Trello's availability, and we'll provide a new update soon. Thanks for your patience and understanding!

Posted Nov 30, 2023 - 10:22 EST

Update

We're still investigating the root cause of this problem, and a new update will be shared soon! Thanks for your patience!

Posted Nov 30, 2023 - 09:52 EST

Investigating

We're currently investigating an issue that's causing Trello to be slow or unavailable to our users. A new update will be shared soon.

Posted Nov 30, 2023 - 09:21 EST

This incident affected: Trello.com and API.