On Nov 30 2023, between 14:04 and 16:57 UTC, Atlassian customers using Trello experienced errors when accessing and interacting with the application. This incident impacted Trello users on the iOS and Android mobile apps as well as those using the Trello web app. The event was triggered by the release of a code change that eventually overloaded a critical part of the Trello database. The incident was detected immediately by our automated monitoring systems and was mitigated by disabling the relevant code change. The issue was extended by the failure of a secondary service whose recovery caused an increase in load on the same critical part of the Trello database, which created a negative feedback loop. This secondary service recovery involved reestablishing over a million connections, with each connection attempt adding load to the same part of the Trello database. We attempted to aid the service recovery by intentionally blocking some of the inbound Trello traffic to reduce load on the database and by increasing the capacity of the Trello database to better handle the high load. Over time the connections were all successfully reestablished, which returned Trello to a known good state. The total time to resolution was just under 3 hours.
The overall impact was between Nov 30 2023, 14:04 UTC and Nov 30 2023, 16:57 UTC on the Trello product. The incident caused service disruption to all Trello customers. Our metrics show there were elevated API response times and increased error rates through the entire incident period, which indicates that most users were unable to load Trello at all or easily interact with the application in any way. The particular database collection that was overloaded was one that is necessary for the Trello service to make authorization decisions, which meant that all requests were impacted.
The issue was caused by a series of changes intended to standardize Trello’s approach to authorizing requests, but had the unintended side effect of modifying a database query from a targeted operation to a broadcast operation. Broadcast operations are more resource-intensive as they must be sent to all database servers to be satisfied. These broadcast operations eventually overloaded some of the Trello database servers as Trello approached its daily peak usage period on Nov 30 2023.
When these database nodes were overloaded, users' HTTP requests received very slow responses or HTTP 504 errors. As we activated our load shedding strategies, some users received HTTP 429 errors.
The incident’s length can be attributed to a secondary failure where our websocket servers experienced a rapid increase in memory leading to processes crashing with OutOfMemoryErrors. As new servers came online and the websockets attempted to reconnected, they once again generated the broadcast queries on the Trello database servers. These broadcast queries continued to put load on the database, which meant the Trello API continued to have high latency, thus perpetuating the negative feedback loop. We are working to determine the root cause of the OutOfMemoryErrors.
We also determined after the incident that due to the Trello application server making the load shedding decision AFTER performing the authorization step, the overloaded database servers were still being queried before the request was rejected. We are working to improve our load shedding strategies post incident.
We know that outages impact your productivity and we are continually working to improve our testing and preventative processes to prevent similar outages in the future.
We are prioritizing the following improvement actions to avoid repeating this type of incident:
Increase the capacity of our database (completed during the incident).
Refactor the new authorization approach to avoid broadcast operations.
Add pre-deployment tests to avoid releasing unnecessary broadcast operations.
Determine the root cause of the secondary failure of the websocket service.
Furthermore, we deploy our changes only after thorough review and automated testing, and we deploy them progressively using feature flags to avoid broad impact. To minimize the impact of breaking changes to our environments, we will implement additional preventative measures:
We apologize to customers who were impacted during this incident; we are taking immediate steps to improve the platform’s performance and availability.
Thanks,
Atlassian Customer Support