On August 17, 2022, between 4:39pm UTC and 5:03pm UTC, Trello's Redis cluster that supports the Trello API experienced several short bursts of degraded performance (one-two minutes each), followed by a seven minute outage. During that time, Trello's API either took a long time to respond or failed to respond altogether. For web users, this made the Trello application unresponsive to user input, while mobile users experienced an offline mode. The Redis failure was caused by the deployment of an internal tool that used an inefficient command for querying data from Redis, thus slowing the Redis cluster significantly and causing the degraded performance experienced on the site.
During the event, there were three periods of degraded performance followed by a total site outage. The timeline was as follows:
During the periods of degraded performance, Trello users experienced slowness when loading board and cards, creating boards and cards, inviting users, etc. For the seven minutes when Trello was completely down, the site was unavailable.
The incident was detected by internal monitoring within 10 minutes of the first degradation occurrence. The site recovered when Redis automatically failed over to a backup instance and began responding to requests as normal. The total time to resolution was 24 minutes.
The issue was caused by a change to Trello's server codebase where a Redis function, used primarily for debugging purposes, was shipped to production for use by an internal tool. When this function was executed, it significantly slowed the Redis cluster, causing it to ultimately fail over. The root cause of the incident was the failure to prevent the use of this function in Trello's production environment.
We know that issues like this impact your productivity. While we have a number of testing and preventative process in place, this specific issue was not seen in any of Trello's non-production environments where the volume of data in cache was smaller than in Trello's production Redis cache.
We are prioritizing the following improvement to avoid repeating this type of incident:
We apologize to customers whose services were impacted during this incident; we are taking immediate steps to improve the platform’s performance and availability.