Trello is slow or unavailable

Incident Report for Trello

Postmortem

SUMMARY

On August 17, 2022, between 4:39pm UTC and 5:03pm UTC, Trello's Redis cluster that supports the Trello API experienced several short bursts of degraded performance (one-two minutes each), followed by a seven minute outage. During that time, Trello's API either took a long time to respond or failed to respond altogether. For web users, this made the Trello application unresponsive to user input, while mobile users experienced an offline mode. The Redis failure was caused by the deployment of an internal tool that used an inefficient command for querying data from Redis, thus slowing the Redis cluster significantly and causing the degraded performance experienced on the site.

IMPACT

During the event, there were three periods of degraded performance followed by a total site outage. The timeline was as follows:

04:39pm - 04:41pm UTC (2 minutes) - significant degradation
04:48pm - 04:50pm UTC (2 minutes) - significant degradation
04:54pm - 04:55pm UTC (1 minute) - minor degradation
04:56pm - 05:03pm UTC (7 minutes) - Trello completely down

During the periods of degraded performance, Trello users experienced slowness when loading board and cards, creating boards and cards, inviting users, etc. For the seven minutes when Trello was completely down, the site was unavailable.

The incident was detected by internal monitoring within 10 minutes of the first degradation occurrence. The site recovered when Redis automatically failed over to a backup instance and began responding to requests as normal. The total time to resolution was 24 minutes.

ROOT CAUSE

The issue was caused by a change to Trello's server codebase where a Redis function, used primarily for debugging purposes, was shipped to production for use by an internal tool. When this function was executed, it significantly slowed the Redis cluster, causing it to ultimately fail over. The root cause of the incident was the failure to prevent the use of this function in Trello's production environment.

REMEDIAL ACTIONS PLAN & NEXT STEPS

We know that issues like this impact your productivity. While we have a number of testing and preventative process in place, this specific issue was not seen in any of Trello's non-production environments where the volume of data in cache was smaller than in Trello's production Redis cache.

We are prioritizing the following improvement to avoid repeating this type of incident:

Implement rules to block the use of the offending Redis command as well as other debugging commands in production.

We apologize to customers whose services were impacted during this incident; we are taking immediate steps to improve the platform’s performance and availability.

Thanks,

Trello

Posted Aug 24, 2022 - 17:58 EDT

Resolved

This incident has been resolved.

Posted Aug 17, 2022 - 14:05 EDT

Monitoring

Trello is operational. We'll continue to investigate the root cause and monitor until the issue is resolved.

Posted Aug 17, 2022 - 13:35 EDT

Investigating

We are currently investigating this issue.

Posted Aug 17, 2022 - 13:07 EDT

This incident affected: Trello.com and API.