On 09-28-2020 between 14:55–16:13, UTC Atlassian customers using Trello may have experienced slowness or unavailability in both the web and mobile apps.
The addition of new routes to our load balancing tier caused some of our load balancer CPU cores to become saturated at 100% utilization. This resulted in errors returned for nearly half of all requests to Trello. Due to a configuration in our monitoring, we were not alerted of early indicators to this problem, however, Amazon CloudWatch monitoring did alert us within 9 minutes that several of our load balancers were unhealthy. We mitigated the issue by reconfiguring our load balancers to use additional CPU cores. Additionally, the process by which new routes were being added to the configuration has now been substantially optimized, resulting in an overall drop in CPU usage.
We know that outages are impactful to your productivity. We deploy our changes progressively to avoid broad impact but in this case, our load balancers did not perform as expected. Moving forward, along with the fixes described above, to minimize the blast radius of breaking changes to our environments, we have implemented improved monitoring, alerting, and oversight for load balancer metrics.
We apologize for any inconvenience this may have caused. Please let us know if there are additional details we can provide.