Error responses across multiple Cloud products

Incident Report for Trello

Postmortem

Summary

On June 3rd, between 09:43pm and 10:58 pm UTC, Atlassian customers using multiple product(s) were unable to access their services. The event was triggered by a change to the infrastructure API Gateway, which is responsible for routing the traffic to the correct application backends.

The incident was detected by the automated monitoring system within five minutes and mitigated by correcting a faulty release feature flag, which put Atlassian systems into a known good state. The first communications were published on the Statuspage at 11:11pm UTC. The total time to resolution was about 75 minutes.

IMPACT

The overall impact was between 09:43pm and 10:17pm UTC, with the system initially in a degraded state, followed by a total outage between 10:17pm and 10:58pm UTC.

The Incident caused service disruption to customers in all regions and affected the following products:

Jira Software
Jira Service Management
Jira Work Management
Jira Product Discovery
Jira Align
Confluence
Trello
Bitbucket
Opsgenie
Compass

ROOT CAUSE

A policy used in the infrastructure API gateway was being updated in production via a feature flag. The combination of an erroneous value entered in a feature flag, and a bug in the code resulted in the API Gateway not processing any traffic.

This created a total outage, where all users started receiving 5XX errors for most Atlassian products.

Once the problem was identified and the feature flag updated to the correct values, all services started seeing recovery immediately.

REMEDIAL ACTIONS PLAN & NEXT STEPS

We know that outages impact your productivity. While we have several testing and preventative processes in place, this specific issue wasn’t identified because the change did not go through our regular release process and instead was incorrectly applied through a feature flag.

We are prioritizing the following improvement actions to avoid repeating this type of incident:

Prevent high-risk feature flags from being used in production
Improve the policy changes testing
Enforcing longer soak time for policy changes
Any feature flags should go through progressive rollouts to minimize broad impact
Review the infrastructure feature flags to ensure they all have appropriate defaults
Improve our processes and internal tooling to provide faster communications to our customers

We apologize to customers whose services were affected by this incident and are taking immediate steps to address the above gaps.

Thanks,

Atlassian Customer Support

Posted Jun 10, 2024 - 20:57 EDT

Resolved

Between 22:18 UTC to 22:56 UTC, we experienced errors for multiple Cloud products. The issue has been resolved and the service is operating normally.

Posted Jun 03, 2024 - 20:31 EDT

Identified

We are investigating an issue with error responses for some Cloud customers across multiple products. We have identified the root cause and expect recovery shortly.

Posted Jun 03, 2024 - 19:11 EDT