Service Disruptions Affecting Atlassian Products
Incident Report for Trello
Postmortem

Summary

On February 14, 2024, between 20:05 UTC and 23:03 UTC, Atlassian customers on the following cloud products encountered a service disruption: Access, Atlas, Atlassian Analytics, Bitbucket, Compass, Confluence, Ecosystem apps, Jira Service Management, Jira Software, Jira Work Management, Jira Product Discovery, Opsgenie, StatusPage, and Trello.

As part of a security and compliance uplift, we had scheduled the deletion of unused and legacy domain names used for internal service-to-service connections. Active domain names were incorrectly deleted during this event. This impacted all cloud customers across all regions. The issue was identified and resolved through the rollback of the faulty deployment to restore the domain names and Atlassian systems to a stable state. The time to resolution was two hours and 58 minutes.

IMPACT

External customers started reporting issues with Atlassian cloud products at 20:52 UTC. The impact of the failed change led to performance degradation or in some cases, complete service disruption. Symptoms experienced by end-users were unsuccessful page loads and/or failed interactions with our cloud products.

ROOT CAUSE

As part of a security and compliance uplift, we had scheduled the deletion of unused and legacy domain names that were being used for internal service-to-service connections. Active domain names were incorrectly deleted during this operation.

REMEDIAL ACTIONS PLAN & NEXT STEPS

We know that outages impact your productivity. The detection was delayed because existing testing & monitoring focused on service health rather than the entire system’s availability.

To prevent a recurrence of this type of incident, we are implementing the following improvement measures:

  • Canary checks to monitor the entire system availability.
  • Faster rollback procedures for this type of service impact.
  • Stricter change control procedures for infrastructure modifications.
  • Migration of all DNS records to centralised management and stricter access controls on modification to DNS records.

We apologize to customers whose services were impacted during this incident; we are taking immediate steps to improve the platform’s performance and availability.

Thanks,

Atlassian Customer Support

Posted Feb 27, 2024 - 00:45 EST

Resolved
We experienced increased errors on Confluence, Jira Work Management, Jira Service Management, Jira Software, Opsgenie, Trello, Atlassian Bitbucket, Atlassian Access, Jira Align, Jira Product Discovery, Atlas, Compass, and Atlassian Analytics. The issue has been resolved and the services are operating normally.
Posted Feb 14, 2024 - 18:32 EST
Monitoring
We have identified the root cause of the Service Disruptions affecting all Atlassian products and have mitigated the problem. We are now monitoring this closely.
Posted Feb 14, 2024 - 17:55 EST
Identified
We have identified the root cause of the increased errors and have mitigated the problem. We continue to work on resolving the issue and monitoring this closely.
Posted Feb 14, 2024 - 17:31 EST
Investigating
We are investigating reports of intermittent errors for all Cloud Customers across all Atlassian products. We will provide more details once we identify the root cause.
Posted Feb 14, 2024 - 16:57 EST
This incident affected: Trello.com, API, Atlassian Support - Support Portal, Atlassian Support Ticketing, and Atlassian Support Knowledge Base.