Atlassian Account login issues
Incident Report for Trello
Postmortem

SUMMARY

On Sep 13, 2023, between 12:00 PM UTC and 03: 30 PM UTC, some Atlassian users were unable to sign in to their accounts and use multiple Atlassian cloud products. The event was triggered by a misconfiguration of rate limits in an internal service which caused a cascading failure in sign-in and signup-related APIs. The incident was quickly detected by multiple automated monitoring systems. The incident was mitigated on Sep 13, 2023, 03: 30 PM UTC by the rollback of a feature and additional scaling of services which put Atlassian systems into a known good state. The total time to resolution was about 3 hours & 30 minutes.

IMPACT

The overall impact was between Sep 13, 2023, 12:00 PM UTC and Sep 13, 2023, 03: 30 PM UTC on multiple products. The Incident caused intermittent service disruption across all regions. Some users were unable to sign in for sessions. Other scenarios that temporarily failed were new user signups, profile retrieval, and password reset. During the incident we had a peak of 90% requests failing across authentication, user profile retrieval, and password reset use cases.

ROOT CAUSE

The issue was caused due to a misconfiguration of a rate limit in an internal core service. As a result, some sign-in requests over the limit received HTTP 429 errors. However, retry behavior for requests caused a multiplication of load which led to higher service degradation. As many internal services depend on each other, the call graph complexity led to a longer time to detect the actual faulty service.

REMEDIAL ACTIONS PLAN & NEXT STEPS

We are continuously improving our system's resiliency.

We are prioritizing the following improvement actions to avoid repeating this type of incident:

  • Audit and improve service rate limits and client retry and backoff behavior.
  • Improve scale and load test automation for complex service interactions.
  • Audit cross-service dependencies and minimize them where possible related to sign-in flows.

Due to the unavailability of sign-in, some customers were unable to create support tickets. We are making additional process improvements to:

  • Enable our unauthenticated support contact form and notify users that it should be used when standard channels are not available.
  • Create status page notifications more quickly and ensure that for severe incidents, notifications to all subscribers are enabled.

We apologize to users who were impacted during this incident; we are taking immediate steps to improve the platform’s reliability and availability.

Thanks,

Atlassian Customer Support

Posted Sep 21, 2023 - 20:48 EDT

Resolved
Between 12:45 UTC to 15:30 UTC, we experienced login and signup issues for Atlassian Accounts. The issue has been resolved and the service is operating normally.

We will publish a post-incident review with the details of the incident and the actions we are taking to prevent similar problem in the future.
Posted Sep 13, 2023 - 15:32 EDT
Update
We are no longer seeing occurrences of the Atlassian Accounts login errors, all clients should be able to successfully login now. We will continue to monitor.
Posted Sep 13, 2023 - 13:36 EDT
Update
We can see a reduction in the Atlassian Accounts login issues after the mitigation actions were taken. We are still monitoring closely and will continue to provide updates.
Posted Sep 13, 2023 - 12:30 EDT
Monitoring
We have identified the root cause of the Atlassian Accounts login issues impacting Cloud Customers and have mitigated the problem. We are now monitoring this closely.
Posted Sep 13, 2023 - 11:21 EDT
Investigating
We are investigating an issue with Atlassian Accounts login that is impacting some Cloud customers. We will provide more details within the next hour.
Posted Sep 13, 2023 - 10:08 EDT
This incident affected: Trello.com.