Login issues with single sign on account

Incident Report for Trello

Postmortem

SUMMARY

From Feb 10th, 2021, at 3:15 AM UTC to Feb 11th at 12:23 AM UTC, a subset of Atlassian customers using Trello, Jira, Opsgenie, Access, and Confluence products were unable to login. The event was caused by a faulty change in Atlassian Access that was deployed to production. The changes included Atlassian Access verifying domains and claiming accounts associated with organizations, even though those organizations did not initiate the domain verifications or account claims. However this did not have any impact on customer privacy. This impacted customers in all regions. When a scheduled job executed, the faulty change was activated and the incident was triggered. The incident was detected after 118 minutes by customer support and mitigated by rolling back the faulty change and by progressively setting affected domains and accounts to a good state. The total time to resolution was about 21 hours and 8 minutes.

The impact on the products affected is listed below.

IMPACT

The product specific impact is between Feb 10th, 2021, 3:15 AM UTC and Feb 11th, 12:23 AM UTC

Atlassian Access

Some domains were verified and user accounts of the domains were claimed without admin consent. The accounts associated with these domains became managed accounts, but this did not have any impact on customer privacy.
The users of such accounts received an email stating that their account was now being managed by their organization.

Confluence, Trello, Jira, Opsgenie

A subset of users were unable to login to the products during this time.

ROOT CAUSE

The issue was caused by a faulty background job in Atlassian Access, which was periodically executed to verify domain ownership, verify domains, and claim accounts for the domain. This resulted in some end-user accounts being locked out. As a result, the products called out above did not allow login to those end users, and the users received login failure messages.

The faulty change was in one of the key services of our system which had an impact on downstream systems including products mentioned above. Determining a good state took longer than anticipated.

REMEDIAL ACTIONS PLAN & NEXT STEPS

We know that outages impact your productivity. We deploy our changes progressively (by cloud region) to avoid broad impact. However, in this case our detection of the domain verification and accounts claim did not work as expected. Moving forward, to minimize the impact of breaking changes to our environments, we will implement preventative measures such as the ones listed below.

Prevention and Detection

While we have very good coverage on testing of the affected service with the faulty change, additional use cases are being identified and tests are being added. These additional tests would help us verify the changes at various stages of deployment.
We are improving our process of deployment of the affected service to increase confidence in our deployments by taking some steps such as:
- Progressive rollouts to production.
- Increased level of scrutiny on changes to be deployed to sensitive services.

Restoration Time

We are improving our end-to-end processes for recovering from such incidents and to reduce the outage/degradation time by:
- Introducing runbooks for identifying impact quickly and restoring the data to a good state.
- Investigating the architecture between our Access and Identity systems to identify quick recovery opportunities.
We will be conducting a review of our architecture to identify any opportunities for faster recovery under such circumstances.

We have identified multiple improvement actions across the affected products to improve resiliency on failures. At the time of writing, we are in the process of implementing some of these.

We apologize to customers who were impacted during this incident; we are taking immediate steps to improve the reliability of the domain verification and accounts claim services.

Thanks,

Atlassian Customer Support

Posted Feb 25, 2021 - 18:41 EST

Resolved

Between 10/Feb/21 03:15 UTC to 10/Feb/21 23:20 UTC, we experienced login issues for a subset of managed accounts for Confluence, Jira Core, Jira Service Management, Jira Software, and Trello. The issue has been resolved and the impacted organizations and related products are operating normally.

Posted Feb 11, 2021 - 10:33 EST

Identified

We continue to work on resolving the managed account access problems identified for Confluence, Jira Core, Jira Service Management, Jira Software, Opsgenie, Trello, and Atlassian Bitbucket. We have identified the root cause and have performed corrective actions against a handful of customer accounts and expect full recovery on these shortly with a wider rollout to the other customer accounts impacted.

Additional updates will be posted when available.

Posted Feb 10, 2021 - 16:08 EST

Investigating

We have identified an incident that has resulted in a small number of customers having been locked out of their accounts as a result of the incident. Atlassian has identified that this outage is caused by the incident and is not a security related issue and there is no data loss or any accounts have been compromised.

We are actively working to resolve this outage. We will post more information here in the next 2 hours

Posted Feb 10, 2021 - 14:25 EST