Login issues with single sign on account

Incident Report for Trello

Postmortem

SUMMARY

From Feb 10th, 2021, at 3:15 AM UTC to Feb 11th at 12:23 AM UTC, a subset of Atlassian customers using Trello, Jira, Opsgenie, Access, and Confluence products were unable to login. The event was caused by a faulty change in Atlassian Access that was deployed to production. The changes included Atlassian Access verifying domains and claiming accounts associated with organizations, even though those organizations did not initiate the domain verifications or account claims. However this did not have any impact on customer privacy. This impacted customers in all regions. When a scheduled job executed, the faulty change was activated and the incident was triggered. The incident was detected after 118 minutes by customer support and mitigated by rolling back the faulty change and by progressively setting affected domains and accounts to a good state. The total time to resolution was about 21 hours and 8 minutes.

The impact on the products affected is listed below.

IMPACT

The product specific impact is between Feb 10th, 2021, 3:15 AM UTC and Feb 11th, 12:23 AM UTC

Atlassian Access

Some domains were verified and user accounts of the domains were claimed without admin consent. The accounts associated with these domains became managed accounts, but this did not have any impact on customer privacy.
The users of such accounts received an email stating that their account was now being managed by their organization.

Confluence, Trello, Jira, Opsgenie

A subset of users were unable to login to the products during this time.

ROOT CAUSE

The issue was caused by a faulty background job in Atlassian Access, which was periodically executed to verify domain ownership, verify domains, and claim accounts for the domain. This resulted in some end-user accounts being locked out. As a result, the products called out above did not allow login to those end users, and the users received login failure messages.

The faulty change was in one of the key services of our system which had an impact on downstream systems including products mentioned above. Determining a good state took longer than anticipated.

REMEDIAL ACTIONS PLAN & NEXT STEPS

We know that outages impact your productivity. We deploy our changes progressively (by cloud region) to avoid broad impact. However, in this case our detection of the domain verification and accounts claim did not work as expected. Moving forward, to minimize the impact of breaking changes to our environments, we will implement preventative measures such as the ones listed below.

Prevention and Detection

While we have very good coverage on testing of the affected service with the faulty change, additional use cases are being identified and tests are being added. These additional tests would help us verify the changes at various stages of deployment.
We are improving our process of deployment of the affected service to increase confidence in our deployments by taking some steps such as:
- Progressive rollouts to production.
- Increased level of scrutiny on changes to be deployed to sensitive services.

Restoration Time

We are improving our end-to-end processes for recovering from such incidents and to reduce the outage/degradation time by:
- Introducing runbooks for identifying impact quickly and restoring the data to a good state.
- Investigating the architecture between our Access and Identity systems to identify quick recovery opportunities.
We will be conducting a review of our architecture to identify any opportunities for faster recovery under such circumstances.

We have identified multiple improvement actions across the affected products to improve resiliency on failures. At the time of writing, we are in the process of implementing some of these.

We apologize to customers who were impacted during this incident; we are taking immediate steps to improve the reliability of the domain verification and accounts claim services.

Thanks,

Atlassian Customer Support

Posted Feb 25, 2021 - 18:41 EST

Resolved

Between 9 Feb 2021 18:25 UTC to 10 Feb 2021 13:00 UTC, a subset of customers were experiencing issues with Single Sign On and a false Managed Account email notification for their account access to Confluence, Jira Core, Jira Service Management, Jira Software, Opsgenie, Trello, and Atlassian Bitbucket. The issue has been resolved and the service is operating normally.

Posted Feb 10, 2021 - 08:37 EST

Identified

We have identified an incident that has caused a trigger of false email notifications to a small number of customers, stating an update has been initiated by the customer admin. A subset of these customers have also been locked out of their accounts as a result of the incident . Atlassian has identified that this outage and email set is caused by the incident and is not a security related issue.

We are actively working to resolve this outage and related email sent. We will post more information here in the next 2 hours

Posted Feb 10, 2021 - 07:23 EST