From Feb 10th, 2021, at 3:15 AM UTC to Feb 11th at 12:23 AM UTC, a subset of Atlassian customers using Trello, Jira, Opsgenie, Access, and Confluence products were unable to login. The event was caused by a faulty change in Atlassian Access that was deployed to production. The changes included Atlassian Access verifying domains and claiming accounts associated with organizations, even though those organizations did not initiate the domain verifications or account claims. However this did not have any impact on customer privacy. This impacted customers in all regions. When a scheduled job executed, the faulty change was activated and the incident was triggered. The incident was detected after 118 minutes by customer support and mitigated by rolling back the faulty change and by progressively setting affected domains and accounts to a good state. The total time to resolution was about 21 hours and 8 minutes.
The impact on the products affected is listed below.
The product specific impact is between Feb 10th, 2021, 3:15 AM UTC and Feb 11th, 12:23 AM UTC
The issue was caused by a faulty background job in Atlassian Access, which was periodically executed to verify domain ownership, verify domains, and claim accounts for the domain. This resulted in some end-user accounts being locked out. As a result, the products called out above did not allow login to those end users, and the users received login failure messages.
The faulty change was in one of the key services of our system which had an impact on downstream systems including products mentioned above. Determining a good state took longer than anticipated.
We know that outages impact your productivity. We deploy our changes progressively (by cloud region) to avoid broad impact. However, in this case our detection of the domain verification and accounts claim did not work as expected. Moving forward, to minimize the impact of breaking changes to our environments, we will implement preventative measures such as the ones listed below.
Prevention and Detection
We are improving our process of deployment of the affected service to increase confidence in our deployments by taking some steps such as:
Restoration Time
We are improving our end-to-end processes for recovering from such incidents and to reduce the outage/degradation time by:
We will be conducting a review of our architecture to identify any opportunities for faster recovery under such circumstances.
We have identified multiple improvement actions across the affected products to improve resiliency on failures. At the time of writing, we are in the process of implementing some of these.
We apologize to customers who were impacted during this incident; we are taking immediate steps to improve the reliability of the domain verification and accounts claim services.
Thanks,
Atlassian Customer Support