Intermittent errors during login and Workspace not showing up for some customers
Incident Report for Trello
Postmortem

SUMMARY

On Aug 4, 2023, Trello users encountered issues accessing their workspaces. This was caused by a processing error during user deletion events involving two users who shared a workspace. The error resulted in unintended workspaces being marked as deleted.

The issue was identified, the deletion process halted, and data restoration initiated. The solution involved marking workspaces as undeleted and implementing a code fix to prevent similar issues in the future.

IMPACT

The overall impact occurred on August 4th, 2023, spanning from the afternoon to the early evening, in UTC time.

All Trello workspaces created before July 2021 were inaccessible during the incident. The impact of this was 39% of active workspaces were inaccessible.

ROOT CAUSE

The event was triggered by a race condition which occurred during the response to user deletion events. When the last user in a workspace is deleted the system automatically marks the workspace as deleted. In this case two users sharing a workspace were deleted simultaneously, causing a race condition. The race condition triggered a code path which generated a query that was not targeted to an individual workspace, but instead marked all workspaces (including unrelated ones) as deleted in our database in a systematic way.

REMEDIAL ACTIONS PLAN & NEXT STEPS

We know that outages affect your productivity, and we are committed to preventing incidents like these from occurring. We already implemented code changes to prevent the specific condition that caused the incident.

We are prioritizing the following improvement actions to avoid repeating this type of incident:

  • Implement a monitoring system for the following metrics in order to improve anomaly detection: CPU usage, inbound and outbound network traffic, memory usage, and disk usage.
  • Add anomaly detection to monitor the number of soft deletes and set up alerting for it.

We apologize to customers whose services were impacted during this incident; we are taking immediate steps to improve the platform’s performance and availability.

Thanks,

Atlassian Customer Support

Posted Aug 09, 2023 - 18:04 EDT

Resolved
Our team has identified the root cause and restored access to previously unavailable workspaces. We will share more in our post incident review which will be published as part of this incident report.
Posted Aug 04, 2023 - 16:40 EDT
Monitoring
Our team has successfully identified the underlying cause of the issue that resulted in log-in difficulties and Workspaces not being displayed while utilizing Trello and subsequently restored access to all affected Workspaces. We will continue to monitor the situation from our end closely.
Posted Aug 04, 2023 - 14:54 EDT
Identified
We have discovered the cause of workspaces being unavailable, and are working to restore access as soon as possible. We will continue to update our Statuspage with the latest information as it becomes available.
Posted Aug 04, 2023 - 13:54 EDT
Update
We are currently investigating issues with Trello workspaces being unavailable and are working to restore service as quickly as possible. We will continue to update our Statuspage with the latest information as it becomes available.
Posted Aug 04, 2023 - 12:47 EDT
Investigating
We are investigating reports of intermittent errors during login and Workspace not appearing for some customers using Trello. We will provide more details once we identify the root cause.
Posted Aug 04, 2023 - 12:03 EDT
This incident affected: Trello.com and API.