Some features are experiencing degraded performance
Incident Report for Trello
Postmortem

SUMMARY

On June 13 2023, from 6:49 PM UTC to June 14, 2023, 02:20 AM UTC, Atlassian customers using Jira Software, Jira Service Management, Jira Work Management, Confluence and Trello with services hosted in AWS us-east-1 region were impacted by Automation rule degradation. This event was triggered by an increased error rates and latencies for AWS Lambda function invocations in the us-east-1 region. Some other AWS services also experienced increased error rates and latencies as a result of degraded Lambda functions invocations.  This incident was automatically detected by multiple monitoring systems within 6 minutes, paging on-call teams. Recovery of the affected AWS Lambda service began after 116 minutes at June 13th 8:45 PM UTC.  Full recovery of all AWS services occurred at 10:37 PM UTC June 13th after the backlog of asynchronous Lambda events had been processed. Some Jira tenants with large event backlogs experienced delays in running schedule-based rule reruns. Full recovery of all Atlassian Cloud services was notified at June 14, 2023, 02:20 AM UTC.

IMPACT

The overall impact was between June 13, 2023, 06:49 PM UTC and June 14, 2023, 02:20 AM UTC.  Product-specific impacts are listed below.

  • Jira Software, Jira Service Management,  Jira Work Management - Automation rules were not executed for 2 hours between Jun 13, 06:49 PM UTC and Jun 13, 08:45 PM UTC.  Jira automation events generated during this period were unable to be rerun.  When AWS Lambda recovered delays were still experienced in our schedule-based and event-based rules for some larger tenants due to a large backlog of events. Full recovery was at June 14, 2023, 02:20 AM UTC.
  • Confluence - Automation rules were not executed for 2 hours between Jun 13, 06:49 PM UTC and Jun 13, 08:45 PM UTC.  On AWS service restoration Confluence automation recovered and Confluence automation events generated during this period were rerun and processed.  Full recovery was at June 14, 2023, 12:41 AM UTC.
  • Jira Product Discovery - Automation rules were not executed for 2 hours between Jun 13, 06:49 PM UTC and Jun 13, 08:45 PM UTC. Jira automation events generated during this period were unable to be rerun.  Sending feedback/filing a support ticket from the application did not work. 
  • Trello -  Email to board delays, card covers image upload failures, attachment preview generation failures, board background upload failures, custom sticker images upload failures, custom emoji upload failures. Trello automation was unaffected. Full recovery was at June 13, 2023, 10:08 PM UTC.

The service disruption lasted for 7 hours and 1 minutes between June 13, 2023, 06:49 PM UTC and June 14, 2023, 02:20 AM UTC and caused service disruption to customers with services hosted in the US-EAST-1 region.

ROOT CAUSE

Atlassian uses Amazon Web Services (AWS) as a cloud service provider. The root cause was an issue with a subsystem responsible for capacity management for AWS Lambda in US-EAST-1 Region, which also impacted 104 AWS services.  This impacted Automation rules as the service is hosted exclusively in this region.

There were no relevant Atlassian-driven events in the lead-up that have been identified to cause or contribute to this incident.

REMEDIAL ACTIONS PLAN & NEXT STEPS

We are prioritizing the following improvement actions to avoid repeating this type of incident:

  • Increase reliability of message delivery and recoverability from Jira to Automation platform to improve recovery times. 
  • Create a plan for multi-region impact mitigation for Automation.

We apologize to customers whose services were impacted during this incident; we are taking immediate steps to improve the platform’s performance and availability.

Thanks,

Atlassian Customer Support

Posted Jun 26, 2023 - 23:08 EDT

Resolved
From 6:49 PM UTC to 8:45 PM UTC, Trello experienced issues with image uploads that affected card covers, attachment previews, board backgrounds, and custom sticker images. Additionally, we observed degradation in our email-to-board and search features. The services are now operating as expected, and we consider this incident resolved.
Posted Jun 13, 2023 - 20:57 EDT
Monitoring
AWS Services have recovered from an outage. This issue affected the Attachment previews, email-to-board features, search, and card covers, and now all Trello services are back online. For now, we'll keep monitoring from our end.
Posted Jun 13, 2023 - 18:33 EDT
Update
We are continuing to work with our third-party partner on a fix. Attachment previews, email-to-board features, search, and card covers are intermittently affected.
Posted Jun 13, 2023 - 17:07 EDT
Update
We have confirmed an issue with AWS services is causing the current issues. We are working with AWS to get the service restored back to normal as soon as possible.
Posted Jun 13, 2023 - 17:05 EDT
Update
It has come to our attention that the Email-to-Board feature, along with other previously mentioned features, has been affected by a technical problem. Our team is currently collaborating with our third-party partner to identify and implement an effective solution for this matter.
Posted Jun 13, 2023 - 16:19 EDT
Identified
We have identified an issue that may cause problems for users while adding new card covers, viewing attachment previews, or using the search function. The issue has been identified with one of our third-party partners, and we are currently working on resolving it.
Posted Jun 13, 2023 - 15:45 EDT
This incident affected: Trello.com and API.