Trello is slow or unavailable

Incident Report for Trello

Postmortem

Feburary 28th Downtime Postmortem

All times are in ET (UTC -5).

Summary

From approximately 12:45pm-4:49pm Amazon Web Services suffered a major service interruption to their S3 service in the US-EAST-1 region, which Trello depends on. This outage caused Trello to become inaccessible to our users. At approximately 3:57pm the Trello engineering team was able to restore partial service to the site, with some images and resources such as card attachments, board backgrounds, and avatars failing to load. As AWS began restoring the S3 service more resources began to load successfully, and at 5:11pm the AWS status page indicated that the S3 service was fully restored.

Timeline

12:45pm - One of our engineers observes an increase in our pending active requests and receives an error when trying to load the Trello web app.
12:50 - Our automated systems begin lighting up and alert all engineering teams that trello.com is not loading.
12:52 - Chatter on Twitter indicates this may be a widespread AWS outage.
12:56 - We attempt to restart the Trello web servers to clear out the growing pending active requests. Due to startup checks that depend on S3, the first server fails to start back up. Restarting is not a viable option.
12:59 - We post on trellostatus.com letting users know that Trello is down.
1:01 - Trello isn’t accessible, so we officially disable the site and display a maintenance page.
1:06 - The AWS status page is updated with a banner indicating that they are investigating increased error rates for S3 requests in US-EAST-1.
1:20 - We are working to identify all of our dependencies on S3 and considering ways to bring Trello back up without S3.
1:45 - We have an outline of a plan to serve Trello’s core Javascript and CSS assets from another location, and begin working on the necessary changes to get this working and tested. The proposed solution will allow access to Trello but user-uploaded assets such as card attachments, board backgrounds, and avatars will still fail to load, and users will be unable to upload any new assets.
3:19 - We succeed in making the changes to our staging environment and begin additional user testing.
3:25 - We observe that a small percentage of GET requests to S3 are succeeding again. We discuss our options as we anxiously observe the slowly improving health of S3.
3:45 - S3’s status is improving, but we conclude that we don’t have enough information to assume that it will be fully recovered soon. We opt to deploy the changes to get Trello back up without S3.
3:56 - Trello is restored, users are able to access their boards and cards at trello.com again but some images stored in S3 will fail to load.
4:00 - We continue to monitor the health of Trello as S3 recovers.
5:11 - The AWS status page indicates that S3 is functioning normally again. Our internal monitoring shows the same.

Posted Mar 01, 2017 - 17:19 EST

Resolved

AWS S3 is back online, and all Trello functions should be back to normal now.

Posted Feb 28, 2017 - 17:34 EST

Update

Trello should now be back for all users. However, certain elements such as avatars, board backgrounds and card attachments may not load immediately while AWS S3 is coming back online. Uploading new attachments will not work at this time.

We'll post another update when full service is restored.

Posted Feb 28, 2017 - 16:01 EST

Monitoring

S3 services appear to be slowly coming back up now. We're waiting until things are stable before bringing Trello back.

Posted Feb 28, 2017 - 15:51 EST

Update

No updates yet, but you can follow along on status.aws.amazon.com for more information as the AWS team continues to work to resolve this issue.

Our engineers are looking into this as well, and we'll post an update as soon as we have one.

Posted Feb 28, 2017 - 15:35 EST

Update

AWS is still working on resolving this issue.

In the meantime, users with the Android or iOS apps installed may still be able to access a cached version of their Trello boards in those apps. Please note that any changes made in the app won't be synced until full service is restored.

Posted Feb 28, 2017 - 15:17 EST

Update

No further updates right now, but we're still following the AWS issue, and we'll keep you posted here.

Posted Feb 28, 2017 - 15:03 EST

Identified

According to status.aws.amazon.com, the AWS team believes they have identified the root cause of the outage, and are working on implementing what they believe will remediate the issue.

Once this issue is resolved, Trello will be able to return to normal operation.

Posted Feb 28, 2017 - 14:45 EST

Update

We are still investigating this issue with AWS. No resolution yet, but we'll continue to post updates here.

Posted Feb 28, 2017 - 14:26 EST

Update

Our engineers are continuing to investigate the issue with AWS that's causing downtime for Trello. We'll continue to post updates here.

Posted Feb 28, 2017 - 14:10 EST

Update

We're still investigating the AWS S3 issue that is impacting Trello right now. We'll post information here as it becomes available.

Posted Feb 28, 2017 - 13:54 EST

Investigating

Our engineers are continuing to investigate the AWS S3 issue as it affects Trello. We will continue to post updates here as we have them.

Posted Feb 28, 2017 - 13:34 EST

Identified

We've identified this as part of a larger issue with AWS S3 that's currently being investigated.

Posted Feb 28, 2017 - 13:19 EST

Investigating

Our engineering team is actively investigating this incident and working to bring Trello back up as quickly as possible.

Users affected by this incident may notice that Trello is slow or completely unavailable in both the web and mobile apps.

We will update this page as we have additional information.

Posted Feb 28, 2017 - 12:59 EST