All times are in ET (UTC -5).
From approximately 12:45pm-4:49pm Amazon Web Services suffered a major service interruption to their S3 service in the US-EAST-1 region, which Trello depends on. This outage caused Trello to become inaccessible to our users. At approximately 3:57pm the Trello engineering team was able to restore partial service to the site, with some images and resources such as card attachments, board backgrounds, and avatars failing to load. As AWS began restoring the S3 service more resources began to load successfully, and at 5:11pm the AWS status page indicated that the S3 service was fully restored.
12:45pm - One of our engineers observes an increase in our pending active requests and receives an error when trying to load the Trello web app.
12:50 - Our automated systems begin lighting up and alert all engineering teams that trello.com is not loading.
12:52 - Chatter on Twitter indicates this may be a widespread AWS outage.
12:56 - We attempt to restart the Trello web servers to clear out the growing pending active requests. Due to startup checks that depend on S3, the first server fails to start back up. Restarting is not a viable option.
12:59 - We post on trellostatus.com letting users know that Trello is down.
1:01 - Trello isn’t accessible, so we officially disable the site and display a maintenance page.
1:06 - The AWS status page is updated with a banner indicating that they are investigating increased error rates for S3 requests in US-EAST-1.
1:20 - We are working to identify all of our dependencies on S3 and considering ways to bring Trello back up without S3.
1:45 - We have an outline of a plan to serve Trello’s core Javascript and CSS assets from another location, and begin working on the necessary changes to get this working and tested. The proposed solution will allow access to Trello but user-uploaded assets such as card attachments, board backgrounds, and avatars will still fail to load, and users will be unable to upload any new assets.
3:19 - We succeed in making the changes to our staging environment and begin additional user testing.
3:25 - We observe that a small percentage of GET requests to S3 are succeeding again. We discuss our options as we anxiously observe the slowly improving health of S3.
3:45 - S3’s status is improving, but we conclude that we don’t have enough information to assume that it will be fully recovered soon. We opt to deploy the changes to get Trello back up without S3.
3:56 - Trello is restored, users are able to access their boards and cards at trello.com again but some images stored in S3 will fail to load.
4:00 - We continue to monitor the health of Trello as S3 recovers.
5:11 - The AWS status page indicates that S3 is functioning normally again. Our internal monitoring shows the same.