AWS explains outages and will make it easier to track future outages
Amazon Web Services CEO Adam Selipsky delivers a speech at the AWS re: Invent conference in Las Vegas on November 30, 2021.
Noé Berger | Getty Images
Amazon Web Services on Friday released an explanation for an outage lasting several hours earlier this week that disrupted its retail business and third-party online services. The company also said it plans to revamp its status page.
Problems in Amazon’s large US-East-1 data center region of Virginia began at 10:30 a.m. ET on Tuesday, the company said.
“An automated activity to scale the capacity of one of the AWS services hosted in the core AWS network triggered unexpected behavior from a large number of customers inside the internal network,” the company wrote. in an article posted on its website. As a result, devices connecting an internal Amazon network and the AWS network became overloaded.
Several AWS tools suffered, including the widely used EC2 service which provides virtual server capacity. AWS engineers worked to resolve issues and restore services over the next few hours. The EventBridge service, which can help software developers build apps that take action in response to certain activity, didn’t fully bounce until 9:40 p.m. ET.
Downtime can undermine the perception that cloud infrastructure is reliable and ready to handle application migrations from physical data centers. It can also have major implications for businesses. AWS has millions of customers and is the largest vendor in the market.
AWS apologized for the impact of the outage on its customers.
Popular websites and heavily used services have been taken offline, including Disney +, Netflix, and Ticketmaster. Roomba vacuums, Amazon’s Ring security cameras and other internet-connected devices such as smart cat litter boxes and app-connected ceiling fans were also removed by the outage.
Amazon’s own retail operations have been crippled in some pockets of U.S. internal apps used by Amazon’s warehouses and delivery staff rely on AWS, so most employees on Tuesday failed to able to scan packages or access delivery routes. Third-party sellers were also unable to access a site used to manage customer orders.
During the outage, AWS tried to keep customers up to date with what was going on, but the cloud struggled to update its status page, known as the Service Health Dashboard.
“As the impact on services during this event was from a single root cause, we chose to provide updates via a global banner on the Service Health Dashboard, which we have since learned from. that it is difficult for some customers to find information on this issue, “said AWS.
Additionally, customers were unable to create support requests for seven hours during the disruption.
AWS said it is now taking steps to address both of these issues.
“We plan to release a new version of our Service Health Dashboard early next year that will make it easier to understand the impact of services and a new support system architecture that is actively running in multiple AWS regions to ensure that we don’t have delays in communicating with customers, ”AWS said.
This isn’t the first time that AWS has changed the way it reports issues.
In 2017, an outage that hit the popular AWS S3 storage service prevented engineers from displaying the correct color to indicate availability on the service health dashboard. Amazon posted banners and took to Twitter to post new information.
“We’ve changed the SHD Admin Console to run in multiple AWS Regions,” Amazon said in a post about this episode.
LOOK: The week that was: Amazon Web Services crash