AWS Frankfurt Experiences Major Outage That Staff Could Not Fix For Hours Due To “Environmental Conditions” On Data Center Floor
Update A single Availability Zone in the EU-Central region of Amazon Web Services (EUC_AZ-1) has experienced a major outage.
The internet giant status page says the outage started at 1324 PDT (2030 UTC) on June 10 and initially caused “connectivity issues for some EC2 instances.”
Half an hour later, AWS reported “increased API error rates and latencies for EC2 APIs and connectivity issues for instances… caused by an increase in ambient temperature in a subsection of the affected Availability Zone “.
At 2:36 p.m. PDT, AWS said temperatures were dropping but network connectivity was still down. But an hour later, the Colossus of Clouds made the following rather disturbing toll:
An update to 1612 PDT reported that staff were still unable to enter the site due to security reasons.
At 4:33 p.m., network services were restored, an event according to AWS that should result in rapid recovery of EC2 instances. A 1719 update stated that “environmental conditions in the affected Availability Zone are now back to normal” and informed users that “the vast majority of affected EC2 instances are now fully recovered, but we are continuing to work some EBS volumes experience degraded performance.
Kinesis Data Streams, Kinesis Firehose, Amazon Relational Database Service, and AWS CloudFormation also faltered.
The latest AWS status update concluded, “We will provide more details on the root cause in a later article, but we can confirm that there was no fire at the facility. “
Which leaves the question of what made the data center too dangerous to access it?
Although we lack evidence on which to base a claim, The register reported UPS eruptions and tiny puffs of smoke leading to the release of hypoxic gas in data centers.
The goal of releasing hypoxic gas in data centers is to deprive fires of oxygen. And since humans need oxygen, it may take some time before engineers can return to a data center.
The register mentions this because it matches the facts presented in this incident, and with AWS language of “environmental conditions” preventing entry.
We will update this story if new information about this incident reaches us. ®
Updated to add at 0245 UTC on June 11
AWS updated its incident report (and most importantly proved that our analysis was correct) by revealing that the incident was caused by “the failure of a control system which deactivated several air managers in the affected availability zone ”.
Air handlers cool the data center. So after they stopped working, “the ambient temperatures started to rise” to dangerous levels, so the AWS Server Networking Kit shut down.
“Unfortunately, because this issue impacted multiple redundant network switches, more EC2 instances in this single Availability Zone lost network connectivity,” the update adds.
And now for the part we can be happy with:
“While our operators would normally have been able to restore pre-impact cooling, a fire suppression system activated within a section of the affected Availability Zone. When this system activates, the data center is evacuated and sealed, and a chemical is dispersed to remove oxygen from the air to extinguish any fire. “
AWS staff had to wait for the arrival of local firefighters and certify that the building was safe. Once this approval was obtained, AWS said that “the building must be re-oxygenated before engineers can safely enter the facility and restore the network equipment and affected servers.”
Safe working conditions have been restored, as have most equipment and services. But it looks like some kits have been damaged, as AWS said: “A very small number of remaining instances and volumes that have been affected by rising ambient temperatures and loss of power remain unresolved.”
The cloud giant also let customers know that the fire extinguisher system that activated remains deactivated.