The AWS Outage and the Need for a Known Fragility Database
Fragility and Complexity are Vulnerabilities, AWS Proved it.
The early hours of Monday, October 20th must have been filled with bleary-eyed and panicked AWS technicians scrambling to get services back online after a massive global outage. Major platforms from Facebook to Coinbase suffered down times as users were unable to access accounts. Now about four days since the outage occurred, damage assessments are underway. Some experts are estimating that the damage from this outage alone will be in the hundreds of billions of dollars.
The cost of the AWS outage probably will not surprise many experts. What may surprise people is that those billions of dollars were lost in mere hours and…as of this writing…were not due to the actions of malicious actors. The system simply failed under its own weight.
This outage should be an indicator of how our view of security should be changing. For years, the value proposition of the big cloud providers was that the security they could achieve at scale was far above what could be achieved at small scales or by on-premise servers. Not accounted for were the vulnerabilities that the complexity of these systems creates and their fragility.
The AWS outage is an example of a new kind of vulnerability, fragility. Fragility is not an unknown term in systems thinking or engineering, but it may be foreign to the average cloud or cloud application user. Fragility refers to the potential for the system to fail based on its inherent construction, not necessarily on outside attack. Had the AWS outage occurred due to a massive cyberattack, there’s an extent to which users would be more comfortable with it. Instead, users have been given technical jargon like “API errors in the US-EAST-1 region” and disruptions to “EC2 instances.” These are the technical terms that accurately describe the issue but are of little comfort to those on the losing end of the estimated hundreds of billions of dollars.
Major zero day attacks are feared by cybersecurity professionals because of their reach and potential to cause billions of dollars in economic damages. The outcome of a fragility event, like AWS, is the same so why do we not view them the same way? Our cybersecurity is missing a big piece, a database of known fragility points similar to the Known Exploited Vulnerability Database maintained by CISA. Without a few BOTH or vulnerabilities from malicious actors AND our vulnerabilities from complexity and fragility, our security picture is incomplete. Large cloud infrastructures have grown in complexity and the fragilities hiding in that complexity will cause another event like this sooner or later. We need to build a fragility database to protect ourselves from the next big cloud failure.
Security at Scale, Fragility Everywhere
The global economy settled into a comfortable position regarding cloud computing. For the most part, enterprises large and small and including governments have embraced cloud computing as the optimized way to store information and host websites. The cloud is no magic floating entity in the sky. It is literal square miles-worth of server stacks housed in huge warehouses we call data centers. Those data centers require large volumes of electricity to power those servers and keep all those webpages up. Some of the webpages hosted there are social media sites, but others contain essential services such as banking websites (which counts as critical infrastructure). For many years, huge portions of the internet have existed this way, concentrated among three major providers:
Google Cloud
Amazon Web Services
Microsoft Azure
We’ve all accepted the case the major providers sold us. Namely, that these providers can enact cybersecurity measures faster and better than individual providers that may run an on-premise server for a given organization. This security at scale argument sold many buyers. In addition, the ability to access information from anywhere and minimizing your on-premise hardware was a huge selling point. The same is true for individual users. Many people keep everything from pictures to tax information in the cloud and do not store most things locally on their laptops or tablets. This is seen as standard practice and is taken for granted by many.
There’s always been a lurking vulnerability in the concentration of cloud services within three major carriers. An outage, for any reason, would by the nature of the architecture cause massive outages across multiple industries, individuals, and potentially governments (federal, state, and local). The providers told us this was mitigated by their scaled security and they have by and large been correct. It is true that AWS with its dedicated staff of cybersecurity engineers around the globe is far better at keeping its architecture secure than I would if I was running a server in my closet. But with this scale and concentration came complexity and complexity is a recipe for October 20th.
AWS is the leading provider of cloud services by a pretty significant margin with competitors in Google and Microsoft and small market share in IBM and Alibaba. AWS’s commanding lead reflects the amount of investment in the hardware architecture that enables its service for millions of users and companies. Naturally, AWS grew its cloud offering as demand grew. The cost savings and security assurance were a huge relief for many companies and AWS got bigger. But it wasn’t just about hosting websites and keeping copies of grandma’s photos. Soon, AWS was offering space for developers and increasingly offering compute for AI training and build. All these offerings begin to swell not only the size of AWS’s physical infrastructure, but its complexity. AWS’s suite of offerings include developer tools, APIs, and other services in addition to its physical storage space. Another word for this is complexity.
Complexity is a mathematics term that refers to a system that is composed of many parts but that exhibits emergent behaviors. Emergent behaviors are outputs of a system that are not predictable based on the components of the system (the basis for Chaos Theory). AWS would not be an example of actual Chaos Theory, but the analogy is useful for explaining fragility. AWS’s created a system that was so large and so complex that there were small failure points lurking in the architecture that its technicians nor its users could see. No amount of AWS certifications would have helped stop what happened on October 20th. As AWS and providers like it grew, incentivized by significant profits and market demand, they built more complex systems. Hiding in that complex system was one error that caused $100B of damage in a few hours. That’s fragility.
Known Vulnerabilities, Unknown Failures
The Cybersecurity and Infrastructure Security Agency (CISA) within the Department of Homeland Security maintains the Known Exploited Vulnerabilities Database. It is a major element of cybersecurity practices allowing professionals to design firewalls and other cybersecurity products to defend against known exploits that may be used by malicious actors. This database, of course, only holds those vulnerabilities we know about and not the far scarier zero day exploits. Because zero days are not known to us, they have a significant potential to cripple cyber systems and become familiar names like Stuxnet, NotPetya, and WannaCry. Each of these attacks (and many more like them) caused massive disruptions, economic damage, and even physical damage. Since they had never been seen in the wild before, there was no defense.
Cyber professionals have improved significantly in their quest to defend against intentional malicious attacks, but the AWS failure was something else.
It was a new type of zero day, a “zero day fragility.”
While the Known Exploited Vulnerabilities Database is far from perfect, it does keep a log of tools and exploits that a professional may encounter. There is no such database for fragilities in our internet architecture. Because we are unable to map the fragile points in the internet infrastructure on which we depend, we are unable to build resilience against these types of attacks and unable to effectively respond when the attacks occur.
If maliciously deployed zero days AND unknown fragilities can cause hundreds of billions of dollars in a few hours, the backbone of much of our economic and financial systems is in more jeopardy than we realize.
The worst outcome of the AWS outage is that it fades from the news cycle.
If this outage is treated as just a temporary pain in the neck when some people in one region of the US could not access their Facebook accounts, we have not learned the right lesson. Instead, we should be thinking about complexity and fragility in the same way that we think about traditional vulnerabilities because they have the same outcome. Until we can see the full picture of the vulnerabilities we face, we will be fragile. Creating a system that is anti-fragile will require this view. It will also require our perception of the concept of vulnerability to include fragility.
AWS proved we need to expand our view and it should not cost $100B more before we change.


