Microsoft Joins the Party with its Own Major Internet Outage
The two companies that had major cloud infrastructure outages in the last two weeks own 51% of the global cloud market.
We know that AWS and Microsoft are competitors, but this is too much.
Last week, a massive AWS outage caused huge portions of the internet to be unavailable for multiple hours including popular social media sites. Not to be outdone, just over a week later, Microsoft Azure reported huge outages taking down
Azure
O365
Xbox
*gasp* Starbucks
*double gasp* Minecraft
and other popular platforms for millions of users. As of this writing, the estimate on when services will be fully restored has not been released. So far, there are no indications of malicious activity, but that this was caused by “a DNS issue” and “Azure Front Door issues.” Microsoft released a statement late yesterday saying that it is restoring Azure to the “last known good configuration.” Comforting.
Two major outages in as many weeks should send a loud signal to all of us. We allowed the AWS outage to slip from the news cycle before truly discussing what it meant. Now, it’s happened again. Some might roll their eyes about the outage of Minecraft, but what we need to be talking about is the inherent fragility that our online lives are being lived under. These systems are buckling under their own weight rooted in increasing levels of complexity. We need to talk seriously about a Known Fragility Database.
Concentrated Complexity
Last week’s post on the AWS outage suggested that we need a Known Fragility Database in the same spirit as the Known Exploited Vulnerabilities Catalog. Administered by CISA, the catalog is a collection of known (not zero-day) exploits that have been observed in the wild or found through bug bounty programs. It allows cybersecurity professionals to create firewalls and other security measures that guard against these exploits. Of course, zero-day exploits are not reflected, so the catalog alone is not the answer. It is, however, an essential part of understanding our exposure to, and protection from, vulnerabilities that may be exploited by malicious actors.
So powerful is the industrial machine that secures clouds, applications, and data from malicious attack that we’ve been caught looking in the wrong direction.
Malicious attacks are the stuff of movies summoning images of inadvertent nuclear launches, civil unrest, or high-tech future wars. But what we’ve learned, TWICE now, is that the applications and broader infrastructure on which we depend needs no assistance failing. It can fail all by itself due to concentrated complexity.
Complexity refers to systems that are composed of many parts but whose output cannot be predicted by the parts. That’s called emergent behavior and falls into the category of complex adaptive systems or chaos theory. Without a mathematics lesson, chaos theory is the study of how small disturbances in large complex systems can cause behaviors that we can’t predict. Sound familiar?
In both the case of AWS and Microsoft, the concentration of highly complex systems caused the broader infrastructure to buckle under its own weight. Small, otherwise entirely ignorable components of a massive architecture caused broad outages. To some, they were minor inconveniences like the inability to play Xbox. To companies, these effects have real costs, potentially in the billions of dollars total. Yesterday, it was gaming platforms. Tomorrow, it could be healthcare systems or critical infrastructure. The sad truth is that these outages, for as bad as they were, only scratched the surface. Outages like this have the potential to be FAR worse.
The culprit is the concentration of this complexity in a small number of HUGE providers. The worldwide cloud services market reached $94 billion as of the first quarter of 2025 with AWS as the clear leader with 29% of the global market share. Microsoft sits in second place with 22%. Let’s stop there for a moment.
The two companies that had major cloud infrastructure outages in the last two weeks own 51% of the global cloud market.
Neither was attacked (as of this writing). Both displayed clear fragility impacting multiple sectors, industries, and governments. If we are only looking at vulnerabilities and malicious actors, we are missing half of the picture.
A Second Warning
When people do after action reports of significant disaster events, they often find there were warning signs. The fragility of our internet backbone and public infrastructure has warned us twice in two weeks. Had this been a major malicious attack by state operators or cyber criminals, our response would be very different. Instead, our response is something like “just fix it.” I understand that frustration, but the lesson is bigger than that. We are in a moment where our traditional view of what it means to maintain confidentiality, integrity, and availability (the CIA triad) is shifting. It’s no longer (not that it ever was) the purview of cyber professionals in a darkened Security Operations Center to fend off cyber armies to keep our systems up. Now, we need to do the less sexy, less screen write-able tasks of examining complexity and looking at how to mitigate fragility. Easier said than done.
To build a Known Fragility Database, we would need to look at concentrated complexity. Where are the biggest, and most complex segments of the architecture that hold the most critical services? The information is likely not concentrated, but the large cloud providers certainly know who their clients are and where their servers are. We might call it the cloud, but it’s really just another computer with some supped up capabilities.
After identifying concentrations of complexity, we would need to look at components of that architecture from its security capabilities to its processing power. From there, we need to ask some critical questions:
What redundancies are in place (not simply power, but compute capacity)?
How easily can services be shifted to other servers in the event of an outage?
What reserve capacity exists (i.e. servers that aren’t currently in use that can be brought online in an emergency)?
What inter- and intra-dependencies exist between this segment of the cloud architecture and others?
To what extent have customers been informed of these fragility points?
The hardest part will be the system dependencies, but some of that information likely already exists. Others might be able to be discovered through the use of a specialized AI application. In any case, that information needs to be concentrated and communicated to consumers and cyber professionals. The people and organizations pay the $94 billion for cloud services should understand the full scope of what they are buying. The concentration of cloud computing services in a small handful of companies is creating serious resilience problems for an infrastructure that holds our digital lives. Without a Known Fragility Database, we don’t have a real chance of avoiding outages like these. Yesterday, it took down Xbox. Next time it could be something more vital.
Maybe we need to amend the CIA triad to CIAR.
Confidentiality, Integrity, Availability, Resilience



