Ep # 240: How Big is Too Big? What the AWS Outage Can Teach Us

Working Conversations Episode 240:
How Big is Too Big? What the AWS Outage Can Teach Us

What happens when the invisible systems that keep our world running… stop working?

That’s exactly what happened when Amazon Web Services (AWS), the cloud backbone powering everything from food delivery to online banking, went down.

For hours, apps froze, websites crashed, and even government services ground to a halt. Businesses couldn’t process orders. Smart devices stopped responding. Employees across industries suddenly found themselves unable to do their jobs.

It was a powerful reminder that we’ve built our world on systems we rarely see, yet depend on every single day.

In this episode, I unpack what that AWS outage revealed about the fragile infrastructure beneath our digital lives. Drawing on Normal Accident Theory, we’ll explore how tight coupling and complex interconnections make today’s biggest systems both incredibly efficient and dangerously delicate.

I also share real-world examples of how one point of failure can ripple outward, halting productivity, communication, and even safety. But this isn’t just a story about technology. It’s a story about trust; trust in systems we don’t control, and in leaders who must prepare for what they can’t predict.

As we navigate the future of work and life in an increasingly digital landscape, we have to ask ourselves: When does efficiency cross into dependency? How do we balance scale with resilience? And what’s our backup plan when the system fails?

Tune in for a thought-provoking look at the hidden infrastructure shaping our world and why resilience, not just growth, should be at the center of how we build what’s next.

Listen and catch the full episode here or wherever you listen to podcasts. You can also watch it and replay it on my YouTube channel, JanelAndersonPhD.

EPISODE TRANSCRIPT

The other day when AWS went down, so did a staggering number of systems, systems that we rely on every single day. Food delivery apps stopped taking orders, Smart home devices like Alexas went silent. Internal tools at major companies just stopped mid task. And even some government systems and hospital networks were affected. It's weird. One hiccup in one company's infrastructure and suddenly the whole modern world paused. Makes you wonder how big is too big? How much of our lives are built on systems that we don't see, can't control, and barely understand.

And what does it say about the kind of digital world that we've built for ourselves? We're going to dive into all those things and more today. Now, like you, I had some evidence that some things were wrong on that particular Monday morning. The first for me was that my wordle stats were missing. Now I play the wordle hosted on the New York Times website every single day. Rarely does a day go by when I miss playing wordle. I'm usually on vacation or something and I'm always very proud of the streak that I have going. And I couldn't get my stats that particular morning. Now it was still six something in the morning.

I'm sipping my first cup of coffee and of course I don't realize that it's a broader outage. I just think that something's happening with the New York Times website or with wordle. A little bit later that morning I discovered that my Alexa device was offline. Once I got to the office, I found that Michelle, who does my social media reels and edits and so forth, was not able to download the reel that she was editing for me on the graphic design system Canva that we use. I also noticed that some of my free downloads things that you could, you know, you're listening to the podcast and I say go to such and such website and download this free tool. Um, I also have various course material for classes that I create and so on. Some of those things were not able to be downloaded. Then a zoom meeting I was going to join just kept stopping.

It kept spinning and stopping, spinning and stopping. Later I found out that my kids school app where they turn in their homework to it's called Canvas, different from Canva, the graphic design software, but Canvas, which is an educational learning platform that was down. So instructional activities at school were hampered as well. And then later that day I couldn't register my son for a winter basketball camp that he was wanting to take. And it was super stressful for me because these camps always fill up and I had the registration page all filled out and I get all the way to, you know, click here to lock in your registration and pay. And it just spins and spins and spins and it says something about my API. And I can see in the URL bar that it's going out to an AWS server now. By that time, the outage was well known and was for the most part corrected.

But not every system was completely reset. I did get them registered the next morning, if you were curious about that. But dozens of other systems that I don't even use on a daily basis were also impacted. The Ring doorbell, Xbox, Snapchat, McDonald's app, Pokemon Go Life, 360, Duolingo, Apple Music and a host of others. And then of course there are systems that I do use, I was just not using them on that particular Monday morning, like Hulu, HBO, Max, Instacart, and Venmo. It was a big outage. More than 1 million reports of outages came within a 2 hour period in the US alone. So this was AWS.

I'll get to what that is exactly in a moment, but AWS accounts for 37% of the global cloud market share. In 2024, according to a market research report by the firm Gardner, they generated 107.6 billion. That's with a B dollars in revenue in 2024. And it runs on 6 million kilometers of fiber optic cabling and it is available in 38 geographic regions. Now AWS and its counterparts like Microsoft Azure, Google Cloud and so on host most of the modern Internet. Their scale offers tremendous efficiency for the companies like mine who use them, and reliability most of the time. But it also creates a single point of failure when something happens. So when AWS goes down, as it did not that long ago, we don't just lose one website, we lose portions of global e commerce, healthcare systems, transportation, logistics and even critical infrastructure.

And of course my kids school assignments as well. Now we're paying AWS to host our data. So we're not just outsourcing data, however, we're outsourcing resilience because we're putting it all in the hands of one or a small few companies. But this isn't really about AWS per se, it's about enormous systems and what happens when they fail. So we can think of AWS as part of the digital supply chain. So when you think about the regular supply chain, the physical supply chain, when a shipping port, let's say, goes down in the physical world, our whole physical supply chain gets all compromised and stuff gets bunged up in web whatever port that has been impacted. Now this on the other hand, was a digital supply chain. So when something gets all bumbled up or gunked up in the Internet, well, it slows everything down and kind of brings certain systems to their knees.

So my question that I want to explore today is what happens when the systems that we depend on become too big to fail, but then still fail? Now, for those of you who aren't familiar with AWS and maybe just heard in the news that there was an outage, let me give you a super low tech explanation for what happened. So AWS stands for Amazon Web Services. It's a product owned by Amazon and it's basically cloud computing at scale. It's the behind the scenes infrastructure that powers a lot of the Internet. Now, instead of every company running its own giant room full of servers, hence cloud computing, they rent space and computing power from Amazon Web Services. Now, those servers store data, they host websites, and they run so many apps that we use every day. Again, from Netflix to Hulu to banking apps to Pokemon Go and other social media services and workplace tools like Slack. So when AWS hiccups, it's not just Amazon that's affected, it's a big old chunk of the Internet.

So exactly what happened when AWS went down goes something like this. A whole bunch of data is stored in a database called DynamoDB, DynamoDataBase, and that's part of AWS. Now, customers couldn't access that specific data that was in Dynamodb because of a problem with the Domain name system or DNS. You might have heard of DNS. Now, you can think of DNS as an Internet Traffic Director pointing requests for information on the Internet to the places where that information is stored. So the information was there, there was no problem with that, no worries about that whatsoever. But the traffic director was having a very bad day at work. The traffic director was basically out of commission.

So why did it happen? Well, that's, my friends, where things get very tricky. Now, to understand exactly what happened, I want to bring in a theoretical framework. So this framework comes from Charles Perrow, and he wrote a book called Normal Living with High Risk Technologies. His book was published in 1984 and his book documented what happened in the Three Mile island nuclear accident. And that is a very complex system with lots of interdependent moving parts. And so, very much like Amazon Web Services, there are a lot of interdependent moving parts in both of those systems. Okay, so Charles Perrow's argument is deceptively simple. He says, in systems that are both complexly interactive and tightly coupled, accidents are not merely possible, they're inevitable.

They are normal. Now, not in the sense of the word normal, meaning acceptable, but in the sense of being just like built in or baked into the system's design. Again, not normal acceptable, but normal as in inevitable. Even when everyone behaves correctly, the system can still fail. Or even when the various parts of the system are operating as intended, the system can still fail. Let me give you a super high level, non technical example of this idea of normal accident theory. My two youngest kids go to the largest high school in Minnesota. There are nearly 4,000 students in that high school.

Now, just for a moment, I want you to think about passing time between classes. The bell rings and nearly 4,000 students all pour out into the hallways at once. Everyone's just trying to get to class. But with that many moving parts and that little amount of space to maneuver, whether that's through the hallways or up and down the stairwells, well, it's inevitable that someone's going to get bumped or jostled. Now, nobody did anything wrong. It's just what happens when a system is crowded and fast moving. So let's run through the high school hallway example with passing classes, and let's run that through Charles Perrow's normal accident theory.

Okay, so the first part is complex interactions. So there are many independent agents. And this would be students moving simultaneously through those hallways and stairwells. And each one is making local decisions, adjusting their path, chatting with their friends, opening lockers, reaching into their backpacks, maybe checking their cell phones without full visibility into what's happening. Certainly not 30ft away, sometimes not even three feet away. Now, those individual decisions are unpredictable emergent patterns. And something as simple as somebody swinging their backpack over a shoulder or someone stopping to tie their shoe, or two groups merging at an intersection can really gunk everything up. No single person intends chaos. It just emerges from the ordinary interaction in that system.

Okay, so that is the idea of complex interactions. Like things can go wrong even when there are no nefarious agents and nobody is trying to make problems. Okay, so that's the first part, complex interactions. The second part is tight coupling. Okay, so again, go back to that hallway in the school. Timing is synchronized. You have to get from classroom A to classroom B. Sometimes that's across the whole building.

Sometimes it's on different floors. But when the bell rings, everyone moves at once. There is very little slack. The hallways are the size that the hallways are. Even if the swell of people would demand a bigger hallway. The building doesn't change. There are fixed schedules. That bell rings at a certain time, and then the next bell is going to ring at a certain time.

And you've got to get from point A to point B in the short passing period time. That's it. A stumble, a spill, a slowdown in one spot can cause cascading delays across the whole building. Just think of a traffic jam or a pile up on the stairs. And I'm not even talking about if somebody drops their backpack and it spills everywhere. Although that could happen, too. There's just not enough space or time to isolate small disturbances before they turn into traffic jams or spread to be bigger problems. Okay, so that's the tight coupling that we're talking about in that high school hallway.

So even with everyone behaving normally and no nefarious agents, nobody is bullying or pushing or anything, intended incidents are still inevitable. No one needs to do anything wrong. It's just the density and timing guarantee that someone will bump into somebody else. And that is precisely Perrow's point. In certain systems, accidents are not caused by negligence. They're just inevitable because the system is complex and the system is moving so fast. Accidents and incidents are just simply baked into the structure. All right, now let's move it into a more technical realm, and we won't go all the way to AWS just yet.

Although I promise you, we're going to get there. Okay, let's say that we're in the world of software development. And a software developer adds a new line of code or a new section of code to a large mature software system. Think a Microsoft product or Google Docs or something like that, and they're adding a new feature. And so this software developer writes this new code. Now, the code compiles fine. There are no syntax errors, there's no obvious typos. And it looks like in an independent test, that piece of code does exactly what it's supposed to do.

But once it gets deployed, once it gets added into that larger code set along with all the other features in that product, well, something could break in unexpected ways. A performance issue, data corruption, a hidden race condition, or some other unintended outcome. Now, I just said a hidden race condition. Let me define what I mean by that. So a race condition in software is two lines of code. When, let's say a system is doing the same thing twice, coming from two different maybe users, it's probably best explained with an example.

Okay, let's say you have two different people trying to withdraw 100 from the same bank account. I share some bank accounts with my children and my husband and so forth. So let's say we both go into a system and each try to make a transaction withdrawing $100 from an account that has $500 in it. So the balance is $500. I make the withdrawal and the balance should drop to $400. But at the same time, one of my kids makes a withdrawal for a hundred do. And so now, but those two things happen almost simultaneously. So I do mine and it drops to 400, but right at the same time they do theirs, it drops to 400.

And so there for a split second it's going to say 400, when really the balance should be 300 because we each took out a hundred dollars at almost the same time. Okay, now that's a race condition. The operations raced to update the balance at the same time and the timing gave one or both of us the wrong outcome. So we each took out a hundred dollars at like exactly the same time. But I only saw the withdrawal that I made and they only saw the withdrawal that they made. But anywhere from a few microseconds to a few actual seconds later, the balance will update properly. So again, that's a race condition. Now some race conditions are easy to detect in testing and in fact software developers test for this sort of thing, especially in complex systems.

And when those race detection, when that race detection happens in testing, and it's a relatively simple race detection and happens every time, it can show up and software developers can fix for it. But hidden race conditions only show up under certain timing or under certain load conditions in a system. So maybe once in a thousand runs or only under really high traffic, and you can't reliably reproduce or predict them. So they're emergent behaviors of complex concurrent behaviors in a system. Now, hopefully you're following along so far and I haven't lost you yet. Now a lot of redundancy is also built into these systems. So in the case of AWS, data isn't just stored in one place. That would be too much risk.

So that database that I was talking about before, it's got backups and backups of its backups spread across all different servers. Now there are multiple ways to access those backups. So just like when, let's say you're physically in a car and you're driving to the bank, but a street is impassable because a tree has fallen over the road, well then you have to turn around and go around the block and come at the bank from the other direction. So let's say you do that, but everybody else needs to get through that same street. Maybe they're not even going to the bank, but they're going to go past that. They also have to turn around, and before you know it, you've got a traffic jam that's developed around the whole block. Okay, so that's what I'm talking about here, too, in terms of concurrent things happening at the same time.

And we've, you know, sometimes got multiple different calls to the same data or even near the same data happening at the same time. So these complex systems are like AWS in that they're programmed to keep trying if they can't get what they need. So when a system calls out, let's say, to a website, and you've probably had this happen where you type in a website URL and it sits there and spins and it takes longer, well, that system is trying. Your browser is trying multiple times to get that website to come in. And sometimes it does, but every once in a while you get the 404 error. You get some error that you can't get to that website right now. But your browser tried a bunch of times.

So again, these systems have that redundancy to keep trying when they don't get what they need. And all of those extra tries adds to being extra traffic. That gums things up as well. Okay, now back to our software developer. The problem isn't necessarily that new code that the software developer wrote in isolation. It's the interplay between that code and all of the other parts of the system when something breaks. Okay, deep breath. This is probably more of a technical education than you thought you might get today.

Now, let's actually look at AWS, shall we? So when we look closely at complex interactions and tight coupling in the AWS outage. Now, again, without casting blame, because there are a handful of other companies that are almost as large doing very much the same thing. So it's not like this is an Amazon problem or an AWS problem. It could have just as easily happened to one of the other large cloud computing companies out there. So we're not blaming AWS, but instead, let's explore what the incident reveals about how massive interdependent systems behave and what it means for all of those of us who rely on them. Okay, so let's look at complex interactions first. So the map is not the territory in a cloud ecosystem like AWS complexity isn't just about size.

It's about the interdependencies that we can't fully see. So what does this look like in practice? Well, we've got thousands of services. Storage, computing, networking, identity. Remember that DNS, the traffic routing, monitoring, and so on. All of these things are interacting continuously across multiple regions, across multiple pieces of physical equipment, and through again, hundreds of thousands of kilometers full of fiber optic cable. And each system and service has its own health checks built in. So there's also scaling logic and internal fail procedures. All of that's built in.

All of that's automated. It's designed for resiliency to a certain extent. A small routine update in any one of those systems, let's say a configuration tweak or a database patch, can trigger behaviors elsewhere in the system that no engineer could have ever predicted. Just like that kid dropping his backpack is going to have problems for 30 people behind them in ways that no one ever predicted in that high school hallway. Okay, so these are the emergent properties that we talked about earlier. Everything is working as designed. But something unexpected arises from the interaction of this really, really complex system. So why does this matter? Well, even the AWS engineers, who are among some of the smartest and best engineers in the world, can't model every possible interaction in real time to come up with all of the ways to make the system more redundant and more resilient.

It's not incompetence. That's just the nature of complex systems. And the dependency is so intricate that cause and effect end up blurring together, and we can't necessarily unravel and point to a specific point of failure. So when the system stumbles, the story isn't like a bug got missed or the system behaved, but it's more of the system behaved in ways that completely exceeded human comprehension. That's the complex interaction that PARO is talking about. Now let's turn to look at tight coupling. And this is when speed becomes fragility.

So cloud infrastructure is designed for efficiency and absolute immediate response. Because when we go to enter our, let's say, credit card information into that system where I want to pay for the winter baseball clinic my son wants to take, I'm expecting that to be relatively immediate. I don't want to sit there and wait and wait and wait. So we have come to expect that certain sense of immediacy. And systems like AWS and all of those other big systems have automatic load balancing, that is, routing traffic within milliseconds if one server is too busy, it's going to send you to another one. And you, as the user, don't even know it happens. That fast auto scaling kicks in and the instant that demand spikes. So the system, think of it as a straw.

You know, the straw is as big as the straw is, but then as soon as the demand gets bigger, that straw. And again, this is not necessarily like the infrastructure that's carrying all of the Internet load, but it is the, that a system is built to scale when the demand gets higher. And that's going to take more energy, it's going to take more computing power, it's going to take more of a lot of things. But these systems are built to scale up when the demand is higher and then scale down when the demand is lower. So when that spike happens, it automatically scales up. There are monitoring tools to trigger corrective actions automatically. It's not like an engineer has to be sitting there watching the system. So a human doesn't even need to know that something's off.

The system is designed to correct itself and to fix itself. So this tight coupling makes a system just incredibly responsive, which is great, but it also removes the slack that removes all that downtime, that slag or slack, and that is the breathing room that might be needed by the system to absorb the surprises. But we're, it's this, the slack has just been engineered out of the system. So there's a bit of a paradox here. When something small goes wrong, automated systems respond immediately and sometimes in ways that actually amplify the problem at hand. It's crazy. Okay, so a flood of retry requests. Remember I was talking about this redundancy.

And so if it doesn't go through the first time, it's going to retry and retry and retry and retry. So if there's a glitch over here, then it can't actually do the computation or reach the information from the database and pull it back to you. It's going to continue to retry. So that flood of retry requests from various apps can really overwhelm the health of various parts of a network. Now, regional failures that normally enhance resilience can spread that instability across large group geographic terrains when the underlying control system is impaired. So when tight coupling keeps things humming under normal conditions, which is great, we don't even notice it. But when tight coupling gets in the way because there is some sort of point of failure, it is much easier for a local disturbance to cascade globally, which is what happened in the case of the US Snafu the other day. So that's technically what happened if you need to know what happened.

But now let's back up and look at the big picture, because this outage is not a story about AWS specifically. It's a story about the trade offs of scale in modern infrastructure. At a certain size, complexity and coupling become inseparable companions to keep global systems performing. You automate and connect more deeply, and that is tighter coupling. And to offer more services, you layer more interdependency, and that is greater complexity. And to move toward efficiency or quietly reduces transparency and resilience in these systems. And then the system crosses an invisible line from merely large to being almost too big to be fully understood. So that's what's going on here, my friends.

Now, as I land this plane and bring this episode to a close, I don't have any tidy answers. I don't have a three step process for you to follow. Maybe there aren't any answers, at least not right now. But moments like this one, the AWS outage that rippled across a third of the Internet on a Monday morning, bringing business services, entertainment, airplanes and banks to their knees. Well, moments like this one remind us that we're living inside of systems that are astonishingly powerful and deeply fragile at the same time. The engineers who keep these systems running are some of the smartest people on the planet, and yet even they can't see every interaction or anticipate every cascade or every place where something could go wrong. The complexity is just too great and the coupling is just too tight. And that makes me wonder, how big is too big? At what point does scale stop serving us and start working, working against us?

And maybe the real question isn't even about cloud computing at all. Maybe it's about us. How dependent we've become on systems that we barely understand, and how comfortable we are outsourcing not just our data, but our trust. So as we log back in and sync up our files and go about our day, it's worth pausing to ask, what kind of world are we building? When even a brief hiccup in a server farm can slow down the heartbeat of the planet? I don't have the answers, but I think these are questions worth asking. Until next week, my friends. Be well, and I hope your online interactions go swimmingly.

Download Full Episode Transcript

CHOOSE YOUR FAVORITE WAY TO LISTEN TO THIS EPISODE:

🎙 Listen on Apple Podcasts

🎙 Listen on Spotify

Working Conversations Episode 240:How Big is Too Big? What the AWS Outage Can Teach Us

EPISODE TRANSCRIPT

Yay! You get a gift!

Working Conversations Episode 240:
How Big is Too Big? What the AWS Outage Can Teach Us