Five Whys and the Myth of the Single Root Cause

I've sat through dozens of Five Whys sessions over the years. The technique is simple: something goes wrong, you ask "why" five times, and you arrive at the root cause. It's taught in onboarding decks and engineering retrospectives like it's a fundamental law of incident response. For straightforward problems, it works fine.

But I've been through enough outages over the years to see where it falls apart. Two recent ones illustrate the problems well.

Both started the same way. A container in a Kubernetes deployment exceeded its memory limit and was killed by the kernel. The pod restarted, the new container hit the same limit, and crashed again. Clients were frustrated, internal stakeholders were annoyed, and SLAs were broken. In both cases, the Five Whys chain landed on the same answer: the container ran out of memory.

That's technically correct, but it's not useful.

The first outage was caused by a single user somehow sending thousands of requests per second. The request payload was growing with each one - like someone holding down a key. But the system had a debounce on keyup specifically to handle that scenario. We dug through access logs, checked client-side code, and never found a satisfying explanation for how it happened. The Five Whys session hit a wall. "Why did the container run out of memory?" Because of thousands of requests per second. "Why were there thousands of requests per second?" We don't know. There's no sixth why that gets past "we don't know."

The second outage was different. The Five Whys session actually went further - we identified several contributing factors. Hundreds of concurrent requests were each generating dozens of database queries and creating large audit records. But no single one of those factors was enough to take down the system on its own. It was the combination of all of them happening at the same time that caused the failure. Five Whys wants a chain - A caused B caused C caused D. This was more like a pile of things that are individually fine until they all happen at once.

That's the core problem with Five Whys, in my experience. It assumes failures have a single, linear causal path. Ask why enough times and you'll drill down to the one thing that went wrong. But real systems - especially distributed ones - don't fail that neatly. They fail because multiple things that are individually fine combine in ways nobody anticipated.

The technique has a few other issues I've noticed over the years.

The number five is arbitrary. Sometimes you need two whys. Sometimes you need eight. Sometimes - like our first outage - you run out of whys entirely because you can't answer the next one.

It's easy to steer. Different people asking "why" will follow different causal chains and arrive at different root causes for the same incident. The path depends on who's in the room and what they think is important, which makes it feel more like a consensus exercise than an investigation.

It gravitates toward a single root cause. This is probably the biggest issue. When you're looking for "the" root cause, you stop when you find one that sounds plausible. There's also organizational pressure to get there quickly - management wants an answer, and Five Whys gives them one, even when the real picture is more complicated. John Allspaw's Each Necessary, But Only Jointly Sufficient digs into this well. But incidents in complex systems rarely have a single root cause. They have contributing factors - no single one is enough on its own, but combined they are.

What actually helped us after these outages wasn't asking more whys. It was looking at the problem from different angles. Access logs told us what endpoints were being hit before the crash and - just as importantly - what requests were hitting the pods as they tried to come back online and caused them to crash again. That's information the Five Whys framework doesn't naturally surface, because it's not asking "what was happening" - it's asking "why did this happen."

We invested in better observability and alerting so the next time a container's memory starts climbing, we'd know about it before it gets killed - not after. And we made architectural changes to address the contributing factors we'd identified - not because any one of them was the root cause, but because reducing the number of things that can go wrong simultaneously reduces the chance they'll all go wrong at once.

I'm not saying Five Whys is useless. For simple, linear problems - a deployment broke because a config value was wrong because someone copy-pasted from the wrong environment - it works fine. But for complex system failures, I'd treat it as one tool in a larger kit rather than the default. Look at what was happening, not just why. Map contributing factors instead of chasing a single chain. And if you hit a why you can't answer, that's worth paying attention to - it usually means the technique doesn't fit the problem.