Outages Happen: Five Realities Every Organisation Needs to Accept

If there’s one thing the last few years of global cloud outages have taught us, it’s no matter how robust your architecture, how refined your processes, or how “resilient” your cloud provider claims to be, incidents will happen.

This has been prevalent in the last month, with 3 serious global outages hitting the mainstream. Based on recent events and my own observations across Service Management and cloud operations, here are five realities organisations need to recognise.

1. CMDBs and Service Maps Matter — But they are not a Silver Bullet.

CMDBs and service maps are invaluable. Understanding downstream dependencies helps us make faster, more informed decisions during incidents.

But in today’s world of multi-region, highly distributed, cloud-native architectures, expecting 100% accuracy in dependency mapping is unrealistic.

Take the recent Azure outage, for example. Even Microsoft’s own Post Incident Review didn’t list all the Azure services that were affected. If the provider themselves cannot map the full blast radius impacts with perfect precision, it’s unreasonable to expect customer organisations to do so.

Why? Because nuance is everywhere:

You may rely on an impacted service — but built-in failover mechanisms might mean you see no customer-facing issues.
Conversely, you might have no visible dependency, yet still experience disruption due to a hidden chain of redirection, middleware, or third-party integrations.

The takeaway: aim for directionally accurate, not perfect. Perfect is impossible.

2. Be Intentional With the Workarounds You Communicate

Sometimes in our eagerness to help, we do more harm than good.

During a recent global outage, I saw the main status headline encouraging users to “try again” as services appeared to stabilise. Later, the PIR explicitly highlighted that these repeated retries triggered a cascade effect — ultimately prolonging the outage.

A well-intentioned workaround became a multiplier of pain.

Before pushing comms, ask:

Does this actually help restoration?
Or does it just increase load or noise?
Is the workaround meaningful, or simply hopeful?

Not every action is useful. Sometimes, patience is the best support.

3. Prepare for Your Response — Because Outages Will Happen

This is the one certainty in technology: incidents are inevitable.

Your preparedness is the only variable you control.

That means having:

Clear mobilisation procedures
Calm, structured comms
Practised crisis simulations
Decision-making frameworks for ambiguous situations
Leaders who understand their role in the room

You cannot choose if an outage happens. You can only choose how ready you are when it does.

4. Avoid the Knee-Jerk Reaction: “We Need to Leave This Provider”

It’s common after a major outage to hear:
“We should migrate away from [insert cloud provider].”

Yes, resilience and recoverability deserve scrutiny. Yes, architectural weaknesses should be addressed, but assuming the answer is simply “put half our services on Cloud B” can be dangerously simplistic.

Multi-cloud resilience:

Is expensive to build and maintain
Often introduces more complexity, not less
Can actually reduce resilience through additional dependencies
Limits your ability to use the best cloud-native capabilities
Rarely protects you from large-scale, internet-wide disruptions

Sometimes diversification helps. Sometimes it harms.

5. Get Crystal Clear on Business Impact

In any outage, the technical story is only half the story.

The real questions are:

Which services are impacted?
Which customers?
Which regulatory commitments?
Which deadlines or SLAs?
Which reputational risks?

And then ultimately what does this mean for our business?

Without absolute clarity on impact, everything else — prioritisation, comms, mobilisation, recovery — is guesswork. If you don’t know the true business effect, you’re not managing the incident… you’re just reacting to noise.

Final Thoughts

This isn’t an exhaustive list. These are simply the themes most top-of-mind for me based on what I’ve observed across recent global outages.

Lewis Cracknell