If there’s one thing the last few years of global cloud outages have taught us, it’s no matter how robust your architecture, how refined your processes, or how “resilient” your cloud provider claims to be, incidents will happen.
This has been prevalent in the last month, with 3 serious global outages hitting the mainstream. Based on recent events and my own observations across Service Management and cloud operations, here are five realities organisations need to recognise.
1. CMDBs and Service Maps Matter — But they are not a Silver Bullet.
CMDBs and service maps are invaluable. Understanding downstream dependencies helps us make faster, more informed decisions during incidents.
But in today’s world of multi-region, highly distributed, cloud-native architectures, expecting 100% accuracy in dependency mapping is unrealistic.
Take the recent Azure outage, for example. Even Microsoft’s own Post Incident Review didn’t list all the Azure services that were affected. If the provider themselves cannot map the full blast radius impacts with perfect precision, it’s unreasonable to expect customer organisations to do so.
Why? Because nuance is everywhere:
- You may rely on an impacted service — but built-in failover mechanisms might mean you see no customer-facing issues.
- Conversely, you might have no visible dependency, yet still experience disruption due to a hidden chain of redirection, middleware, or third-party integrations.
The takeaway: aim for directionally accurate, not perfect. Perfect is impossible.
2. Be Intentional With the Workarounds You Communicate
Sometimes in our eagerness to help, we do more harm than good.
During a recent global outage, I saw the main status headline encouraging users to “try again” as services appeared to stabilise. Later, the PIR explicitly highlighted that these repeated retries triggered a cascade effect — ultimately prolonging the outage.
A well-intentioned workaround became a multiplier of pain.
Before pushing comms, ask:
- Does this actually help restoration?
- Or does it just increase load or noise?
- Is the workaround meaningful, or simply hopeful?
Not every action is useful. Sometimes, patience is the best support.
3. Prepare for Your Response — Because Outages Will Happen
This is the one certainty in technology: incidents are inevitable.
Your preparedness is the only variable you control.
That means having:
- Clear mobilisation procedures
- Calm, structured comms
- Practised crisis simulations
- Decision-making frameworks for ambiguous situations
- Leaders who understand their role in the room
You cannot choose if an outage happens. You can only choose how ready you are when it does.
4. Avoid the Knee-Jerk Reaction: “We Need to Leave This Provider”
It’s common after a major outage to hear:
“We should migrate away from [insert cloud provider].”
Yes, resilience and recoverability deserve scrutiny. Yes, architectural weaknesses should be addressed, but assuming the answer is simply “put half our services on Cloud B” can be dangerously simplistic.
Multi-cloud resilience:
- Is expensive to build and maintain
- Often introduces more complexity, not less
- Can actually reduce resilience through additional dependencies
- Limits your ability to use the best cloud-native capabilities
- Rarely protects you from large-scale, internet-wide disruptions
Sometimes diversification helps. Sometimes it harms.
5. Get Crystal Clear on Business Impact
In any outage, the technical story is only half the story.
The real questions are:
Which services are impacted?
Which customers?
Which regulatory commitments?
Which deadlines or SLAs?
Which reputational risks?
And then ultimately what does this mean for our business?
Without absolute clarity on impact, everything else — prioritisation, comms, mobilisation, recovery — is guesswork. If you don’t know the true business effect, you’re not managing the incident… you’re just reacting to noise.
Final Thoughts
This isn’t an exhaustive list. These are simply the themes most top-of-mind for me based on what I’ve observed across recent global outages.
Leave a comment