Global Akamai Outage

Sun Jul 25 17:28:05 UTC 2021

> Very often the corrective and preventive actions appear to be
> different versions and wordings of 'dont make mistakes', in this case:
> 
> - Reviewing and improving input safety checks for mapping components
> - Validate and strengthen the safety checks for the configuration
> deployment zoning process
> 
> It doesn't seem like a tenable solution, when the solution is 'do
> better', since I'm sure whoever did those checks did their best in the
> first place. So we must assume we have some fundamental limits what
> 'do better' can achieve, we have to assume we have similar level of
> outage potential in all work we've produced and continue to produce
> for which we exert very little control over.
> 
> I think the mean-time-to-repair actions described are more actionable
> than the 'do better'.  However Akamai already solved this very fast
> and may not be very reasonable to expect big improvements to a 1h
> start of fault to solution for a big organisation with a complex
> product.
> 
> One thing that comes to mind is, what if Akamai assumes they cannot
> reasonably make it fail less often and they can't fix it faster. Is
> this particular product/solution such that the possibility of having
> entirely independent A+B sides, for which clients fail over is not
> available? If it was a DNS problem, it seems like it might have been
> possible to have entirely failed A, and clients automatically
> reverting to B, perhaps adding some latencies but also allowing the
> system to automatically detect that A and B are performing at an
> unacceptable delta.

formal verification