Global Akamai Outage

Mark Tinka mark at tinka.africa
Sun Jul 25 14:37:37 UTC 2021


On 7/25/21 08:18, Saku Ytti wrote:

> Hey,
>
> Not a critique against Akamai specifically, it applies just the same
> to me. Everything seems so complex and fragile.
>
> Very often the corrective and preventive actions appear to be
> different versions and wordings of 'dont make mistakes', in this case:
>
> - Reviewing and improving input safety checks for mapping components
> - Validate and strengthen the safety checks for the configuration
> deployment zoning process
>
> It doesn't seem like a tenable solution, when the solution is 'do
> better', since I'm sure whoever did those checks did their best in the
> first place. So we must assume we have some fundamental limits what
> 'do better' can achieve, we have to assume we have similar level of
> outage potential in all work we've produced and continue to produce
> for which we exert very little control over.
>
> I think the mean-time-to-repair actions described are more actionable
> than the 'do better'.  However Akamai already solved this very fast
> and may not be very reasonable to expect big improvements to a 1h
> start of fault to solution for a big organisation with a complex
> product.
>
> One thing that comes to mind is, what if Akamai assumes they cannot
> reasonably make it fail less often and they can't fix it faster. Is
> this particular product/solution such that the possibility of having
> entirely independent A+B sides, for which clients fail over is not
> available? If it was a DNS problem, it seems like it might have been
> possible to have entirely failed A, and clients automatically
> reverting to B, perhaps adding some latencies but also allowing the
> system to automatically detect that A and B are performing at an
> unacceptable delta.
>
> Did some of their affected customers recover faster than Akamai due to
> their own actions automated or manual?

Can we learn something from how the airline industry has incrementally 
improved safety through decades of incidents?

"Doing better" is the lowest hanging fruit any network operator can 
strive for. Unlike airlines, the Internet community - despite being 
built on standards - is quite diverse in how we choose to operate our 
own islands. So "doing better", while a universal goal, means different 
things to different operators. This is why we would likely see different 
RFO's and remedial recommendations from different operators for the 
"same kind of" outage.

In most cases, continuing to "do better" may be most appealing prospect 
because anything better than that will require significantly more 
funding, in an industry where most operators are generally threading the 
P&L needle.

Mark.


More information about the NANOG mailing list