Global Akamai Outage
jared at puck.nether.net
Sun Jul 25 15:12:43 UTC 2021
Work hat is not on, but context is included from prior workplaces etc.
> On Jul 25, 2021, at 2:22 AM, Saku Ytti <saku at ytti.fi> wrote:
> It doesn't seem like a tenable solution, when the solution is 'do
> better', since I'm sure whoever did those checks did their best in the
> first place. So we must assume we have some fundamental limits what
> 'do better' can achieve, we have to assume we have similar level of
> outage potential in all work we've produced and continue to produce
> for which we exert very little control over.
I have seen a very strong culture around risk and risk avoidance whenever possible at akamai. Some minor changes are taken very seriously.
I appreciate that on a daily basis, and when we make mistakes (I am human after all) are made, reviews of the mistakes and corrective steps are planned and followed up on. I'm sure this time will not be different.
I also get how easy it is to be cynical about these issues. There's always someone with power who can break things, but those can also often fix them just as fast.
Focus on how you can do a transactional routing change and roll it back, how you can test etc.
This is why for years I told one vendor that had a line-by-line parser their system was too unsafe for operation.
There's also other questions like:
How can we improve response times when things are routed poorly? Time to mitigate hijacks is improved my majority of providers doing RPKI OV, but interprovider response time scales are much longer. I also think about the two big CTL long haul and routing issues last year. How can you mitigate these externalities.
More information about the NANOG