Global Akamai Outage

Sun Jul 25 06:18:24 UTC 2021

Hey,

Not a critique against Akamai specifically, it applies just the same
to me. Everything seems so complex and fragile.

Very often the corrective and preventive actions appear to be
different versions and wordings of 'dont make mistakes', in this case:

- Reviewing and improving input safety checks for mapping components
- Validate and strengthen the safety checks for the configuration
deployment zoning process

It doesn't seem like a tenable solution, when the solution is 'do
better', since I'm sure whoever did those checks did their best in the
first place. So we must assume we have some fundamental limits what
'do better' can achieve, we have to assume we have similar level of
outage potential in all work we've produced and continue to produce
for which we exert very little control over.

I think the mean-time-to-repair actions described are more actionable
than the 'do better'.  However Akamai already solved this very fast
and may not be very reasonable to expect big improvements to a 1h
start of fault to solution for a big organisation with a complex
product.

One thing that comes to mind is, what if Akamai assumes they cannot
reasonably make it fail less often and they can't fix it faster. Is
this particular product/solution such that the possibility of having
entirely independent A+B sides, for which clients fail over is not
available? If it was a DNS problem, it seems like it might have been
possible to have entirely failed A, and clients automatically
reverting to B, perhaps adding some latencies but also allowing the
system to automatically detect that A and B are performing at an
unacceptable delta.

Did some of their affected customers recover faster than Akamai due to
their own actions automated or manual?

On Sat, 24 Jul 2021 at 21:46, Hank Nussbacher <hank at interall.co.il> wrote:
>
> On 23/07/2021 09:24, Hank Nussbacher wrote:
>
>  From Akamai.  How companies and vendors should report outages:
>
> [07:35 UTC on July 24, 2021] Update:
>
> Root Cause:
>
> This configuration directive was sent as part of preparation for
> independent load balancing control of a forthcoming product. Updates to
> the configuration directive for this load balancing component have
> routinely been made on approximately a weekly basis. (Further changes to
> this configuration channel have been blocked until additional safety
> measures have been implemented, as noted in Corrective and Preventive
> Actions.)
>
> The load balancing configuration directive included a formatting error.
> As a safety measure, the load balancing component disregarded the
> improper configuration and fell back to a minimal configuration. In this
> minimal state, based on a VIP-only configuration, it did not support
> load balancing for Enhanced TLS slots greater than 6145.
>
> The missing load balancing data meant that the Akamai authoritative DNS
> system for the akamaiedge.net zone would not receive any directive for
> how to respond to DNS queries for many Enhanced TLS slots. The
> authoritative DNS system will respond with a SERVFAIL when there is no
> directive, as during localized failures resolvers will retry an
> alternate authority.
>
> The zoning process used for deploying configuration changes to the
> network includes an alert check for potential issues caused by the
> configuration changes. The zoning process did result in alerts during
> the deployment. However, due to how the particular safety check was
> configured, the alerts for this load balancing component did not prevent
> the configuration from continuing to propagate, and did not result in
> escalation to engineering SMEs. The input safety check on the load
> balancing component also did not automatically roll back the change upon
> detecting the error.
>
> Contributing Factors:
>
>      The internal alerting which was specific to the load balancing
> component did not result in blocking the configuration from propagating
> to the network, and did not result in an escalation to the SMEs for the
> component.
>      The alert and associated procedure indicating widespread SERVFAILs
> potentially due to issues with mapping systems did not lead to an
> appropriately urgent and timely response.
>      The internal alerting which fired and was escalated to SMEs was for
> a separate component which uses the load balancing data. This internal
> alerting initially fired for the Edge DNS system rather than the mapping
> system, which delayed troubleshooting potential issues with the mapping
> system and the load balancing component which had the configuration
> change. Subsequent internal alerts more clearly indicated an issue with
> the mapping system.
>      The impact to the Enhanced TLS service affected Akamai staff access
> to internal tools and websites, which delayed escalation of alerts,
> troubleshooting, and especially initiation of the incident process.
>
> Short Term
>
> Completed:
>
>      Akamai completed rolling back the configuration change at 16:44 UTC
> on July 22, 2021.
>      Blocked any further changes to the involved configuration channel.
>      Other related channels are being reviewed and may be subject to a
> similar block as reviews take place. Channels will be unblocked after
> additional safety measures are assessed and implemented where needed.
>
> In Progress:
>
>      Validate and strengthen the safety checks for the configuration
> deployment zoning process
>      Increase the sensitivity and priority of alerting for high rates of
> SERVFAILs.
>
> Long Term
>
> In Progress:
>
>      Reviewing and improving input safety checks for mapping components.
>      Auditing critical systems to identify gaps in monitoring and
> alerting, then closing unacceptable gaps.
>
>
>
> > On 22/07/2021 19:34, Mark Tinka wrote:
> >> https://edgedns.status.akamai.com/
> >>
> >> Mark.
> >
> >
> > [18:30 UTC on July 22, 2021] Update:
> >
> > Akamai experienced a disruption with our DNS service on July 22, 2021.
> > The disruption began at 15:45 UTC and lasted for approximately one
> > hour. Affected customer sites were significantly impacted for
> > connections that were not established before the incident began.
> >
> > Our teams identified that a change made in a mapping component was
> > causing the issue, and in order to mitigate it we rolled the change
> > back at approximately 16:44 UTC. We can confirm this was not a
> > cyberattack against Akamai's platform. Immediately following the
> > rollback, the platform stabilized and DNS services resumed normal
> > operations. At this time the incident is resolved, and we are
> > monitoring to ensure that traffic remains stable.
>
>

-- 
  ++ytti