Global Akamai Outage

Hank Nussbacher hank at interall.co.il
Sat Jul 24 18:42:28 UTC 2021


On 23/07/2021 09:24, Hank Nussbacher wrote:

 From Akamai.  How companies and vendors should report outages:

[07:35 UTC on July 24, 2021] Update:

Root Cause:

This configuration directive was sent as part of preparation for 
independent load balancing control of a forthcoming product. Updates to 
the configuration directive for this load balancing component have 
routinely been made on approximately a weekly basis. (Further changes to 
this configuration channel have been blocked until additional safety 
measures have been implemented, as noted in Corrective and Preventive 
Actions.)

The load balancing configuration directive included a formatting error. 
As a safety measure, the load balancing component disregarded the 
improper configuration and fell back to a minimal configuration. In this 
minimal state, based on a VIP-only configuration, it did not support 
load balancing for Enhanced TLS slots greater than 6145.

The missing load balancing data meant that the Akamai authoritative DNS 
system for the akamaiedge.net zone would not receive any directive for 
how to respond to DNS queries for many Enhanced TLS slots. The 
authoritative DNS system will respond with a SERVFAIL when there is no 
directive, as during localized failures resolvers will retry an 
alternate authority.

The zoning process used for deploying configuration changes to the 
network includes an alert check for potential issues caused by the 
configuration changes. The zoning process did result in alerts during 
the deployment. However, due to how the particular safety check was 
configured, the alerts for this load balancing component did not prevent 
the configuration from continuing to propagate, and did not result in 
escalation to engineering SMEs. The input safety check on the load 
balancing component also did not automatically roll back the change upon 
detecting the error.

Contributing Factors:

     The internal alerting which was specific to the load balancing 
component did not result in blocking the configuration from propagating 
to the network, and did not result in an escalation to the SMEs for the 
component.
     The alert and associated procedure indicating widespread SERVFAILs 
potentially due to issues with mapping systems did not lead to an 
appropriately urgent and timely response.
     The internal alerting which fired and was escalated to SMEs was for 
a separate component which uses the load balancing data. This internal 
alerting initially fired for the Edge DNS system rather than the mapping 
system, which delayed troubleshooting potential issues with the mapping 
system and the load balancing component which had the configuration 
change. Subsequent internal alerts more clearly indicated an issue with 
the mapping system.
     The impact to the Enhanced TLS service affected Akamai staff access 
to internal tools and websites, which delayed escalation of alerts, 
troubleshooting, and especially initiation of the incident process.

Short Term

Completed:

     Akamai completed rolling back the configuration change at 16:44 UTC 
on July 22, 2021.
     Blocked any further changes to the involved configuration channel.
     Other related channels are being reviewed and may be subject to a 
similar block as reviews take place. Channels will be unblocked after 
additional safety measures are assessed and implemented where needed.

In Progress:

     Validate and strengthen the safety checks for the configuration 
deployment zoning process
     Increase the sensitivity and priority of alerting for high rates of 
SERVFAILs.

Long Term

In Progress:

     Reviewing and improving input safety checks for mapping components.
     Auditing critical systems to identify gaps in monitoring and 
alerting, then closing unacceptable gaps.



> On 22/07/2021 19:34, Mark Tinka wrote:
>> https://edgedns.status.akamai.com/
>>
>> Mark.
>
>
> [18:30 UTC on July 22, 2021] Update:
>
> Akamai experienced a disruption with our DNS service on July 22, 2021. 
> The disruption began at 15:45 UTC and lasted for approximately one 
> hour. Affected customer sites were significantly impacted for 
> connections that were not established before the incident began.
>
> Our teams identified that a change made in a mapping component was 
> causing the issue, and in order to mitigate it we rolled the change 
> back at approximately 16:44 UTC. We can confirm this was not a 
> cyberattack against Akamai's platform. Immediately following the 
> rollback, the platform stabilized and DNS services resumed normal 
> operations. At this time the incident is resolved, and we are 
> monitoring to ensure that traffic remains stable.




More information about the NANOG mailing list