Better description of what happened

Tue Oct 5 22:18:54 UTC 2021

On 10/5/21 3:09 PM, Andy Brezinsky wrote:
>
> It's a few years old, but Facebook has talked a little bit about their 
> DNS infrastructure before.  Here's a little clip that talks about 
> Cartographer: https://youtu.be/bxhYNfFeVF4?t=2073
>
> From their outage report, it sounds like their authoritative DNS 
> servers withdraw their anycast announcements when they're unhealthy.  
> The health check from those servers must have relied on something 
> upstream.  Maybe they couldn't talk to Cartographer for a few minutes 
> so they thought they might be isolated from the rest of the network 
> and they decided to withdraw their routes instead of serving stale 
> data.  Makes sense when a single node does it, not so much when the 
> entire fleet thinks that they're out on their own.
>
> A performance issue in Cartographer (or whatever manages this fleet 
> these days) could have been the ticking time bomb that set the whole 
> thing in motion.
>
Rereading it is said that their internal (?) backbone went down so 
pulling the routes was arguably the right thing to do. Or at least not 
flat out wrong. Taking out their nameserver subnets was clearly a 
problem though, though a fix is probably tricky since you clearly want 
to take down errant nameservers too.


Mike


>>>
>>> 	
>>> 	
>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.nanog.org/pipermail/nanog/attachments/20211005/ef7826ea/attachment.html>