Better description of what happened

Andy Brezinsky andy at mbrez.com
Tue Oct 5 22:09:43 UTC 2021


It's a few years old, but Facebook has talked a little bit about their 
DNS infrastructure before.  Here's a little clip that talks about 
Cartographer: https://youtu.be/bxhYNfFeVF4?t=2073

 From their outage report, it sounds like their authoritative DNS 
servers withdraw their anycast announcements when they're unhealthy.  
The health check from those servers must have relied on something 
upstream.  Maybe they couldn't talk to Cartographer for a few minutes so 
they thought they might be isolated from the rest of the network and 
they decided to withdraw their routes instead of serving stale data.  
Makes sense when a single node does it, not so much when the entire 
fleet thinks that they're out on their own.

A performance issue in Cartographer (or whatever manages this fleet 
these days) could have been the ticking time bomb that set the whole 
thing in motion.

On 10/5/21 3:39 PM, Michael Thomas wrote:
>
> This bit posted by Randy might get lost in the other thread, but it 
> appears that their DNS withdraws BGP routes for prefixes that they 
> can't reach or are flaky it seems. Apparently that goes for the 
> prefixes that the name servers are on too. This caused internal 
> outages too as it seems they use their front facing DNS just like 
> everybody else.
>
> Sounds like they might consider having at least one split horizon 
> server internally. Lots of fodder here.
>
> Mike
>
> On 10/5/21 11:11 AM, Randy Monroe wrote:
>> Updated: 
>> https://engineering.fb.com/2021/10/05/networking-traffic/outage-details/
>>
>> On Tue, Oct 5, 2021 at 1:26 PM Michael Thomas <mike at mtcc.com 
>> <mailto:mike at mtcc.com>> wrote:
>>
>>
>>     On 10/5/21 12:17 AM, Carsten Bormann wrote:
>>     > On 5. Oct 2021, at 07:42, William Herrin <bill at herrin.us
>>     <mailto:bill at herrin.us>> wrote:
>>     >> On Mon, Oct 4, 2021 at 6:15 PM Michael Thomas <mike at mtcc.com
>>     <mailto:mike at mtcc.com>> wrote:
>>     >>> They have a monkey patch subsystem. Lol.
>>     >> Yes, actually, they do. They use Chef extensively to configure
>>     >> operating systems. Chef is written in Ruby. Ruby has something
>>     called
>>     >> Monkey Patches.
>>     > While Ruby indeed has a chain-saw (read: powerful, dangerous,
>>     still the tool of choice in certain cases) in its toolkit that is
>>     generally called “monkey-patching”, I think Michael was actually
>>     thinking about the “chaos monkey”,
>>     > https://en.wikipedia.org/wiki/Chaos_engineering#Chaos_Monkey
>>     > https://netflix.github.io/chaosmonkey/
>>
>>     No, chaos monkey is a purposeful thing to induce corner case
>>     errors so
>>     they can be fixed. The earlier outage involved a config sanitizer
>>     that
>>     screwed up and then pushed it out. I can't get my head around why
>>     anybody thought that was a good idea vs rejecting it and making
>>     somebody
>>     fix the config.
>>
>>     Mike
>>
>>
>>
>>
>> -- 
>>
>> Randy Monroe
>>
>> Network Engineering
>>
>> Uber <https://uber.com/>
>>
>>
>> 	
>> 	
>> 	
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.nanog.org/pipermail/nanog/attachments/20211005/fd1c8643/attachment.html>


More information about the NANOG mailing list