Better description of what happened

Hugo Slabbert hugo at slabnet.com
Tue Oct 5 22:28:26 UTC 2021


Had some chats with other folks:
Arguably you could change the nameserver isolation check failure action to
be "depref your exports" rather than "yank it all".  Basically, set up a
tiered setup so the boxes passing those additional health checks and that
should have correct entries would be your primary destination and failing
nodes shouldn't receive query traffic since they're depref'd in your
internal routing.  But in case all nodes fail that check simultaneously,
those nodes failing the isolation check would attract traffic again as no
better paths remain.  Better to serve stale data than none at all; CAP
theorem trade-offs at work?

-- 
Hugo Slabbert


On Tue, Oct 5, 2021 at 3:22 PM Michael Thomas <mike at mtcc.com> wrote:

>
> On 10/5/21 3:09 PM, Andy Brezinsky wrote:
>
> It's a few years old, but Facebook has talked a little bit about their DNS
> infrastructure before.  Here's a little clip that talks about Cartographer:
> https://youtu.be/bxhYNfFeVF4?t=2073
>
> From their outage report, it sounds like their authoritative DNS servers
> withdraw their anycast announcements when they're unhealthy.  The health
> check from those servers must have relied on something upstream.  Maybe
> they couldn't talk to Cartographer for a few minutes so they thought they
> might be isolated from the rest of the network and they decided to withdraw
> their routes instead of serving stale data.  Makes sense when a single node
> does it, not so much when the entire fleet thinks that they're out on their
> own.
>
> A performance issue in Cartographer (or whatever manages this fleet these
> days) could have been the ticking time bomb that set the whole thing in
> motion.
>
> Rereading it is said that their internal (?) backbone went down so pulling
> the routes was arguably the right thing to do. Or at least not flat out
> wrong. Taking out their nameserver subnets was clearly a problem though,
> though a fix is probably tricky since you clearly want to take down errant
> nameservers too.
>
>
> Mike
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.nanog.org/pipermail/nanog/attachments/20211005/20c7897d/attachment.html>


More information about the NANOG mailing list