Facebook post-mortems...

Bjørn Mork bjorn at mork.no
Wed Oct 6 10:04:24 UTC 2021


Masataka Ohta <mohta at necom830.hpcl.titech.ac.jp> writes:
> Bjørn Mork wrote:
>
>> Removing all DNS servers at the same time is never a good idea, even in
>> the situation where you believe they are all failing.
>
> As I wrote:
>
> : That facebook use very short expiration period for zone
> : data is a separate issue.
>
> that is a separate issue.

Sorry, I don't understand what you're getting at.  The TTL is not an
issue. An infinite TTL won't save you if all authoritative servers are
unreachable.  It will just make things worse in almost every other error
scenario.

The only solution to the problem of unreachable authoritative DNS
servers is:  Don't do that.

>> This is a very hard problem to solve.
>
> If that is their policy, it is just a policy to enforce and not
> a problem to solve.

The policy is there to solve a real problem.

Serving stale data from a single disconnected anycast site is a problem.
A disconnected site is unmanaged and must make autonomous decisions.
That pre-programmed decision is "just policy".  Should you withdraw the
DNS routes or not?  Serve stale or risk meltdown?

I still don't think there is an easy and obviously correct answer.  But
they do of course need to add a safety net or two if they continue with
the "meltdown" policy.


Bjørn


More information about the NANOG mailing list