DNS pulling BGP routes?

Mon Oct 11 19:49:53 UTC 2021

On Mon, Oct 11, 2021 at 8:07 AM Christopher Morrow <morrowc.lists at gmail.com>
wrote:

> On Sat, Oct 9, 2021 at 11:16 AM Masataka Ohta <
> mohta at necom830.hpcl.titech.ac.jp> wrote:
>
>> Bill Woodcock wrote:
>>
>
[...]

>
> it seems that the problem FB ran into was really that there wasn't either:
>    "secondary path to communicate: "You are the last one standing, do not
> die"  (to an edge node)
>  or:
>   "maintain a very long/less-preferred path to a core location(s) to
> maintain service in case the CDN disappears"
>
> There are almost certainly more complexities which FB is not discussion in
> their design/deployment which
> affected their services last week, but it doesn't look like they were very
> far off on their deployment, if they
> need to maintain back-end connectivity to serve customers from the CDN
> locales.
>
> -chris
>

Having worked on trying to solve health-checking situations
in large production complexes in the past, I can definitely
say that is is an exponentially difficult problem for a single
site to determine whether it is "safe" for it to fail out, or if
doing so will result in an entire service going offline, short
of having a central controller which tracks every edge site's
health, and can determine "no, we're below $magic_threshold
number of sites, you can't fail yourself out no matter how
unhealthy you think you are".   Which of course you can't
really have, without undoing one of the key reasons for
distributing your serving sites to geographically distant
places in different buildings on different providers--namely
to eliminate single points of failure in your serving infrastructure.

Doing the equivalent of "no router bgp" on your core backbone
is going to make things suck, no matter how you slice it, and
I don't think any amount of tweaking the anycast setup or
DNS values would have made a whit of difference to the
underlying outage.

I think the only question we can armchair quarterback
at this point is whether there were prudent steps that
could go into a design to shorten the recovery interval.

So far, we seem to have collected a few key points:

1) make sure your disaster recovery plan doesn't depend
    on your production DNS servers being usable; have
    key nodes in /etc/hosts files that are periodically updated
    via $automation_tool, but ONLY for non-production,
    out-of-band recovery nodes; don't static any of your
    production-facing entries.

2) Have a working out-of-band that exists entirely independent
    of your production network.  Dial, frame relay, SMDS, LTE
    modems, starlink dishes on the roof; pick your poison, but
    budget it in for every production site.  Test it monthly to ensure
    connectivity to all sites works.  Audit regularly to ensure no
    dependencies on the production infrastructure have crept in.

3) Ensure you have a good "oh sh**" physical access plan for
    key personnel.  Some of you at a recent virtual happy hour
    heard me talk about the time I isolated the credit card payment
    center for a $dayjob, which also cut off access for the card readers
    to get into it to restore the network.   Use of a fire axe was granted
    to on-site personnel during that.  Take the time to think through
    how physical access is controlled for every key site in your network,
    think about failure scenarios, and have a "in case of emergency,
    break glass to get the key" plan in place to shorten recovery times.

4) Have a dependency map/graph of your production network.
 a) if everything dies, and you have to restart, what has to come up first?
 b) what dependencies are there that have to be done in the right order
 c) what services are independent that can be brought up in parallel to
speed
   up recovery?
d) does every team supporting services on the critical, dependent pathway
  have 24x7 on-call coverage, and do they know where in the recovery graph
  they're needed?  It doesn't help to have teams that can't start back up
until
  step 9 crowding around asking "are you ready for us yet?" when you still
can't
  raise the team needed for step 1 on the dependency graph.  ^_^;

5) do you know how close the nearest personnel are to each POP/CDN node,
   in case you have to do emergency "drive over with a laptop, hop on the
console,
   and issue the following commands" rousting in the middle of the night?
If someone
   lives.3 miles from the CDN node, it's good to know that, so you don't
call the person
   who is on-call but is 2 hours away without first checking if the person
3 miles away
   can do it faster.

I'm sure others have even better experiences than I, who can contribute
and add to the list.  If nothing else, perhaps collectively we can help
other companies prepare a bit better, so that when the next big "ooops"
happens, the recovery time can be a little bit shorter.   :)

Thanks!

Matt
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.nanog.org/pipermail/nanog/attachments/20211011/07db0fd8/attachment.html>