Facebook post-mortems...

Mark Tinka mark at tinka.africa
Wed Oct 6 05:08:09 UTC 2021

On 10/6/21 06:51, Hank Nussbacher wrote:

> - "During one of these routine maintenance jobs, a command was issued 
> with the intention to assess the availability of global backbone 
> capacity, which unintentionally took down all the connections in our 
> backbone network"
> Can anyone guess as to what command FB issued that would cause them to 
> withdraw all those prefixes?

Hard to say, as it seems that the command was innocent enough, perhaps 
running a batch of other sub-commands to check port status, bandwidth 
utilization, MPLS-TE values, e.t.c. However, sounds like another 
unforeseen bug in the command ran other things, or the cascade process 
of how the sub-commands were ran caused unforeseen problems.

We shall guess this one forever, as I doubt Facebook will go into that 
much detail.

What I can tell you is that all the major content providers spend a lot 
of time, money and effort in automating both capacity planning, as well 
as capacity auditing. It's a bit more complex for them, because their 
variables aren't just links and utilization, but also locations, fibre 
availability, fibre pricing, capacity lease pricing, the presence of 
carrier-neutral data centres, the presence of exchange points, current 
vendor equipment models and pricing, projection of future fibre and 
capacity pricing, e.t.c.

It's a totally different world from normal ISP-land.

> - "it was not possible to access our data centers through our normal 
> means because their networks were down, and second, the total loss of 
> DNS broke many of the internal tools we’d normally use to investigate 
> and resolve outages like this.  Our primary and out-of-band network 
> access was down..."
> Does this mean that FB acknowledges that the loss of DNS broke their 
> OOB access?

I need to put my thinking cap on, but not sure whether running DNS in 
the IGP would have been better in this instance.

We run our Anycast DNS network in our IGP, mainly to always guarantee 
latency-based routing, but also to ensure that the failure of a 
higher-level protocol like BGP does not disconnect internal access that 
is needed for troubleshooting and repair. Given the IGP is a much more 
lower-level routing protocol, it's more likely (not impossible) that it 
would not go down with BGP.

In the past, we have, indeed, had BGP issues that allowed us to maintain 
DNS access internally as the IGP was unaffected.

The final statement from that report is interesting:

     "From here on out, our job is to strengthen our testing,
     drills, and overall resilience to make sure events like this
     happen as rarely as possible."

... which, in my rudimentary translation, means that:

     "There are no guarantees that our automation software will not
     poop cows again, but we hope that when that does happen, we
     shall be able to send our guys out to site much more quickly."

... which, to be fair, is totally understandable. These automation 
tools, especially in large networks such as BigContent, are 
significantly more fragile the more complex they get, and the more batch 
tasks they need to perform on various parts of a network of this size 
and scope. It's a pity these automation tools are all homegrown, and 
can't be bought "pre-packaged and pre-approved to never fail" from IT 
Software Store down the road. But it's the only way for networks of this 
capacity to operate, and the risk they always sit with for being that large.


More information about the NANOG mailing list