massive facebook outage presently

Sabri Berisha sabri at cluecentral.net
Mon Oct 4 17:46:23 UTC 2021


----- On Oct 4, 2021, at 10:07 AM, Anne P. Mitchell, Esq. amitchell at isipp.com wrote:

Hi Anne,

> On a related note, what do you think the scene is like in FB HQ right now?
> (shaking head)

Very quiet, as their offices are still closed for all but essentials :)

But, from experience I can tell you how that works. I assume Facebook works in a
similar manner as some of my previous employers. This assumption comes from the
fact that quite a number of my previous colleagues now work at Facebook in similar
roles.

First there is the question of detecting the outage. Obviously, Facebook will have
a monitoring/SRE team that continuously monitors 1000s of metrics. They observe
a number of metrics go down, and start to investigate. Most likely they will have
some sort of overall technical lead (let's call this the Technical Duty Officer),
that is responsible for the whole thing. Once the SRE team figured out where the
problem lies, they will alert the TDO. TDO will then hit that big red button and
send out alerts to the appropriate teams to jump on a bridge (let's call that the
Technical Crisis Bridge), to fix the issue. 

If done right, whomever was on call for that team will take the lead and interface
with adjoining teams, and other team members who are available to help out. Looking
at how long this outage lasts, there must be either something very broken, or they're
having trouble rolling back a change which was expected to not have impact.

Once the issue is fixed, the TDO will write a report and submit it to the Problem
Management group. This group will now contact the teams deemed responsible for the
outage. This team will no have an opportunity to explain themselves during a post-
mortem. Depending on the scale of the outage, the post-mortem can be a 10 minute
call on a bridge with a Problem Management manager, or in the hot seat during a
60 minute meeting with a bunch of execs.

I've been in that hot seat a few times. Not the most pleasurable experience. Perhaps
it's time for a new career :)

Thanks,

Sabri





More information about the NANOG mailing list