<div dir="ltr"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Do we actually know this wrt the tools referred to in "the total loss of<br>DNS broke many of the tools we’d normally use to investigate and resolve<br>outages like this."?  Those tools aren't necessarily located in any of<br>the remote data centers, and some of them might even refer to resources<br>outside the facebook network.</blockquote><div> </div><div>Yea; that's kinda the thinking here.  Specifics are scarce, but there were notes re: the OOB for instance also being unusable.  The questions are how much that was due to dependence of the OOB network on the production side, and how much DNS being notionally available might have supported getting things back off the ground (if it would just provide mgt addresses for key devices, or if perhaps there was a AAA dependency that also rode on DNS).  This isn't to say there aren't other design considerations in play to make that fly (e.g. if DNS lives in edge POPs, and such an edge POP gets isolated from the FB network but still has public Internet peering, how do we ensure that edge POP does not continue exporting the DNS prefix into the DFZ and serving stale records?), but perhaps also still solvable </div><div><br></div><div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">I'm sure they'll learn from this and in the future have some better things in place to account for such a scenario.</blockquote><div><br></div><div>100%</div><div><br></div><div>I think we can say with some level of confidence that there is going to be a <i>lot</i> of discussion and re-evaluation of inter-service dependencies. </div><div><div dir="ltr" class="gmail_signature" data-smartmail="gmail_signature"><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><div dir="ltr"><font face="monospace"><br></font></div><div dir="ltr"><span style="font-family:arial,sans-serif">-- </span></div><div dir="ltr"><span style="font-family:arial,sans-serif">Hugo Slabbert</span></div></div></div></div></div></div></div></div><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, Oct 6, 2021 at 9:48 AM Tom Beecher <<a href="mailto:beecher@beecher.cc">beecher@beecher.cc</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div>I mean, at the end of the day they likely designed these systems to be able to handle one or more datacenters being disconnected from the world, and considered a scenario of ALL their datacenters being disconnected from the world so unlikely they chose not to solve for it. Works great, until it doesn't.</div><div><br></div><div>I'm sure they'll learn from this and in the future have some better things in place to account for such a scenario. </div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, Oct 6, 2021 at 12:21 PM Bjørn Mork <<a href="mailto:bjorn@mork.no" target="_blank">bjorn@mork.no</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Tom Beecher <<a href="mailto:beecher@beecher.cc" target="_blank">beecher@beecher.cc</a>> writes:<br>

<br>

>  Even if the external<br>

> announcements were not withdrawn, and the edge DNS servers could provide<br>

> stale answers, the IPs those answers provided wouldn't have actually been<br>

> reachable<br>

<br>

Do we actually know this wrt the tools referred to in "the total loss of<br>

DNS broke many of the tools we’d normally use to investigate and resolve<br>

outages like this."?  Those tools aren't necessarily located in any of<br>

the remote data centers, and some of them might even refer to resources<br>

outside the facebook network.<br>

<br>

Not to mention that keeping the DNS service up would have prevented<br>

resolver overload in the rest of the world.<br>

<br>

Besides, the disconnected frontend servers are probably configured to<br>

display a "we have a slight technical issue. will be right back" notice<br>

in such situations.  This is a much better user experience that the<br>

"facebook?  never heard of it" message we got on monday.<br>

<br>

yes, it makes sense to keep your domains alive even if your network<br>

isn't.  That's why the best practice is name servers in more than one<br>

AS.<br>

<br>

<br>

<br>

<br>

Bjørn<br>

</blockquote></div></div>

</blockquote></div>