<div dir="ltr"><div></div><div>By what they have said publicly, the initial trigger point was that all of their datacenters were disconnected from their internal backbone, thus unreachable. </div><div><br></div><div>Once that occurs, nothing else really matters. Even if the external announcements were not withdrawn, and the edge DNS servers could provide stale answers, the IPs those answers provided wouldn't have actually been reachable, and there wouldn't be 3 days of red herring conversations about DNS design. </div><div><br></div><div>No DNS design exists that can help people reach resources not network reachable. /shrug</div><div><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, Oct 5, 2021 at 6:30 PM Hugo Slabbert <<a href="mailto:hugo@slabnet.com">hugo@slabnet.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">Had some chats with other folks:<br>Arguably you could change the nameserver isolation check failure action to be "depref your exports" rather than "yank it all".  Basically, set up a tiered setup so the boxes passing those additional health checks and that should have correct entries would be your primary destination and failing nodes shouldn't receive query traffic since they're depref'd in your internal routing.  But in case all nodes fail that check simultaneously, those nodes failing the isolation check would attract traffic again as no better paths remain.  Better to serve stale data than none at all; CAP theorem trade-offs at work?<br clear="all"><div><div dir="ltr"><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><div dir="ltr"><font face="monospace"><br></font></div><div dir="ltr"><span style="font-family:arial,sans-serif">-- </span></div><div dir="ltr"><span style="font-family:arial,sans-serif">Hugo Slabbert</span></div></div></div></div></div></div></div></div><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, Oct 5, 2021 at 3:22 PM Michael Thomas <<a href="mailto:mike@mtcc.com" target="_blank">mike@mtcc.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
  
    
  
  <div>
    <p><br>
    </p>
    <div>On 10/5/21 3:09 PM, Andy Brezinsky
      wrote:<br>
    </div>
    <blockquote type="cite">
      
      <p>It's a few years old, but Facebook has talked a little bit
        about their DNS infrastructure before.  Here's a little clip
        that talks about Cartographer: <a href="https://youtu.be/bxhYNfFeVF4?t=2073" target="_blank">https://youtu.be/bxhYNfFeVF4?t=2073</a>
        <br>
      </p>
      <p>From their outage report, it sounds like their authoritative
        DNS servers withdraw their anycast announcements when they're
        unhealthy.  The health check from those servers must have relied
        on something upstream.  Maybe they couldn't talk to Cartographer
        for a few minutes so they thought they might be isolated from
        the rest of the network and they decided to withdraw their
        routes instead of serving stale data.  Makes sense when a single
        node does it, not so much when the entire fleet thinks that
        they're out on their own.<br>
      </p>
      <p>A performance issue in Cartographer (or whatever manages this
        fleet these days) could have been the ticking time bomb that set
        the whole thing in motion.<br>
      </p>
    </blockquote>
    <p>Rereading it is said that their internal (?) backbone went down
      so pulling the routes was arguably the right thing to do. Or at
      least not flat out wrong. Taking out their nameserver subnets was
      clearly a problem though, though a fix is probably tricky since
      you clearly want to take down errant nameservers too. <br>
    </p>
    <p><br>
    </p>
    <p>Mike<br>
    </p>
    <br>
    <blockquote type="cite">
      <blockquote type="cite">
        <blockquote type="cite">
          <div dir="ltr">
            <div dir="ltr">
              <div>
                <div dir="ltr">
                  <div dir="ltr">
                    <div dir="ltr">
                      <div dir="ltr">
                        <div dir="ltr">
                          <div dir="ltr">
                            <div dir="ltr">
                              <div dir="ltr">
                                <div dir="ltr">
                                  <table>
                                    <tbody>
                                      <tr style="border-bottom-style:none">
                                        <td style="padding:0px;font-weight:600;height:64px;width:64px;vertical-align:bottom"><br>
                                        </td>
                                        <td style="padding:20px 0px;width:24px"><br>
                                        </td>
                                        <td style="padding:20px 0px;width:24px"><br>
                                        </td>
                                      </tr>
                                    </tbody>
                                  </table>
                                </div>
                              </div>
                            </div>
                          </div>
                        </div>
                      </div>
                    </div>
                  </div>
                </div>
              </div>
            </div>
          </div>
        </blockquote>
      </blockquote>
    </blockquote>
  </div>

</blockquote></div>
</blockquote></div>