Assumptions about network designs...

Tue Jul 12 00:18:38 UTC 2022

I’m “guessing” based on all the services that were impacted the outage was likely cause by a change that caused a routing change in their multi-service network which overloaded many network devices, and by isolating the source the routes or traffic the rest of the network was able to recover.

But just a guess.

Shane
> On Jul 11, 2022, at 4:22 PM, Matthew Petach <mpetach at netflight.com> wrote:
> 
> 
> 
>> On Mon, Jul 11, 2022 at 9:01 AM Andrey Kostin <ankost at podolsk.ru> wrote:
>> It's hard to believe that a same time maintenance affecting so many 
>> devices in the core network could be approved. Core networks are build 
>> with redundancy, so that failures can't completely destroy the whole 
>> network.
> 
> I think you might need to re-evaluate your assumption 
> about how core networks are built.
> 
> A well-designed core network will have layers of redundancy 
> built in, with easy isolation of fault layers, yes.
> 
> I've seen (and sometimes worked on) too many networks 
> that didn't have enough budget for redundancy, and were 
> built as a string of pearls, one router to the next; if any router 
> in the string of pearls broke, the entire string of pearls would 
> come crashing down, to abuse a metaphor just a bit too much.
> 
> Really well-thought out redundancy takes a design team that 
> has enough experience and enough focused hours in the day 
> to think through different failure modes and lay out the design 
> ahead of time, before purchases get made.    Many real-world 
> networks share the same engineers between design, deployment, 
> and operation of the network--and in that model, operation and 
> deployment always win over design when it comes time to allocate 
> engineering hours.  Likeise, if you didn't have the luxury of being 
> able to lay out the design ahead of time, before purchasing hardware 
> and leasing facilities, you're likely doing the best you can with locations 
> that were contracted before you came into the picture, using hardware 
> that was decided on before you had an opportunity to suggest better 
> alternatives. 
> 
> Taking it a step further, and thinking about the large Facebook outage, 
> even if you did well in the design phase, and chose two different vendors, 
> with hardware redundancy and site redundancy in your entire core 
> network, did you also think about redundancy and diversity for the 
> O&M side of the house?   Does each redundant data plane have a 
> diverse control plane and management plane, or would an errant 
> redistribution of BGP into IGP wipe out both data planes, and both 
> hardware vendors at the same time?  Likewise, if a bad configuration 
> push isolates your core network nodes from the "God box" that 
> controls the device configurations, do you have redundancy in 
> connectivity to that "God box" so that you can restore known-good 
> configurations to your core network sites, or are you stuck dispatching 
> engineers with laptops and USB sticks with configs on them to get 
> back to a working condition again?
> 
> As you follow the control of core networks back up the chain, 
> you ultimately realize that no network is truly redundant and 
> diverse.  Every network ultimately comes back to a single point 
> of failure, and the only distinction you can make is how far up the 
> ladder you climb before you discover that single point of failure.
> 
> Thanks!
> 
> Matt
> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.nanog.org/pipermail/nanog/attachments/20220711/df510992/attachment-0001.html>