Assumptions about network designs...
sronan at ronan-online.com
sronan at ronan-online.com
Tue Jul 12 00:18:38 UTC 2022
I’m “guessing” based on all the services that were impacted the outage was likely cause by a change that caused a routing change in their multi-service network which overloaded many network devices, and by isolating the source the routes or traffic the rest of the network was able to recover.
But just a guess.
Shane
> On Jul 11, 2022, at 4:22 PM, Matthew Petach <mpetach at netflight.com> wrote:
>
>
>
>> On Mon, Jul 11, 2022 at 9:01 AM Andrey Kostin <ankost at podolsk.ru> wrote:
>> It's hard to believe that a same time maintenance affecting so many
>> devices in the core network could be approved. Core networks are build
>> with redundancy, so that failures can't completely destroy the whole
>> network.
>
> I think you might need to re-evaluate your assumption
> about how core networks are built.
>
> A well-designed core network will have layers of redundancy
> built in, with easy isolation of fault layers, yes.
>
> I've seen (and sometimes worked on) too many networks
> that didn't have enough budget for redundancy, and were
> built as a string of pearls, one router to the next; if any router
> in the string of pearls broke, the entire string of pearls would
> come crashing down, to abuse a metaphor just a bit too much.
>
> Really well-thought out redundancy takes a design team that
> has enough experience and enough focused hours in the day
> to think through different failure modes and lay out the design
> ahead of time, before purchases get made. Many real-world
> networks share the same engineers between design, deployment,
> and operation of the network--and in that model, operation and
> deployment always win over design when it comes time to allocate
> engineering hours. Likeise, if you didn't have the luxury of being
> able to lay out the design ahead of time, before purchasing hardware
> and leasing facilities, you're likely doing the best you can with locations
> that were contracted before you came into the picture, using hardware
> that was decided on before you had an opportunity to suggest better
> alternatives.
>
> Taking it a step further, and thinking about the large Facebook outage,
> even if you did well in the design phase, and chose two different vendors,
> with hardware redundancy and site redundancy in your entire core
> network, did you also think about redundancy and diversity for the
> O&M side of the house? Does each redundant data plane have a
> diverse control plane and management plane, or would an errant
> redistribution of BGP into IGP wipe out both data planes, and both
> hardware vendors at the same time? Likewise, if a bad configuration
> push isolates your core network nodes from the "God box" that
> controls the device configurations, do you have redundancy in
> connectivity to that "God box" so that you can restore known-good
> configurations to your core network sites, or are you stuck dispatching
> engineers with laptops and USB sticks with configs on them to get
> back to a working condition again?
>
> As you follow the control of core networks back up the chain,
> you ultimately realize that no network is truly redundant and
> diverse. Every network ultimately comes back to a single point
> of failure, and the only distinction you can make is how far up the
> ladder you climb before you discover that single point of failure.
>
> Thanks!
>
> Matt
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.nanog.org/pipermail/nanog/attachments/20220711/df510992/attachment-0001.html>
More information about the NANOG
mailing list