CenturyLink RCA?

William Herrin bill at herrin.us
Mon Dec 31 21:32:06 UTC 2018


On Mon, Dec 31, 2018 at 7:24 AM Naslund, Steve <SNaslund at medline.com> wrote:
> Bad design if that’s the case, that would be a huge subnet.

According to the notes at the URL Saku shared, they suffered a cascade
failure from which they needed the equipment vendor's help to recover.
That indicates at least two grave design errors:

1. Vendor monoculture is a single point of failure. Same equipment
running the same software triggers the same bug. It all kabooms at
once. Different vendors running different implementations have
compatibility issues but when one has a bug it's much less likely to
take down all the rest.

2. Failure to implement system boundaries. When you automate systems
it's important to restrict the reach of that automation. Whether it's
a regional boundary or independent backbones, a critical system like
this one should be structurally segmented so that malfunctioning
automation can bring down only one piece of it.

Regards,
Bill Herrin



  However even if that was the case, you would not need to replace
hardware in multiple places.  You might have to reset it but not
replace it.  Also being an ILEC it seems hard to believe how long
their dispatches to their own central office took.  It might have
taken awhile to locate the original problem but they should have been
able to send a corrective procedure to CO personnel who are a lot
closer to the equipment.  In my region (Northern Illinois) we can
typically get access to a CO in under 30 minutes 24/7.  They are
essentially smart hands technicians that can reseat or replace line
cards.
>
> > 2.  Do we believe that an OOB management card was able to generate so much traffic as to bring down the optical switching?  Very doubtful which means that the systems were actually broken due to trying to PROCESS the "invalid >frames".  Seems like very poor control plane management if the system is attempting to process invalid data and bringing down the forwarding plane.
>
> >L2 loop. You will kill your JNPR/CSCO with enough trash on MGMT ETH.
> >However I can be argued that optical network should fail up in absence of control-plane, IP network has to fail down.
>
> Most of the optical muxes I have worked with will run without any management card or control plane at all.  Usually the line cards keep forwarding according to the existing configuration even in the absence of all management functions.  It would help if we knew what gear this was.  True optical muxes do not require much care and feeding once they have a configuration loaded.  If they are truly dependent on that control plane, then it needs to be redundant enough with watch dogs to reset them if they become non responsive and they need policers and rate limiter on their interfaces.  Seems they would be vulnerable to a DoS if a bad
> BPDU can wipe them out.
>
> > 3.  In the cited document it was stated that the offending packet did not have source or destination information.  If so, how did it get propagated throughout the network?
>
> >BPDU
>
> Maybe, it would be strange that it was invalid but valid enough to continue forwarding.  In any case loss of the management network should not interrupt forwarding.  I also would not be happy with an optical network that relies on spanning tree to remain operational.
>
> > My guess at the time and my current opinion (which has no real factual basis, just years of experience) is that a bad software package was propagated through their network.
>
> >Lot of possible reasons, I choose to believe what they've communicated is what the writer of the communication thought that happened, but as they likely are not SME it's broken radio communication. BCAST storm on L2 DCN >would plausibly fit the very ambiguous reason offered and is something people actually are doing.
>
> My biggest problem with their explanation is the replacement of line cards in multiple cities.  The only way that happens is when bad code gets pushed to them.  If it took them that long to fix an L2 broadcast storm, something is seriously wrong with their engineering.  Resetting the management interfaces should be sufficient once the offending line card is removed.  That is why I think this was a software update failure or a configuration push.  Either way, they should be jumping up and down on their vendor as to why this caused such large scale effects.



--
William Herrin ................ herrin at dirtside.com  bill at herrin.us
Dirtside Systems ......... Web: <http://www.dirtside.com/>



More information about the NANOG mailing list