CenturyLink RCA?

Naslund, Steve SNaslund at medline.com
Mon Dec 31 15:23:52 UTC 2018


See my comments in line.

Steve

>Hey Steve,

>I will continue to speculate, as that's all we have.

> 1.  Are you telling me that several line cards failed in multiple cities in the same way at the same time?  Don't think so unless the same software fault was propagated to all of them.  If the problem was that they needed to be reset, >couldn't that be accomplished by simply reseating them?

>L2 DCN/OOB, whole network shares single broadcast domain. 

Bad design if that’s the case, that would be a huge subnet.  However even if that was the case, you would not need to replace hardware in multiple places.  You might have to reset it but not replace it.  Also being an ILEC it seems hard to believe how long their dispatches to their own central office took.  It might have taken awhile to locate the original problem but they should have been able to send a corrective procedure to CO personnel who are a lot closer to the equipment.  In my region (Northern Illinois) we can typically get access to a CO in under 30 minutes 24/7.  They are essentially smart hands technicians that can reseat or replace line cards.

> 2.  Do we believe that an OOB management card was able to generate so much traffic as to bring down the optical switching?  Very doubtful which means that the systems were actually broken due to trying to PROCESS the "invalid >frames".  Seems like very poor control plane management if the system is attempting to process invalid data and bringing down the forwarding plane.

>L2 loop. You will kill your JNPR/CSCO with enough trash on MGMT ETH.
>However I can be argued that optical network should fail up in absence of control-plane, IP network has to fail down.

Most of the optical muxes I have worked with will run without any management card or control plane at all.  Usually the line cards keep forwarding according to the existing configuration even in the absence of all management functions.  It would help if we knew what gear this was.  True optical muxes do not require much care and feeding once they have a configuration loaded.  If they are truly dependent on that control plane, then it needs to be redundant enough with watch dogs to reset them if they become non responsive and they need policers and rate limiter on their interfaces.  Seems they would be vulnerable to a DoS if a bad 
BPDU can wipe them out.

> 3.  In the cited document it was stated that the offending packet did not have source or destination information.  If so, how did it get propagated throughout the network?

>BPDU

Maybe, it would be strange that it was invalid but valid enough to continue forwarding.  In any case loss of the management network should not interrupt forwarding.  I also would not be happy with an optical network that relies on spanning tree to remain operational.

> My guess at the time and my current opinion (which has no real factual basis, just years of experience) is that a bad software package was propagated through their network.

>Lot of possible reasons, I choose to believe what they've communicated is what the writer of the communication thought that happened, but as they likely are not SME it's broken radio communication. BCAST storm on L2 DCN >would plausibly fit the very ambiguous reason offered and is something people actually are doing.

My biggest problem with their explanation is the replacement of line cards in multiple cities.  The only way that happens is when bad code gets pushed to them.  If it took them that long to fix an L2 broadcast storm, something is seriously wrong with their engineering.  Resetting the management interfaces should be sufficient once the offending line card is removed.  That is why I think this was a software update failure or a configuration push.  Either way, they should be jumping up and down on their vendor as to why this caused such large scale effects.


More information about the NANOG mailing list