Famous operational issues

Sabri Berisha sabri at cluecentral.net
Fri Feb 19 20:15:53 UTC 2021


----- On Feb 19, 2021, at 3:07 AM, Daniel Karrenberg dfk at ripe.net wrote:

Hi,

> Lessons: HW/SW mono-cultures are dangerous. Input testing is good
> practice at all levels software. Operational co-ordination is key in
> times of crisis.

Well... Here is a very similar, fairly recent one. Albeit in this case, the
opposite is true: running one software train would have prevented an outage.
Some members on this list (hi, Brian!) will recognize the story.

Group XX within $company decided to deploy EVPN. All of backbone was running
single $vendor, but different software trains. Turns out that between an
early draft, implemented in version X, and the RFC, implemented in version Y,
a change was made in NLRI formats which were not backwards compatible.

Version X was in use on virtually all DC egress boxes, version Y was in use
on route reflectors. The moment the first EVPN NLRI was advertised, the 
entire backbone melted down. Dept-wide alert issued (at night), people trying
to log on to the VPN. Oh wait, the VPN requires yubikey, which requires the
corp network to access the interwebs, which is not accessible due to said
issue.

And, despite me complaining since the day of hire, no out of band network.

I didn't stay much longer after that.

Thanks,

Sabri 



More information about the NANOG mailing list