Data Center testing

Jack Bates jbates at brightok.net
Wed Aug 26 14:22:12 UTC 2009


James Hess wrote:
> Config checking can't say much about silent hardware failures.
> Unanticipated problems are likely to arise in failover systems,
> especially complicated ones.  A failover system that has not been
> periodically verified may not work as designed.
> 

I've seen 3-4 failover failures in the last year alone on the sonet 
transport gear. In almost each case, the backup cards were dead when the 
primary either died or induced errors causing telco to switch to the 
backup card. I have no doubts that they haven't been testing. While it 
didn't effect most of my network, I have a few customers that aren't 
multihomed, and it wiped them out in the middle of the day up to 3 hours.

> There can be other types of errors:
> Possibly there is a damaged patch cable, dying port, failing power
> supply, or other hardware on the warm spare that has silently degraded
> and its poor condition won't be detected    (until it actually tries
> to take a heavy workload, blows a fuse, eats a transceiver,  and
> everything just falls apart).
> 

Lots of weird things to test for. I remember once rebooting a c5500 that 
had been cruising along for 3 years and the bootup diag detected 1/2 a 
linecard as bad, which had been running decently up until the reload. 
Over the years, I think I've seen or detected everything you mentioned 
either during routine testing or in production "oh crap" events.

Jack






More information about the NANOG mailing list