Data Center testing
Jack Bates
jbates at brightok.net
Wed Aug 26 14:22:12 UTC 2009
James Hess wrote:
> Config checking can't say much about silent hardware failures.
> Unanticipated problems are likely to arise in failover systems,
> especially complicated ones. A failover system that has not been
> periodically verified may not work as designed.
>
I've seen 3-4 failover failures in the last year alone on the sonet
transport gear. In almost each case, the backup cards were dead when the
primary either died or induced errors causing telco to switch to the
backup card. I have no doubts that they haven't been testing. While it
didn't effect most of my network, I have a few customers that aren't
multihomed, and it wiped them out in the middle of the day up to 3 hours.
> There can be other types of errors:
> Possibly there is a damaged patch cable, dying port, failing power
> supply, or other hardware on the warm spare that has silently degraded
> and its poor condition won't be detected (until it actually tries
> to take a heavy workload, blows a fuse, eats a transceiver, and
> everything just falls apart).
>
Lots of weird things to test for. I remember once rebooting a c5500 that
had been cruising along for 3 years and the bootup diag detected 1/2 a
linecard as bad, which had been running decently up until the reload.
Over the years, I think I've seen or detected everything you mentioned
either during routine testing or in production "oh crap" events.
Jack
More information about the NANOG
mailing list