Data Center testing
deepak at ai.net
Mon Aug 24 15:03:51 CDT 2009
Thanks for the kind words Ken.
Power failure testing and network testing are very different disciplines.
We operate from the point of view that if a failure occurs because we have scheduled testing, it is far better since we have the resources on-site to address it (as opposed to an unplanned event during a hurricane). Not everyone has this philosophy.
This is one of the reasons we do monthly or bimonthly, full live load transfer tests on power at every facility we own and control during the morning hours (~10:00am local time on a weekday, run on gensets for up to two hours). Of course there is sufficient staff and contingency planning on-site to handle almost anything that comes up. The goal is to have a measurable "good" outcome at our highest reasonable load levels [temperature, data load, etc].
We don't hesitate to show our customers and auditors our testing and maintenance logs, go over our procedures, etc. They can even watch events if they want (we provide the ear protection). I don't think any facility of any significant size can operate differently and do it well.
This is NOT advisable to folks who do not do proper preventative maintenance on their transfer bus ways, PDUs, switches, batteries, transformers and of course generators. The goal is to identify questionable relays, switches, breakers and other items that may fail in an actual emergency.
On the network side, during scheduled maintenance we do live failovers -- sometimes as dramatic as pulling the cable without preemptively removing traffic. Part of *our* procedures is to make sure it reroutes and heals the way it is supposed to before the work actually starts. Often network and topology changes happen over time and no one has had a chance to actually test all the "glue" works right. Regular planned maintenance (if you have a fast reroute capability in your network) is a very good way to handle it.
For sensitive trunk links and non-invasive maintenance, it is nice to softly remove traffic via local pref or whatever in advance of the maintenance to minimize jitter during a major event.
As part of your plan, be prepared for things like connectors (or cables) breaking and have a plan for what you do if that occurs. Have a plan or a rain-date if a connector takes a long time to get out or the blade it sits in gets damaged. This stuff looks pretty while its running and you don't want something that has been friction-frozen to ruin your window.
All of this works swimmingly until you find a vendor (X) bug. :) Not for the faint-of-heart.
Anyone who has more specific questions, I'll be glad to answer off-line.
> I know Peer1 in vancouver reguarly send out notifications of
> "non-impacting" generator load testing, like monthly. Also InterXion
> in Dublin, Ireland have occasionally sent me notification that there
> was a power outage of less than a minute however their backup
> successfully took the load.
> I only remember one complete outage in Peer1 a few years ago... Never
> seen any outage in InterXion Dublin.
> Also I don't ever remember any power failure at AiNet (Deepak will
> probably elaborate)
> 2009/8/24 Dan Snyder <sliplever at gmail.com>:
> > Does any one know of any data centers that do failure testing of
> > networking equipment
> > regularly? I mean to verify that everything fails over properly after
> > changes have been made over
> > time. Is there any best practice guides for doing this?
> > Thanks,
> > Dan
More information about the NANOG