Data Center testing

Wed Aug 26 03:45:07 UTC 2009

There's more to data integrity in a data center (well, anything powered,
that is) than network configurations.  There's the loading of individual
power outlets, UPS loading, UPS battery replacement cycles, loading of
circuits, backup lighting, etc.  And the only way to know if something is
really working like it's designed is to test it.  That's why we have
financial auditors, military exercises, fire drills, etc.

So while your analogy emphasizes the importance of having good processes in
place to catch the problems up front, it doesn't eliminate throwing the
switch.

Frank

-----Original Message-----
From: Jeff Aitken [mailto:jaitken at aitken.com] 
Sent: Tuesday, August 25, 2009 7:53 AM
To: Dan Snyder
Cc: NANOG list
Subject: Re: Data Center testing

On Mon, Aug 24, 2009 at 09:38:38AM -0400, Dan Snyder wrote:
> We have done power tests before and had no problem.  I guess I am looking
> for someone who does testing of the network equipment outside of just
power
> tests.  We had an outage due to a configuration mistake that became
apparent
> when a switch failed.  It didn't cause a problem however when we did a
power
> test for the whole data center.

Dan,

With all due respect, if there are config changes being made to your 
devices that aren't authorized or in accordance with your standards (you
*do* have config standards, right?) then you don't have a testing problem,
you have a data integrity problem.  Periodically inducing failures to catch
them is sorta like using your smoke detector as an oven timer.

There are several tools that can help in this area; a good free one is
rancid [1], which logs in to your routers and collects copies of configs
and other info, all of which gets stored in a central repository.  By
default, you will be notified via email of any changes.  An even better
approach than scanning the hourly config diff emails is to develop scripts
that compare the *actual* state of the network with the *desired* state and
alert you if the two are not in sync.  Obviously this is more work because
you have to have some way of describing the desired state of the network in
machine-parsable format, but the benefit is that you know in pseudo-realtime
when something is wrong, as opposed to finding out the next time a device
fails.  Rancid diffs + tacacs logs will tell you who made the changes, and
with that info you can get at the root of the problem.

Having said that, every planned maintenance activity is an opportunity to
run through at least some failure cases.  If one of your providers is going
to take down a longhaul circuit, you can observe how traffic re-routes and
verify that your metrics and/or TE are doing what you expect.  Any time you
need to load new code on a device you can test that things fail over
appropriately.  Of course, you have to willing to just shut the device
down without draining it first, but that's between you and your customers.
Link and/or device failures will generate routing events that could be used
to test convergence times across your network, etc.

The key is to be prepared.  The more instrumentation you have in place
prior to the test, the better you will be able to analyze the impact of the
failure.  An experienced operator can often tell right away when looking at
a bunch of MRTG graphs that "something doesn't look right", but that doesn't
tell you *what* is wrong.  There are tools (free and commercial) that can
help here, too.  Have a central syslog server and some kind of log reduction
tool in place.  Have beacons/probes deployed, in both the control and data
planes.  If you want to record, analyze, and even replay routing system
events, you might want to take a look at the Route Explorer product from
Packet Design [2].

You said "switch failure" above, so I'm guessing that this doesn't apply
to you, but there are also good network simulation packages out there.
Cariden [3] and WANDL [4] can build models of your network based on actual
router configs and let you simulate the impact of various scenarios,
including device/link failures.  However, these tools are more appropriate
for design and planning than for catching configuration mistakes, so
they may not be what you're looking for in this case.

--Jeff

[1] http://www.shrubbery.net/rancid/
[2] http://www.packetdesign.com/products/rex.htm
[3] http://www.cariden.com/
[4] http://www.wandl.com/html/index.php