Monitoring highly redundant operations

Thu Jan 25 00:49:01 UTC 2001

[ On , January 24, 2001 at 14:31:20 ( -0800), Sean Donelan wrote: ]
> Subject: Monitoring highly redundant operations
>
> But he does raise an interesting problem.  How do you know if your
> highly redudant, diverse, etc system has a problem.  With an ordinary
> system its easy.  It stops working.  In a highly redudant system you
> can start losing critical components, but not be able to tell if
> your operation is in fact seriously compromised, because it continues
> to "work."

The real problem is that the most critical part of the puzzle has _not_
been made "highly redundant" in this case.

If at least one of your registered authoritative DNS servers are not
responding from the point of view of any _and_ every user on the
Internet, your hosts (MX records, etc.) don't exist for those people and
their e-mail to you may well bounce and they will not view your web
pages.

The only way to ensure that your DNS is highly redundant and working is
to ensure that you've got maximum possible dispersion of _registered_
authoritative servers throughout the network geography, just like the
root and TLD servers are widely distributed.

Note this is just as important (if not more so!) for any delegated
sub-domains in your zone too, and equally important for any related
zones (eg. passport.com in this case).

The only really effective way to measure the effectiveness of your
nameserver dispersion is to make it terribly easy for anyone anywhere to
report any problems they percieve to you via as many optional channels
as possible -- you can't be everywhere at once, but if you make it easy
for people to send you information out-of-band then you'll get lots of
early warning when various chunks of the Internet can't see your
nameservers and/or your other hosts.

Now if the majority of DNS cache server operators don't get too paranoid
you could try to set up a mesh of equally widely dispersed monitoring
systems that cross-check the availability of test records from your zone
by querying any number of regional and remote cache servers.  You'd make
the TTL of these test records the minimum recommended by major
nameserver software vendors (300 seconds?) and then query the whole
group every TTL+N seconds.  Obviously you're probably going to have to
report your results out-of-band, and/or have independent people at each
monitoring site who are responsible for investigating problems
immediately and doing what they can locally to resolve as them.

-- 
							Greg A. Woods

+1 416 218-0098      VE3TCP      <gwoods at acm.org>      <robohack!woods>
Planix, Inc. <woods at planix.com>; Secrets of the Weird <woods at weird.com>