Monitoring highly redundant operations
Howard C. Berkowitz
hcb at clark.net
Thu Jan 25 04:23:11 UTC 2001
>Sean Donelan <sean at donelan.com> observed,
>But he does raise an interesting problem. How do you know if your
>highly redudant, diverse, etc system has a problem. With an ordinary
>system its easy. It stops working. In a highly redudant system you
>can start losing critical components, but not be able to tell if
>your operation is in fact seriously compromised, because it continues
>to "work."
I suspect answers here aren't going to be found in traditional
engineering, but more in a discipline that deals with extremely
complex systems where a full failure may be irretrievable. I'm
thinking of clinical medicine.
The initial problem there indeed may be subtle. I have a substantial
amount of medical experience, but it easily was 2-3 hours before I
recognized, in myself, early symptoms of a cardiac problem. It seemed
so much like indigestion, and then a pulled muscle. I remember
relaxing, and then recognizing a chain of minor
events...sweating...mild but persistent left arm pain radiating into
the chest...shortness of breath...and then a big OH SH*T.
My first point is having what physicians call a "high index of
suspicion" when seeing a combination of minor symptoms. I suspect
that we need to be looking for patterns of network symptoms that are
sensitive (i.e., high chance of being positive when there is a
problem) but not necessarily selective (i.e., low probability of
false positives).
Once the index of suspicion is triggered, the next thing to look for
is not necessarily direct indication of a problem, but a more
selective surrogate marker: objective criteria, especially when
analyzed as trends, point in the direction of an impending failure.
In emergency medicine, the EKG often isn't as informative as TV drama
would suggest. A constantly improving area, however, has been
measurement, especially successive measurements, of blood chemicals
that indicate cardiac tissue is being damaged or destroyed.
Early in the use of cardiac-related enzymes, it was a matter of
considering several nonspecific factors in combination. SGOT, CPK
and LDH are all enzymes that will elevate with tissue damage. The
problem is that any one can be elevated by problems in different
areas: liver and heart, heart and skeletal muscle, etc. You need to
look for elevations in a couple of areas that are associated with the
heart, AND look for normal values for other tests that rule out liver
disease, etc. The biochemical techniques have constantly improved,
but you still need to look at several factors.
The second-phase analogy for networking could be more frequent
polling and trending, or relatively benign tests such as traceroutes,
etc.
Only after there is a clear clinical problem, or several pieces of
laboratory evidence, does a physician jump to more invasive tests, or
begin aggressive treatment on suspicion. In like manner, you
wouldn't do a processor-intensive trace on a router, or do a possibly
disruptive switch to backup links, unless you had reasonable
confidence that there was a problem.
More information about the NANOG
mailing list