Monitoring highly redundant operations

Howard C. Berkowitz hcb at clark.net
Thu Jan 25 04:23:11 UTC 2001


>Sean Donelan <sean at donelan.com> observed,



>But he does raise an interesting problem.  How do you know if your
>highly redudant, diverse, etc system has a problem.  With an ordinary
>system its easy.  It stops working.  In a highly redudant system you
>can start losing critical components, but not be able to tell if
>your operation is in fact seriously compromised, because it continues
>to "work."

I suspect answers here aren't going to be found in traditional 
engineering, but more in a discipline that deals with extremely 
complex systems where a full failure may be irretrievable.  I'm 
thinking of clinical medicine.

The initial problem there indeed may be subtle. I have a substantial 
amount of medical experience, but it easily was 2-3 hours before I 
recognized, in myself, early symptoms of a cardiac problem. It seemed 
so much like indigestion, and then a pulled muscle. I remember 
relaxing, and then recognizing a chain of minor 
events...sweating...mild but persistent left arm pain radiating into 
the chest...shortness of breath...and then a big OH SH*T.

My first point is having what physicians call a "high index of 
suspicion" when seeing a combination of minor symptoms.  I suspect 
that we need to be looking for patterns of network symptoms that are 
sensitive (i.e., high chance of being positive when there is a 
problem) but not necessarily selective (i.e., low probability of 
false positives).

Once the index of suspicion is triggered, the next thing to look for 
is not necessarily direct indication of a problem, but a more 
selective surrogate marker: objective criteria, especially when 
analyzed as trends, point in the direction of an impending failure. 
In emergency medicine, the EKG often isn't as informative as TV drama 
would suggest.  A constantly improving area, however, has been 
measurement, especially successive measurements, of blood chemicals 
that indicate cardiac tissue is being damaged or destroyed.

Early in the use of cardiac-related enzymes, it was a matter of 
considering several nonspecific factors in combination.  SGOT, CPK 
and LDH are all enzymes that will elevate with tissue damage.  The 
problem is that any one can be elevated by problems in different 
areas:  liver and heart, heart and skeletal muscle, etc.  You need to 
look for elevations in a couple of areas that are associated with the 
heart, AND look for normal values for other tests that rule out liver 
disease, etc.   The biochemical techniques have constantly improved, 
but you still need to look at several factors.

The second-phase analogy for networking could be more frequent 
polling and trending, or relatively benign tests such as traceroutes, 
etc.

Only after there is a clear clinical problem, or several pieces of 
laboratory evidence, does a physician jump to more invasive tests, or 
begin aggressive treatment on suspicion.  In like manner, you 
wouldn't do a processor-intensive trace on a router, or do a possibly 
disruptive switch to backup links, unless you had reasonable 
confidence that there was a problem.





More information about the NANOG mailing list