outages, quality monitoring, trouble tickets, etc

ulmo at Q.Net ulmo at Q.Net
Thu Dec 7 01:55:18 UTC 1995


> Since error reporting sucks in most network applications, it becomes
> the fault of whatever help desk happens to take the customers phone call.

Bingo!

Ever since ten years ago, I've wanted *all* my programs to
automatically detect larger-than-usual delays (more than 5ms?) and
start giving *exact* status reports, such as "your host is doing DNS
query", getting more detailed as the delay gets worse, getting the
status and error information dynamically from the intermediaries "host
x.y.z is doing a query of nameservers of the BAZ.COM domain in your
host flappy.c.e.baz.com, and out of seven route servers 4 have failed,
currently trying ... a.root-servers.net ... ICMP unreachable by router
X-Y-Z based on data obtained from ..." and on and on until either
utter failure or success; the greater the delays from what's
reasonable, the more status information gets queried and automatically
offered; if any point freezes, you'd know who is responsible.
Programs should use a common library of routines which would pop up a
window of the responsible organization for the error at hand, and the
proper email address to send complaints to; an automated, standard
computer-interpretable complaint should be registered automatically
similar to syslog but internetted, and a button the user can press to
add comments, opinions, etc. or even send further emails (and
attaching CC's, making local copies, etc.).

I'm amazed it even works 50% of the time, too.  I understand why it
doesn't work more often; a zillion pieces.  We could use the tools (I
mean computers) at hand better to inform us of these problems.

Yes, I want it to be like my Volvo -- a light lights up whenever a
bulb stops working, but better yet the computer can tell me which
light is out, and can automatically order the spare part from the
right factory.  When I'm in Mosaic or Netscape, there's no reason it
shouldn't tell me that a diode in a CSU/DSU just blew out in Wyoming,
owned by Joe Bizzlededof, and that his staff has an average response
time of six hours to fix such a problem, and that his functionaries
have been notified, and whether or not I should expect workarounds and
what actions I would have to take in order to use them (mostly, this
is none -- wait n minutes for routers to find another route and the
route will work again?)  And, the same thing with a route missing from
a router table.

Is this crazy?

I don't think so!  Everything is getting so complex, we need:

1) to know when users can't use the network; thus the automatic feedback
   mechanisms from end-to-end
2) to know what things to fix to minimize #1 (we're not all
   trillionares and get to buy redundant everythings; we have to know
   which items manufactured by who and which programs maintained by
   who are bound to be more reliable than which others, so we can
   choose good items or know which ones need redundancy or other
   protective measures).
3) users to know exactly who to blame, so that help desks get
   *appropriate* calls rather than *inappropriate* ones.  (the
   get-calls metephore is starting to get old; get-email will be more
   and more appropriate), and so that users know which organizations
   not to spend money on.

In many cases, users are help desks fixing other peoples' problems.
It's all hierarchical, everyone's the top and everyone's the bottom.

I believe programmers should experiment with these things, and
standards should be drafted, specifically for
feedback-to-user-of-actual-problem and end-to-end automatic error
reporting in *both* directions (so that each side of the connection
and each end of the stack of layers knows what to fix), and
responsible party lookup automation (who to *really* bother when
there's just no other way (right now, 95% of the time)).

Bradley Allen
<Ulmo at Q.Net>



More information about the NANOG mailing list