outages, quality monitoring, trouble tickets, etc

Sun Nov 26 08:43:05 UTC 1995

On Sat, 25 Nov 1995, Matt Zimmerman wrote:

> connectivity issues are MORE likely to be caused by interaction with 
> other NSP's.  Dissemination of problem information between providers 
> helps everyone diagnose difficulties and keep their customers better 
> informed with respect to current status and predictions for the near 
> future (solutions).

Agreed, but it has to be done in an "easy" manner.  I'm sure that several
of the NSPs have concerns as to what this information will be used 
for. Everyone likes to portray the image of having a 99.98% 
uptime whenever possible, even though most folks realize that it just 
plain isn't possible, at least today.  This sort of leads into the 
question of the various NOCs integration with whatever central repository of 
information we are shooting to provide.  When provider X opens a ticket, 
will it automatically be reflected in the 'central' database?   I doubt 
folks will go for that based on security alone.  Or how about provider 
X's NOC staff fire off an Email to incident-report at outages.com?  How will 
they be trained or reimbursed for their time spent on this service?

[..facts about how useless mailing lists are removed..]
> A more interactive shared system (ticket-based?) makes more sense, but 
> may prove far more difficult to design.  Problem classification, impact, 
> severity, and location are all issues here, as well as the problem of 
> associating such a record of a problem with its effects.  That is, when 
> a provider "discovers" a problem, how are they to know if it has already 
> been "registered", and if so, how to reference the information associated 
> with it?

Such an idea is already being discussed in several smoke filled rooms. :)  
Remedy/ARS has the ability to accept input for incident reports and 
queries to its database via an Email form.  One could write a Web page 
containing the necessary parameters in a form, and then transpose that to an
Email sent to the AR system.  Implementing such a system is really based
around cost issues, as the coding is relatively trivial. (CGIs come to mind)
(I used the above example because it's something we've done in the past 
and I know works, there are probably others)

On the issue of connectivity -- agreed; some lonely site should not 
be allowed to be the only host.  However -- if connectivity between 
certain NSPs also falls apart, you're equally screwed.  Some sort of
distribution of the "centralized" source of information would be needed.

I forsee the most difficult part of the process being, convincing all of 
the associated Operations groups into sharing their outage information.
Providing a simple mechanism for either the customer service, or operations
staff to disseminate outage information to the "server," would be equally 
challenging.  If step (a) were to be overcome, I would assume that 
writing a procedure to fit (b).

-jh-