Monitoring service that has a human component?

Thu Dec 6 23:05:00 UTC 2018

Hi David - Just a bit of insight from my own experience:

Common issues when monitoring (and the associated escalation processes)
don't work and similar issues are seen as you described:
- Inconsistent HTTP response codes across services and service layers
(nginx vs the backend tomcat), means you can't use them properly.
- Monitoring on arbitrary metrics (90% of something) as opposed to metrics
linked to an actual outcome (response times for example).
- No runbook in place (engineer to change some setting to switch on/off
maintenance mode).
- No central view of what engineer is doing what to which systems.

Some fairly simple example of when I've seen things work pretty well:
Organisation uses HTTP code monitoring, alerting on 5xx but not 503.
Services configured (and tested!) to return other, specific 5xx errors, but
keep 503 as a 'known and expected maintenance' mode.
Runbook in place to let other engineers know what's happening (slack
message for example) and then maintenance page on the reverse proxy.
Monitor and report on the common 90% metrics (disk space, memory) but no
alerts.
Don't fill up the disk with logs, only to delete them and let it fill up
again.. :)
Remove all non-actionable alerts.

Of course a good solution could be to implement a rolling-upgrade / ha
maintenance strategy, but in reality (depending on how ancient the app is)
this can be quite hard.

ps. This is a really good read:
https://landing.google.com/sre/sre-book/toc/index.html

Cheers
Heath

On Thu, Dec 6, 2018 at 9:03 AM David H <ispcolohost at gmail.com> wrote:

> Hey all, was curious if anyone knows of a website monitoring service that
> has the option to incorporate a human component into the decision and
> escalation tree?  I’m trying to help a customer find a way around false
> positives bogging down their NOC staff, by having a human determine the
> difference between a real error, desired (but different) content, or
> something in between like “Hey it’s 3am and we’ve taken our website offline
> for maintenance, we’ll be back up by 6am.”  Automated systems tend to only
> know if test A, or steps A through C, are failing, then this is ‘down’ and
> do my preconfigured thing, but that ends up needlessly taking NOC time if
> the customer themselves is performing work on their own site, or just
> changed it and whatever content was being watched, is now gone.  So, the
> goal would be to have the end user be the first point of contact if it
> looks like more of a customer-side issue.  If they can’t be reached to
> confirm, THEN contact NOC, and unlike email alerts, keep contacting until a
> human acknowledges receipt of the alert.
>
>
>
> Thanks
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.nanog.org/pipermail/nanog/attachments/20181207/b7759664/attachment.html>