What NMS do you use and why?

Wed Aug 15 16:19:02 UTC 2018

On Wed, Aug 15, 2018 at 08:49:12AM -0500, Colton Conor wrote:
> We are looking for a new network monitoring system. Since there are so many
> operators on this list, I would like to know which NMS do you use and why?
> Is there one that you really like, and others that you hate?
> 
> For free options (opensouce), LibreNMS and NetXMS come highly recommended
> by many wireless ISPs on low budgets. However, I am not sure the commercial
> options available nor their price points.

For monitoring network device/interface data plane reachability with
ping, we are still using an ancient piece of open source software
called Autostatus.  I find it invaluable for notifying us about
reachability issues with it's simple to understand parent/child
relationships and graph-based fping methodology.  It isn't perfect--it
doesn't scale very well, it doesn't have HA/clustering, it has no
fancy dependencies (just basic parent-child) and no event correlation,
no contact scheduling, no API, etc. but it is very easy to understand
why you are getting an alert or not and boiling that down to a single
point of failure and as such it provides reliable, trustable
information about data plane reachability from one vantage point on
the network.

For monitoring server & network service availability,
device/environmental health, etc. we are currently using Nagios.  My
problems with it are that it has complex rules for how/when to perform
a specific health check and send or suppress a notification (and
perhaps bugs in our old version that never ever seems to send any Host
notifications except when it does) and the whole idea of "suppress the
Host check unless all Service checks for all services on the host are
down" doesn't really fit well with the idea of monitoring
device/interface reachability on routers & switches that make up a
complex graph of dependencies.  Trying to shoehorn Nagios into
alerting on just the one IP address/device/interface that is causing
all the others behind it to be unreachable doesn't work very well.
You can't use Host Depenencies because Host checks are suppressed by
default, and Host Dependencies don't affect Service
Checks/notifications.  Forcing Host checks to always run causes
performance problems.  Creating a "Ping" service for every host
requires creating manual Service Dependencies between all the "Ping"
services on every Host.  Then you end up with a complex configuration
that is very hard to understand.  But for things like telling you when
a power supply or fan has died, or if the web service crashed, it
works well.

We did a survey of a bunch of open source tools to replace Nagios and
have settled on Icinga for it's APIs, dynamic rules with pattern
matching and boolean logic, and compatibility with Nagios plugins.
But it still doesn't change the basic architectural choices of the
Nagios core engine and hence isn't a good fit for network
device/interface reachability monitoring IMO.