Monitoring system recommendation

Mikael Falkvidd mikael.falkvidd at op5.com
Tue Jun 7 07:42:12 UTC 2016


>
> On Monday, June 6, 2016, Manuel Marín <mmg at transtelco.net> wrote:
>
> > Dear Nanog community
> >
> > We are currently planning to upgrade our monitoring system (Opsview) due
> to
> > scalability issues and I was wondering what do you recommend for
> monitoring
> > 5000 hosts and 35000 services. We would like to use a monitoring system
> > that is compatible with the nagios plugin format, however we are not sure
> > if systems like Icinga/Shinken/Op5 are the way to go.
> >
> > Is someone using systems like Op5 or Icinga2 for monitoring > 5000 hosts?
> > Would you recommend commercial systems like Sevone, Zabbix, etc instead
> of
> > open source ones?
>

We (op5) have customers running > 50,000 hosts and > 300,000 services. So
5,000 hosts is generally not a problem.

As mentioned by Jeff, the forking model *can* become a problem. Small
binaries
that don't load a lot of libraries fork pretty fast. A test we made some
time ago
showed a 15 minute load peak at 3.89 (on 24 cores/hyperthreads) when
checking
100,000 services every 5 minutes. Check latencies were 0.8 seconds max and
0.002 seconds avg. Average cpu load was 15%.

Specs for the machine used:
Dell PowerEdge R620
2x Intel Xeon E5-2620
24 GB ram
Dell PERC H710 hardware RAID card
RAID10 on 4x300GB 15kRPM SAS drives

So a single (now almost vintage) server can handle 300 plugin executions per
second without breaking a sweat. Scaling up is definitely a possibility, but
scaling out (using mod gearman, mk or merlin, all open source) is available
as
well.

Complex plugins, for example check_vmware_api which loads the large VMware
perl SDK can get you in trouble though. I suggest you run a test with the
plugin
mix you are planning to use.

If scaling out is not an option, and you want to stay in the nagios/naemon
world,
a custom worker can be developed to get rid of the loading overhead.
Documentation is available at
http://www.naemon.org/documentation/developer/workers.html

Full disclosure: I work as development team lead at op5

best regards
Mikael Falkvidd



More information about the NANOG mailing list