OS, Hardware, Network - Logging, Monitoring, and Alerting

Adam Armstrong lists at memetic.org
Fri Jun 27 16:42:53 UTC 2008

Rev. Jeffrey Paul wrote:
> Hi.  I've a (theoretically) simple problem and I'm wondering how others
> solve it.
> I've recently deployed ~40 Linux instances on ~20 different Dell blades
> and PowerEdges (we're big on virtualization), a few 7204s and 3560s, and
> assorted switchable PDUs and whatnot.  
> We need to monitor standard things like cpu, memory, disk usage on all
> OSes.  This is straightforward with net-snmp.  It would also be cool if
> I could monitor more esoteric things, like ntp synchronization status,
> i/o statistics, etc.
> Other stuff we really need to keep an eye on is hardware - redundant 
> PSU status in our 7204s and Dells, temperatures and voltages (one of
> our colos in New York peaked at over 40C a few weeks ago, for 
> instance), and disk array status (I'd like to know of a failed disk 
> in a hardware RAID5 before I get calls about performance issues).  Our
> blade chassis have DRACs in them and I think they export this data via
> SNMP (I'm trying to avoid the use of SNMP traps), but not all of our
> other PowerEdges have the DRACs in them so some of this information may
> need to be pulled via IPMI from within the host OS.  Presumably the
> Cisco gear makes the temperature available via SNMP.
> Finally, service checks - standard stuff (dns, http, https, ssh, smtp).
> Now, to the questions.
> 1) Is SNMP the best way to do this?  Obviously some of the data (service
> checks) will need to be collected other ways.
> 2) Is there any good solution that does both logging/trending of this
> data and also notification/monitoring/alerting?  I've used both Nagios
> and Cacti in the past, and, due to the number of individual things being
> monitored (3-5 items per OS instance, 5-10 items per physical server,
> 10-50 things per network device), setting them both up independently
> seems like a huge pain.  Also, I've never really liked Nagios that much.
> I recently entertained the idea of writing a CGI that output all of this
> information in a standard format (csv?), distributing and installing it, then
> collecting it periodically at a central location and doing all the
> rrd/notification myself, but then realized that this problem must've
> been solved a million times already.
> There's got to be a better way.  What do you guys use?
I wrote an NMS to do something along these lines. It's focussed more 
towards graphing than alerting. It knows where to find Dell/Cisco 
temperature monitors via SNMP and will keep track of  hardware and OS 
types/versions. It's probably still not really ready for general 
consumption, but if you think it would be useful to you, give me a shout 
and I'll see if I can help you make it work properly for you.


I wrote it mostly due to my own absolute hatred of Nagios and 
disappointment at the other NMSes around (where are the asthetics?)! :)


More information about the NANOG mailing list