OS, Hardware, Network - Logging, Monitoring, and Alerting

Thu Jun 26 09:30:43 UTC 2008

On Jun 26, 2008, at 5:22 AM, Rev. Jeffrey Paul wrote:

> Hi.  I've a (theoretically) simple problem and I'm wondering how  
> others
> solve it.
>
> I've recently deployed ~40 Linux instances on ~20 different Dell  
> blades
> and PowerEdges (we're big on virtualization), a few 7204s and 3560s,  
> and
> assorted switchable PDUs and whatnot.
>
> We need to monitor standard things like cpu, memory, disk usage on all
> OSes.  This is straightforward with net-snmp.  It would also be cool  
> if
> I could monitor more esoteric things, like ntp synchronization status,
> i/o statistics, etc.
>
> Other stuff we really need to keep an eye on is hardware - redundant
> PSU status in our 7204s and Dells, temperatures and voltages (one of
> our colos in New York peaked at over 40C a few weeks ago, for
> instance), and disk array status (I'd like to know of a failed disk
> in a hardware RAID5 before I get calls about performance issues).  Our
> blade chassis have DRACs in them and I think they export this data via
> SNMP (I'm trying to avoid the use of SNMP traps), but not all of our
> other PowerEdges have the DRACs in them so some of this information  
> may
> need to be pulled via IPMI from within the host OS.  Presumably the
> Cisco gear makes the temperature available via SNMP.
>
> Finally, service checks - standard stuff (dns, http, https, ssh,  
> smtp).
>
> Now, to the questions.
>
> 1) Is SNMP the best way to do this?  Obviously some of the data  
> (service
> checks) will need to be collected other ways.
>
> 2) Is there any good solution that does both logging/trending of this
> data and also notification/monitoring/alerting?  I've used both Nagios
> and Cacti in the past, and, due to the number of individual things  
> being
> monitored (3-5 items per OS instance, 5-10 items per physical server,
> 10-50 things per network device), setting them both up independently
> seems like a huge pain.  Also, I've never really liked Nagios that  
> much.
>
> I recently entertained the idea of writing a CGI that output all of  
> this
> information in a standard format (csv?), distributing and installing  
> it, then
> collecting it periodically at a central location and doing all the
> rrd/notification myself, but then realized that this problem must've
> been solved a million times already.
>
> There's got to be a better way.  What do you guys use?
>
> (I'm not opposed to non-free solutions, provided they work better.)

You may want to have a look at Zenoss, http://www.zenoss.com/

Cheers,
Andrew