OS, Hardware, Network - Logging, Monitoring, and Alerting
Andrew Girling
agirling at denetron.com
Thu Jun 26 09:30:43 UTC 2008
On Jun 26, 2008, at 5:22 AM, Rev. Jeffrey Paul wrote:
> Hi. I've a (theoretically) simple problem and I'm wondering how
> others
> solve it.
>
> I've recently deployed ~40 Linux instances on ~20 different Dell
> blades
> and PowerEdges (we're big on virtualization), a few 7204s and 3560s,
> and
> assorted switchable PDUs and whatnot.
>
> We need to monitor standard things like cpu, memory, disk usage on all
> OSes. This is straightforward with net-snmp. It would also be cool
> if
> I could monitor more esoteric things, like ntp synchronization status,
> i/o statistics, etc.
>
> Other stuff we really need to keep an eye on is hardware - redundant
> PSU status in our 7204s and Dells, temperatures and voltages (one of
> our colos in New York peaked at over 40C a few weeks ago, for
> instance), and disk array status (I'd like to know of a failed disk
> in a hardware RAID5 before I get calls about performance issues). Our
> blade chassis have DRACs in them and I think they export this data via
> SNMP (I'm trying to avoid the use of SNMP traps), but not all of our
> other PowerEdges have the DRACs in them so some of this information
> may
> need to be pulled via IPMI from within the host OS. Presumably the
> Cisco gear makes the temperature available via SNMP.
>
> Finally, service checks - standard stuff (dns, http, https, ssh,
> smtp).
>
> Now, to the questions.
>
> 1) Is SNMP the best way to do this? Obviously some of the data
> (service
> checks) will need to be collected other ways.
>
> 2) Is there any good solution that does both logging/trending of this
> data and also notification/monitoring/alerting? I've used both Nagios
> and Cacti in the past, and, due to the number of individual things
> being
> monitored (3-5 items per OS instance, 5-10 items per physical server,
> 10-50 things per network device), setting them both up independently
> seems like a huge pain. Also, I've never really liked Nagios that
> much.
>
> I recently entertained the idea of writing a CGI that output all of
> this
> information in a standard format (csv?), distributing and installing
> it, then
> collecting it periodically at a central location and doing all the
> rrd/notification myself, but then realized that this problem must've
> been solved a million times already.
>
> There's got to be a better way. What do you guys use?
>
> (I'm not opposed to non-free solutions, provided they work better.)
You may want to have a look at Zenoss, http://www.zenoss.com/
Cheers,
Andrew
More information about the NANOG
mailing list