OS, Hardware, Network - Logging, Monitoring, and Alerting
alex at blastro.com
Thu Jun 26 11:14:37 CDT 2008
Andrew Girling wrote:
> On Jun 26, 2008, at 5:22 AM, Rev. Jeffrey Paul wrote:
>> Hi. I've a (theoretically) simple problem and I'm wondering how others
>> solve it.
>> I've recently deployed ~40 Linux instances on ~20 different Dell blades
>> and PowerEdges (we're big on virtualization), a few 7204s and 3560s, and
>> assorted switchable PDUs and whatnot.
>> We need to monitor standard things like cpu, memory, disk usage on all
>> OSes. This is straightforward with net-snmp. It would also be cool if
>> I could monitor more esoteric things, like ntp synchronization status,
>> i/o statistics, etc.
>> Other stuff we really need to keep an eye on is hardware - redundant
>> PSU status in our 7204s and Dells, temperatures and voltages (one of
>> our colos in New York peaked at over 40C a few weeks ago, for
>> instance), and disk array status (I'd like to know of a failed disk
>> in a hardware RAID5 before I get calls about performance issues). Our
>> blade chassis have DRACs in them and I think they export this data via
>> SNMP (I'm trying to avoid the use of SNMP traps), but not all of our
>> other PowerEdges have the DRACs in them so some of this information may
>> need to be pulled via IPMI from within the host OS. Presumably the
>> Cisco gear makes the temperature available via SNMP.
>> Finally, service checks - standard stuff (dns, http, https, ssh, smtp).
>> Now, to the questions.
>> 1) Is SNMP the best way to do this? Obviously some of the data (service
>> checks) will need to be collected other ways.
>> 2) Is there any good solution that does both logging/trending of this
>> data and also notification/monitoring/alerting? I've used both Nagios
>> and Cacti in the past, and, due to the number of individual things being
>> monitored (3-5 items per OS instance, 5-10 items per physical server,
>> 10-50 things per network device), setting them both up independently
>> seems like a huge pain. Also, I've never really liked Nagios that much.
>> I recently entertained the idea of writing a CGI that output all of this
>> information in a standard format (csv?), distributing and installing
>> it, then
>> collecting it periodically at a central location and doing all the
>> rrd/notification myself, but then realized that this problem must've
>> been solved a million times already.
>> There's got to be a better way. What do you guys use?
>> (I'm not opposed to non-free solutions, provided they work better.)
> You may want to have a look at Zenoss, http://www.zenoss.com/
I have to second the Zenoss recommendation. Fairly automatic setup for
most things, great categorization and it will incorporate nagios plugins
or any script that outputs in that format.
It's free, but you can also buy support or install service from them.
More information about the NANOG