OS, Hardware, Network - Logging, Monitoring, and Alerting
lists at memetic.org
Fri Jun 27 11:42:53 CDT 2008
Rev. Jeffrey Paul wrote:
> Hi. I've a (theoretically) simple problem and I'm wondering how others
> solve it.
> I've recently deployed ~40 Linux instances on ~20 different Dell blades
> and PowerEdges (we're big on virtualization), a few 7204s and 3560s, and
> assorted switchable PDUs and whatnot.
> We need to monitor standard things like cpu, memory, disk usage on all
> OSes. This is straightforward with net-snmp. It would also be cool if
> I could monitor more esoteric things, like ntp synchronization status,
> i/o statistics, etc.
> Other stuff we really need to keep an eye on is hardware - redundant
> PSU status in our 7204s and Dells, temperatures and voltages (one of
> our colos in New York peaked at over 40C a few weeks ago, for
> instance), and disk array status (I'd like to know of a failed disk
> in a hardware RAID5 before I get calls about performance issues). Our
> blade chassis have DRACs in them and I think they export this data via
> SNMP (I'm trying to avoid the use of SNMP traps), but not all of our
> other PowerEdges have the DRACs in them so some of this information may
> need to be pulled via IPMI from within the host OS. Presumably the
> Cisco gear makes the temperature available via SNMP.
> Finally, service checks - standard stuff (dns, http, https, ssh, smtp).
> Now, to the questions.
> 1) Is SNMP the best way to do this? Obviously some of the data (service
> checks) will need to be collected other ways.
> 2) Is there any good solution that does both logging/trending of this
> data and also notification/monitoring/alerting? I've used both Nagios
> and Cacti in the past, and, due to the number of individual things being
> monitored (3-5 items per OS instance, 5-10 items per physical server,
> 10-50 things per network device), setting them both up independently
> seems like a huge pain. Also, I've never really liked Nagios that much.
> I recently entertained the idea of writing a CGI that output all of this
> information in a standard format (csv?), distributing and installing it, then
> collecting it periodically at a central location and doing all the
> rrd/notification myself, but then realized that this problem must've
> been solved a million times already.
> There's got to be a better way. What do you guys use?
I wrote an NMS to do something along these lines. It's focussed more
towards graphing than alerting. It knows where to find Dell/Cisco
temperature monitors via SNMP and will keep track of hardware and OS
types/versions. It's probably still not really ready for general
consumption, but if you think it would be useful to you, give me a shout
and I'll see if I can help you make it work properly for you.
I wrote it mostly due to my own absolute hatred of Nagios and
disappointment at the other NMSes around (where are the asthetics?)! :)
More information about the NANOG