What to expect after a cooling failure
cite+nanog at incertum.net
Wed Jul 10 06:46:51 UTC 2013
* Erik Levinson <erik.levinson at uberflip.com>:
> For those who have gone through such events in the past, what can
> one expect in terms of long-term impact...should we expect some
> premature component failures? Does anyone have any stats to share?
We had a similar event (temperatures were a bit higher at 49°C,
duration was a bit shorter, 10am to 3pm) this January. In the two days
after the event, two of our HP servers had drives that went from "OK" to
"Predictive Failure", which is the SmartArray controller's way of
telling about high error rates. Two weeks after, we had a single DIMM
with an uncorrectable ECC error, causing a server reboot. Three weeks
after, a single PSU failed.
In our opinion, the disk problems were caused by the cooling failure,
while the ECC error and the faulted PSU were probably not related.
I believe that your hardware will be fine, but it probably wouldn't be
a bad idea to check if you have current maintenance contracts/warranty
for your servers, or any other way of obtaining replacement drives in
a reasonably short time.
More information about the NANOG