What to expect after a cooling failure

Wed Jul 10 04:17:04 UTC 2013

On 7/9/2013 10:28 PM, Erik Levinson wrote:
> As some may know, yesterday 151 Front St suffered a cooling failure
> after Enwave's facilities were flooded.
>
> One of the suites that we're in recovered quickly but the other took
> much longer and some of our gear shutdown automatically due to
> overheating. We shut down remotely many redundant and non-essential
> systems in the hotter suite, and transferred remotely some others to
> the cooler suite, to ensure that we had a minimum of all core systems
> running in the hotter suite. We waited until the temperatures
> returned to normal, and brought everything back online. The entire
> event lasted from approx 18:45 until 01:15. Apparently ambient
> temperature was above 43 degrees Celcius at one point on the cool
> side of cabinets in the hotter suite.
>
> For those who have gone through such events in the past, what can one
> expect in terms of long-term impact...should we expect some premature
> component failures? Does anyone have any stats to share?

No stats, but way back in the day of very large computers (1 each) in 
very large facilities, it seems like the thing we worried most about at 
restart was too-rapid cooling and the resulting condensation if the 
conditions were right.

After power-up the next thing was disk crashes that occurred on the way 
down (this was a long time ago discs and drums are different now).

Lastly was overheat failures which were relatively few and always in 
components with a weakness reputation.

-- 
Requiescas in pace o email           Two identifying characteristics
                                         of System Administrators:
Ex turpi causa non oritur actio      Infallibility, and the ability to
                                         learn from their mistakes.
                                           (Adapted from Stephen Pinker)