"Hypothetical" Datacenter Overheating

Saku Ytti saku at ytti.fi
Wed Jan 17 13:13:27 UTC 2024


On Wed, 17 Jan 2024 at 03:18, <bzs at theworld.com> wrote:

> Others have pointed to references, I found some others, it's all
> pretty boring but perhaps one should embrace the general point that
> some equipment may not like abrupt temperature changes.

Can you share them? Only one I've found is:
https://www.ashrae.org/file%20library/technical%20resources/bookstore/supplemental%20files/referencecard_2021thermalguidelines.pdf

Which quotes 20c/h, which is a much higher rate than almost anyone has
ability to perform in their DC ambient. But it makes no explanation
where this comes from.

I believe in reality there is immense complexity here
     - Gradient depends on processes and materials used in
manufacturing (like pre/post ROHS will certainly have different
gradient)
     - Gradient has directionality, unlike ASHRAE quotes, because
devices are engineered to go from 20C to 90C in very short moment,
when turned on, but there was less engineering pressure for similar
cooling rates
     - Gradient has positionality going 20C between any two pairs does
not mean equal risk

And likely no one knows well, because no one has had to know well,
because it's not expensive enough to derisk.

But what we do know well
    - ASHRAE quotes rate which you are unlikely to be able to hit
    - Devices that travel with you, regularly see 50c instant ambient
gradients, both directions, multiple times a day
    - Devices see large fast gradients when turned on, but slower when
turned off
    - Compute people quote ASHRAE, Networking people appear not to,
perhaps like you say spindles is the ultimately reason for the limits
to exist

I think generally we have bias in that we like to identify risks and
then add them as organisational knowledge, but ultimately all these
new rules and exceptions you introduce, increase cost, complexity,
reduce efficiency and productivity. So we should be very critical
about them. It is fine to realise risks, and use realised risks as
data to analyse if avoiding those risks makes sense. It's very easy to
build poorly defined rules over poorly defined rules and arrive in
high cost, low efficiency operations.
Like this 'few centigrades per hour' is an exceedingly palatable
rule-of-thumb, it sounds good, unless you stop to think about it.

I would not recommend spending any time or money derisking gradients,
I would hope that rules that redisk condensation are enough to cover
derisking gradients and I would re-evaluate after sufficient realised
risks.
-- 
  ++ytti


More information about the NANOG mailing list