Most energy efficient (home) setup

Mon Apr 16 01:54:14 UTC 2012

On Sun, Apr 15, 2012 at 10:52:51AM -0500, Jimmy Hess wrote:
> Consider that the probability 16GB of SDRAM experiences at least one
> single bit error at sea level,
> in a given 6 hour period exceeds  66%  = 1 - (1 - 1.3e-12 * 6)^(16 *
> 2^30 * 8).    In any given 24 hour period, the probability of at least
> one single bit error  exceeds 98%.    Assuming the memory is good and
> functioning correctly;
> 
> It's expected to see on average approximately   3 to 4   1-bit errors
> per day.  More are frequently seen.
> 
> Now if most of this 16GB of memory is unused, you will never notice
> that over 30 days,  120 or so bits have been flipped  from their
> proper value..

I think that is an overestimate, at least if single-bit (corrected)
ecc errors are as common as flipped bits on non-ecc ram. 

Now, First, count me in the "ECC is a must, full stop." crowd.   I 
insist on ecc for even my customer's dedicated servers, even though most
of the customers don't care that much.   "It's not for you, it's for me."
With ECC?  if you have EDAC/bluesmoke setup correctly on a supported
motherboard, you get console spew whenever you have a single-bit error.

This means I can do a very simple grep on the box conserver logs to
and I can find all the failing ram modules I am responsible for.  
Without ecc, I have no real way of telling the difference between broken
software and broken ram.    

That said,  I still think the 120 bits a month estimate is large;  I 
believe that ECC ram should report correctable errors (assuming a 
correctly configured EDAC/bluesmoke module and supported chipset) 
about as often as non-ecc ram would get a bit flip.   

In a past role, I did spend the time grepping through such a properly 
configured cluster, with tens of thousands of nodes, looking for failing
hardware.   I should have done a proper paper with statistics, but
I did not.   The vast majority of servers had zero correctable ecc errors,
while a few had a lot, which is consistent with the theory that ECC errors
are more often caused by bad ram.    

(Of course, all these servers were in proper cases in a proper data center,
which probably gives you a fair bit of shielding.)

On my current fleet (well under 100 servers)  single bit errors are so rare
that if I get one, I schedule that machine for removal from production.