Solar Flux (was: Re: China prefix hijack)
scott at doc.net.au
Sun Apr 11 14:58:44 CDT 2010
On Sun, Apr 11, 2010 at 7:07 AM, Robert E. Seastrom <rs at seastrom.com> wrote:
> We've seen great increases in CPU and memory speeds as well as disk
> densities since the last maximum (March 2000). Speccing ECC memory is
> a reasonable start, but this sort of thing has been a problem in the
> past (anyone remember the Sun UltraSPARC CPUs that had problems last
> time around?) and will no doubt bite us again.
Sun's problem had an easy solution - and it's exactly the one you've
mentioned - ECC.
The issue with the UltraSPARC II's was that they had enough redundancy to
detect a problem (Parity), but not enough to correct the problem (ECC). They
also (initially) had a very abrupt handling of such errors - they would
basically panic and restart.
>From the UltraSPARC III's they fixed this problem by sticking with Parity in
the L1 cache (write-through, so if you get a parity error you can just dump
the cache and re-read from memory or a higher cache), but using ECC on the
L2 and higher (write-back) caches. The memory and all datapaths were
already protected with ECC in everything but the low-end systems.
It does raise a very interesting question though - how many systems are you
running that don't use ECC _everywhere_? (CPU, memory and datapath)
Unlike many years ago, today Parity memory is basically non-existent, which
means if you're not using ECC then you're probably suffering relatively
regular single-bit errors without knowing it. In network devices that's
less of an issue as you can normally rely on higher-level protocols to
detect/correct the errors, but if you're not using ECC in your servers then
you're asking for (silent) trouble...
More information about the NANOG