[125269] in North American Network Operators' Group
Re: Solar Flux (was: Re: China prefix hijack)
daemon@ATHENA.MIT.EDU (Scott Howard)
Sun Apr 11 15:58:58 2010
In-Reply-To: <867hoet758.fsf_-_@seastrom.com>
Date: Sun, 11 Apr 2010 12:58:44 -0700
From: Scott Howard <scott@doc.net.au>
To: "Robert E. Seastrom" <rs@seastrom.com>
Cc: nanog@merit.edu
Errors-To: nanog-bounces+nanog.discuss=bloom-picayune.mit.edu@nanog.org
On Sun, Apr 11, 2010 at 7:07 AM, Robert E. Seastrom <rs@seastrom.com> wrote:
> We've seen great increases in CPU and memory speeds as well as disk
> densities since the last maximum (March 2000).  Speccing ECC memory is
> a reasonable start, but this sort of thing has been a problem in the
> past (anyone remember the Sun UltraSPARC CPUs that had problems last
> time around?) and will no doubt bite us again.
>
Sun's problem had an easy solution - and it's exactly the one you've
mentioned - ECC.
The issue with the UltraSPARC II's was that they had enough redundancy to
detect a problem (Parity), but not enough to correct the problem (ECC). They
also (initially) had a very abrupt handling of such errors - they would
basically panic and restart.
>From the UltraSPARC III's they fixed this problem by sticking with Parity in
the L1 cache (write-through, so if you get a parity error you can just dump
the cache and re-read from memory or a higher cache), but using ECC on the
L2 and higher (write-back) caches.  The memory and all datapaths were
already protected with ECC in everything but the low-end systems.
It does raise a very interesting question though - how many systems are you
running that don't use ECC _everywhere_? (CPU, memory and datapath)
Unlike many years ago, today Parity memory is basically non-existent, which
means if you're not using ECC then you're probably suffering relatively
regular single-bit errors without knowing it.  In network devices that's
less of an issue as you can normally rely on higher-level protocols to
detect/correct the errors, but if you're not using ECC in your servers then
you're asking for (silent) trouble...
  Scott.