[125269] in North American Network Operators' Group

home help back first fref pref prev next nref lref last post

Re: Solar Flux (was: Re: China prefix hijack)

daemon@ATHENA.MIT.EDU (Scott Howard)
Sun Apr 11 15:58:58 2010

In-Reply-To: <867hoet758.fsf_-_@seastrom.com>
Date: Sun, 11 Apr 2010 12:58:44 -0700
From: Scott Howard <scott@doc.net.au>
To: "Robert E. Seastrom" <rs@seastrom.com>
Cc: nanog@merit.edu
Errors-To: nanog-bounces+nanog.discuss=bloom-picayune.mit.edu@nanog.org

On Sun, Apr 11, 2010 at 7:07 AM, Robert E. Seastrom <rs@seastrom.com> wrote:

> We've seen great increases in CPU and memory speeds as well as disk
> densities since the last maximum (March 2000).  Speccing ECC memory is
> a reasonable start, but this sort of thing has been a problem in the
> past (anyone remember the Sun UltraSPARC CPUs that had problems last
> time around?) and will no doubt bite us again.
>

Sun's problem had an easy solution - and it's exactly the one you've
mentioned - ECC.

The issue with the UltraSPARC II's was that they had enough redundancy to
detect a problem (Parity), but not enough to correct the problem (ECC). They
also (initially) had a very abrupt handling of such errors - they would
basically panic and restart.

>From the UltraSPARC III's they fixed this problem by sticking with Parity in
the L1 cache (write-through, so if you get a parity error you can just dump
the cache and re-read from memory or a higher cache), but using ECC on the
L2 and higher (write-back) caches.  The memory and all datapaths were
already protected with ECC in everything but the low-end systems.

It does raise a very interesting question though - how many systems are you
running that don't use ECC _everywhere_? (CPU, memory and datapath)

Unlike many years ago, today Parity memory is basically non-existent, which
means if you're not using ECC then you're probably suffering relatively
regular single-bit errors without knowing it.  In network devices that's
less of an issue as you can normally rely on higher-level protocols to
detect/correct the errors, but if you're not using ECC in your servers then
you're asking for (silent) trouble...

  Scott.

home help back first fref pref prev next nref lref last post