[152162] in North American Network Operators' Group
Re: Most energy efficient (home) setup
daemon@ATHENA.MIT.EDU (Leo Bicknell)
Mon Apr 16 08:40:56 2012
Date: Mon, 16 Apr 2012 05:39:34 -0700
From: Leo Bicknell <bicknell@ufp.org>
To: NANOG list <nanog@nanog.org>
Mail-Followup-To: NANOG list <nanog@nanog.org>
In-Reply-To: <20120416015413.GE24826@luke.xen.prgmr.com>
Errors-To: nanog-bounces+nanog.discuss=bloom-picayune.mit.edu@nanog.org
--5vNYLRcllDrimb99
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable
In a message written on Sun, Apr 15, 2012 at 09:54:14PM -0400, Luke S. Craw=
ford wrote:
> On my current fleet (well under 100 servers) single bit errors are so ra=
re
> that if I get one, I schedule that machine for removal from production.=
=20
In a previous life, in a previous time, I worked at a place that
had a bunch of Cisco's with parity RAM. For the time, these boxes
had a lot of RAM, as they had distributed line cards each with their
own processor memory.
Cisco was rather famous for these parity errors, mostly because of
their stock answer: sunspots. The answer was in fact largely
correct, but it's just not a great response from a vendor. They
had a bunch of statistics though, collected from many of these
deployed boxes.
We ran the statistics, and given hundreds of routers, each with
many line cards the math told us we should have approximately 1
router every 9-10 months get one parity error from sunspots and
other random activity (e.g. not a failing RAM module with hundreds
of repeatable errors). This was, in fact, close to what we observed.
This experience gave me two takeaways. First, single bit flips are
rare, but when you have enough boxes rare shows up often. It's
very similar to anyone with petabytes of storage, disks fail every
couple of days because you have so many of them. At the same time
a home user might not see a failure in their lifetime (of disk or
memory).
Second though, if you're running a business, ECC is a must because
the message is so bad. "This was caused by sunspots" is not a
customer inspiring response, no matter how correct. "We could have
prevented this by spending an extra $50 on proper RAM for your $1M
box" is even worse.
Some quick looking at Newegg, 4GB DDR3 1333 ECC DIMM, $33.99. 4GB
DDR3 1333 Non-ECC DIMM, $21.99. Savings, $12. (Yes, I realize the
Motherboard also needs some extra circuitry, I expect it's less than $1
in quantity though).
Pretty much everyone I know values their data at more than $12 if it
is lost.
--=20
Leo Bicknell - bicknell@ufp.org - CCIE 3440
PGP keys at http://www.ufp.org/~bicknell/
--5vNYLRcllDrimb99
Content-Type: application/pgp-signature
Content-Disposition: inline
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.18 (FreeBSD)
iQIVAwUBT4wTBrN3O8aJIdTMAQKgeQ//W+KVshREWnX/RBx17KMApoNkRqkhM8+l
mksF5S0LOHDZQPzZ8QQFVhMW4LZw+LKCZgQ+JXpxZ+sqttvSqVjI0EpSbozRVlDx
lQ/Co3uSfEW+XGxs7BjSbxU4vdwL8zzzOc3N1KvcKNxfJINvQIHULguHchRp+eQa
6TfVIJaMS+7zBZDerH8JZi1WvPl8VrGisCYO/gSnz70JAD7eb/OslnwI/9LRVyKN
UzM1FE+yM2yOR//qwCswJ4njq1AwRCvWg49j4E5G8zJWC9sfOV7s+4LmVJVGrEv6
imphI9GOztdSFna6rbK685FF2CP67B3Ya7v4sX2/3cau+zG29En6IxKPQkSbUl7j
uKxPb+YRK1SwG5Atd8QNy5IZMWi1qV+WZQmemomgCG/dBJzRebxTt/wgczhjEwbU
4Cp1JM9T3OUxkga3W08Z4CO1rh9xT6yaTvMi8cGLV9U7sFL/eQ/FJ253GrhdIK2p
QWTRg9wiuPT9kvnVOEVbZ2+AezQkpFABbbwPQZI4SLle7BFh7pqOUu5NZ3bwduYi
rqhTbrz7y1htIWwwRmp/HP3sTC5A0twWzpSw7QrL7C8JlPvsZ3ezGcb6dYfWKOkC
yT0x+CSJLyeBelFwJ3iEhrxBCZo3wVmq658QEHxxawMjjDjvXeobrbHnTUdtPGZ8
N82qKBttHOw=
=HGkL
-----END PGP SIGNATURE-----
--5vNYLRcllDrimb99--