[119033] in North American Network Operators' Group
Re: HE.net, Fremont-2 outage?
daemon@ATHENA.MIT.EDU (Valdis.Kletnieks@vt.edu)
Wed Nov 4 21:58:49 2009
To: Joe Greco <jgreco@ns.sol.net>
In-Reply-To: Your message of "Wed, 04 Nov 2009 12:26:15 CST."
<200911041826.nA4IQGeT020376@aurora.sol.net>
From: Valdis.Kletnieks@vt.edu
Date: Wed, 04 Nov 2009 21:57:33 -0500
Cc: nanog@nanog.org
Errors-To: nanog-bounces+nanog.discuss=bloom-picayune.mit.edu@nanog.org
--==_Exmh_1257389853_4430P
Content-Type: text/plain; charset=us-ascii
On Wed, 04 Nov 2009 12:26:15 CST, Joe Greco said:
> With power:
>
> N+1 is usually better than N
> Best to assume full load when doing math
> Things will go wrong, predict common failures
And uncommon ones. :)
So as part of a major compute-cluster install, we upgraded our UPS and diesel
generator one weekend, and breathed a collective sigh of relief that we were
now safe from power outages and mostly dodged a bullet. We *did* have some
scary moments when we discovered that (a) of the 400 or so disks on our Sun
E10K, about 10 didn't spin up again and (b) several of the boot disks on said
box weren't mirrored. Fortunately, none of the 10 fails were on a non-mirrored
disk. By Tuesday, all the non-mirrored boot disks were in fact mirrored.
That Friday, a bozo contractor relocating a doorway managed to set off the
Halon. Only lost two disks on the E10K. Guess which two? ;)
And a month later, we discovered that the nice shiny new automatic cutover
switch was wired in backwards, necessitating another power outage to re-wire it
correctly.
So much for safe from power outages... :)
--==_Exmh_1257389853_4430P
Content-Type: application/pgp-signature
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Exmh version 2.5 07/13/2001
iD8DBQFK8j8dcC3lWbTT17ARAhBUAKCRw33a4YAISppPCdS/psuAaCY93ACeOKKm
9+r0RLA0AA4U+tkkH0/zECU=
=tWkX
-----END PGP SIGNATURE-----
--==_Exmh_1257389853_4430P--