[164359] in North American Network Operators' Group
Re: What to expect after a cooling failure
daemon@ATHENA.MIT.EDU (Jay Ashworth)
Wed Jul 10 00:04:49 2013
Date: Wed, 10 Jul 2013 00:04:23 -0400 (EDT)
From: Jay Ashworth <jra@baylink.com>
To: NANOG <nanog@nanog.org>
In-Reply-To: <1373426894.69598008@apps.rackspace.com>
Errors-To: nanog-bounces+nanog.discuss=bloom-picayune.mit.edu@nanog.org
----- Original Message -----
> From: "Erik Levinson" <erik.levinson@uberflip.com>
> For those who have gone through such events in the past, what can one
> expect in terms of long-term impact...should we expect some premature
> component failures? Does anyone have any stats to share?
If the HDDs were spinning while above rated maximum ambient intake temp,
*especially* if they're not *right out front in the intake path* (is
anything not built that way anymore? Yeah; the back side of 45-drive
Supermicro racks, among other things), you should probably plan on doing
a preemptive replacement cycle, or at the very least, pay *very* close
attention to smartctld, and have a good stock of pre-trayed replacements.
Remember that you may fall in the RAID Hole if you wait for failures,
and hence lose data which isn't backed up anyway -- if more drives in a
raid group fail *during rebuilds*, you're essentially screwed.
If your raid groups were properly dispersed across drive build dates, then
this will probably be *slightly* less dangerous, but still.
Also watch bearing-type fans.
Cheers,
-- jra
--
Jay R. Ashworth Baylink jra@baylink.com
Designer The Things I Think RFC 2100
Ashworth & Associates http://baylink.pitas.com 2000 Land Rover DII
St Petersburg FL USA #natog +1 727 647 1274