[145568] in North American Network Operators' Group
Re: [outages] News item: Blackberry services down worldwide,
daemon@ATHENA.MIT.EDU (Chris Campbell)
Wed Oct 12 12:00:13 2011
From: Chris Campbell <chris@ctcampbell.com>
In-Reply-To: <6707.1318434596@turing-police.cc.vt.edu>
Date: Wed, 12 Oct 2011 16:58:21 +0100
To: nanog@nanog.org
Errors-To: nanog-bounces+nanog.discuss=bloom-picayune.mit.edu@nanog.org
I think it raises serious questions about RIM's DR strategy if a DB =
corruption or switch failure or whatever can cause this much outage. =
'Surely' RIM have an second site that is independent of the primary =
(within reason) that they could of flipped to when they realised the DB =
was borked. If not then any business that relies on them needs to be =
shouting from the rooftops to get RIM to fix it.
Chris.
On 12 Oct 2011, at 16:49, Valdis.Kletnieks@vt.edu wrote:
> On Wed, 12 Oct 2011 09:52:02 CDT, -Hammer- said:
>> What kills me is what they have told the public. The lost a "core=20
>> switch". I don't know if they actually mean network switch or not but=20=
>> I'm pretty sure any of us that work on an enterprise environment know=20=
>> how to factor N+1 just for these types of days. And then the backup=20=
>> solution failed? I'm not buying it either.
>=20
> Yeah, and that extra comma in the one config file that didn't make a =
difference
> when you tested the failover in the lab *never* makes a difference =
when it hits
> in the production network, right? Or they changed the config of the =
primary and
> it didn't get propogated just right to the backup, or they had =
mismatched firmware
> levels on blades in the blades on the primary and backup switches, so =
traffic that
> didn't tickle a bug on the primary blades caused the blade to crash on =
the backup,
> or...
>=20
> Anybody on this list who's been around long enough probably has enough =
"We
> should have had N+2 because the N+1'th device failed too" stories to =
drain
> *several* pitchers of beer at a good pub... I've even had one case =
where my
> butt got *saved* from a ohnosecond-class whoops because the N+1'th =
device *was*
> crashed (stomped a config file, it replicated, was able to salvage a =
copy from
> a device that didn't replicate because it was down at the time).
>=20