[116874] in North American Network Operators' Group

home help back first fref pref prev next nref lref last post

RE: Data Center testing

daemon@ATHENA.MIT.EDU (Deepak Jain)
Mon Aug 24 16:04:37 2009

From: Deepak Jain <deepak@ai.net>
To: Ken Gilmour <ken.gilmour@gmail.com>, Dan Snyder <sliplever@gmail.com>
Date: Mon, 24 Aug 2009 16:03:51 -0400
In-Reply-To: <5b6f80200908240631q7f711db5m26a4f195c647aaeb@mail.gmail.com>
Cc: NANOG list <nanog@nanog.org>
Errors-To: nanog-bounces+nanog.discuss=bloom-picayune.mit.edu@nanog.org


Thanks for the kind words Ken.

Power failure testing and network testing are very different disciplines.=20

We operate from the point of view that if a failure occurs because we have =
scheduled testing, it is far better since we have the resources on-site to =
address it (as opposed to an unplanned event during a hurricane). Not every=
one has this philosophy.=20

This is one of the reasons we do monthly or bimonthly, full live load trans=
fer tests on power at every facility we own and control during the morning =
hours (~10:00am local time on a weekday, run on gensets for up to two hours=
). Of course there is sufficient staff and contingency planning on-site to =
handle almost anything that comes up. The goal is to have a measurable "goo=
d" outcome at our highest reasonable load levels [temperature, data load, e=
tc].

We don't hesitate to show our customers and auditors our testing and mainte=
nance logs, go over our procedures, etc. They can even watch events if they=
 want (we provide the ear protection). I don't think any facility of any si=
gnificant size can operate differently and do it well.

This is NOT advisable to folks who do not do proper preventative maintenanc=
e on their transfer bus ways, PDUs, switches, batteries, transformers and o=
f course generators. The goal is to identify questionable relays, switches,=
 breakers and other items that may fail in an actual emergency.

On the network side, during scheduled maintenance we do live failovers -- s=
ometimes as dramatic as pulling the cable without preemptively removing tra=
ffic. Part of *our* procedures is to make sure it reroutes and heals the wa=
y it is supposed to before the work actually starts. Often network and topo=
logy changes happen over time and no one has had a chance to actually test =
all the "glue" works right. Regular planned maintenance (if you have a fast=
 reroute capability in your network) is a very good way to handle it.=20

For sensitive trunk links and non-invasive maintenance, it is nice to softl=
y remove traffic via local pref or whatever in advance of the maintenance t=
o minimize jitter during a major event.=20

As part of your plan, be prepared for things like connectors (or cables) br=
eaking and have a plan for what you do if that occurs. Have a plan or a rai=
n-date if a connector takes a long time to get out or the blade it sits in =
gets damaged. This stuff looks pretty while its running and you don't want =
something that has been friction-frozen to ruin your window.

All of this works swimmingly until you find a vendor (X) bug. :) Not for th=
e faint-of-heart.=20

Anyone who has more specific questions, I'll be glad to answer off-line.=20

Deepak Jain
AiNET

> I know Peer1 in vancouver reguarly send out notifications of
> "non-impacting" generator load testing, like monthly. Also InterXion
> in Dublin, Ireland have occasionally sent me notification that there
> was a power outage of less than a minute however their backup
> successfully took the load.
>=20
> I only remember one complete outage in Peer1 a few years ago... Never
> seen any outage in InterXion Dublin.
>=20
> Also I don't ever remember any power failure at AiNet (Deepak will
> probably elaborate)
>=20
> 2009/8/24 Dan Snyder <sliplever@gmail.com>:
> > Does any one know of any data centers that do failure testing of
> their
> > networking equipment
> > regularly? I mean to verify that everything fails over properly after
> > changes have been made over
> > time. =A0Is there any best practice guides for doing this?
> >
> > Thanks,
> > Dan
> >



home help back first fref pref prev next nref lref last post