[116907] in North American Network Operators' Group

home help back first fref pref prev next nref lref last post

RE: Data Center testing

daemon@ATHENA.MIT.EDU (Deepak Jain)
Wed Aug 26 14:24:02 2009

From: Deepak Jain <deepak@ai.net>
To: Dylan Ebner <dylan.ebner@crlmed.com>, Dan Snyder <sliplever@gmail.com>,
	Ken Gilmour <ken.gilmour@gmail.com>
Date: Wed, 26 Aug 2009 14:22:49 -0400
In-Reply-To: <017265BF3B9640499754DD48777C3D206612CC1D7E@MBX9.EXCHPROD.USA.NET>
Cc: NANOG list <nanog@nanog.org>
Errors-To: nanog-bounces+nanog.discuss=bloom-picayune.mit.edu@nanog.org


The idea of regular testing is to essentially detect failures on your time =
schedule rather than entropy's (or Murphy's). There can be flaws in your te=
sting methodology too. This is why generic load bank tests and network load=
 simulators rarely tell the whole story.

Customers are rightfully unpleased with any testing that affects their norm=
al peace-of-mind, and doubly so when it affects actual operational effectiv=
eness. However, since no system can operate indefinitely without maintenanc=
e, failover and other items, the question of taking a window is not negotia=
ble. The only thing that is negotiable (somewhat) is when, and only in one =
direction (ahead of the item failing on its own).=20

So, taking this concept to networks. It's not negotiable whether a link or =
a device will fail, the question is only how long you are going to forward =
bits along the dead path before rerouting and how long that rerouting will =
take. SONET says about 50ms, standard BGP about 30-300seconds. BFD and othe=
r things may improve these dramatically in your setup. You build your netwo=
rk around your business case and vice versa.=20

Clearly, most of the known universe has decided that BGP time is "good enou=
gh" for the Internet as a whole right now. Most are aware of the costs in t=
erms of overall jitter, CPU and stability if we reduce those times too far.=
=20

Its intellectually dishonest to talk about never losing a packet or never f=
orwarding along a dead path for even a nanosecond when the state-of-the-art=
 says something very different indeed.=20

Deepak Jain
AiNET

> -----Original Message-----
> From: Dylan Ebner [mailto:dylan.ebner@crlmed.com]
> Sent: Wednesday, August 26, 2009 11:33 AM
> To: Dan Snyder; Ken Gilmour
> Cc: NANOG list
> Subject: RE: Data Center testing
>=20
> I would hope that the data center engineers built and ran suite of
> tests to find failure points before the network infrastructure was put
> into production. That said, changes are made constantly to the
> infrastructure and it can become very difficult very quickly to know if
> the failovers are still going to work. This is one place where the
> power and network in a datacenter divulge. The power systems may take
> on additional load over the course of the life of the facility, but the
> transfer switches and generators do not get many changes made to them.
> Also, network infrastructure tests are not going to be zero impact if
> there is a config problem. Generator tests are much easier. You can
> start up the generator and do a load test. You can also load test the
> UPS systems as well. Then you can initiate your failover. Network tests
> are not going to be zero impact even if there isn't a problem. Let's
> say you wanted to power fail a edge router participating in BGP, it can
> take 30 seconds for that routers route to get withdrawn from the BGP
> tables of the world. The other problem is network failures always seem
> to come from "unexpected" issues. I always love it when I get an outage
> report from my ISP's or datacenter and they say an "unexpected issue"
> or "unforseen issue" caused the problem.
>=20
>=20
> Dylan
> -----Original Message-----
> From: Dan Snyder [mailto:sliplever@gmail.com]
> Sent: Monday, August 24, 2009 8:39 AM
> To: Ken Gilmour
> Cc: NANOG list
> Subject: Re: Data Center testing
>=20
> We have done power tests before and had no problem.  I guess I am
> looking for someone who does testing of the network equipment outside
> of just power tests.  We had an outage due to a configuration mistake
> that became apparent when a switch failed.  It didn't cause a problem
> however when we did a power test for the whole data center.
>=20
> -Dan
>=20
>=20
> On Mon, Aug 24, 2009 at 9:31 AM, Ken Gilmour <ken.gilmour@gmail.com>
> wrote:
>=20
> > I know Peer1 in vancouver reguarly send out notifications of
> > "non-impacting" generator load testing, like monthly. Also InterXion
> > in Dublin, Ireland have occasionally sent me notification that there
> > was a power outage of less than a minute however their backup
> > successfully took the load.
> >
> > I only remember one complete outage in Peer1 a few years ago... Never
> > seen any outage in InterXion Dublin.
> >
> > Also I don't ever remember any power failure at AiNet (Deepak will
> > probably elaborate)
> >
> > 2009/8/24 Dan Snyder <sliplever@gmail.com>:
> > > Does any one know of any data centers that do failure testing of
> > > their networking equipment regularly? I mean to verify that
> > > everything fails over properly after changes have been made over
> > > time.  Is there any best practice guides for doing this?
> > >
> > > Thanks,
> > > Dan
> > >
> >
>=20



home help back first fref pref prev next nref lref last post