[77620] in North American Network Operators' Group

home help back first fref pref prev next nref lref last post

Re: Resilience: faults, causes, statistics, open issues

daemon@ATHENA.MIT.EDU (David Andersen)
Thu Jan 27 11:38:28 2005

In-Reply-To: <F005CD411D18D3119C8F00508B0874801245C7FA@ehubunt100.eth.ericsson.se>
Cc: nanog@merit.edu
From: David Andersen <dga@lcs.mit.edu>
Date: Thu, 27 Jan 2005 11:37:33 -0500
To: =?ISO-8859-1?Q?Andr=E1s_Cs=E1sz=E1r_=28IJ/ETH=29?= <Andras.Csaszar@ericsson.com>
Errors-To: owner-nanog-outgoing@merit.edu



On Jan 27, 2005, at 6:39 AM, Andr=E1s Cs=E1sz=E1r (IJ/ETH) wrote:

>
> Hi people!
>
> I've begun research on (carrier-grade, aka telecom-grade) resiliency=20=

> in IP transport networks. The first step would be to collect possible=20=

> failure events, their causes and consequences, statistics about=20
> downtimes (mean time to repair) and mean times between failures, and I=20=

> would like to identify which of the problems are most typical (HW bug,=20=

> SW bug, cable cut through, plugged out (link going down), severe=20
> misconfiguration).
>
> I think this is the perfect forum to get some feedback from real=20
> network-operational experience.
>
> Is anyone out there who has some statistics/documents that would help=20=

> me in any way?

This is self-serving, but see the intro and related work sections of my=20=

thesis (we'll have a conference paper version of it done soon for NSDI,=20=

but we're still revising it.  Apologies for not having a shorter=20
reference to give you):

   http://nms.lcs.mit.edu/papers/index.php?detail=3D113

It doesn't focus specifically on carrier failures, but it has a batch=20
of references that might get you started on what the academic side=20
knows.  I've also got some refs in there to some of the earlier teleco=20=

studies, which I recommend taking a peek at.  Again, relation to year=20
2005 ISP failures isn't totally clear, but it's a starting point.

Unfortunately, the reality is that we don't actually know all that much=20=

as far as what's _really_ happening!  Nick Feamster and I took a look=20
at some of the BGP routing failures (but didn't get back to root=20
causes):

http://nms.lcs.mit.edu/papers/index.php?detail=3D23

Nick's also done some work on configuration management and building a=20
better routing protocol that's somewhat related to your question.

Ratul Mahajan examined BGP configuration errors - but it's not clear=20
exactly what fraction of failures or downtime are really due to those=20
errors:

http://www.cs.washington.edu/homes/ratul/bgp/index.html

David Oppenheimer studied failures at a few edge companies (app.=20
service providers, hosting providers, etc.).  Has a nice breakdown of=20
failure causes and durations, but it's not clear if those numbers=20
directly translate to the carrier realm:

http://roc.cs.berkeley.edu/papers/usits03.pdf

Finally, google back for some of Sean Donelan's NANOG posts.  You'll=20
get some good individual cases from those, though the last time I=20
looked, I didn't find a big overall analysis.


> Also, do you have any suggestions on open research issues to be solved=20=

> in the area?

   Most of it. :)  I (and probably others on this lis) would be=20
interested in what you find.

   -Dave=


home help back first fref pref prev next nref lref last post