[77620] in North American Network Operators' Group
Re: Resilience: faults, causes, statistics, open issues
daemon@ATHENA.MIT.EDU (David Andersen)
Thu Jan 27 11:38:28 2005
In-Reply-To: <F005CD411D18D3119C8F00508B0874801245C7FA@ehubunt100.eth.ericsson.se>
Cc: nanog@merit.edu
From: David Andersen <dga@lcs.mit.edu>
Date: Thu, 27 Jan 2005 11:37:33 -0500
To: =?ISO-8859-1?Q?Andr=E1s_Cs=E1sz=E1r_=28IJ/ETH=29?= <Andras.Csaszar@ericsson.com>
Errors-To: owner-nanog-outgoing@merit.edu
On Jan 27, 2005, at 6:39 AM, Andr=E1s Cs=E1sz=E1r (IJ/ETH) wrote:
>
> Hi people!
>
> I've begun research on (carrier-grade, aka telecom-grade) resiliency=20=
> in IP transport networks. The first step would be to collect possible=20=
> failure events, their causes and consequences, statistics about=20
> downtimes (mean time to repair) and mean times between failures, and I=20=
> would like to identify which of the problems are most typical (HW bug,=20=
> SW bug, cable cut through, plugged out (link going down), severe=20
> misconfiguration).
>
> I think this is the perfect forum to get some feedback from real=20
> network-operational experience.
>
> Is anyone out there who has some statistics/documents that would help=20=
> me in any way?
This is self-serving, but see the intro and related work sections of my=20=
thesis (we'll have a conference paper version of it done soon for NSDI,=20=
but we're still revising it. Apologies for not having a shorter=20
reference to give you):
http://nms.lcs.mit.edu/papers/index.php?detail=3D113
It doesn't focus specifically on carrier failures, but it has a batch=20
of references that might get you started on what the academic side=20
knows. I've also got some refs in there to some of the earlier teleco=20=
studies, which I recommend taking a peek at. Again, relation to year=20
2005 ISP failures isn't totally clear, but it's a starting point.
Unfortunately, the reality is that we don't actually know all that much=20=
as far as what's _really_ happening! Nick Feamster and I took a look=20
at some of the BGP routing failures (but didn't get back to root=20
causes):
http://nms.lcs.mit.edu/papers/index.php?detail=3D23
Nick's also done some work on configuration management and building a=20
better routing protocol that's somewhat related to your question.
Ratul Mahajan examined BGP configuration errors - but it's not clear=20
exactly what fraction of failures or downtime are really due to those=20
errors:
http://www.cs.washington.edu/homes/ratul/bgp/index.html
David Oppenheimer studied failures at a few edge companies (app.=20
service providers, hosting providers, etc.). Has a nice breakdown of=20
failure causes and durations, but it's not clear if those numbers=20
directly translate to the carrier realm:
http://roc.cs.berkeley.edu/papers/usits03.pdf
Finally, google back for some of Sean Donelan's NANOG posts. You'll=20
get some good individual cases from those, though the last time I=20
looked, I didn't find a big overall analysis.
> Also, do you have any suggestions on open research issues to be solved=20=
> in the area?
Most of it. :) I (and probably others on this lis) would be=20
interested in what you find.
-Dave=