[77629] in North American Network Operators' Group

home help back first fref pref prev next nref lref last post

RE: Resilience: faults, causes, statistics, open issues

daemon@ATHENA.MIT.EDU (=?iso-8859-1?Q?Andr=E1s_Cs=E1sz=E1)
Fri Jan 28 05:32:02 2005

From: =?iso-8859-1?Q?Andr=E1s_Cs=E1sz=E1r_=28IJ/ETH=29?= <Andras.Csaszar@ericsson.com>
To: David Andersen <dga@lcs.mit.edu>
Cc: nanog@merit.edu
Date: Fri, 28 Jan 2005 11:30:06 +0100
Errors-To: owner-nanog-outgoing@merit.edu


Hi David, this is going to be very useful, I really appretiate it, =
thank you very much.

Just some comments about the root causes of BGP related problems, maybe =
you find something useful from the research perspective, although =
probably this is not going to be new for you.

I found a few author groups with very related and useful papers:

- Tim Griffin and co.
- Nick Feamster and co.
- Jennifer Rexford and co.
- Lixin Gao and co.

These people often have joint publications but sometimes separate as =
well. Also, Craig Labovitz and co have some very useful papers in the =
area of routing convergence time.

The IRTF also has some interesting, futuristic and somewhat visionary =
drafts about "Future Domain Routing".

As I see things now, in case of BGP, routing divergence, configuration =
and policies have a very strong correlation.

A high level conclusion (what you probably can expect from half year =
paper- and presentation-reading research) is that the first root cause =
of BGP problems is the absence of a >>widely deployed and practical<< =
formal language for policies. Since there is no formal language, there =
is no compiler, and so you have unwanted anomalies resulting from your =
config.

My conclusion was that BGP has an analogy to software development:

SW: Specification=3D>High-level formal language (e.g. C++)=3D>Low-level =
formal language (assembly, binary, etc.)

Both steps can be called implementation or compiliation. The good thing =
here is that you have automated compilers for the second step, which is =
harder.

BGP: Business relation=3D>Policies=3D>Router configuration

First you implement your business relations, when you think out =
policies, but in the end you will have to implement/compile your =
policies as router configuration. The problem is, there is no automated =
compiler for the second step, since there is no formal policy language, =
and so verifcation is also very hard.

As a result you may have configuration bugs or your config is not doing =
what you originally wanted to do, or you have inconsistency among your =
routers, etc.

Of course, it is clear why such a formal language and compiler is not =
used in practice (different router vendors, different features, =
different capabilities, no standard interface, etc.), although there =
is, e.g., RPSL and the tools built upon RPSL. Lately, Griffin and co =
have begun thinking about a completely new policy language.


The second root cause that I think can be somewhat separated is that =
there is no practically used central database about policies. You do =
not necessary know what your neighbour operators are doing (their =
configs and policies). As a result you may have external inconsistency =
(that may lead to divergence, "wedgies", etc.).

Of course, here it is also clear why, e.g., IRRs are not used or not =
updated frequently (information hiding principle , which is actually =
the basis of the hierarchical domain structure of the internet).


So, in the end, although we can possibly identify the root causes =
behind BGP problems, I'm not sure they can ever be fully ceased. OK, I =
can imagine a formal language and config compiler, and one can find =
verification tools as well, but I can hardly imagine e.g. the sharing =
of policies (although some papers write about methods how to infer the =
necessary knowledge from measurements).

Thanks again for you help,
Andr=E1s

p.s. Sorry for the long mail :) :)


----Original Message----
From: David Andersen [mailto:dga@lcs.mit.edu]
Sent: 2005. janu=E1r 27. 17:38
To: Andr=E1s Cs=E1sz=E1r (IJ/ETH)
Cc: nanog@merit.edu
Subject: Re: Resilience: faults, causes, statistics, open issues

> On Jan 27, 2005, at 6:39 AM, Andr=E1s Cs=E1sz=E1r (IJ/ETH) wrote:
>=20
>>=20
>> Hi people!
>>=20
>> I've begun research on (carrier-grade, aka telecom-grade) resiliency
>> in IP transport networks. The first step would be to collect =
possible
>> failure events, their causes and consequences, statistics about
>> downtimes (mean time to repair) and mean times between failures, and
>> I would like to identify which of the problems are most typical (HW
>> bug, SW bug, cable cut through, plugged out (link going down),
>> severe misconfiguration).=20
>>=20
>> I think this is the perfect forum to get some feedback from real
>> network-operational experience.
>>=20
>> Is anyone out there who has some statistics/documents that would =
help
>> me in any way?
>=20
> This is self-serving, but see the intro and related work sections of
> my thesis (we'll have a conference paper version of it done soon for
> NSDI, but we're still revising it.  Apologies for not having a =
shorter
> reference to give you):
>=20
>    http://nms.lcs.mit.edu/papers/index.php?detail=3D113
>=20
> It doesn't focus specifically on carrier failures, but it has a batch
> of references that might get you started on what the academic side
> knows.  I've also got some refs in there to some of the earlier =
teleco
> studies, which I recommend taking a peek at.  Again, relation to year
> 2005 ISP failures isn't totally clear, but it's a starting point.
>=20
> Unfortunately, the reality is that we don't actually know all that
> much as far as what's _really_ happening!  Nick Feamster and I took a
> look at some of the BGP routing failures (but didn't get back to root
> causes):
>=20
> http://nms.lcs.mit.edu/papers/index.php?detail=3D23
>=20
> Nick's also done some work on configuration management and building a
> better routing protocol that's somewhat related to your question.
>=20
> Ratul Mahajan examined BGP configuration errors - but it's not clear
> exactly what fraction of failures or downtime are really due to those
> errors:
>=20
> http://www.cs.washington.edu/homes/ratul/bgp/index.html
>=20
> David Oppenheimer studied failures at a few edge companies (app.
> service providers, hosting providers, etc.).  Has a nice breakdown of
> failure causes and durations, but it's not clear if those numbers
> directly translate to the carrier realm:
>=20
> http://roc.cs.berkeley.edu/papers/usits03.pdf
>=20
> Finally, google back for some of Sean Donelan's NANOG posts.  You'll
> get some good individual cases from those, though the last time I
> looked, I didn't find a big overall analysis.
>=20
>=20
>> Also, do you have any suggestions on open research issues to be
>> solved in the area?
>=20
>    Most of it. :)  I (and probably others on this lis) would be
> interested in what you find.
>=20
>    -Dave

home help back first fref pref prev next nref lref last post