[193322] in North American Network Operators' Group

home help back first fref pref prev next nref lref last post

Re: Soliciting your opinions on Internet routing: A survey on BGP

daemon@ATHENA.MIT.EDU (Mike Jones)
Tue Jan 10 16:34:08 2017

X-Original-To: nanog@nanog.org
In-Reply-To: <20170110195802.GD2066@hanna.meerval.net>
From: Mike Jones <mike@mikejones.in>
Date: Tue, 10 Jan 2017 21:31:20 +0000
To: Job Snijders <job@instituut.net>
Cc: nanog@nanog.org
Errors-To: nanog-bounces+nanog.discuss=bloom-picayune.mit.edu@nanog.org

On 10 January 2017 at 19:58, Job Snijders <job@instituut.net> wrote:
> On Tue, Jan 10, 2017 at 03:51:04AM +0100, Baldur Norddahl wrote:
>> If a transit link goes, for example because we had to reboot a router,
>> traffic is supposed to reroute to the remaining transit links.
>> Internally our network handles this fairly fast for egress traffic.
>>
>> However the problem is the ingress traffic - it can be 5 to 15 minutes
>> before everything has settled down. This is the time before everyone
>> else on the internet has processed that they will have to switch to
>> your alternate transit.
>>
>> The only solution I know of is to have redundant links to all transits.
>
> Alternatively, if you reboot a router, perhaps you could first shutdown
> the eBGP sessions, then wait 5 to 10 minutes for the traffic to drain
> away (should be visible in your NMS stats), and then proceed with the
> maintenance?
>
> Of course this only works for planned reboots, not suprise reboots.
>
> Kind regards,
>
> Job

If I tear down my eBGP sessions the upstream router withdraws the
route and the traffic just stops. Are your upstreams propagating
withdraws without actually updating their own routing tables?

I believe the simple explanation of the problem can be seen by firing
up an inbound mtr from a distant network then withdrawing the route
from the path it is taking. It should show either destination
unreachable or a routing loop which "retreats" (under the right
circumstances I have observed it distinctly move 1 hop at a time)
until it finds an alternate path.

My observed convergence times for a single withdraw are however in the
sub-10 second range, to get all the networks in the original path
pointing at a new one. My view on the problem is that if you are
failing over frequently enough for a customer to notice and report it,
you have bigger problems than convergence times.

- Mike Jones

home help back first fref pref prev next nref lref last post