[193426] in North American Network Operators' Group

home help back first fref pref prev next nref lref last post

Re: External BGP Controller for L3 Switch BGP routing

daemon@ATHENA.MIT.EDU (joel jaeggli)
Mon Jan 16 20:45:11 2017

X-Original-To: nanog@nanog.org
To: Tore Anderson <tore@fud.no>, Saku Ytti <saku@ytti.fi>
From: joel jaeggli <joelja@bogus.com>
Date: Mon, 16 Jan 2017 17:45:00 -0800
In-Reply-To: <20170116155328.13a10b42@envy.e1.y.home>
Cc: nanog list <nanog@nanog.org>
Errors-To: nanog-bounces@nanog.org

This is an OpenPGP/MIME signed message (RFC 4880 and 3156)
--mtD8959AOEESphCgTrPvSOIOkSMISRahH
From: joel jaeggli <joelja@bogus.com>
To: Tore Anderson <tore@fud.no>, Saku Ytti <saku@ytti.fi>
Cc: nanog list <nanog@nanog.org>
Message-ID: <f41c13e9-5227-1d4f-9851-2b722b03a4b7@bogus.com>
Subject: Re: External BGP Controller for L3 Switch BGP routing
References: <432759437.1584530.1484371476169.JavaMail.zimbra@snappytelecom.net>
 <CABNB40UOKQucVUdu2zVH_QZVmMYkny2FQAXfg7oG9X8iLFftUw@mail.gmail.com>
 <CAAeewD_ZE=GoEtnpvm50ciUjpZACQ89rUAHkpsxuczO9stVi_g@mail.gmail.com>
 <20170116074047.4bb46a13@echo.ms.redpill-linpro.com>
 <CAAeewD8SjNDi_oixWrntDSfB0ocfnSxvHT2kCGAB0YiC2SAX=Q@mail.gmail.com>
 <20170116133654.5239831f@echo.ms.redpill-linpro.com>
 <CAAeewD8GYmgdqzXhVGveAZL7nDAWa3ttRrTUBxiHVjmowbZMPQ@mail.gmail.com>
 <20170116155328.13a10b42@envy.e1.y.home>
In-Reply-To: <20170116155328.13a10b42@envy.e1.y.home>
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable

On 1/16/17 6:53 AM, Tore Anderson wrote:
> * Saku Ytti
>=20
>> On 16 January 2017 at 14:36, Tore Anderson <tore@fud.no> wrote:
>>
>>> Put it another way, my =C2=ABInternet facing=C2=BB interfaces are typ=
ically
>>> 10GEs with a few (kilo)metres of dark fibre that x-connects into my
>>> IP-transit providers' routers sitting in nearby rooms or racks
>>> (worst case somewhere else in the same metro area). Is there any
>>> reason why I should need deep buffers on those interfaces? =20
>>
>> Imagine content network having 40Gbps connection, and client having
>> 10Gbps connection, and network between them is lossless and has RTT of=

>> 200ms. To achieve 10Gbps rate receiver needs 10Gbps*200ms =3D 250MB
>> window, in worst case 125MB window could grow into 250MB window,  and
>> sender could send the 125MB at 40Gbps burst.
>> This means the port receiver is attached to, needs to store the 125MB,=

>> as it's only serialising it at 10Gbps. If it  cannot store it, window
>> will shrink and receiver cannot get 10Gbps.
>>
>> This is quite pathological example, but you can try with much less
>> pathological numbers, remembering TridentII has 12MB of buffers.
>=20
> I totally get why the receiver need bigger buffers if he's going to
> shuffle that data out another interface with a slower speed.
>=20
> But when you're a data centre operator you're (usually anyway) mostly
> transmitting data. And you can easily ensure the interface speed facing=

> the servers can be the same as the interface speed facing the ISP.

unlikely given that the interfaces facing the server is 1/10/25/50 and
the one facing the isp is n x 10 or n x 100

> So if you consider this typical spine/leaf data centre network topology=

> (essentially the same one I posted earlier this morning):
>=20
> (Server) --10GE--> (T2 leaf X) --40GE--> (T2 spine) --40GE-->
> (T2 leaf Y) --10GE--> (IP-transit/"the Internet") --10GE--> (Client)
>=20
> If I understand you correctly you're saying this is a "suspect" topolog=
y
> that cannot achieve 10G transmission rate from server to client (or
> from client to server for that matter) because of small buffers on my
> "T2 leaf Y" switch (i.e., the one which has the Internet-facing
> interface)?

you can externalize the cost of the buffer at the expense of latency
from the t2, e.g. by enabling flow control faciing the host or other
high capacity device, or engaging in packet pacing on the server if the
network is fairly shallow.

If the question is how can I ensure high link utilization rather than
maximum throughput for this one flow, the buffer requirement  may be
substantially lower.

e.g. if you are sizing based on

buffer =3D (bandwidth delay * desired bandwidth) / sqrt(nr of flows)

http://conferences.sigcomm.org/sigcomm/2004/papers/p277-appenzeller1.pdf

rather than buffer =3D (bandwidth delay * bandwidth)

> If so would it solve the problem just replacing "T2 leaf Y" with, say,
> a Juniper MX or something else with deeper buffers?

broadcom jericho/ptx/qfx whatever sure it's plausible to have a large
buffer without using the feature rich extremely large fib asic.

> Or would it help to use (4x)10GE instead of 40GE for the links between
> the leaf and spine layers too, so there was no change in interface
> speeds along the path through the data centre towards the handoff to
> the IPT provider?

it can reduce the demand on the buffer, you can however multiplex two
our more flows that might otherwise run at 10Gb/s onto the same lag membe=
r.

> Tore
>=20



--mtD8959AOEESphCgTrPvSOIOkSMISRahH
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: OpenPGP digital signature
Content-Disposition: attachment; filename="signature.asc"

-----BEGIN PGP SIGNATURE-----
Comment: GPGTools - http://gpgtools.org

iEYEARECAAYFAlh9dx0ACgkQ8AA1q7Z/VrIxxgCfSj4+/IjwjwMTSlJVBTafJ06+
Sh8AnikV+nhyJBtoj9q8AkZZ4WQvIUnh
=RSBB
-----END PGP SIGNATURE-----

--mtD8959AOEESphCgTrPvSOIOkSMISRahH--

home help back first fref pref prev next nref lref last post