[193428] in North American Network Operators' Group

home help back first fref pref prev next nref lref last post

Re: External BGP Controller for L3 Switch BGP routing

daemon@ATHENA.MIT.EDU (joel jaeggli)
Tue Jan 17 00:22:29 2017

X-Original-To: nanog@nanog.org
To: Yucong Sun <sunyucong@gmail.com>, Tore Anderson <tore@fud.no>,
 Saku Ytti <saku@ytti.fi>
From: joel jaeggli <joelja@bogus.com>
Date: Mon, 16 Jan 2017 21:22:16 -0800
In-Reply-To: <CAJygYd15RV2JLbQGR9dzHTe7=79r1UYL229+FnsKX0MTfn=Lcg@mail.gmail.com>
Cc: nanog list <nanog@nanog.org>
Errors-To: nanog-bounces@nanog.org

This is an OpenPGP/MIME signed message (RFC 4880 and 3156)
--l5N6WkfIMKK0AHlVts0mMCieb3HIGqjT6
From: joel jaeggli <joelja@bogus.com>
To: Yucong Sun <sunyucong@gmail.com>, Tore Anderson <tore@fud.no>,
 Saku Ytti <saku@ytti.fi>
Cc: nanog list <nanog@nanog.org>
Message-ID: <4ff16eef-8756-2e28-99c2-542e02079cb7@bogus.com>
Subject: Re: External BGP Controller for L3 Switch BGP routing
References: <432759437.1584530.1484371476169.JavaMail.zimbra@snappytelecom.net>
 <CABNB40UOKQucVUdu2zVH_QZVmMYkny2FQAXfg7oG9X8iLFftUw@mail.gmail.com>
 <CAAeewD_ZE=GoEtnpvm50ciUjpZACQ89rUAHkpsxuczO9stVi_g@mail.gmail.com>
 <20170116074047.4bb46a13@echo.ms.redpill-linpro.com>
 <CAJygYd15RV2JLbQGR9dzHTe7=79r1UYL229+FnsKX0MTfn=Lcg@mail.gmail.com>
In-Reply-To: <CAJygYd15RV2JLbQGR9dzHTe7=79r1UYL229+FnsKX0MTfn=Lcg@mail.gmail.com>
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable

On 1/15/17 11:00 PM, Yucong Sun wrote:
> In my setup, I use an BIRD instance to combine multiple internet full
> tables,  i use some filter to generate some override route to send to m=
y L3
> switch to do routing.  The L3 switch is configured with the default rou=
te
> to the main transit provider , if BIRD is down, the route would be
> unoptimized, but everything else remain operable until i fixed that BIR=
D
> instance.
>=20
> I've asked around about why there isn't a L3 switch capable of handling=

> full tables, I really don't understand the difference/logic behind it.

In practice there are several merchant silicon implmentations that
support the addition of external tcams. building them accordingly
increases the COGS and and various performance and packaging limitions.

arista 7280r and cisco ncs5500 are broadcom jericho based devices that
are packaged  accordingly.

Ethernet merchant silicon is heavily biased towards doing most if not
all the IO on the same asic, with limitations driven by gate size, die
size, heat dissipation pin count an so on.

There was a recent packet pushers episode with Pradeep Sindhu that
touched on some of these issues:

http://packetpushers.net/podcast/podcasts/show-315-future-networking-prad=
eep-sindhu/


> On Sun, Jan 15, 2017 at 10:43 PM Tore Anderson <tore@fud.no> wrote:
>=20
>> Hi Saku,
>>
>>>>
>> https://www.redpill-linpro.com/sysadvent/2016/12/09/slimming-routing-t=
able.html
>>>
>>> ---
>>> As described in a prevous post, we=E2=80=99re testing a HPE Altoline =
6920 in
>>> our lab. The Altoline 6920 is, like other switches based on the
>>> Broadcom Trident II chipset, able to handle up to 720 Gbps of
>>> throughput, packing 48x10GbE + 6x40GbE ports in a compact 1RU chassis=
=2E
>>> Its price is in all likelihood a single-digit percentage of the price=

>>> of a traditional Internet router with a comparable throughput rating.=

>>> ---
>>>
>>> This makes it sound like small-FIB router is single-digit percentage
>>> cost of full-FIB.
>>
>> Do you know of any traditional =C2=ABInternet scale=C2=BB router that =
can do ~720
>> Gbps of throughput for less than 10x the price of a Trident II box? Or=

>> even <100kUSD? (Disregarding any volume discounts.)
>>
>>> Also having Trident in Internet facing interface may be suspect,
>>> especially if you need to go from fast interface to slow or busy
>>> interface, due to very minor packet buffers. This obviously won't be
>>> much of a problem in inside-DC traffic.
>>
>> Quite the opposite, changing between different interface speeds happen=
s
>> very commonly inside the data centre (and most of the time it's done b=
y
>> shallow-buffered switches using Trident II or similar chips).
>>
>> One ubiquitous configuration has the servers and any external uplinks
>> attached with 10GE to leaf switches which in turn connects to a 40GE
>> spine layer with. In this config server<->server and server<->Internet=

>> packets will need to change speed twice:
>>
>> [server]-10GE-(leafX)-40GE-(spine)-40GE-(leafY)-10GE-[server/internet]=

>>
>> I suppose you could for example use a couple of MX240s or something as=

>> a special-purpose leaf layer for external connectivity.
>> MPC5E-40G10G-IRB or something towards the 40GE spines and any regular
>> 10GE MPC towards the exits. That way you'd only have one
>> shallow-buffered speed conversion remaining. But I'm very sceptical if=

>> something like this makes sense after taking the cost/benefit ratio
>> into account.
>>
>> Tore
>>
>=20



--l5N6WkfIMKK0AHlVts0mMCieb3HIGqjT6
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: OpenPGP digital signature
Content-Disposition: attachment; filename="signature.asc"

-----BEGIN PGP SIGNATURE-----
Comment: GPGTools - http://gpgtools.org

iEYEARECAAYFAlh9qggACgkQ8AA1q7Z/VrKLugCZAVwnBMvGjJd1zZZyRdUuBcR3
ASwAn14vuXdLXZw+oslJUTELVOW5KhkZ
=/dyg
-----END PGP SIGNATURE-----

--l5N6WkfIMKK0AHlVts0mMCieb3HIGqjT6--

home help back first fref pref prev next nref lref last post