[142376] in North American Network Operators' Group

home help back first fref pref prev next nref lref last post

Re: BGP Design question.

daemon@ATHENA.MIT.EDU (Owen DeLong)
Thu Jun 23 15:15:50 2011

From: Owen DeLong <owen@delong.com>
In-Reply-To: <4E033531.5050604@gmail.com>
Date: Thu, 23 Jun 2011 12:15:03 -0700
To: -Hammer- <bhmccie@gmail.com>
Cc: nanog@nanog.org
Errors-To: nanog-bounces+nanog.discuss=bloom-picayune.mit.edu@nanog.org

Except in those (becoming less rare than hardware failure) instances =
where the software controlling the failover process is the actual cause =
of the outage.

Owen

On Jun 23, 2011, at 5:44 AM, -Hammer- wrote:

> Agreed. At an enterprise level, there is no need to risk extended =
downtime to save a buck or two. Redundant hardware is always a good way =
to keep Murphy out of the equation. And as far as hardware failures go, =
it's not that common. Nowadays it's the bugs in overly complicated code =
on your gear that get you first. I miss IOS 11.3.....
>=20
> -Hammer-
>=20
>=20
>=20
> On 06/23/2011 01:07 AM, Bret Palsson wrote:
>> That's fine if you are running a website. When it comes to =
telecommunications, a 15 minute outage is pretty huge. Especially with =
certain types of customers: emergency services for example.
>>=20
>> -Bret
>>=20
>> On Jun 23, 2011, at 12:02 AM, Hank Nussbacher wrote:
>>=20
>>  =20
>>> At 20:42 22/06/2011 -0700, Jason Roysdon wrote:
>>>=20
>>> Let me be a bit of a heretic here.  How often does your router fail? =
 Or your firewall?  In the 25 years I have gone into customers I have =
found when they did a cross setup as proposed below by Bret and Jason, =
only one person truly knew the complete setup and if something broke =
only he was able to fix it.  There is never complete printed =
documentation: routing design, IPs on all interfaces, subnetting =
schematic, etc.  And if there was at one point, after 2 years it was =
outdated and never updated and only the *1* guy knew the changes in his =
head.
>>>=20
>>> In that kind of situation, when something stopped working they =
always had to call in the "guru" to fix it.  On the other hand, a simple =
design of only *one* path (pick either left or right side of each of the =
ASCII arts), made it possible that even junior network engineers as well =
as technicians called in on emergency with 4 hours notice, were able to =
fix the situation much more quickly than the "cross" design.  And the =
MTBF on a single path solution, IMHO, is around 3-4 years.  And if you =
need redundancy, keep a spare box on a shelf, completely loaded with the =
latest config so that it can be hot-swapped in within 15 minutes of =
failure.
>>>=20
>>> This 1-path design is not for everyone.  The vendors always =
recommend the "cross" design since they sell 2x the amount of boxes but =
I have found that life works fine with just a 1-path design as well.
>>>=20
>>> -Hank
>>>=20
>>>=20
>>>    =20
>>>> I second the static routes, specially from a simplicity standpoint. =
 Add
>>>> in a pair of layer two switches to simplify further:
>>>>=20
>>>>=20
>>>>     +--------+    +--------+
>>>>     | Peer A |    | Peer A |<-Many carriers. Using 1 carrier
>>>>     +---+----+    +----+---+    for this scenario.
>>>>         |eBGP          | eBGP
>>>>         |              |
>>>>     +---+----+iBGP+----+---+
>>>>     | Router +    + Router |<- Routers. Not directly connected
>>>>     +-+------+    +------+-+
>>>>       |                  |
>>>>     +-+------+    +------+-+
>>>>     |L2Switch|----|L2Switch|<- Layer 2 switches, can be stacked
>>>>     +--------+    +--------+
>>>>       |                  |
>>>>     +-+------+    +------+-+
>>>>     |Act. FW |----|Pas. FW |<-Firewalls Active/Passive.
>>>>     +--------+    +--------+
>>>>=20
>>>> You can lose all of the left leg, or all of the right leg, and =
still be
>>>> up.  If you want to complicate things, you can add crossing links
>>>> between it all, but again, beyond BGP and VRRP, this is a very =
simple
>>>> design you can easily troubleshoot at 3AM.  It's also much easier =
to
>>>> document the troubleshooting steps (so you can go on vacation and
>>>> someone else can solve without calling you) and test upgrades.
>>>>=20
>>>> You can nearly evenly split the traffic by having a VRRP VIP on =
each
>>>> edge router, with the other router backing up the first.  The =
firewalls
>>>> can have two static routes, one to each VIP, and this will roughly
>>>> load-balance the traffic out on a packet basis.  As you peer with =
the
>>>> same ISP, this will work just fine.  If they have an outage, your =
edge
>>>> routers will learn, and even if the circuit drops it'll know, and
>>>> basically the VIP will just redirect traffic to the other router.
>>>>=20
>>>> Now all your firewalls have to do is maintain stateful session
>>>> information, not OSPF.
>>>>=20
>>>> If you had two different ISPs (especially if they are not roughly =
evenly
>>>> connected), then not having intelligence of the BGP paths in your
>>>> firewalls can cause an extra hop when it hits router with the =
longer
>>>> path, which will redirect it to the router with the shorter path.
>>>>=20
>>>> Speaking from a Cisco/HSRP point of view, you could be more =
intelligent
>>>> (re:more complicated, and complication means harder troubleshooting =
and
>>>> more documentation needed) during problem periods by having the VIP =
move
>>>> routers automatically based on the WAN link dropping and/or a route
>>>> beyond it being lost (others can comment to if VRRP supports this).
>>>> This would save one hop to the "broken" router when the BGP path or =
WAN
>>>> is down.
>>>>=20
>>>> Jason Roysdon
>>>>=20
>>>> On 06/22/2011 06:07 PM, Bret Palsson wrote:
>>>>      =20
>>>>> On Wed, Jun 22, 2011 at 5:33 PM, PC<paul4004@gmail.com>  wrote:
>>>>>=20
>>>>>        =20
>>>>>> Who makes the firewall?
>>>>>>=20
>>>>>>=20
>>>>>>          =20
>>>>> Juniper SSG. We use NSRP and replicate all the RTOs. We have =
hitless on the
>>>>> Firewalls, have for years. We're now peering with our own carriers =
vs. using
>>>>> our datacenter's mix.
>>>>>=20
>>>>> A static route from the junipers to the VIP (VRRP) is probably the =
way to
>>>>> go. I think.
>>>>>=20
>>>>> To make this work and be "hitless", your firewall vendor must =
support
>>>>>        =20
>>>>>> stateful replication of routing protocol data (including OSPF).  =
For
>>>>>> example, Cisco didn't support this in their ASA product until =
version 8.4 of
>>>>>> code.
>>>>>>=20
>>>>>> Otherwise, a failover requires OSPF to re-converge -- and quite =
frankly,
>>>>>> will likely cause some state of confusion on the upstream OSPF =
peers, loss
>>>>>> of adjacency, and a loss of routing until this occurs.  It's like =
someone
>>>>>> just swapped a router with the same IP  to the upstream device -- =
assuming
>>>>>> your active/standby vendor's implementation only presents itself =
as one
>>>>>> device.
>>>>>>=20
>>>>>> However, once this is succesful your current failover topology =
should work
>>>>>> fine -- even if it takes some time to failover.
>>>>>>=20
>>>>>> In my opinion though, unless the firewall is serving as "transit" =
to
>>>>>> downstream routers or other layer 3 elements, and you need to run =
OSPF to it
>>>>>> (And through it) as a result, it's often just easier to static =
default route
>>>>>> out from the firewall(s) and redistribute a static route on the =
upstream
>>>>>> routers for the subnets behind the firewalls.  It also helps =
ensure
>>>>>> symmetrical traffic flows, which is important for stateful =
firewalls and can
>>>>>> become moderatly confusing when your firewalls start having many =
interfaces.
>>>>>>=20
>>>>>>=20
>>>>>>=20
>>>>>>=20
>>>>>> On Wed, Jun 22, 2011 at 4:27 PM, Bret Palsson<bret@getjive.com>  =
wrote:
>>>>>>=20
>>>>>>          =20
>>>>>>> Here is my current setup in ASCII art. (Please view in a fixed =
width
>>>>>>> font.) Below the art I'll write out the setup.
>>>>>>>=20
>>>>>>>=20
>>>>>>>     +--------+    +--------+
>>>>>>>     | Peer A |    | Peer A |<-Many carriers. Using 1 carrier
>>>>>>>     +---+----+    +----+---+    for this scenario.
>>>>>>>         |eBGP          | eBGP
>>>>>>>         |              |
>>>>>>>     +---+----+iBGP+----+---+
>>>>>>>     | Router +----+ Router |<-Netiron CERs Routers.
>>>>>>>     +-+------+    +------+-+
>>>>>>>       |A   `.P    A.'    |P<-A/P indicates Active/Passive
>>>>>>>       |      `.  .'      |      link.
>>>>>>>       |        ::        |
>>>>>>>     +-+------+'  `+------+-+
>>>>>>>     |Act. FW |    |Pas. FW |<-Firewalls Active/Passive.
>>>>>>>     +--------+    +--------+
>>>>>>>=20
>>>>>>>=20
>>>>>>> To keep this scenario simple, I'm multihoming to one carrier.
>>>>>>> I have two Netiron CERs. Each have a eBGP connection to the same =
peer.
>>>>>>> The CERs have an iBGP connection to each other.
>>>>>>> That works all fine and dandy. Feel free to comment, however if =
you think
>>>>>>> there is a better way to do this.
>>>>>>>=20
>>>>>>> Here comes the tricky part. I have two firewalls in an =
Active/Passive
>>>>>>> setup. When one fails the other is configured exactly the same
>>>>>>> and picks up where the other left off. (Yes, all the sessions =
etc. are
>>>>>>> actively mirrored between the devices)
>>>>>>>=20
>>>>>>> I am using OSPFv2 between the CERs and the Firewalls. Failover =
works just
>>>>>>> fine, however when I fail an OSPF link that has the active =
default route,
>>>>>>> ingress traffic still routes fine and dandy, but egress traffic =
doesn't.
>>>>>>> Both Netiron's OSPF are setup to advertise they are the default =
route.
>>>>>>>=20
>>>>>>> What I'm wondering is, if OSPF is the right solution for this. =
How do
>>>>>>> others solve this problem?
>>>>>>>=20
>>>>>>>=20
>>>>>>> Thanks,
>>>>>>>=20
>>>>>>> Bret
>>>>>>>=20
>>>>>>>=20
>>>>>>> Note: Since lately ipv6 has been a hot topic, I'll state that =
after we get
>>>>>>> the BGP all figured out and working properly, ipv6 is our next =
project. :)
>>>>>>>=20
>>>>>>>=20
>>>>>>>=20
>>>>>>>            =20
>>>>>>          =20
>>>>>        =20
>>>    =20
>>=20
>>  =20



home help back first fref pref prev next nref lref last post