[164566] in North American Network Operators' Group
Re: tools and techniques to pinpoint and respond to loss on a path
daemon@ATHENA.MIT.EDU (Michael DeMan)
Tue Jul 16 22:28:43 2013
From: Michael DeMan <nanog@deman.com>
In-Reply-To: <9F4D4FC766780045A8E7ECEA533A1A8D0367BBC8@CORPTPMAIL03.corp.theplatform.com>
Date: Tue, 16 Jul 2013 19:28:12 -0700
To: "nanog@nanog.org" <nanog@nanog.org>
Errors-To: nanog-bounces+nanog.discuss=bloom-picayune.mit.edu@nanog.org
What I have done in the past, and this presumes you have a /29 or bigger =
on the peering session to your upstreams is to check with the direct =
upstream provider at each and get approval to put a linux box =
diagnostics server on the peering side of each BGP upstream connection =
you have - default-routed out to their BGP router(s). Typically not a =
problem with the upstream as long as they know this is for diagnostics =
purposes and will be taken down later. Also helps the upstreams know =
you are seriously looking at the reliability they are giving and their =
competitors are giving you.
On that diagnostics box, run some quick & dirty tools to try and start =
isolating if the problem is related to one upstream link or another, or =
a combination of them. Have each one monitoring all the distant peer =
connections, and possibly even each-other local peers for connectivity =
if you are uber-detailed. The problem could be anywhere in between, but =
if you notice it is one link that has the issues and the other one does =
not, and/or a combo of src/dst, then you are in better shape to help =
your upstreams diagnose as well. A couple tools like smokeping and =
running traceroute and ping on a scripted basis are not perfect, but =
easy to setup. Log it all out so when it impacts production systems you =
can go back and look at those logs and see if there are any clues. =
nettop is also another handy tool to dump stuff out with and also in the =
nearly impossible case you happen to be on the console when the problem =
occurs is very handy.
=46rom there, let that run for a while - hours, days, weeks depending on =
the frequency of the problem and typically you will find that the =
'hiccup' happens either via one peering partner or all of them - and/or =
from one end or the other. More than likely something will fall out =
from the data as to where the problem is, and often it is not with your =
direct peers, but their peers or somebody else further down the chain.
This kind of stuff is notoriously difficult to troubleshoot and I =
generally agree with the opinions that for better or worse - global IP =
connectivity is still just a 'best effort basis' with out spending =
immense amounts of money.
I remember a few years ago having blips and near one-hour outages from =
NW Washington State over to Europe and the problem was that global =
crossing was doing a bunch of maintenance and it was not going well for =
them. They were 'man in the middle' for the routing from two different =
peers and just knowing the problem was a big help and with some creative =
BGP announcements we were able to minimize the impact.
- mike
On Jul 15, 2013, at 2:18 PM, Andy Litzinger =
<Andy.Litzinger@theplatform.com> wrote:
> Hi,
>=20
> Does anyone have any recommendations on how to pinpoint and react to =
packet loss across the internet? preferably in an automated fashion. =
For detection I'm currently looking at trying smoketrace to run from =
inside my network, but I'd love to be able to run traceroutes from my =
edge routers triggered during periods of loss. I have Juniper MX80s on =
one end- which I'm hopeful I'll be able to cobble together some combo of =
RPM and event scripting to kick off a traceroute. We have Cisco4900Ms =
on the other end and maybe the same thing is possible but I'm not so =
sure.
>=20
> I'd love to hear other suggestions and experience for detection and =
also for options on what I might be able to do when loss is detected on =
a path.
>=20
> In my specific situation I control equipment on both ends of the path =
that I care about with details below.
>=20
> we are a hosted service company and we currently have two data =
centers, DC A and DC B. DC A uses juniper MX routers, advertises our =
own IP space and takes full BGP feeds from two providers, ISPs A1 and =
A2. At DC B we have a smaller installation and instead take redundant =
drops (and IP space) from a single provider, ISP B1, who then peers =
upstream with two providers, B2 and B3
>=20
> We have a fairly consistent bi-directional stream of traffic between =
DC A and DC B. Both of ISP A1 and A2 have good peering with ISP B2 so =
under normal network conditions traffic flows across ISP B1 to B2 and =
then to either ISP A1 or A2
>=20
> oversimplified ascii pic showing only the normal best paths:
>=20
> -- ISP A1----------------------ISP B2--
> DC A--| =
|--- ISP B1 ----- DC B
> -- ISP A2----------------------ISP B2--
>=20
>=20
> with increasing frequency we've been experiencing packet loss along =
the path from DC A to DC B. Usually the periods of loss are brief, 30 =
seconds to a minute, but they are total blackouts.
>=20
> I'd like to be able to collect enough relevant data to pinpoint the =
trouble spot as much as possible so I can take it to the ISPs and =
request a solution. The blackouts are so quick that it's impossible to =
log in and get a trace- hence the desire to automate it.
>=20
> I can provide more details off list if helpful- I'm trying not to =
vilify anyone- especially without copious amounts of data points.
>=20
> As a side question, what should my expectation be regarding packet =
loss when sending packets from point A to point B across multiple =
providers across the internet? Is 30 seconds to a minute of blackout =
between two destinations every couple of weeks par for the course? My =
directly connected ISPs offer me an SLA, but what should I reasonably =
expect from them when one of their upstream peers (or a peer of their =
peers) has issues? If this turns out to be BGP reconvergence or similar =
do I have any options?
>=20
> many thanks,
> -andy
>=20