[164546] in North American Network Operators' Group


home	help	back	first	fref	pref	prev	next	nref	lref	last	post

Re: tools and techniques to pinpoint and respond to loss on a path

daemon@ATHENA.MIT.EDU (Jared Mauch)
Mon Jul 15 17:30:59 2013

From: Jared Mauch <jared@puck.nether.net>
In-Reply-To: <9F4D4FC766780045A8E7ECEA533A1A8D0367BBC8@CORPTPMAIL03.corp.theplatform.com>
Date: Mon, 15 Jul 2013 17:30:42 -0400
To: Andy Litzinger <Andy.Litzinger@theplatform.com>
Cc: "nanog@nanog.org" <nanog@nanog.org>
Errors-To: nanog-bounces+nanog.discuss=bloom-picayune.mit.edu@nanog.org

On Jul 15, 2013, at 5:18 PM, Andy Litzinger =
<Andy.Litzinger@theplatform.com> wrote:

>  I'd like to be able to collect enough relevant data to pinpoint the =
trouble spot as much as possible so I can take it to the ISPs and =
request a solution.  The blackouts are so quick that it's impossible to =
log in and get a trace- hence the desire to automate it.
>=20
> I can provide more details off list if helpful- I'm trying not to =
vilify anyone- especially without copious amounts of data points.
>=20
> As a side question, what should my expectation be regarding packet =
loss when sending packets from point A to point B across multiple =
providers across the internet?  Is 30 seconds to a minute of blackout =
between two destinations every couple of weeks par for the course?  My =
directly connected ISPs offer me an SLA, but what should I reasonably =
expect from them when one of their upstream peers (or a peer of their =
peers) has issues?  If this turns out to be BGP reconvergence or similar =
do I have any options?

I think there are a number of tools available to detect if something is =
happening:

1) iperf (test network/bw usage)
2) owamp (one way ping) - you can use this to detect when reordering or =
other events happen.. this will collect nearly continuious data.  =
requires good ntp references, or accepting you may see skewed data.
3) some other udp/low latency responder.  i've built something of my own =
that does this, i can provide a pointer if you are interested.  i have =
graphs of my connection at home to someplace remote that crosses 3 =
carriers.  you can see the queuing delay increment throughout the day =
until peak times and taper off at night.  no loss, but the increase is =
quite visible.
4) some vendor SLA/SAA product.  Cisco and others have SAA responders =
that work on their devices you can configure to collect data.

That being said, losing network for 30 seconds once every 2 weeks I =
would expect is fairly common.  Someone will be doing network =
upgrades/work or there will be hardware/transmission error, etc.

30 seconds sounds a lot like bgp convergence, and in older platforms, =
eg: 6500/sup720 expect about 8k prefixes/second max to be downloaded =
into the tcam/fib.  with 400k+ prefixes, it takes awhile to pump the =
tables into the forwarding side.

- Jared=


home	help	back	first	fref	pref	prev	next	nref	lref	last	post

[164546] in North American Network Operators' Group

Re: tools and techniques to pinpoint and respond to loss on a path

daemon@ATHENA.MIT.EDU (Jared Mauch)Mon Jul 15 17:30:59 2013

daemon@ATHENA.MIT.EDU (Jared Mauch)
Mon Jul 15 17:30:59 2013