tools and techniques to pinpoint and respond to loss on a path

Mon Jul 15 21:30:42 UTC 2013

On Jul 15, 2013, at 5:18 PM, Andy Litzinger <Andy.Litzinger at theplatform.com> wrote:

>  I'd like to be able to collect enough relevant data to pinpoint the trouble spot as much as possible so I can take it to the ISPs and request a solution.  The blackouts are so quick that it's impossible to log in and get a trace- hence the desire to automate it.
> 
> I can provide more details off list if helpful- I'm trying not to vilify anyone- especially without copious amounts of data points.
> 
> As a side question, what should my expectation be regarding packet loss when sending packets from point A to point B across multiple providers across the internet?  Is 30 seconds to a minute of blackout between two destinations every couple of weeks par for the course?  My directly connected ISPs offer me an SLA, but what should I reasonably expect from them when one of their upstream peers (or a peer of their peers) has issues?  If this turns out to be BGP reconvergence or similar do I have any options?

I think there are a number of tools available to detect if something is happening:

1) iperf (test network/bw usage)
2) owamp (one way ping) - you can use this to detect when reordering or other events happen.. this will collect nearly continuious data.  requires good ntp references, or accepting you may see skewed data.
3) some other udp/low latency responder.  i've built something of my own that does this, i can provide a pointer if you are interested.  i have graphs of my connection at home to someplace remote that crosses 3 carriers.  you can see the queuing delay increment throughout the day until peak times and taper off at night.  no loss, but the increase is quite visible.
4) some vendor SLA/SAA product.  Cisco and others have SAA responders that work on their devices you can configure to collect data.

That being said, losing network for 30 seconds once every 2 weeks I would expect is fairly common.  Someone will be doing network upgrades/work or there will be hardware/transmission error, etc.

30 seconds sounds a lot like bgp convergence, and in older platforms, eg: 6500/sup720 expect about 8k prefixes/second max to be downloaded into the tcam/fib.  with 400k+ prefixes, it takes awhile to pump the tables into the forwarding side.

- Jared