packet loss question

Ken Chase math at sizone.org
Thu Jul 7 21:34:52 UTC 2016


On Thu, Jul 07, 2016 at 08:32:19PM +0000, Mel Beckman said:
  >Yes. It indicates that there was never a time when you did not know everything :)
  >
  > -mel beckman

The issue isnt knowing everything, it's making accusations of issues while you still
dont know how much you dont know. (~D. Rumsfeld) -- My customers in a nutshell
(they pay to be able to yell about random stuff I guess, and I provide that service!).

The OP didnt make any accusations however, and just asked what was going on (sorry
if I sounded harsh in reply). Once, Google having a 8.8.8.8 failure locally on
its (anycast?) dns servers resulted in dozens of calls to us "your server
hosting our site must be down!! Our website isnt working! People are calling us!".

Most of my work is with these situations is spent proving it's not our fault.
Mtr makes it very hard because it's a very subtle tool, and only gives partial
information. (I still think mtr is a killer app though!)

consider this (fake, example) trace:

  6. 100ge13-1.core1.chi1.he.net            0.0%    10 
  7. 100ge14-1.core2.chi1.he.net            0.0%    10 
  8. 100ge3-1.core1.sjc2.he.net            30.0%    10 
  9. ???
 10. UNKNOWN-216-115-101-X.yahoo.com       10.0%    10 
 11. routerer-ext.ysv.freebsd.org          20.0%    10 
 12. wfe0.ysv.freebsd.org                  30.0%    10 

First off, the OP may have asked "who's fault is hop 9, yahoo or HE?" and seen it
as an issue. Ignoring that for now, the rest of the packetloss is an issue --
where is the problem though?

This is very tricky - it looks like hop 8 is at fault of course - or is it
just dropping ICMP as it's allowed to? How did hop 10 get only 10% loss then if
8 has 30? Is 8 then dropping ~20% (not statistically correct..) of ICMP just cuz
it can, and then having a 'real' 10% loss on top of that?

Or it's hop 11? But hop 12 has more PL, perhaps hop 12 is the issue
all along and 8 10 and 11 are just dropping ICMP? Or it's 8, 11 and 12 doing
~10% each? (not statistically correct.)

Can't say for sure - it's a probabilities game - and being completely correct
about it, hop 6 isn't blameless either (just very unlikely to be at fault
statistically, though not impossible with only 10 pings per hop - a statistician
can calculate it for us).

This is why more pings are required to be sure of the situation - I like to do
-i 0.1 -c 100 so it's completed quickly before conditions change.  Then you
can make a statistically valid pronouncement of where the problem MIGHT BE
within a useful confidence interval - however, without the return route we're
still largely in the dark as to the actual location of the issue. You cant be
'100% sure' with this stuff - technically speaking, it's all 'luck of the draw'.

(Beware: this one time, at band camp, some etherchannel or equiv at HE was
showing PL only for specific ips in any target subnet -- because they were xor'ing
the source & target IP to load balance and one channel was wonky. Fun times
debugging that one: "WFM from here, what's your issue?")

/kc
-- 
Ken Chase - ken at heavycomputing.ca skype:kenchase23 +1 416 897 6284 Toronto Canada
Heavy Computing - Clued bandwidth, colocation and managed linux VPS @151 Front St. W.



More information about the NANOG mailing list