Odd router brokenness

Mark Radabaugh mark at amplex.net
Wed Nov 23 14:41:08 UTC 2011


Since this list likes to speculate with little facts on a regular basis 
(and I'll admit to being as guilty as anyone) I throw this one out for 
opinions :

We were seeing very odd behavior on a Cogent circuit following a 
software upgrade to tol01.atlas.   Two traceroutes:

mark at angola-gw> traceroute 74.125.226.6
traceroute to 74.125.226.6 (74.125.226.6), 30 hops max, 40 byte packets
  1  * * gi1-1.ccr01.tol01.atlas.cogentco.com (38.104.148.5)  110.315 ms
  2  te4-2.ccr01.sbn01.atlas.cogentco.com (154.54.7.154)  139.520 ms  
196.910 ms  5.728 ms
  3  * * *
  4  * te0-5-0-5.ccr21.ord03.atlas.cogentco.com (154.54.44.174)  8.310 
ms te0-0-0-7.ccr21.ord03.atlas.cogentco.com (154.54.25.70)  8.752 ms
  5  te0-0-0-0.ccr22.ord03.atlas.cogentco.com (154.54.24.214)  8.983 ms 
te0-1-0-0.ccr22.ord03.atlas.cogentco.com (66.28.4.66)  7.948 ms *
  6  * * te-9-1.car4.Chicago1.Level3.net (4.68.127.129)  26.127 ms
  7  GOOGLE-INC.car4.Chicago1.Level3.net (4.71.100.22)  38.132 ms  
25.120 ms *
  8  * * 209.85.254.122 (209.85.254.122)  24.539 ms
  9  * 72.14.237.130 (72.14.237.130)  26.134 ms 72.14.237.108 
(72.14.237.108)  25.021 ms
      MPLS Label=666803 CoS=4 TTL=1 S=1
10  216.239.46.161 (216.239.46.161)  31.816 ms  35.702 ms  32.249 ms
11  72.14.233.142 (72.14.233.142)  32.897 ms * *
12  * yyz06s05-in-f6.1e100.net (74.125.226.6)  33.319 ms *

and a ping over the same path:

--- www.l.google.com ping statistics ---
675 packets transmitted, 323 packets received, 52.1% packet loss
round-trip min/avg/max/stddev = 12.834/28.831/129.743/28.987 ms

and at the same time:

mark at angola-gw> traceroute 38.100.128.10
traceroute to 38.100.128.10 (38.100.128.10), 30 hops max, 40 byte packets
  1  gi1-1.ccr01.tol01.atlas.cogentco.com (38.104.148.5)  4.445 ms  
1.841 ms  1.713 ms
  2  te7-7.ccr02.cle04.atlas.cogentco.com (154.54.5.230)  5.318 ms 
te3-2.ccr02.cle04.atlas.cogentco.com (154.54.28.86)  4.755 ms 
te7-7.ccr02.cle04.atlas.cogentco.com (154.54.5.230)  4.982 ms
  3  te4-2.ccr01.pit02.atlas.cogentco.com (154.54.30.10)  7.997 ms 
te3-2.ccr01.pit02.atlas.cogentco.com (154.54.30.6)  7.736 ms 
te4-2.ccr01.pit02.atlas.cogentco.com (154.54.30.10)  8.177 ms
  4  te0-0-0-5.mpd21.dca01.atlas.cogentco.com (154.54.40.81)  17.197 ms 
te0-0-0-5.ccr22.dca01.atlas.cogentco.com (154.54.30.230)  16.907 ms 
te0-0-0-5.mpd21.dca01.atlas.cogentco.com (154.54.40.81)  17.008 ms
  5  te0-1-0-0.mpd22.dca01.atlas.cogentco.com (154.54.2.193)  17.358 ms 
te0-0-0-0.mpd22.dca01.atlas.cogentco.com (154.54.31.38)  17.196 ms 
te0-1-0-0.mpd22.dca01.atlas.cogentco.com (154.54.2.193)  18.690 ms
  6  te4-2.mpd01.iad03.atlas.cogentco.com (154.54.29.122)  17.885 ms *  
18.537 ms
  7  cogentco.com (38.100.128.10)  17.836 ms !<10>  17.918 ms !<10>  
17.833 ms !<10>

--- 38.100.128.10 ping statistics ---
236 packets transmitted, 236 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 22.717/27.942/128.011/12.236 ms
sh-3.2#

Works perfectly.    There is no asymmetric routing in this scenario 
(only 1 BGP peer running during this test), and it is not due to traffic 
congestion.  Initial speculation over the dropped packets in the trace 
to 74.125.226.6 was ICMP depriortization.  The results are too 
consistent for that  to make sense (I have dozens of traceroutes to the 
same destination - they all appear similar).

I realize there is a long history of Cogent/L3 ugliness but I'm pretty 
sure that this issue has nothing to do with that subject.

Traceroutes and pings from the control plane of tol01.atlas sourced from 
38.104.148.5 do not show any odd behavior.   Inbound traffic (to us) is 
not affected by this.  Our workaround while resolving this issue was to 
change local-pref on the affected prefixes to send traffic out our other 
providers.

The issue started after a software upgrade to tol01.atlas and resolved 
after a (reported) reboot of tol01.atlas.

The question is:   How does a router break in this manner?    It appears 
to unintentionally be doing something different with traffic based on 
the source address, not the destination address.    I realize this can 
be done intentionally  - but that is not the case here (unless somebody 
isn't telling me something).



-- 
Mark Radabaugh
Amplex

mark at amplex.net  419.837.5015





More information about the NANOG mailing list