cross connect reliability

Fri Sep 18 17:55:25 UTC 2009

On Sep 17, 2009, at 7:45 PM, Richard A Steenbergen wrote:

[ SNIP ]

> Story 2. Had a customer report that they were getting extremely slow
> transfers to another network, despite not being able to find any  
> packet
> loss. Shifting the traffic to a different port to reach the same  
> network
> resolved the problem. After removing the traffic and attempting to  
> ping
> the far side, I got the following:
>
> <drop>
> 64 bytes from x.x.x.x: icmp_seq=1 ttl=61 time=0.194 ms
> 64 bytes from x.x.x.x: icmp_seq=2 ttl=61 time=0.196 ms
> 64 bytes from x.x.x.x: icmp_seq=3 ttl=61 time=0.183 ms
> 64 bytes from x.x.x.x: icmp_seq=0 ttl=61 time=4.159 ms
> <drop>
> 64 bytes from x.x.x.x: icmp_seq=5 ttl=61 time=0.194 ms
> 64 bytes from x.x.x.x: icmp_seq=6 ttl=61 time=0.196 ms
> 64 bytes from x.x.x.x: icmp_seq=7 ttl=61 time=0.183 ms
> 64 bytes from x.x.x.x: icmp_seq=4 ttl=61 time=4.159 ms
>
> After a little bit more testing, it turned out that every 4th packet
> that was being sent to the peers' router was being queued until  
> another
> "4th packet" would come along and knock it out. If you increased the
> interval time of the ping, you would see the amount of time the packet
> spent in the queue increase. At one point I had it up to over 350
> seconds (not milliseconds) that the packet stayed in the other  
> routers'
> queue before that 4th packet came along and knocked it free. I suspect
> it could have gone higher, but random scanning traffic on the internet
> was coming in. When there was a lot of traffic on the interface you
> would never see the packet loss, just reordering of every 4th packet  
> and
> thus slow tcp transfers. :)

Story 1:
-----------
I had a router where I was suddenly unable to reach certain hosts on  
the (/24) ethernet interface -- pinging form the router worked fine,  
transit traffic wouldn't. I decided to try and figure out if there was  
any sort of rhyme or reason to which hosts had gone unreachable. I  
could successfully reach
xxx.yyy.zzz.1
xxx.yyy.zzz.2
xxx.yyy.zzz.3
xxx.yyy.zzz.5
xxx.yyy.zzz.7
xxx.yyy.zzz.11
xxx.yyy.zzz.13
xxx.yyy.zzz.17
...
xxx.yyy.zzz.197
xxx.yyy.zzz.199

There were only 200 hosts on the LAN, but I'd bet dollars to donuts  
that I know what the next reachable one would have been if there had  
been more.  Unfortunately the box rebooted itself (when I tried to  
view the FIB) before I could collect more info.

Story 2:
----------
Had a small router connecting a remote office over a multilink PPP[1]  
interface (4xE1). Site starts getting massive packet-loss, so I figure  
one of the circuits has gone bad, but didn't get removed from the  
bundle. I'm having a hard time reaching the remote side, so I pull the  
interfaces from protocols and try ping the remote router -- no  
replies.... Luckily I didn't hit Ctrl-C on the ping, because suddenly  
I start getting replies with no drops:

64 bytes from x.x.x.x: icmp_seq=1 ttl=120 time=30132.148 ms
64 bytes from x.x.x.x: icmp_seq=2 ttl=120 time=30128.178 ms
64 bytes from x.x.x.x: icmp_seq=3 ttl=120 time=30133.231 ms
64 bytes from x.x.x.x: icmp_seq=4 ttl=120 time=30112.571 ms
64 bytes from x.x.x.x: icmp_seq=5 ttl=120 time=30132.632 ms

What?! I figure it's gotta be MLPPP stupidity and / or depref of ICMP,  
so I connect OOB and  A: remove MLPPP and use just a single interface  
and B: start pinging a host behind the router instead...

64 bytes from x.x.x.x: icmp_seq=1 ttl=120 time=30142.323 ms
64 bytes from x.x.x.x: icmp_seq=2 ttl=120 time=30144.571 ms
64 bytes from x.x.x.x: icmp_seq=3 ttl=120 time=30141.632 ms
64 bytes from x.x.x.x: icmp_seq=4 ttl=120 time=30142.420 ms
64 bytes from x.x.x.x: icmp_seq=5 ttl=120 time=30159.706 ms

I fire up tcpdump and try ssh to a host on the remote side -- I see  
the SYN leave my machine and then, 30 *seconds* later I get back a SYN- 
ACK. I change the queuing on the interface from FIFO to something else  
and the problem goes away. I change the queuing back to FIFO and it's  
30 second RTT again. Somehow it seems to be buffering as much traffic  
as it can (and anything more than one copy of ping running (or ping  
with anything larger than the default packet size) makes if start  
dropping badly). I ran "show buffers" to try get more of an idea what  
was happening, but it didn't like that and reloaded. Came back up fine  
though...

Story 3:
----------

Running a network that had a large number of L3 switches from a vendor  
(lets call them "X") in a single OSPF area. This area also contained a  
large number of poor quality international circuits that would flap  
often, so there was *lots* of churn. Apparently this vendor X's OSPF  
implementation didn't much like this and so would become unhappy. The  
way it would express its displeasure was by corrupting a pointer to /  
in the LSDB so it was off-by-one and you'd get:
Nov 24 22:23:53.633 EST: %OSPF-4-BADLSAMASK: Bad LSA mask: Type 5,  
LSID 0.9.32.5
Mask 10.160.8.0 from 10.178.255.252
NOTE: This route will not be installed in the routing table.
Nov 26 11:01:32.997 EST: %OSPF-4-BADLSAMASK: Bad LSA mask: Type 5,  
LSID 0.4.2.3
Mask 10.2.153.0 from 10.178.255.252
NOTE: This route will not be installed in the routing table.
Nov 27 23:14:00.660 EST: %OSPF-4-BADLSAMASK: Bad LSA mask: Type 5,  
LSID 0.4.2.3
Mask 10.2.153.0 from 10.178.255.252
NOTE: This route will not be installed in the routing table.
(This network was addressed out of 10/8 -  10.178.255.252 is one of  
vendors X boxes and 10.160.8.0 is a valid subnet, but, surprisingly  
enough, not a valid mask..... ).
To make mattes even more fun, the OSPF adjacency would go down and  
then come back up -- and the grumpy box would flood all of it's  
(corrupt) LSAs...

W

[1]: Hey, not my idea...

>
> -- 
> Richard A Steenbergen <ras at e-gerbil.net>       http://www.e-gerbil.net/ras
> GPG Key ID: 0xF8B12CBC (7535 7F59 8204 ED1F CC1C 53AF 4C41 5ECA F8B1  
> 2CBC)
>

--
"Real children don't go hoppity-skip unless they are on drugs."

     -- Susan, the ultimate sensible governess (Terry Pratchett,  
Hogfather)