cross connect reliability

Thu Sep 17 23:45:47 UTC 2009

On Thu, Sep 17, 2009 at 03:35:37PM -0700, Charles Wyble wrote:
> 
> Random failures of a single ports connectivity.... bizzare and annoying. 
> Whole switches? Seen it.
> Whole panels? Seen it.
> Whole blades? Seen it.
> 
> Single port on a switch or patch panel? Never.

You've never seen a single port go bad on a switch? I can't even count
the number of times I've seen that happen. Not that I'm not suggesting 
the OP wasn't the victim of a human error like unplugging the wrong port 
and they just lied to him, that happens even more.

My favorite bizarre random failure story is a toss-up between one of 
these two:

Story 1. Had a customer report that they weren't able to transfer this
one particular file over their connection. The transfer would start and
then at a certain point the tcp session would just lock up. After a lot
of head scratching, it turned out that for 8 ports on a 24 port FastE
switch blade, this certain combination of bytes caused the packet to be
dropped on this otherwise perfectly normal and functioning card, thus
stalling the tcp session while leaving everything around it unaffected.
If you moved them to a different port outside this group of 8, or used
https, or uuencoded it, it would go through fine.

Story 2. Had a customer report that they were getting extremely slow 
transfers to another network, despite not being able to find any packet 
loss. Shifting the traffic to a different port to reach the same network 
resolved the problem. After removing the traffic and attempting to ping 
the far side, I got the following:

<drop>
64 bytes from x.x.x.x: icmp_seq=1 ttl=61 time=0.194 ms
64 bytes from x.x.x.x: icmp_seq=2 ttl=61 time=0.196 ms
64 bytes from x.x.x.x: icmp_seq=3 ttl=61 time=0.183 ms
64 bytes from x.x.x.x: icmp_seq=0 ttl=61 time=4.159 ms
<drop>
64 bytes from x.x.x.x: icmp_seq=5 ttl=61 time=0.194 ms
64 bytes from x.x.x.x: icmp_seq=6 ttl=61 time=0.196 ms
64 bytes from x.x.x.x: icmp_seq=7 ttl=61 time=0.183 ms
64 bytes from x.x.x.x: icmp_seq=4 ttl=61 time=4.159 ms

After a little bit more testing, it turned out that every 4th packet
that was being sent to the peers' router was being queued until another
"4th packet" would come along and knock it out. If you increased the
interval time of the ping, you would see the amount of time the packet
spent in the queue increase. At one point I had it up to over 350
seconds (not milliseconds) that the packet stayed in the other routers'
queue before that 4th packet came along and knocked it free. I suspect
it could have gone higher, but random scanning traffic on the internet
was coming in. When there was a lot of traffic on the interface you
would never see the packet loss, just reordering of every 4th packet and 
thus slow tcp transfers. :)

-- 
Richard A Steenbergen <ras at e-gerbil.net>       http://www.e-gerbil.net/ras
GPG Key ID: 0xF8B12CBC (7535 7F59 8204 ED1F CC1C 53AF 4C41 5ECA F8B1 2CBC)