impossible circuit

Justin Shore justin at justinshore.com
Wed Aug 13 11:02:29 CDT 2008


This is just a WAG but what the hell.

Jon Lewis wrote:
> I've got this private line DS3.  It connects cisco 7206 routers in 
> Orlando (at our data center) and in Ocala (a colo rack in the Embarq CO).
> 
> According to the DLR, it's a real circuit, various portions of it ride 
> varying sized OC circuits, and then it's handed off to us at each end 
> the usual way (copper/coax) and plugged into PA-2T3 cards.

Are you sure that they are not crossing some channels in the middle and 
accidentally handing them to a different customer?  You mention above 
that various portions of the DS3 ride different transport circuits in 
the middle.  That always creates the potential for someone to not put it 
back together correctly on either end.  I've seen DLCs get crossed 
before.  I could easily see a transport provider crossing portions of a 
circuit, especially if they break it into pieces in the middle and have 
to put it back together on the ends.

I think it makes sense too.  Somebody's getting traffic off a T1 that 
isn't destined for them.  Their router sees it, says WTF and sends a 
ICMP dest unreachable via their default route through Sprint.  Same 
thing goes for a traceroute; it simply follows its default route to 
reply to your packets with the expiring TTL.  Taking a path through a 
different provider would be expected since it doesn't have a connected 
route to the source of the traceroute (since it's not the far end of 
your T1 that you're expecting).  The site getting your crossed T1 could 
be using the T1 as a PtP to a branch office and has Internet through a 
different circuit that hasn't been hosed.

I would be curious to hear if Sprint is having any problems with a 
circuit connected to sl-bb20-dc-6-0-0.sprintlink.net, what the router is 
and if any directly connected customers are having T1 problems.  If 
nothing else Sprint should be able to track down the source of the 
traceroute return packets and contact the customer.  The T1 could be 
part of a bundle at their site and they may not even realize that the 
bundle dropped a path.

> Last Tuesday, at about 2:30PM, "something bad happened."  We saw a 
> serious jump in traffic to Ocala, and in particular we noticed one 
> customer's connection (a group of load sharing T1s) was just totally 
> full.  We quickly assumed it was a DDoS aimed at that customer, but 
> looking at the traffic, we couldn't pinpoint anything that wasn't 
> expected flows.

Are you sure that the traffic being received by each of the T1s is 
their's?  Do you have any way to getting flows or packets off of 
individual T1s and not the bundle as a whole?

Tracing through you to your upstream...

>  7  andc-br-3-f2-0.atlantic.net (209.208.9.138)  47.951 ms  56.096 ms  
> 56.154 ms
>  8  ocalflxa-br-1-s1-0.atlantic.net (209.208.112.98)  56.199 ms  56.320 
> ms  56.196 ms
>  9  * * *

Circuit gets crossed onto the wrong customer.  Wrong site received a 
packet with an expiring TTL and goes to send a reply.  Destination IP 
isn't on a connected route so the site sends the reply via it's default 
route on Sprint.

> 10  sl-bb20-dc-6-0-0.sprintlink.net (144.232.8.174)  80.774 ms  81.030 
> ms  81.821 ms
> 11  sl-st20-ash-10-0.sprintlink.net (144.232.20.152)  75.731 ms  75.902 
> ms  77.128 ms

Reply traverses Sprint to L3 and on to you.

> 12  te-10-1-0.edge2.Washington4.level3.net (4.68.63.209)  46.548 ms  
> 53.200 ms  45.736 ms
> 13  vlan69.csw1.Washington1.Level3.net (4.68.17.62)  42.918 ms 
> vlan79.csw2.Washington1.Level3.net (4.68.17.126)  55.438 ms 
> vlan69.csw1.Washington1.Level3.net (4.68.17.62)  42.693 ms
> 14  ae-81-81.ebr1.Washington1.Level3.net (4.69.134.137)  48.935 ms 
> ae-61-61.ebr1.Washington1.Level3.net (4.69.134.129)  49.317 ms 
> ae-91-91.ebr1.Washington1.Level3.net (4.69.134.141)  48.865 ms
> 15  ae-2.ebr3.Atlanta2.Level3.net (4.69.132.85)  59.642 ms  56.278 ms  
> 56.671 ms
> 16  ae-61-60.ebr1.Atlanta2.Level3.net (4.69.138.2)  47.401 ms  62.980 
> ms  62.640 ms
> 17  ae-1-8.bar1.Orlando1.Level3.net (4.69.137.149)  40.300 ms  40.101 
> ms  42.690 ms
> 18  ae-6-6.car1.Orlando1.Level3.net (4.69.133.77)  40.959 ms  40.963 ms  
> 41.016 ms
> 19  unknown.Level3.net (63.209.98.66)  246.744 ms  240.826 ms  239.758 ms
> 20  andc-br-3-f2-0.atlantic.net (209.208.9.138)  39.725 ms  37.751 ms  
> 42.262 ms
> 21  ocalflxa-br-1-s1-0.atlantic.net (209.208.112.98)  43.524 ms  45.844 
> ms  43.392 ms
> 22  * * *
> 23  sl-bb20-dc-6-0-0.sprintlink.net (144.232.8.174)  63.752 ms  61.648 
> ms  60.839 ms
> 24  sl-st20-ash-10-0.sprintlink.net (144.232.20.152)  66.923 ms  65.258 
> ms  70.609 ms
> 25  te-10-1-0.edge2.Washington4.level3.net (4.68.63.209)  67.106 ms  
> 93.415 ms  73.932 ms
> 26  vlan99.csw4.Washington1.Level3.net (4.68.17.254)  88.919 ms  75.306 
> ms vlan79.csw2.Washington1.Level3.net (4.68.17.126)  75.048 ms
> 27  ae-61-61.ebr1.Washington1.Level3.net (4.69.134.129)  69.508 ms  
> 68.401 ms ae-71-71.ebr1.Washington1.Level3.net (4.69.134.133)  79.128 ms
> 28  ae-2.ebr3.Atlanta2.Level3.net (4.69.132.85)  64.048 ms  67.764 ms  
> 67.704 ms
> 29  ae-71-70.ebr1.Atlanta2.Level3.net (4.69.138.18)  68.372 ms  67.025 
> ms  68.162 ms
> 30  ae-1-8.bar1.Orlando1.Level3.net (4.69.137.149)  65.112 ms  65.584 
> ms  65.525 ms

I can't explain the continuous loop or the dupes.  I'm not sure if my 
theory fits those symptoms or not.

> Our circuit provider's support people have basically just maintained 
> that this behavior isn't possible and so there's nothing they can do 
> about it. i.e. that the problem has to be something other than the circuit.

Can you have them put the circuit into maintenance and have them test it 
end to end?  They can't deny it when their TDR says that there's a problem.

> I got tired of talking to their brick wall, so I contacted Sprint and 
> was able to confirm with them that the traffic in question really was 
> inexplicably appearing on their network...and not terribly close 
> geographically to the Orlando/Ocala areas.

Which supports with my theory of a crossed circuit.  Crossing a DS1 onto 
the wrong DS3 or OCx could easily make it pop up anywhere.  Somewhere is 
another site that's having T1 problems.

> So, I have a circuit that's bleeding duplicate packets onto an unrelated 
> IP network, a circuit provider who's got their head in the sand and 
> keeps telling me "this can't happen, we can't help you", and customers 
> who were getting tired of receiving all their packets in triplicate (or 
> more) saturating their connections and confusing their applications.  
> After a while, I had to give up on finding the problem and focus on just 
> making it stop.  After trying a couple of things, the solution I found 
> was to change the encapsulation we use at each end of the DS3.  I 
> haven't gotten confirmation of this from Sprint, but I assume they're 
> now seeing massive input errors one the one or more circuits where our 
> packets were/are appearing.  The important thing (for me) is that this 
> makes the packets invalid to Sprint's routers and so it keeps them from 
> forwarding the packets to us.  Cisco TAC finally got back to us the day 
> after I "fixed" the circuit...but since it was obviously not a problem 
> with our cisco gear, I haven't pursued it with them.

Right.  By changing the encap you've basically killed the circuit.  With 
that T1 effectively down on your end you won't be sending any packets 
down the problem path and aren't able to see that problem anymore with 
your traceroutes.  However your customer with the bundle of T1s is down 
a circuit.

It makes sense in my mind that it's simply a crossed circuit in the 
middle.  Your transport provider for whatever reason pulled out a DS1 
and sent it down a different path.  They accidentally crossed DS1s in 
the middle and are handing your DS1 to a Sprint customer and their DS1 
to your customer.  That's my theory at least.

Justin






More information about the NANOG mailing list