justin at justinshore.com
Wed Aug 13 11:02:29 CDT 2008
This is just a WAG but what the hell.
Jon Lewis wrote:
> I've got this private line DS3. It connects cisco 7206 routers in
> Orlando (at our data center) and in Ocala (a colo rack in the Embarq CO).
> According to the DLR, it's a real circuit, various portions of it ride
> varying sized OC circuits, and then it's handed off to us at each end
> the usual way (copper/coax) and plugged into PA-2T3 cards.
Are you sure that they are not crossing some channels in the middle and
accidentally handing them to a different customer? You mention above
that various portions of the DS3 ride different transport circuits in
the middle. That always creates the potential for someone to not put it
back together correctly on either end. I've seen DLCs get crossed
before. I could easily see a transport provider crossing portions of a
circuit, especially if they break it into pieces in the middle and have
to put it back together on the ends.
I think it makes sense too. Somebody's getting traffic off a T1 that
isn't destined for them. Their router sees it, says WTF and sends a
ICMP dest unreachable via their default route through Sprint. Same
thing goes for a traceroute; it simply follows its default route to
reply to your packets with the expiring TTL. Taking a path through a
different provider would be expected since it doesn't have a connected
route to the source of the traceroute (since it's not the far end of
your T1 that you're expecting). The site getting your crossed T1 could
be using the T1 as a PtP to a branch office and has Internet through a
different circuit that hasn't been hosed.
I would be curious to hear if Sprint is having any problems with a
circuit connected to sl-bb20-dc-6-0-0.sprintlink.net, what the router is
and if any directly connected customers are having T1 problems. If
nothing else Sprint should be able to track down the source of the
traceroute return packets and contact the customer. The T1 could be
part of a bundle at their site and they may not even realize that the
bundle dropped a path.
> Last Tuesday, at about 2:30PM, "something bad happened." We saw a
> serious jump in traffic to Ocala, and in particular we noticed one
> customer's connection (a group of load sharing T1s) was just totally
> full. We quickly assumed it was a DDoS aimed at that customer, but
> looking at the traffic, we couldn't pinpoint anything that wasn't
> expected flows.
Are you sure that the traffic being received by each of the T1s is
their's? Do you have any way to getting flows or packets off of
individual T1s and not the bundle as a whole?
Tracing through you to your upstream...
> 7 andc-br-3-f2-0.atlantic.net (188.8.131.52) 47.951 ms 56.096 ms
> 56.154 ms
> 8 ocalflxa-br-1-s1-0.atlantic.net (184.108.40.206) 56.199 ms 56.320
> ms 56.196 ms
> 9 * * *
Circuit gets crossed onto the wrong customer. Wrong site received a
packet with an expiring TTL and goes to send a reply. Destination IP
isn't on a connected route so the site sends the reply via it's default
route on Sprint.
> 10 sl-bb20-dc-6-0-0.sprintlink.net (220.127.116.11) 80.774 ms 81.030
> ms 81.821 ms
> 11 sl-st20-ash-10-0.sprintlink.net (18.104.22.168) 75.731 ms 75.902
> ms 77.128 ms
Reply traverses Sprint to L3 and on to you.
> 12 te-10-1-0.edge2.Washington4.level3.net (22.214.171.124) 46.548 ms
> 53.200 ms 45.736 ms
> 13 vlan69.csw1.Washington1.Level3.net (126.96.36.199) 42.918 ms
> vlan79.csw2.Washington1.Level3.net (188.8.131.52) 55.438 ms
> vlan69.csw1.Washington1.Level3.net (184.108.40.206) 42.693 ms
> 14 ae-81-81.ebr1.Washington1.Level3.net (220.127.116.11) 48.935 ms
> ae-61-61.ebr1.Washington1.Level3.net (18.104.22.168) 49.317 ms
> ae-91-91.ebr1.Washington1.Level3.net (22.214.171.124) 48.865 ms
> 15 ae-2.ebr3.Atlanta2.Level3.net (126.96.36.199) 59.642 ms 56.278 ms
> 56.671 ms
> 16 ae-61-60.ebr1.Atlanta2.Level3.net (188.8.131.52) 47.401 ms 62.980
> ms 62.640 ms
> 17 ae-1-8.bar1.Orlando1.Level3.net (184.108.40.206) 40.300 ms 40.101
> ms 42.690 ms
> 18 ae-6-6.car1.Orlando1.Level3.net (220.127.116.11) 40.959 ms 40.963 ms
> 41.016 ms
> 19 unknown.Level3.net (18.104.22.168) 246.744 ms 240.826 ms 239.758 ms
> 20 andc-br-3-f2-0.atlantic.net (22.214.171.124) 39.725 ms 37.751 ms
> 42.262 ms
> 21 ocalflxa-br-1-s1-0.atlantic.net (126.96.36.199) 43.524 ms 45.844
> ms 43.392 ms
> 22 * * *
> 23 sl-bb20-dc-6-0-0.sprintlink.net (188.8.131.52) 63.752 ms 61.648
> ms 60.839 ms
> 24 sl-st20-ash-10-0.sprintlink.net (184.108.40.206) 66.923 ms 65.258
> ms 70.609 ms
> 25 te-10-1-0.edge2.Washington4.level3.net (220.127.116.11) 67.106 ms
> 93.415 ms 73.932 ms
> 26 vlan99.csw4.Washington1.Level3.net (18.104.22.168) 88.919 ms 75.306
> ms vlan79.csw2.Washington1.Level3.net (22.214.171.124) 75.048 ms
> 27 ae-61-61.ebr1.Washington1.Level3.net (126.96.36.199) 69.508 ms
> 68.401 ms ae-71-71.ebr1.Washington1.Level3.net (188.8.131.52) 79.128 ms
> 28 ae-2.ebr3.Atlanta2.Level3.net (184.108.40.206) 64.048 ms 67.764 ms
> 67.704 ms
> 29 ae-71-70.ebr1.Atlanta2.Level3.net (220.127.116.11) 68.372 ms 67.025
> ms 68.162 ms
> 30 ae-1-8.bar1.Orlando1.Level3.net (18.104.22.168) 65.112 ms 65.584
> ms 65.525 ms
I can't explain the continuous loop or the dupes. I'm not sure if my
theory fits those symptoms or not.
> Our circuit provider's support people have basically just maintained
> that this behavior isn't possible and so there's nothing they can do
> about it. i.e. that the problem has to be something other than the circuit.
Can you have them put the circuit into maintenance and have them test it
end to end? They can't deny it when their TDR says that there's a problem.
> I got tired of talking to their brick wall, so I contacted Sprint and
> was able to confirm with them that the traffic in question really was
> inexplicably appearing on their network...and not terribly close
> geographically to the Orlando/Ocala areas.
Which supports with my theory of a crossed circuit. Crossing a DS1 onto
the wrong DS3 or OCx could easily make it pop up anywhere. Somewhere is
another site that's having T1 problems.
> So, I have a circuit that's bleeding duplicate packets onto an unrelated
> IP network, a circuit provider who's got their head in the sand and
> keeps telling me "this can't happen, we can't help you", and customers
> who were getting tired of receiving all their packets in triplicate (or
> more) saturating their connections and confusing their applications.
> After a while, I had to give up on finding the problem and focus on just
> making it stop. After trying a couple of things, the solution I found
> was to change the encapsulation we use at each end of the DS3. I
> haven't gotten confirmation of this from Sprint, but I assume they're
> now seeing massive input errors one the one or more circuits where our
> packets were/are appearing. The important thing (for me) is that this
> makes the packets invalid to Sprint's routers and so it keeps them from
> forwarding the packets to us. Cisco TAC finally got back to us the day
> after I "fixed" the circuit...but since it was obviously not a problem
> with our cisco gear, I haven't pursued it with them.
Right. By changing the encap you've basically killed the circuit. With
that T1 effectively down on your end you won't be sending any packets
down the problem path and aren't able to see that problem anymore with
your traceroutes. However your customer with the bundle of T1s is down
It makes sense in my mind that it's simply a crossed circuit in the
middle. Your transport provider for whatever reason pulled out a DS1
and sent it down a different path. They accidentally crossed DS1s in
the middle and are handing your DS1 to a Sprint customer and their DS1
to your customer. That's my theory at least.
More information about the NANOG