Outbound Route Optimization

Tue Jan 27 02:02:38 UTC 2004

Richard A Steenbergen wrote:
> > The issue that you describe does indeed offer some constraints to the
> > application of route optimization technology. Within the scope of this
> > issue, though, I think that you would agree that a network which is ALL
> > transit would face no challenge here -- and more specifically, if there
> > is a routing optimization decision among local transit links, that
> > problem could be solved independantly of the existance of "non-transit"
> > links.
> 
> Just noting why it will never be anything other than a small customer
> transit-only solution. As long as you are guaranteed by design that your
> product will never be applicable to large networks or networks with any
> peering, you know that odds are VERY slim you'll ever have anyone with
> real network clue using the product. Under such conditions, snake oil
> sales flurish.

It appears to me that you've acknowledged that route
optimization solves a problem, albeit one that is not
a complete solution for your network. The claims of
'snake oil' seems inappropriate in this context. 

One step further: if you are running a network of this
type, then there seems to be a large likelihood that 
you are selling transit. Thus, your customers may well
be using technology of this sort to provide real solutions
to THEIR problems. (specifically, they may be directing 
traffic towards providers that are to _their_ advantage;
and be gaining detailed insight as to the real quality 
of connectivity being provided to them.)

It's not clear to me how you chose to define "real network 
clue", but I would not suggest that your customers are 
completely lacking in that area. :)

> > In other cases, it may be possible to define the set of destinations
> > that are legal over a given link, and constrain measurements for that
> > link.
> 
> Good luck making this scale. :)

Granted - it is a limited solution -- but still a 
solution that does solve a set of real-world problems. 

> What is broken for one provider and fixed at another may very well break
> something else that was working before at the first provider, yes? Besides
> the difficulties of assigning a true metric to the overall reachability of
> a /8 or any aggregate for that matter ("ok we decreased rtt by 20ms to
> these 3 destinations doing 15Mbps each but we increased rtt to this other
> destination doing 40Mbps by 60ms so we're better right?"), 

Having measurement traffic that directly correlates to 
actual traffic makes this problem much more managable. 

> The problems then become:
> 
>  * The quicker you try to react, the more you place yourself at risk of
>    starting a best path flap cycle.
> 
>  * Congestion does not only happen on your uplink circuit, it can happen
>    at every point along the path, including peers, backbone circuits, and
>    even the end user/site links. While I find the sales pitches of people
>    touting the horrors of peering to be quite sad (from Internap to the
>    classic MAE Dulles :P), peering capacity is largely based on the
>    ability to predict the traffic levels far in advance. It doesn't take
>    that many "large" customers selecting certain destinations through one
>    provider at once to blow up a peer in one region.

Flap control is an important consideration. 

Note that in the described topology, changing the selection 
of an egress point does not affect the routing tables of 
external networks (as opposed to flapping of route advertisements, 
for inbound traffic.)

I do think that it's useful to compare the behaviour of 
"mortal" BGP in the conditions you describe ... if BGP
selects a path that is, or becomes, congested ... BGP 
has no feedback mechanism to make a change until the 
overall topology changes, or until manual intervention. 

An automated route optimization system can evaluate 
the performance, and current load, of alternate egresses, 
make an automated change to the egress, and then monitor
the success of the change. In most cases, the overall 
conditions will have been improved. In the case you 
describe above, the route change results in suboptimal
performance, and a new decision is needed. This process
needs to have effective flap control. This is an area
in which I've seen a fair amount of development; and
have seen good results in years of production use. 

> Balancing the traffic of a GigE and a couple of FastE transits to keep
> each one uncongested may be enough functionality to sell some boxes to
> some low end users, but this falls into the categories I've described
> above, and does nothing to address the true end to end performance.

It's not clear to me what you mean here by "true end to end 
performance". I don't pretend that the approach being discussed
is a COMPREHENSIVE solution to all the problems that can impair 
performance; but I do think that for the class of performance 
problems that are directly observable via inspection of alternate 
egresses, redirecting the egress does in fact address "true end to 
end performance". 

> Thus the only real solution to the problem if you actually want to
> optimize traffic is:
> 
> >   c) Dynamically measure all of the possible
> >      deaggregations of all active space, and dynamically
> >      determine which prefixes need to be deaggregated
> >      to what level.
> >
> > Note that in any of the above cases, the de-aggregated
> > routes should be marked NO_EXPORT.
> 
> Throw away the BGP routing table completely, and build your own based on
> the topology and metrics you have detected. Of course, this means saying
> goodbye to the usual failsafe method of keeping the normal BGP routes in
> the table with a lower localpref so if the box falls over you just fail
> back to normal BGP path selection. 

This alone seems to make adoption of such technology
rather difficult ... 

> And probably more importantly, there
> isn't enough scale in the traffic probing system to gather the necessary
> topology info once for every customer... 
> ... Maybe if you made everyone's
> boxes report data back to a central site, you could gather something
> useful from it.

IMHO, that approach has demonstrated scalability limitations. 

Performance, and load information, tends to get stale very
quickly. 

------------------------------------------------------

While it does seem obvious that a richer palette of routing
policy control SHOULD be a core part of the routing fabric, 
I don't expect to see BGPv4, (or multihoming under IPv6,)
providing real solutions for this set of problems for the 
foreseeable future. 

cheers -- Sean