Extreme congestion (was Re: inter-domain link recovery)

Thu Aug 16 03:39:14 UTC 2007

[...Lots of good stuff deleted to get to this point...]

On Wed, 15 Aug 2007, Fred Baker wrote:
> So I would suggest that a third thing that can be done, after the other two 
> avenues have been exhausted, is to decide to not start new sessions unless 
> there is some reasonable chance that they will be able to accomplish their 
> work. This is a burden I would not want to put on the host, because the 
> probability is vanishingly small - any competent network operator is going to 
> solve the problem with money if it is other than transient. But from where I 
> sit, it looks like the "simplest, cheapest, and most reliable" place to 
> detect overwhelming congestion is at the congested link, and given that 
> sessions tend to be of finite duration and present semi-predictable loads, if 
> you want to allow established sessions to complete, you want to run the 
> established sessions in preference to new ones. The thing to do is delay the 
> initiation of new sessions.

I view this as part of the flash crowd family of congestion problems, a 
combination of a rapid increase in demand and a rapid decrease in 
capacity.  But instead of targeting a single destination, the impact is
across multiple networks in the region.

In the flash crowd cases (including DDOS variations), the place to respond 
(Note: the word change from "detect" to "respond") to extreme congestion 
does not seem toe be at the congested link but several hops upstream of 
the congested link. Current "effective practice" seems to be 1-2 ASN's 
away from the congested/failure point, but that may just also be the 
distance to reach "effective" ISP backbone engineer response.

> If I had an ICMP that went to the application, and if I trusted the 
> application to obey me, I might very well say "dear browser or p2p 
> application, I know you want to open 4-7 TCP sessions at a time, but for the 
> coming 60 seconds could I convince you to open only one at a time?". I 
> suspect that would go a long way. But there is a trust issue - would 
> enterprise firewalls let it get to the host, would the host be able to get it 
> to the application, would the application honor it, and would the ISP trust 
> the enterprise/host/application to do so? is ddos possible? <mumble>

For the malicious DDOS, of course we don't expect the hosts to obey. 
However, in the more general flash crowd case, I think the expectation of 
hosts following the RFC is pretty strong, although it may take years for
new things to make it into the stacks.  It won't slow down all the 
elephants, but maybe can turn the stampede into just a rampage.  And
the advantage of doing it in the edge host is their scale grow with 
the Internet.

But even if the hosts don't respond to the back-off, it would give the 
edge more in-band trouble-shooting information. For example, ICMP 
"Destination Unreachable - Load shedding in effect. Retry after "N" 
seconds" (where N is stored like the Next-Hop MTU). Sending more packets 
to signal congestion, just makes congestion worse.  However, having an 
explicit Internet "busy signal" is mostly to help network operators 
because firewalls will probably drop those ICMP messages just like PMTU.

> So plan B would be to in some way rate limit the passage of TCP SYN/SYN-ACK 
> and SCTP INIT in such a way that the hosed links remain fully utilized but 
> sessions that have become established get acceptable service (maybe not great 
> service, but they eventually complete without failing).

This would be a useful plan B (or plan F - when things are really 
FUBARed), but I still think you need a way to signal it upstream 1 or 2 
ASNs from the Extreme Congestion to be effective. For example, BGP says 
for all packets for network w.x.y.z with community a, implement back-off 
queue plan B.  Probably not a queue per network in backbone routers, just 
one alternate queue plan B for all networks with that community.  Once
the origin ASN feels things are back to "normal," they can remove the
community from their BGP announcements.

But what should the alternate queue plan B be?

Probably not fixed capacity numbers, but a distributed percentage across
different upstreams.

   Session protocol start packets (TCP SYN/SYN-ACK, SCTP INIT, etc) 1% queue
   Datagram protocol packets (UDP, ICMP, GRE, etc) 20% queue
   Session protocol established/finish packets (TCP ACK/FIN, etc) normal queue

That values session oriented protocols more than datagram oriented 
protocols during extreme congestion.

Or would it be better to let the datagram protocols fight it out with the 
session oriented protocols, just like normal Internet operations

   Session protocol start packets (TCP SYN/SYN-ACK, SCTP INIT, etc) 1% queue
   Everything else (UDP, ICMP, GRE, TCP ACK/FIN, etc) normal queue

And finally why only do this during extreme congestion?  Why not always
do it?