inter-domain link recovery

Wed Aug 15 12:02:46 UTC 2007

On Wed, Aug 15, 2007 at 12:06:36PM +0800, Chengchen Hu wrote:
> I find that the link recovery is sometimes very slow when failure occures between different ASes. The outage may last hours. In such cases, it seems that the automatic recovery of BGP-like protocol fails and the repair is took over manually. 
> 
> We should still remember the taiwan earthquake in Dec. 2006 which damaged almost all the submarine cables. The network condition was quit terrible in the following a few days. One may need minutes to load a web page in US from Asia. However, two main cables luckly escaped damage. Furthermore, we actually have more routing paths, e.g., from Asia and Europe over the trans-Russia networks of Rostelecom and TransTeleCom. With these redundent path, the condition should not be that horrible.

Please see the presentation I made at AMSIX in May (original version by Todd at Renesys): http://www.thedogsbollocks.co.uk/tech/0705quakes/AMSIXMay07-Quakes.ppt

BGP failover worked fine, much of the instability occurs after the cable cuts as operators found their networks congested and tried to manually change to new uncongested routes.

(Check slide 4) - the simple fact was that with something like 7 of 9 cables down the redundancy is useless .. even if operators maintained N+1 redundancy which is unlikely for many operators that would imply 50% of capacity was actually used with 50% spare.. however we see around 78% of capacity is lost. There was simply to much traffic and not enough capacity.. IP backbones fail pretty badly when faced with extreme congestion.

> And here is what I'd like to disscuss with you, especially the network operators,
> 1. Why BGP-like protocol failed to recover the path sometimes? Is it mainly because the policy setting by the ISP and network operators?

No, BGP was fine.. this was a congestion issue - ultimately caused by lack of resiliency in cable routes in and out of the region.

> 2. What is the actions a network operator will take when such failures occures? Is it the case like that, 1)to find (a) alternative path(s); 2)negotiate with other ISP if need; 3)modify the policy and reroute the traffic. Which actions may be time consuming?

Yes, and as the data shows this only made a bad situation worse.. any routes that may have had capacity were soon overwhelmed.

> 3. There may be more than one alternative paths and what is the criterion for the network operator to finally select one or some of them?

Pick one that works? But in this case no such option was available. 

> 4. what infomation is required for a network operator to find the new route?  

In the case of a BGP change presumably the operator checks that the new path appears to function without latency or delay (a traceroute would be a basic way to check). 

In terms of a real fix, it cant be done with BGP, you would need to find unused Layer1 capacity and plug in a new cable. Slides 28-31 show that this occurred with Asian networks picking up Westward paths to Europe but it took some manual intervention, time, and money.

I think the real question given the facts around this is whether South East Asia will look to protect against a future failure by providing new routes that circumvent single points of failure such as the Luzon straights at Taiwan. But that costs a lot of money .. so the futures not hopeful!

Steve