BGP convergence problem

Tue Jun 8 16:52:57 UTC 2010

On Tue, Jun 08, 2010 at 12:22:04PM -0400, Jared Mauch wrote:
> 
> The Cisco 7600 and 6500 platforms are getting fairly old and have
> underpowered cpus these days.
> 
> Starting in SXH the control plane did not scale quite as well as in
> SXF.  This got better in SXI, but is not back on par with SXF
> performance yet.
> 
> I mostly attribute this to a combination of bloat in software and
> routing tables.  I would start to look for a replacement sooner rather
> than later.

Place blame where blame is due, the cpu may be slow, but the crappy ios
scheduler is the real problem here. We saw a huge reduction in the
number of self-sustaining protocols timeouts cycles on these boxes
(where the process of trying to bring up a new neighbor and converge
routing uses so much cpu that it causes other neighbors to time out,
resulting in a never-ending cycle of fail until you shut down everything
and bring them up one neighbor at a time) with the move from SXF to the 
SR branches. We never really went down the SXH/SXI road, but I'd have 
assumed they would have introduced the same improvements there too. I 
guess you know what they say about assuming. :)

Try the usual suspects:

* Configure "process-max-time 20" at the top level, this improves 
interactivity by making the scheduler switch processes more often.

* Make sure you don't have an overly aggressive control-plane policer. 
In my experience the COPP rate-limits are quite harsh, and if you end up 
bumping against them you don't get a graceful slowing of the exchange of 
routes, you get protocol timeouts.

* Make sure you don't have any stupid mls rate-limits, such as cef 
receive. I don't know why anyone would ever want to configure this, all 
it does is make your box fall over faster (as if these things need any 
help) by rate-limiting all traffic to the msfc.

* You might want to try something like "scheduler allocate 400 4000",
which gives the vast majority of the cpu time to the control plane
rather than process switching on the data plane (which in theory
shouldn't happen on an entirely hw forwarded box like 6500/7600, though 
of course we all know that isn't true :P).

Oh and also the OP should take this to the cisco-nsp mailing list, where 
all the good bitching about broken Crisco routers takes place. :)

-- 
Richard A Steenbergen <ras at e-gerbil.net>       http://www.e-gerbil.net/ras
GPG Key ID: 0xF8B12CBC (7535 7F59 8204 ED1F CC1C 53AF 4C41 5ECA F8B1 2CBC)