EGP-related routing problem on T3 backbone
Wed Aug 12 02:47:35 UTC 1992
During the last 72 hours, we have seen a multiple instances of
routing instability across the T3 network including RS6000 CPU
starvation on several ENSS nodes and route flapping on E139, E138,
E136, and E134. Several sites have called to complain about route
flapping and instability. We spent considerable time over the last
couple of days trying to isolate this. The following is our current
understanding of the problem, and our plan for correction. We will
update the list as we learn more.
With the new network announcements that resulted from the
scheduled configuration update last Friday (8/7), the T3 ENSS nodes
started sending EGP updates to regional peer routers that exceed 8KB in
size at all sites configured for explicit routing. The problem
starts when the ENSS sends the regional peer an 8KB+ update. The peer
router may flap if it is operating with software that will NOT support
8KB+ EGP updates.
Also, there is a dormant bug in the RS6000 rcp_routed EGP code
which involves routes getting imported via IBGP which do not get
flushed out of a queue. When the regional router flaps (misses one
message from the peer) due the 8KB+ update described above, the
rcp_routed routes derived from EGP sitting in the queue do not get
installed and this might indirectly result in a routing inconsistency
between the ENSS and its CNSS neighbor.
We suspect that a simple fix to the T3 network problem is to
flush the queue every time we timeout the EGP derived routes. We have
a new version of rcp_routed that flushes the queue and has a trace
statement that logs the event. We will install the new rcp_routed on
C99, C51 and if successful, we will install it on E138 at 5am EST 8/11
(with SURAnet approval). If this is successful, we will install it on
several other nodes as emergency maintainence on 8/12pm. We will
send another note to the nwg list tomorrow with an update to the
However this fix not solve the problem of the 8KB+ updates
causing route flapping on several regional peer routers. Peer
networks that are running BGP should not experience this problem. We
have contacted Cisco, Proteon, and Wellfleet, and have learned the
following regarding their suggested software fixes to this problem.
Experiencing this problem depends on the version number of
software that you are running. To find out what you're currently
running at, do a "show buffers" and note the size of the huge buffers.
This is the maximum size IP packet that the router can reassemble. If
EGP updates come in that are larger than this, then you will get
reassembly failures which can be seen in "show ip traffic".
In later Cisco releases, there is now a knob so that you can
change the buffer size on huge buffers. Using this, you can
reassemble up to 64k IP packets. The following releases support the
following buffer sizes:
v8.1 8KB buffer
v8.2 12KB buffer
v8.3(1) 12KB buffer
v8.3(>=3) 18KB buffer, but configurable
v9.0 18KB buffer, but configurable
The Proteon router has a fixed size reassembly buffer. Any
packets bigger than the reassembly buffer will be dropped. Proteon
will generally increase the size of the reassembly buffer in each
release. Its current size (in Proteon releases 11.0 and greater) is
12K. In release 10.0b, a large number of which are probably still in
the field, the reassembly size was 8K.
If there is a site that thinks it is having this problem with
a Proteon router, they can contact Proteon customer service to get the
latest software revision. Customer service is familiar with the
reassembly buffer issues.
The Wellfleet router also has a fixed size reassembly buffer.
Any packets bigger than the reassembly buffer will be dropped. Any
Wellfleet router running a software release of v5.6 or older will have
a 4KB buffer. This release stopped shipping 2 years ago. All new
releases are compiled with a buffer size of 16KB and should not
experience this problem.
--Jordan Becker, ANS
Mark Knopper, Merit
More information about the NANOG