Recent MBONE Events
Jordan Becker
becker at ans.net
Fri Dec 4 15:11:37 UTC 1992
ANSNET/NSFNET Operational Experiences with MBONE
Jordan Becker, ANS
Mark Knopper, Merit
During the week of 11/9 and 11/16, there were a number of operational
problems during the preparation and actual operation of the IETF MBONE
packet video/audiocast. This note summarizes the problems observed as
we currently understand them, the corrective actions that were taken, and
addresses some recommendations for avoiding similar problems during
future MBONE video/audiocasts.
The use of loose source route packets, and the large volume of MBONE
traffic appears to have caused fairly widespread problems for several
Internet service providers. However, the volume of MBONE traffic and
source route optioned packets did not seem to adversely affect the
ANSNET/NSFNET, as was earlier believed. There were severe routing
instabilities with peer networks at several ANSNET/NSFNET border
gateways including E128 (Palo Alto), E144 (FIX-E), E145 (FIX-W) and
most notably at E133 (Ithaca) due to the MBONE traffic and processing
of source route packets. The instability in these peer networks coupled
with inefficient handling of very large and frequent routing changes
introduced through EGP resulted in some ANSNET/NSFNET instabilities.
Networks carrying MBONE traffic frequently stopped being advertised by
external peers, and were timed out by the ENSS. The external peer then
stabilized and these networks were then advertised to the ENSS by the
external peer soon thereafter. This process repeated itself in a cyclical
fashion. This seems to have resulted in the following problems as
recorded in our BGP/EGP logs on the ENSS and neighboring CNSS
routers:
(1) The general flapping of routing in the Internet was keeping the
ANSNET routers fairly busy processing external updates.
(2) The routing implementation employed by the MBONE was very slow
about stopping flows to destinations which had become unreachable.
We observed unidirectional flows, most likely due to use of default
routing long after the destination routes had been removed from our
tables. When the ANSNET routers had lost the route to the
destination host (from the peer network), the route from the
destination host to the source still seemed to be working. This
means the source host may still have been hearing MBONE routing
updates from the destination host, even though the destination was
receiving nothing, and this seemed to keep tunnels up far longer
than they should have. We are seeking more information from the
mrouted developers on the dynamics of this.
(3) As a result of (2) above, some ANSNET routers were supporting a
moderate amount of card-to-system traffic as the flows to
unreachable destinations continued. While this was not a problem
by itself since, the volumes were only on the order of a few hundred
packets per second, the processing of these packets on an ENSS
did further reduce the amount of system CPU available for the
processing of routing packets, and this slowed down the updates of
routing information on the ENSS interfaces.
(4) The general routing performance degradation caused by (3) above
occasionally left the ANSNET routers with insufficient resources to
deal with major routing events, such as a large EGP neighbor's
routing session being broken, in a speedy enough fashion to avoid
timeouts on internal sessions (either IS-IS or IBGP) from kicking in
and causing reachability of the ENSS to be lost by the external
peers.
This caused a few connectivity problems at various places on the
ANSNET, but was by far the worst at ENSS133 (Ithaca). One reason that
the degradation of ENSS133 was worse than at other places was due to
the fact that Cornell was on the sending end of a fair number of MBONE
tunnels, which meant the card-to-system traffic for unreachable
destinations tended to be higher on the ENSS133 router than elsewhere.
The events which precipitated each and every routing failure on ENSS133
were either the withdrawal of most or all of PSI's routes by its router or
the timeout of the ENSS133 EGP session with the PSI router. Fully
withdrawing these routes required the ENSS to withdraw the 600-700
networks PSI advertises to ANSNET from all 78 ANSNET backbone
routers, and shortly thereafter, add them back to all 78 backbone routers.
The same nets were also flapping on the T1 NSFNET, adding slightly to
the load on ENSS133. Due to inefficiencies in redistributing EGP via
IBGP (lack of aggregation of routes resulting in one system call per route
per IBGP peer), ENSS133 had trouble processing back to back changes,
causing an outage until both routers stabilized. There were also a couple
of failures at ENSS144 during the week of 11/16 which seem to be have
been prompted by the EGP sessions with two ENSS144 FIX-West peers
(the MILNET peer was one) flapping at about the same time.
There were several actions taken during the week of 11/16 (IETF
video/audiocast) which reduced the severity of this problem including:
(a) ICMP unreachable messages were turned off on the external
interfaces of ENSS routers that experienced problems. These
messages were being not being processed directly on the external
ENSS interfaces which resulted in some inefficiency.
(b) SprintLink rerouted traffic (and the MBONE tunnel) from the IETF to
Cornell from the CIX (via internal PSInet path), to the T3 ANSNET
path. This improved stability within PSInet and within ANSNET.
(c) Cornell rerouted traffic (MBONE tunnel) to SDSC from the PSInet
path to the T3 ANSNET path.
(d) One of the two parallel IETF audio/video channels was disabled.
(e) A default route was established on ENSS133 pointing to its adjacent
internal router (CNSS49). This ensured that card<->system traffic
being processed due to unreachable destinations was moved to the
CNSS router which was not involved in processing EGP updates.
(f) A new version of the routing software was installed on the four
ENSS nodes that experienced route flapping to aggregate EGP
updates from external peers before sending IBGP messages to other
internal T3 routers.
The combination of all of these actions stabilized ENSS133 and the other
ENSS routers that experienced instabilities.
There are several actions which we already have, or will soon implement
to avoid ANSNET border router instabilities during future MBONE multicast
events:
(1) The ENSS EGP software has been enhanced to support improved
aggregation of updates from external peers into IBGP update
messages. The ENSS will now aggregate EGP derived routes
together into a single update before flooding this to other routers
across the backbone via IBGP. This improves the efficiency of the
ENSS dramatically.
(2) A change to the ANSNET router interface microcode has been
implemented (and will be deployed during the next week) so that
problems resulting from large amounts of ENSS card-system traffic
will be eliminated when destinations become unreachable. Even if
mrouted keeps sending traffic, this will be dropped on the incoming
ENSS interface.
(3) The T1 NSFNET backbone was disconnected on 12/2. The T1
network (particularly the interconnect points with the T3 system) was
a major source of route flapping, and eliminating it should provide
an additional margin for handling instability from other peer networks.
While the changes we are making to the T3 network will significantly
improve T3 network performance in dealing with external EGP peer
flapping, and related MBONE routing problems, our changes will *NOT*
improve the problems that other Internet networks may experience when
processing source route packets, and handling routing transitions with
MBONE tunnels.
We recommend that each service provider develop their own internal
routing plan to address this, we continue to recommend the migration to
use of BGP at all border gateways, and we recommend that MBONE
software be upgraded to support IP encapsulation to avoid the problems
with routers that do not process loose source route optioned packets
efficiently. We also are recommending that the MBONE developers
explore optimizing the mrouted software to avoid the sustained
unidirectional flows to unreachable destinations that we observed.
Finally, it is recommended that an mrouted machine be maintained on the
ENSS DMZ of each participating regional, and this node be used as a
hierarchical distribution point to locations in the local campus and regional.
Backhauling of traffic across campuses and regionals should be
discouraged.
There is another MBONE packet video/audiocast scheduled to coincide
with the Concert Packet Video conference on 12/10. We would like to
test the setup of the proposed tunnel topology with participating service
providers prior to this event to ensure stable operation. We would
suggest an off-hours maintenance window with interested service providers
to test the stability of the MBONE prior to 12/10. We are open to
suggestions on the timeframe for this. Tuesday evening 12/8 might be a
good time for this.
More information about the NANOG
mailing list