November Backbone Engineering Report

Mark Knopper mak
Fri Dec 18 05:29:48 UTC 1992


This report is included in this month's Internet Monthly Report.
	Mark



             ANSNET/NSFNET Backbone Engineering Report  
  
                            November 1992
  
            Jordan Becker, ANS      Mark Knopper, Merit  
               becker at ans.net          mak at merit.edu  
  
  
Network Status Summary
======================
     All remaining T1 backbone traffic was cutover to the T3 backbone
and the T1 backbone network was turned off on December 2nd.  

     There were network stability problems observed at several sites prior
to, and during the IETF MBONE video/audio multicast event in November. 
The problems were mostly due to increased traffic, and the inefficient
processing of source route packets that caused routing instabilities. 

     Software changes have been deployed on the T3 backbone to
reduce the routing instability problems due the MBONE, and to improve
the efficiency for downloading large routing updates to the packet
forwarding interfaces on the T3 routers. 

     New RS960 FDDI interfaces are being scheduled for deployment to
selected T3 ENSS nodes in December.  Performance measurements over
the T3 network are proceeding at sites that already have FDDI interfaces
installed.

   
Backbone Traffic and Routing Statistics  
=======================================  
     The total inbound packet count for the T1 network was
3,589,916,970, down 14.8% from October.  598,015,432 of these packets
entered from the T3 network.
  
     The total inbound packet count for the T3 network was
20,968,465,293, up 10.7% from October.  134,269,388 of these packets
entered from the T1 network.    

     The combined total inbound packet count for the T1 and T3 
networks (less cross network traffic) was 23,826,097,443 up 5.9% from 
October.
  
     As of November 30, the number of networks configured in the NSFNET
Policy Routing Database was 7833 for the T1 backbone, and 7581 for the
T3 backbone. Of these, 1642 networks were never announced to the T1
backbone and 1602 were never announced to the T3 backbone.  For the T1,
the maximum number of networks announced to the backbone during the
month (from samples collected every 15 minutes) was 5772; on the T3 the
maximum number of announced networks was 5548.  Average announced
networks on 11/30 were 5707 to T1, and 5495 to T3.



T1 NSFNET Backbone Turned Off  
=============================
     The activities required to turn off the T1 backbone were completed
during November and the network was officially turned off by disabling
routing on the NSS nodes starting at 00:01 EST on 12/2.  Several actions
were taken in advance of the routing shutdown including installation of the
T1 ENSS206 at CERN in Switzerland, reconfiguration of the T1 NSS
routers to gateway traffic between CA*net and NSFNET, the deployment
of the EON software for OSI encapsulation over IP, and the final
installation of T1 backup circuit infrastructure connecting T3 ENSS nodes
to secondary CNSS nodes.  
 
Remaining Network Cutovers to T3  
-------------------------------- 
     AS 68, Los Alamos National Laboratory networks and AS22,
operated by MilNet at the San Diego Supercomputer Center were cutover
to use the T3 backbone in November.

     A new T1 ENSS (ENSS206) was installed in CERN, Switzerland to
provide connectivity to the T3 backbone for EASInet.  ENSS206 is
interconnected via a T1 circuit with CNSS35 in New York City.  The
ENSS was initially configured with less than the recommended memory,
and had to be upgraded to overcome some performance problems.  Other
than that, the installation went smoothly, and EASInet traffic was cutover
to the T3 backbone on 12/1.
 
     The NSS nodes in Seattle, Ithaca and Princeton were converted for
use by CA*net to allow CA*net to peer with the T3 network until
the longer term GATED hardware peer configurations are available.  The
E-PSP nodes for CA*net will be converted to run the CA*net software and
operate as part of CA*net's domain.  These nodes will run GATED and
exchange routes via BGP with the T3 ENSS.

OSI Support on T3 Backbone 
--------------------------
     OSI CLNP forwarding over the T3 backbone was configured via
encapsulation of CLNP over IP packets using the EON method (RFC1070)
until native CLNP switching services are available on the T3 routers.
RT PSP routers were configured as EON encapsulators at most regional
and peer networks.  CLNP traffic on the regional is first routed to the
EON machine.  The EON machine encapsulates the CLNP packet in an IP
packet.   The EON machine will send the IP packet to the remote EON
machine that is associated with the destination NSAP address prefix in
the CLNP packet.  The IP packet generated by EON will contain the
source address of the EON machine, and the destination address of the
EON machine.  The following static mapping tables will exist in the EON
machines:

NSAP Prefix ->  remote IP address of EON machine to decapsulate the
                IP packet into a CLNP packet.
           
For local NETs of Router:
 
NSAP Prefix ->  local NET of router on the ethernet to route the traffic off
                the NSFNET service
            
Changes or requests to be added to these tables should be sent to
nsfnet-admin at merit.edu.  

     The support for CLNP native switching services on the T3 backbone
proceeded to be tested in November.  The AIX 3.2 system software that
supports CLNP switching is in system test and is expected to be available
on the T3 backbone in February '93.
  
T3 ENSS Backup Plan 
------------------- 
     The installation and testing of dedicated T1 leased line circuits
between all T3 ENSS nodes, and a CNSS T1 router in a secondary POP
was completed in November. The topology for T3 ENSS backup is
illustrated in a postscript map that is available via anonymous FTP on
ftp.ans.net in the file </pub/info/t3enss-backup.ps>.  We have begun to
work on subsequent optimizations to further improve backup connectivity.
There may be situations where a secondary ENSS router is used to
terminate T1 backup circuits. 
 
T1 Backbone Turned Off
----------------------
     These activities required to precede the T1 backbone were
concluded on 12/1 and the shutdown of the T1 backbone commenced on
12/2.  There were a couple of problems that were quickly corrected during
the T1 backbone shutdown.  Some regionals that maintained default
routing configurations pointing to the T1 backbone lost connectivity for a
brief period.  Some regional router configurations were changed, and the
T3 backbone will continue to announce the T1 backbone address
(129.140) from several ENSS nodes for a while longer to ease the
transition.

     Also, there was a problem discovered with the RCP nodes in the
"NSS-router" configuration used for the CA*Net interconnection to the
T3 network.  The RCPs could not manage the full 6000+ network
destination routing tables. As a workaround, the three NSS-routers are
now configured to advertise the T3 backbone network as a default route
to CA*Net.

ANSNET/NSFNET Operational Experiences with MBONE
================================================
     During the week of 11/9 and 11/16, there were a number of
operational problems during the preparation and actual operation of the
IETF MBONE packet video/audiocast.  The use of loose source route
packets, and the large volume of MBONE traffic appears to have caused
fairly widespread problems for several Internet service providers. 
However, the volume of MBONE traffic and source route optioned packets
did not seem to adversely affect the ANSNET/NSFNET, as was earlier
believed.  There were severe routing instabilities with peer networks at
several ANSNET/NSFNET border gateways including E128 (Palo Alto),
E144 (FIX-E), E145 (FIX-W) and most notably at E133 (Ithaca) due to the
MBONE traffic and processing of source route packets.  The instability in
these peer networks coupled with inefficient handling of very large and
frequent routing changes introduced through EGP resulted in some
ANSNET/NSFNET instabilities.  Networks carrying MBONE traffic
frequently stopped being advertised by external peers, and were timed out
by the ENSS.  The external peer then stabilized and these networks were
then advertised to the ENSS by the external peer soon thereafter.  This
process repeated itself in a cyclical fashion. 

     This caused a few connectivity problems at various places on the
ANSNET, but was by far the worst at ENSS133 (Ithaca).  One reason that
the problem was worse at ENSS133 than at other places was due to the
fact that Cornell was on the sending end of a fair number of MBONE
tunnels, which meant the card-to-system traffic for unreachable
destinations tended to be higher on the ENSS133 router than elsewhere.
There were several actions taken during the week of 11/16 (IETF
video/audiocast) which reduced the severity of this problem including:

(a)  ICMP unreachable messages were turned off on the external
     interfaces of ENSS routers that experienced problems.  These
     messages were not being processed directly on the external ENSS
     interfaces which resulted in some inefficiency.  New software will be
     deployed in early December to correct this.

(b)  SprintLink rerouted traffic (and the MBONE tunnel) from the IETF to
     Cornell from the CIX (via internal PSInet path), to the T3 ANSNET
     path.  This improved stability within PSInet and within ANSNET.

(c)  Cornell rerouted traffic (MBONE tunnel) to SDSC from the PSInet
     path to the T3 ANSNET path.

(d)  One of the two parallel IETF audio/video channels was disabled.

(e)  A default route was established on ENSS133 pointing to its adjacent
     internal router (CNSS49).  This ensured that card<->system traffic
     being processed due to unreachable destinations was moved to the
     CNSS router which was not involved in processing EGP updates.

(f)  A new version of the routing software was installed on the four
     ENSS nodes that experienced route flapping to aggregate EGP
     updates from external peers before sending IBGP messages to other
     internal T3 routers.

The combination of all of these actions stabilized ENSS133 and the other
ENSS routers that experienced instabilities. There are several actions
which we already have, or will soon implement to avoid ANSNET border
router instabilities during future MBONE multicast events:
 
(1)  The ENSS EGP software has been enhanced to support improved
     aggregation of updates from external peers into IBGP update
     messages.  The ENSS will now aggregate EGP derived routes
     together into a single update before flooding this to other routers
     across the backbone via IBGP.  This improves the efficiency of the
     ENSS dramatically.

(2)  A change to the ANSNET router interface microcode has been
     implemented (and will be deployed during early December) so that
     problems resulting from large amounts of ENSS card-system traffic
     will be eliminated when destinations become unreachable.  Even if
     mrouted keeps sending traffic, this will be dropped on the incoming
     ENSS interface.
 
(3)  The T1 NSFNET backbone was disconnected on 12/2.  The T1
     network (particularly the interconnect points with the T3 system) was
     a major source of route flapping, and eliminating it should provide
     an additional margin for handling instability from other peer networks.
     
While the changes we are making to the T3 network will significantly
improve T3 network performance in dealing with external EGP peer
flapping, and related MBONE routing problems, our changes will *NOT*
improve the problems that other Internet networks may experience when
processing source route packets, and handling routing transitions with
MBONE tunnels.

We recommend that each service provider develop their own internal
routing plan to address this, we continue to recommend the migration to
use of BGP at all border gateways, and we recommend that MBONE
software be upgraded to support IP encapsulation to avoid the problems
with routers that do not process loose source route optioned packets
efficiently.  We also are recommending that the MBONE developers
explore optimizing the mrouted software to avoid the sustained
unidirectional flows to unreachable destinations that we observed.
Finally, it is recommended that an mrouted machine be maintained on the
ENSS DMZ of each participating regional, and this node be used as a
hierarchical distribution point to locations in the local campus and regional.
Backhauling of traffic across campuses and regionals should be
discouraged.

  
Routing Software on the T3 Network  
==================================  
     New routing software was installed on the T3 backbone in November
to support various enhancements.  There is improved route download
performance using asynchronous IOCTL calls.   There is new support for
static routes, and checks for the size of BGP updates and attributes. 

     New software will be installed in early December that addresses the
routing instability problems observed during the MBONE multicast events
in November.  This will include the code for improved aggregation of EGP
updates thereby eliminating any CPU starvation during EGP route flapping.


Next AIX 3.1 System Software Release
====================================
     New RS6000 system software was deployed on several T3 network
nodes in late November and will be fully deployed by early December. 
The most significant change is the ability to drop packets on the interface
which receives them for which there is no route available.  During
transient routing conditions we may experience high card/system traffic
due to route downloads which can cause transient instabilities.  This
change will improve the efficiency for route installs/deletes between router
system memory and the on-card forwarding tables.  This code also
supports the generation of ICMP network unreachable messages on the
card rather than on the system.  There are two bug fixes in this software,
one for the performance problem that can occur when an IBGP TCP
session gets deadlocked, and another that avoids FDDI problems if a T1
backup link on a T3 ENSS gets congested.  Also all ENSS interfaces now
support MTU path discovery.

 
RS960 FDDI Deployment Status 
============================ 
     We are proceeding to schedule RS960 FDDI adapter installations on
several ENSS nodes in December including ENSS134 (NEARnet),
ENSS130 (Argonne), ENSS145 (FIX-E), ENSS136 (SURAnet), ENSS139
(SESQUInet), ENSS144 (FIX-W), ENSS142 (WestNet), and ENSS143
(NorthWestNet). 

     There are performance tests under way involving Pittsburgh
Supercomputer Center, San Diego Supercomputer Center, and National
Center for Supercomputer Applications that are designed to exploit the
FDDI high bandwidth capability. The MTUs on various interfaces in the T3
backbone have been changed as was described in the October
engineering report, and the ENSS nodes will negotiate MTU path
discovery with other systems outside the T3 backbone.  During the tests
performed so far, there have been some observations of low level packet
loss (1.6%) between the Cray end systems at the supercomputer centers
which has caused performance problems with the large window TCP
implementations achieving peak performance over the T3 backbone.  The
packet loss problems have not been traced to sources inside the T3
backbone, and the problem is being investigated by the supercomputer
centers.  
  
   
Network Source/Destination Statistics Collection 
================================================ 
     During November, we collected the first full month of T3 network
source/destination traffic statistics.  This data will be used for topology
engineering, capacity planning, and various research projects.

 
RS960 Memory Parity Problem 
=========================== 
     During November, we experienced two problems on CNSS17 and
CNSS65 due to the failure of parity checking logic within the on-card
memory on selected RS960 T3 adapters.  These outages are very
infrequent and do not generally result in ENSS isolation from the network
since only a single interface will be affected, and redundant connectivity
is employed on other CNSS interfaces.  The problems were cleared by a
software reset of the interface.  The problem is suspected to be due to
the memory technology used on some of the cards.
 
 










More information about the NANOG mailing list