October Backbone Engineering Report

Mark Knopper mak
Thu Nov 12 03:51:48 UTC 1992


Hi. I may have procrastinated this too long for it to be in the
official Internet Monthly Report, but here is the report in any case.
	Mark




------------------------------------------------------------------------


                ANSNET/NSFNET Backbone Engineering Report 
 
                               October 1992  
 
                Jordan Becker, ANS      Mark Knopper, Merit 
                becker at ans.net          mak at merit.edu 
 
 
T3 Backbone Status 
================== 
     During October, the T3 network router software was upgraded to
support 10,000 destination networks with up to four routes per
destination.  Development work in support of migration to GATED
routing software continues on schedule.

     Problems that were addressed in October included a major routing
software bug that resulted in 3 separate routing instability events,
an RS960 memory parity problem, and an AIX TCP software bug.  The
phase-4 backbone upgrade activities were completed in October.

     Significant positive experience was gained with the RS960 FDDI
interface in October that will lead to additional FDDI deployments on
T3 ENSS nodes.

     The preparations continued for dismantling of the T1 backbone
which is scheduled for November.  Activities included testing of OSI
CLNP encapsulation over the T1 backbone, deployment of the redundant
backup circuits for the T3 ENSS gateways at each regional network,
collection of network source/destination traffic statistics on the T3
backbone, and cutover of the ESnet networks to the T3 backbone.  The
EASInet and CA*Net systems are not yet using the T3 backbone, and will
be cutover in November.


 
Backbone Traffic and Routing Statistics 
======================================= 
     The total inbound packet count for the T1 network was
4,213,308,629, up 20.0% from September.  469,891,322 of these packets
entered from the T3 network.
 
     The total inbound packet count for the T3 network was
18,940,301,000 up 20.8% from September.  185,369,688 of these packets
entered from the T1 network.   
 
     The combined total inbound packet count for the T1 and T3 
networks (less cross network traffic) was 22,498,348,619 up 20.1% from 
September.
 
     As of October 31, the number of networks configured in the NSFNET
Policy Routing Database was 7354 for the T1 backbone, and 7046 for the
T3 backbone.  Of these, 1343 networks were never announced to the T1
backbone and 1244 were never announced to the T3 backbone.  For the
T1, the maximum number of networks announced to the backbone during
the month (from samples collected every 15 minutes) was 5378; on the
T3 the maximum number of announced networks was 5124.  Average
announced networks on 10/31 were 5335 to T1, and 5085 to T3.
 
 
Routing Software on the T3 Network 
================================== 
     During October, the T3 network routing and system software was
upgraded to support an increased on-card routing table size on the
RS960 interfaces (T3/FDDI) and T960 interface (T1/ethernet) to 10,000
destination networks with up to 4 alternate routes per destination
network.  The previous limit was 6000 destinations networks with up to
4 alternate routes per destination.

     A serious routing bug was exposed that caused instabilities
across the entire T3 system during three different events, the first
on 10/19, and the 2nd & 3rd on 10/23.  We successfully installed a new
version of rcp_routed software on all T3 backbone nodes to fix the
problem.  This bug involved the interface between the routing daemon,
and the SNMP subagent.  With the addition of the 86th AS peer on the
T3 backbone, the buffer between the routing daemon and the SNMP
subagent would get corrupted and induce a failure of the routing
software.

     With the increased number of routes in the on-card routing
tables, we have begun to observe problems with the performance for
route installs/deletes between the on-card forwarding tables, and the
AIX kernel.  During transient routing conditions we may experience
high card/system traffic due to route downloads which can cause
transient instabilities.  We plan to deploy new software that will
improve the efficiency for route installs/deletes between the on-card
forwarding tables, and the AIX kernel in November.

     Also during November, we plan to install new routing software
that will support static routes.  This can be used for situations
where there is no peer router available to announce the shared
interface and any networks behind it.  This version of the routing
software will also selectively filter routes out of the local routing
tables on a network and/or AS basis.  The software will also increase
the limit to 16 peer AS numbers per ENSS, and improve the checks for
the size of BGP updates and attributes.

     The development of GATED software to replace the rcp_routed
software base is proceeding on schedule.  During October, the BGP4
protocol was developed and unit tested in GATED along with the interim
link state IGP that will be used to interoperate with internal nodes
running rcp_routed.  We expect to deploy GATED software on the T3
network in early 1993 following the upgrade to the AIX 3.2 operating
system.
 

RS960 Memory Parity Problem
===========================
     During October, we continued to experience some problems on CNSS
nodes due to the failure of parity checking logic within the on-card
memory on selected RS960 T3 adapters.  These problems have largely
been isolated to a few specific nodes including CNSS97 (Denver),
CNSS32 (New York), and CNSS40 (Cleveland).  These outages do not
generally result in ENSS isolation from the network since only a
single interface will be affected, and redundant connectivity is
employed on other CNSS interfaces.  The problem can be cleared by a
software reset of the interface.  Some of these problems have been
alleviated with hardware replacement (e.g. CNSS97 in Denver).


AIX TCP Software Problem
========================
     During October we experienced a problem involving TCP session
deadlock involving I-BGP sessions between particular ENSS routers.  A
bug was found in the TCP implementation of the AIX 3.1 operating
system (same bug in BSD) where an established TCP session between two
ENSS routers (e.g. for I-BGP information transfer) would hang and
induce a high traffic condition between an RS960 T3 interface and the
RS/6000 system processor.  This would cause one of the ENSS routers on
either end of the TCP session to suffer from performance problems,
until the problem was cleared with a reboot.  This problem occurred on
several ENSS nodes in October including ENSS134 (Boston), ENSS144
(Ames), ENSS131 (Ann Arbor), and ENSS129 (Champaign).

     A fix to this problem was identified, and successfully tested in
October.  This will be released as part of a new system software build
for the RS/6000 router in November.


Phase-4 Deployment Complete
===========================
     The phase-4 network upgrade was completed in October '92.  The
final steps in the upgrade involved the installation of CNSS36 in the
New York POP, and the completion of T3 DSU PROM upgrades.  The DSU
firmware upgrade supports new c-bit parity monitoring features, and
incorporates several bug fixes.

 
RS960 FDDI Deployment Status
============================
     Since the installation of the new RS960 FDDI adapters for the
RS/6000 router on ENSS128 (Palo Alto), ENSS135 (San Diego), ENSS129
(Champaign), and ENSS132 (Pittsburgh), there have been only two
operational problems.  One involved an instance of backlevel microcode
following a software update, and the other involved a failure of the
original hardware installed.  The experience with RS960 FDDI has been
extremely positive so far.

     There are performance tests under way involving Pittsburgh
Supercomputer Center, San Diego Supercomputer Center, and National
Center for Supercomputer Applications that are designed to exploit the
FDDI high bandwidth capability.  Following the completion of these
tests, additional RS960 FDDI adapters will be deployed.
 
     All production network interfaces are still configured for a 1500
byte Maximum Transmission Unit (MTU).  We will soon reconfigure the
MTU on most network interfaces to maximize performance for
applications designed to exploit T3/FDDI bandwidth, while maintaining
satisfactory performance for sites that interconnect to the T3 routers
via an ethernet- only interface.  The new configuration will be:
 
o Any ENSS with an RS960 FDDI interface will have a 4000 byte MTU
  except for the ethernet interfaces which will remain at 1500 bytes.
  The FDDI interface MTU will be set to 4352 bytes following the
  deployment of AIX 3.2.
 
o All other ENSS ethernet interfaces will have a 1500 byte MTU
 
o All T3 CNSS interfaces will have a 4000 byte MTU except T3
  interfaces connecting to an ENSS with ethernet only, and interfaces
  connecting to a T1 CNSS.
 
o All T1 CNSS interfaces and T1 ENSS interfaces will have a 1500 byte
  MTU. 
 
  
Dismantling the T1 Backbone 
=========================== 
     Plans to dismantle the T1 backbone have proceeded on schedule.
We will begin dismantling the T1 backbone in November '92.  This will
occur as soon as (1) the remaining networks using the T1 backbone are
cut over to the T3 backbone (EASInet and CA*net); (2) the OSI CLNP
encapsulation support for the T3 backbone is deployed; (3) the T3 ENSS
nodes are backed up by additional T1 circuits terminating at alternate
backbone POPs. These activities are described below.

T1 Routing Announcement Change
------------------------------
     In early November a change will be made to eliminate the
redundant announcements of networks from the T3 to the T1 backbone via
the interconnect, for those networks which are announced to both
backbones.  This has the effect of eliminating the use of the T1/T3
interconnect gateway for CA*Net and EASInet in the case of isolation
of a multi-homed regional from the T1 backbone. This change is
necessary for the interim to allow these duplicate routes to be
removed from the overloaded routing tables on the T1 RCP nodes, and to
allow the T1 backbone to be used for a few more weeks.

Remaining Network Cutovers to T3 
--------------------------------
     A new T1 ENSS will be installed in CERN, Switzerland to provide
connectivity to the T3 backbone for EASInet.  Cutover of EASInet
traffic will occur when this installation is complete.

     The Seattle RT E-PSP for CA*net is being converted to run the
CA*net software and operate as part of CA*net's domain. It will run
GATED and speak BGP with the T3 ENSS. Once this has been debugged and
tested the Princeton and Ithaca connections will similarly be
upgraded.

OSI Support Plan 
---------------- 
     We have successfully tested the RT/PC OSI encapsulator software
(EON) that was described in the August '92 engineering report.
Because the encapsulator software uses IP to route encapsulated OSI
traffic, this can be tested over the production T1 network.
Encapsulation is enabled in one way mode from NSS17 in Ann Arbor to
selected NSAP prefixes (i.e. encapsulation outgoing, native CLNP
incoming).  The half duplex capability is important for testing and
deployment.  When any of the T1 NSFNET EPSP nodes receive an
encapsulated OSI packet, it decodes it and proceeds to switch it via
native CLNP.  The OSI encapsulator EPSP is configured on a per prefix
basis (e.g. any prefix configured to use EON at a given NSS will have
its outgoing packets encapsulated).

     This flexibility in configuration will allow us to switch two
regional networks over to OSI encapsulation during the 2nd week of
November, with full conversion to OSI encapsulation during the 3rd
week of November.

     Native CLNP switching services will be available in an upcoming
release of the RS/6000 AIX 3.2 system software which is scheduled for
deployment on the T3 network in early 1993.
 
T3 ENSS Backup Plan
-------------------
     The plan to provide backup connectivity to T3 ENSS nodes
proceeded on schedule in October.  We have begun to install dedicated
T1 leased line circuits between all T3 ENSS nodes, and a CNSS T1
router in a secondary POP.  These T1 circuits are replacing T1 router
ports that were formerly used by T1 safetynet circuits on the T1 CNSS
routers.  This work is expected to be complete by the end of November.
The planned topology for T3 ENSS backup is illustrated in a postscript
map that is available via anonymous FTP on ftp.ans.net in the file
</pub/info/t3enss-backup.ps>.  Once the backup infrastructure is in
place, we will begin to work on subsequent optimizations to further
improve backup connectivity.  We have already begun discussions with
several regional networks on this.

     Several regionals have indicated that they will stop peering with
the T1 backbone when their T1 ENSS backup circuit is in place.  The
final phaseout of the T1 backbone will occur after OSI encapsulation,
final traffic cutovers, and these backup circuits are installed.


  
Network Source/Destination Statistics Collection
================================================
     During October we tested and deployed software on the T3 backbone
to collect network source/destination traffic statistics.  This is a
feature that has been supported on the T1 backbone in the past, and
was supported for a brief period on the T3 backbone prior to the
migration to RS960 switching technology.

     For each ENSS local area network interface, we will collect the
following information for each source/destination network pair:
packets (in and out), bytes (in and out), packets distributed by
port#, packets distributed by protocol type (UDP, TCP).  Packets are
sampled on the card (1 in 50 packets sampled) and forwarded to the
system processor for reduction and storage.  We expect to have
collected a full month of network source/destination statistics by the
end of November.

 
Notable Outages in October
==========================
   
MCI Fiber Outage - 10/17                                            
----------------                                                    
     At 12:33EST on 10/17 we experienced a major MCI fiber outage on
the east coast.  A truck accident damaged a fiber-line between Trenton
and New Brunswick (New Jersey).  This caused an extended loss of
connectivity for several T3 and T1 circuits that transited the MCI
junction in West Orange New Jersey.  All circuits affected by the
fiber cut were back on their original path as of 10/17/92 22:00 EDT.
During the outage, several circuits were moved to backup restoration
facilities and were moved back during the early morning of 10/18.
There were some periods of routing instability with circuits going
down and coming back up that caused temporary loss of connectivity for
other network sites as well.
                  
Routing Software Instabilities - 10/19, 10/23
------------------------------   
     During the week of October 19th, the T3 network experienced 3
unscheduled outages (roughly 1 hour each in duration).  We
successfully installed a new version of rcp_routed software on all T3
backbone nodes that fixed the problem on 10/25. This bug involved the
interface between the routing daemon, and the SNMP subagent.  With the
addition of the 86th AS peer on the T3 backbone, the buffer between
the routing daemon and the SNMP subagent would get corrupted and
induce a crash of the routing software.

     This problem occurred first during the cutover of the ESnet peers
at FIX-E, and FIX-W on 10/19.  Following the resolution of this
problem, the ESnet peers were successfully cutover to use the T3
backbone.






More information about the NANOG mailing list