December Backbone Engineering Report

mak mak
Wed Jan 8 04:11:19 UTC 1992



              ANSNET/NSFNET Backbone Engineering Report
                       December 31, 1991


           Jordan Becker			Mark Knopper
	   becker at ans.net			mak at merit.edu
   Advanced Network & Services Inc.	      Merit Network Inc.



Summary
=======
	The T3 network continued to perform reliably during the month of
December.  A change freeze and stability test period was observed on the T3
network during December 13-20 in preparation for the cutover of additional
traffic from the T1 network which is scheduled to begin in mid-January.
During the stability test period, two test outages during off-peak
hours were scheduled.  Some final software changes are being deployed
following the stability period, and a routing migration plan has been
developed to support the continued traffic migration from T1->T3.

	Changes were deployed on the T1 backbone during December to improve
the reliability.  A new routing daemon was deployed to fix the chronic EGP
peer loss problem that had been exhibited most notably on NSS10 in addition to
some other nodes.  The T1 network continues to experience a low level of
congestion, primarily at the ethernet interfaces on some nodes.

	The December 1991 T1 and T3 traffic statistics are in available for
FTP in pub/stats on merit.edu.  The total inbound packet count for the T1
network was 9,683,414,659, down 4.4% from November.  529,316,363 of these
packets entered from the T3 network.  The total inbound packet count for the T3
network was 2,201,976,944, up 38.7% from November.  489,233,191 of these
packets entered from the T1 network.  The combined total inbound packet count
for the T1 and T3 networks (less cross network traffic) was 10,866,842,049,
down 3.3% from November.

	Finally, the plan to deploy the RS/960 technology for Phase III of the
T3 backbone is taking shape.  Testing on the T3 Research Network will begin in
January with the possibility for deployment in late February.


T1 Backbone Update
==================

NSS Software Problems and Changes

	We had been experiencing an EGP session loss problem on several T1
backbone NSS systems.  This was occuring most frequently on NSS10 at Ithaca.
This problem has been fixed by a change to the rcp_routed program running on
the RCP nodes in the backbone NSS's.  The problem was due to the timing
between the creation of routing packets, and the transmission of those packets
during transient conditions.  This new software prevents the simultaneous loss
of EGP sessions across multiple PSPs in an NSS that had been observed at some
nodes.

	Since this problem was corrected, we have experienced a few isolated
disconnects with an EGP/BGP peer on NSS10 at Ithaca, which are believed to be
unrelated to the earlier problem.  This symptom happens less frequently, and
involves only one PSP at a time.  The latest occurences have been traced to an
isolation of the PSP from the RCP.  This is due to CPU looping on the PSP
during a flood of traffic sourced from the local ethernet interface.  We are
working to attach a sniffer to the local ethernet to determine the source of
these traffic floods on the PSP ethernet interface.

	Another problem that we have seen roughly three times a month is a
crash of a T1 RCP node due to a known virtual memory bug in the RT kernel.  We
are working on both of these problems now and hope to have them corrected
soon.

	We continue to experience congestion on the T1 backbone.  We
are collecting 15 minute peak and average traffic measures via SNMP on all
interfaces, and we also sample some interfaces at shorter intervals to look at
burst traffic.  We have occasionally measured sustained T1 backbone link
utilization around 50% average, and peaks above 70% on several T1 lines.  We
also have experienced high burst data streaming on the local EPSP ethernet
interface (3500PPS bursts at an average 200 bytes/packet).  We have already
taken a number of actions to reduce the congestion including the addition of
split-EPSP routers and T1 links, and installation of dual-ethernet EPSP
systems where we split the routes across each ethernet interface.  These have
been deployed at Ithaca, College Park, and Champaign.  There are a number of
things we can still do to improve this, however the greatest reduction in
congestion has, and will continue to come from migration of traffic from the
T1 to the T3 network.


ICMP Network Unreachable Messages

	The T1 network has exhibited some problematic behavior that was
previously addressed on the T3 network where ICMP network unreachable messages
are generated and transmitted external to the backbone by core backbone nodes
during routing transients.  This was addressed in the T3 network by
implementing an option to limit the generation of these unreachable messages
only to nodes equipped with an external LAN interface rather than allowing the
core backbone nodes to generate them.  An equivalent implementation of this
option is now being tested for deployment on the T1 network.  This manifests
itself as a problem for host software implementations when routing transients
occur in the backbone due to circuit problems or other reasons.


T1 Backbone Circuit Problems

	On the T1 backbone nodes, CSU circuit error reporting data is not made
available to the RT router software via SNMP as is the case on the T3
backbone.  This makes it more difficult to generate NOC alerts that correspond
to circuit performance problems recorded by the CSU equipment.  However the
PSP nodes are able to detect DCD transitions (known as "DCD waffles") and
record them in the router log files.

	An increase in circuit performance problems on the T1 backbone has
been observed on several links lately as evidenced by DCD waffles, and some
actions have been taken to resolve the problems.  These include work in
progress to provide more timely reports on all DCD waffle events, as well as
direct integration of carrier monitored T1 ESF data with our SNMP based
network monitoring tools.  Procedures have been improved for the diagnosis and
troubleshooting for T1 backbone circuit problems in cooperation with MCI and
the local exchange carriers.  We have also worked to improve the procedures
and communications between our operators & engineers, and our peer network
counterparts.



T3 Backbone Update
==================

Summary

	During the week of December 13-20, a change freeze was conducted and
stability measurements were performed.  No software or hardware changes were
administered to the backbone nodes, with the exception of the normal Tuesday
and Friday morning routing configuration updates.  Prior to the change freeze
and stability week, several changes and software enhancements were introduced
on the T3 system to address known problems.  During stability week, two
problems were identified.  One problem involves the loss of connectivity to an
adjacent IS-IS neighbor.  Following stability week, a new rcp_routed program
has been installed on the network which has instrumentation developed in order
to identify the problem.  Unfortunately this problem has not been observed
again since the new code has been installed.

	A new plan for T1-T3 routing and traffic exchanges has been developed.
This will support the continued migration of traffic from the T1 to the T3
system which is expected to commence in January 1992.


Pre-Stability Period Changes

	--Safety Net

	The remaining two links for "Safety Net" were installed and
configured.  Safety Net is a collection of 12 T1 links connecting the CNSS
nodes in a redundant fashion within the core of the T3 network.  Safety Net
has proven to be useful backup path on a couple of occasions for several
minutes in duration where all T3 paths out of a core node become unusable due
to a T3 interface or CNSS failure.  Safety Net will remain in place until the
existing T3 interface hardware is replaced with the newer RS/960 interface
hardware, and this is no longer necessary.

	--Routing Software Changes

	Three changes were made to the rcp_routed daemon software. A digital
signature from MD4 was implemented in the rcp_routed software to ensure the
integrity of IBGP messages between systems.  An enhancement to increase the
level of route aggregation was made in the BGP software that reduces the size
of external routing updates to peer routers.  This provided a workaround to a
problem in some of the regional routers that are supporting external BGP where
the peer router would freeze after receiving a BGP update message.  The "route
loss" problem mentioned in the November 1991 report was identified and fixed
prior to the commencement of the stability period.  This was identified as a
bug involving the exchange of information between the external and internal
routing software.

	--AIX Build 3.0.62 Kernel Installed

	A new system software build was deployed on all RS/6000 nodes in the
T3 backbone to fix several problems.  One of these was the T960 "sticky T1"
problem, which would cause a delay on packet forwarding across a T1 interface.
Another problem that was fixed involved a delay in the download of routes from
the RS/6000 system and T960 ethernet and T1 "smart" interface cards.


Change Freeze and Stability Week 12/13-12/20

	During this period, no hardware configuration or software changes were
administered and several reliability and stability tests were performed.  Some
of these tests included scheduled test outages of selected nodes during
off-peak hours.

	A test outage of the T1/T3 interconnect gateway was performed.  The
external BGP sessions on the Ann Arbor interconnect gateway were disconnected,
forcing the Houston backup interconnect gateway to become operational. This
transition occurred automatically over a 15 minute time period. After the
switchover, the Ann Arbor primary gateway was put back into production.

	Another test that was performed was a node outage of the Denver T3
backbone CNSS.  This node was chosen since it does not yet support any
production ENSS nodes.  The routing daemon on this node was taken down and
brought back up again.  This had no unexpected results, and did not have any
noticeable impact on other network traffic during the IS-IS routing
convergence which was measured to be on the order of 25 seconds across the T3
network.

	As a result of these tests and the measurement of improved T3 backbone
stability, the change freeze week was concluded successfully on 12/20.  Plans
are described below to migrate additional traffic from the T1 to the T3
network in January.


Post-Stability Week Actions and Plans

	--New Routing Software

	The new rcp_routed with instrumentation to debug the IS-IS adjacency
loss problem was installed.  This problem has not occured since 12/22.

	--AIX Kernel Build 3.0.63 Targeted for Installation

	A new software build is being tested at this time to address
the T960 ethernet freeze problem, and to support a full CRC-32 link
layer error check computed in software.  This new software build will
be deployed in two phases.  Build 63 also includes a version of the
NNstat feature which allows the net-to-net traffic statistics matrix to
be collected.  This is a necessary change targeted for deployment prior
to migrating a major portion of the T1 backbone traffic over to T3.

	--Routing Architecture Plan

	With the traffic migration from T1 to T3, it will be necessary to
split the announcements of routes from the T3 network to the T1 network (for
networks that are connected to both T1 and T3) across multiple T1/T3
interconnect gateways in order to load balance, and ensure that the size of
the IS-IS packets contained in the announcement does not get excessively
large.  Routing announcements from the T1 to the T3 networks will be made on
all primary interconnect gateways, as will routing announcements for networks
which are only connected to the T3 network.  The routing configuration
database modifications and configuration updates are already underway to
support this design.

	In order to provide improved redundancy for traffic between the T1 and
T3 networks, additional T1/T3 interconnect gateways will be established.  A
fourth T1/T3 gateway is being installed at the Princeton site to act as backup
to the Ann Arbor primary gateway.  A fifth and sixth gateway are planned for
future expansion with the expectation that the IS-IS packet size will increase
with additional growth in the total number of networks announced to the T3 and
T1 backbones.

	--T1->T3 Traffic Migration Plan

	A plan has been drafted that addresses the T1->T3 traffic migration in
support of peer networks that are not already using the T3 network.  Regional
networks maintain a co-located peer router with both a T1 NSS and T3 ENSS are
requested to maintain EGP/BGP peer sessions with both the T1 and T3 networks.
This will allow them to announce their networks to both the T1 and T3 systems.
It is advised that regionals have their peer routers learn default routes from
the T1 NSS, and explicit routes for all destinations from the T3 ENSS.  This
will result in all traffic destined for a site primarily reachable on T3 to
take the T3 path, and likewise for T1. The goal here is to minimize traffic on
the interconnect gateways.  Primary reachability via T3 or T1 will be managed
through the adjustment of routing metrics on the T1 and T3 systems.

	An analysis of the traffic associated with each Autonomous System has
been conducted.  Migration of traffic will be implemented by choosing AS pairs
that account for the largest inter-AS traffic flows.  These will be moved over
together in a pairwise fashion as part of a scheduled routing configuration
update.  We are working with some regionals now to schedule this.

	We will proceede slowly with this migration where no more than
	one pair of AS's will be moved over in a single week at first.
We are working to coordinate this with the regionals and we hope to
have a signficant portion of the T1 traffic cut over to T3 by the end
of February.  Some traffic will likely remain on T1 backbone for
several reasons.  Since the T3 nodes do not yet support the OSI CLNP
protocol, that traffic will remain on the T1 backbone.  There are also
some other international networks that do not directly peer with the T3
network which will announce themselves only to the T1 backbone.


Phase III T3 Network RS/960 T3 Adapter Upgrade
==============================================

	A phased implementation plan for the new RS/960 T3 adapters is being
developed, and testing will begin on the T3 Research Network in mid-January.
The testing phase will take over a month and exercize many features and test
cases.  Redundant backup facilities to be used during the phased upgrade will
be supported and tested on the research network.  Performance and stress
testing will also be conducted.  Test outages and adapter swaps will take
place to simulate expected maintainance scenarios.

	The RS/960 T3 adapters do not interoperate across a DS3 serial link
with the existing T3 adapters, and so the phased upgrade must be administered
on a link-by-link rather than node-by-node basis.  Deployment will be
coordinated with the peer networks to ensure that adequate advanced notice and
backup planning is afforded.  The deployment could begin in late-February
depending upon the test results from the T3 Research network.












More information about the NANOG mailing list