June Backbone Engineering Report

mak mak
Wed Jul 15 04:18:58 UTC 1992


Hi. This appeared in the Internet Monthly Report just sent around,
but since some have indicated that they would also like to see a
separate posting to the regional-techs list, here it is....
	Mark





                ANSNET/NSFNET Backbone Engineering Report

                                 June 1992


                Jordan Becker, ANS      Mark Knopper, Merit
                becker at ans.net          mak at merit.edu



T3 Backbone Status
==================

	The T3 Backbone continued to run very reliably during June.
With the completion of the RS/960 DS3 interface upgrade in May, the
cutover of additional traffic from the T1 to the T3 network resumed in
June and is proceeding as quickly as possible.  The number of networks
configured and announced to the T3 network continues to increase.
Midlevel traffic cut over from the T1 to the T3 backbone included
NorthWestNet, Sprint/International Connections Manager, and Alternet.
The T3 backbone is now carrying nearly double the packet load of that
of the T1 backbone.

	With the upgrade complete and the T3 network stable, several
performance and functional enhancements have been administered during
June.  Improvements to the routing daemon and SNMP daemon were made.
A remaining problem on the T3 network is the FDDI adapter performance
and stability.  Due to the complexity of the T3 adapter upgrade, we
chose to defer the FDDI upgrade until August to ensure operational
stability.


Statistics on network traffic and configured networks
=====================================================

	The total inbound packet count for the T3 network was 10,736,059,912,up 29% from April.  220,593,003 of these packets entered from the T1
network.  The total inbound packet count for the T1 network was
5,761,976,518, down 16.7% from May.  536,009,585 of these packets entered
from the T3 network.  The combined total inbound packet count for the T1
and T3 networks (less cross network traffic) was 15,741,433,842,
up 0.9% from April.

	Currently there are 5801 IP networks configured in the
policy routing database for the T1 network, and 3966 for the T3 
network. Actual announced networks to the backbone varies and
is currently 2750 for T3 and 4425 for T1.



NOC Problem Reports
===================

	The number of problem reports that result in NOC trouble
tickets (total all priority classes) for the T3 network remains
constant at 10-20 per week, and for the T1 network it remains at the
15-20 rate per week.



T1 Backbone Status
==================

	The T1 backbone's reliability is not as good as T3, due
largely to increased route processing on the RCP nodes.  The full load
of routes is still being carried by these machines, and they are
experiencing congestion and performance problems to some degree.
Improvements have been made to the routing software to accomodate
protocol upgrades (ie. BGP2).



T3 Routing Daemon Software Status
=================================

	Activities related to the rcp_routed software in June
emphasized correcting software problems involving routing instability,
and monitoring & correcting routing table integrity problems. There
were many bug fixes applied to the routing daemon over the last three
months.

	Monitoring of routing integrity consists of data collection of
the full netstat table to find route flapping problems within the
backbone and within peer networks, BGP disconnect problems, and
external network metric problems. Additional work is underway to
collect full routing tables from backbone nodes to be processed using
a relational database system. This system generates reports on the
statistical use of primary routes, reliability of network
announcements to the backbone, and long term statistics on
inter-domain routing announcements and growth.

	A number of improvements and bug fixes have been made to the
T3 routing software over the last two months. Highlights included: fix
to allow an ENSS that is isolated from the backbone to stop announcing
default to peers, better handling of router adapter failures,
preventing overruns of external BGP messages sent to external peer
routers, gracefully dropping bogus external routes to backbone ENSS
nodes, correct response to external metric selection problem for nets
announced at same metric from multiple peers, problem with interaction
between BGP and EGP for peers in the same Autonomous System, hashing
route table efficiency improvements, two routes with same AS path are
now both installed to allow backup, BGP-2 PDU size increased from 1024
to 4096 bytes, route from BGP and EGP with same metric now prefers
BGP route, better handling of next hop behind peer router and shared
network, BGP update packet format fix, fix to BGP 1-2 version
negotiation, eliminated chance of BGP disconnects during IGP
transitions, eliminating BGP disconnects if peer router is too busy,
better response to route instabilities upon failure of T1 interconnect
or ENSS, and autorestart of the routing daemon in the event of a
crash.

	As a result of the monitoring and analysis effort along with
the actual software changes, reliability and route integrity has
improved dramatically on the T3 network over the last month.


RS/960 DS3 On-Card Memory Problem
=================================

	A batch of bad memory chips have been found to result in
memory parity errors on a few interfaces.  Five of these cards have
been replaced as the problems have been identified. Diagnostic
microcode has been developed to detect the problems in advance, and
nodes are being scheduled for diagnostics to be run over the next few
weeks during routing configuration update scheduled windows.



DSU Synchronization and CRC/alignment Problem
=============================================

	A problem that causes logical link failures has been traced to
a clock synchronization problem on the T3 Technologies DSU's during
clock master/slave transitions. This problem occurs very infrequently
and has been reproduced using a newly installed circuit on the T3
research network. Enhanced instrumentation has been added to detect
this problem, and work is in progress to correct it.



End-To-End Packet Loss Analysis
===============================

	Researchers at University of Maryland recently conducted some
experiments and noticed periodic and random packet loss and packet
duplicates when using the T3 network. There were two problems traced
to a bridge device and an ethernet problem on the SURAnet ethernet.
Peer router problems causing some packet loss during routing updates
at NEARnet were identified and are being corrected.  Also some packet
loss on the T3 ENSS FDDI interface at Stanford was identified.  This
is due to an FDDI card output buffering problem and might be addressed
prior to the FDDI upgrade in August.


FDDI Adapter Upgrade
====================

	Although the T3 adapters have been upgraded from older
technology to the new RS960 adapter technology, the FDDI adapters in
the ENSS nodes have not yet been upgraded.  The older FDDI adapters
continue to suffer from performance on reliability problems.

	The new RS960 FDDI adapter is scheduled to be installed as
part of a field trial on July 20th.  Following this field trial, we
expect to upgrade the older FDDI interfaces with the new RS960
interface adapters in early August. There are currently five T3
ENSS sites that are using FDDI interfaces in production.



SNMP Daemon Changes
===================

	A new version of the SNMP daemon for the T3 network was
installed on June 26. This version supports MIB-II variables for the
T/960 ethernet cards (ifInUcastPkts, ifOutUcastPkts, and ifInErrors),
and also includes enhanced configuration support for monitoring T3
DSUs.
	A new SNMP client for the NOC to control the T1 Cylink ACSUs
which are part of the T3 backbone has been implemented. This avoids
use of a separate dial-in connection to these CSUs.

	New SNMP variables have been added to furthermonitor the DSU
synchronization problem mentioned above.








More information about the NANOG mailing list