Phase III RS-960 Deployment Status Update

Wed Apr 22 05:28:02 UTC 1992

		Phase-III Deployment Status Update
		==================================
	   Jordan Becker, ANS		Mark Knopper, Merit

	The phase-III deployment schedule involving the replacement of
"Hawthorne" T3 interface adapters and DSU cards in the RS/6000 T3 routers is
progressing on-schedule.  We have completed our system testing and are
conducting on-going regression tests on the test network (recent testing and
upgrade activies described below).  Pending any un-foreseen problems that come
up, the hardware upgrades will begin on Friday 4/24 at 23:00 local time at the
Seattle CNSSs, Denver CNSSs, Seattle ENSS143, Boulder ENSS141, and Salt Lake
ENSS142.

	In preparation for the upgrade, we have been making software
changes to the production T3 system in advance to support the
hardware upgrades that are scheduled to begin on 4/24.  These changes
include a new system software build that supports the RS960 drivers,
kernel, and microcode.  We are also installing a new SNMP program that
support the new c-bit parity DSU function, and several bug fixes for
existing FDDI and T960 ethernet freeze problems.  The proposed
maintainance window for these network software upgrades is as follows:

1.	4/21 at 12-1am EDT: Install build 2.78.22 on Seattle
	CNSS's 88, 89 and 91; Denver CNSS's 96, 97, and 99; University
	of Washington ENSS 143; Boulder ENSS 141; and Salt Lake ENSS 142.
	(done)

2.	4/22 at 6am EDT: Install build 2.78.22 on San Francisco
	CNSS's 8, 9, and 11; Los Angeles CNSS's 16, 17, and 19;
	Palo Alto ENSS 128; NASA Ames/FIX-W ENSS 144; and San
	Diego ENSS 135.

	With the cutover of traffic from T1->T3 of FIX-W, Westnet-E,
CICnet, NCAR and Midnet in the last 2 weeks, we will halt any
additional regional traffic cut-overs until after the phase-III
upgrades are stabilized.

	We are in the process of contacting each regional individually to
discuss the upgrade and what you can do to help us ensure that this goes as
smoothly as possible.  We will be requesting coordination with regionals in
the use of the T1 backbone as backup to the T3 system during different points
in the upgrade schedule. This may require employing emergency procedures
such as disabling AS peer sessions to the T3 backbone on some regionals
if we run into trouble (not that we anticipate that, since we are
confident that testing has been very thorough).

	During the upgrade schedule, there will be a growing router cloud that
supports the RS960 T3 technology and a shrinking router cloud that supports
the Hawthorne T3 technology.  We have designed the interface points between
these clouds to try and minimize traffic flow across them at any give time
since the T3 system is carrying considerable traffic and these "hybrid links"
represent a potential performance bottleneck.

	A internal T3 routing link metric assignment plan is being
generated to minimize Hawthorne-cloud to RS960-cloud traffic (by
assigning an increased metric to selected T3 hybrid links).  We are
performing an AS-AS regional traffic simulation to design these
internal link metrics in an way that balances traffic flow across the
T3 system and avoids over-loading of the hybrid links.  We have already
done some initial experimentation with different link metrics and
validated the results.  These link metrics will be changed manually by
our engineers at different points in the deployment process to spread
the load across the hybrid links and thus minimize load on any single
hybrid machine.  An appendix to this message contains some more
information on metric adjustments and how these were determined and
verified. This work is credited to Elise Gerich of the Merit IE group.

	We would like to develop a procedure and plan with each regional for
cutover to the T1 backbone during scheduled periods, and during possible
unscheduled periods in the event that a hybrid link should get too close to
saturation.  We would like to coordinate a plan where our engineers maintain a
pre-designated list of T3 ENSS nodes in each phase of the deployment that get
configured down (off-line), in conjuction with a regional switching their
routing over to divert traffic to the T1 backbone.

Phase-III Technology Testing Status
===================================
	Since our last update on phase-III deployment testing, we have closed
several problems that were identified and we will continue to regression test
up to and during the deployment, and maintain a scaled down replica of the T3
system configuration on the T3 research network during the deployment.  This
will allow us to continue testing and support any needed tests for problems
that may be identified during the deployment.  Since the last report, we have
been focusing our test efforts in the following areas:

1.	Regression tests for bug fixes in the current T3 system.  We have
fixed several bugs in the current production T3 network FDDI driver, T960
ethernet/T1 interface driver and associated microcode.  These fixes have been
merged into the production RS960 system software build scheduled for
deployment this week and have been regression tested.

2.	Stress testing of the new build.  We have corrected an earlier problem
we were experiencing with a test tool that copies all production ethernet
traffic with a new destination address onto the T3 Research network to
simulate real production network traffic flows.  This so called "copy tool" is
working on the test network and has been a major component of our stress
testing.

3.	Full RS960 (4 interface) configuration testing.  We have focused much
of our testing over the last couple of months on the mixed technology
configurations (e.g. Hawthorne T3 adapters and RS960 T3 adapters co-resident
in the same router and network) since we were most concerned about the actual
deployment process.  We have most recently tested the full RS960 (4
interfaces in a CNSS) configuration which we are less concerned about.

4.	FDDI peering tests.  We have tested ENSS configurations that include
an FDDI interface and an RS960 T3 interface that peer with with 2 FDDI
equipped Cisco routers.

5.	Noise injection testing on T3 lines.  We have tested noise injection
on the T3 test network circuits with the aid of a test tool provided with the
T3Plus Inc. BMX45 bandwidth manager.  The results of these tests demonstrated
a T3 technologies DSU problem involving SNMP sub-agent signalling.  We have
addressed this problem for now with a change to the SNMP sub-agent.  A PROM
change to the DSU will be installed at a later date to avoid any additional
SNMP sub-agent overhead required to address this problem. Noise testing
on the research network is ongoing, in an attempt to find additional
problems during stress and other traffic load tests.

5.	T3 routing software testing.  One of the tests that we have always
wanted to conduct on the T3 Research network is the simulation of 60+ internal
routing peers that generate IS-IS LSPs which most accurately mimics the
production T3 network.  We never had enough nodes on the T3 Research network
to generate more than 10 of these internal IBGP peer sessions before.  We are
now testing with a modified version of our routing software that generates
multiple IBGP internal sessions so that we can simulate internal IS-IS routing
exchanges with several more internal peers.  This will improve our routing
software test environment now and in the future.

6.	T3 routing software enhancements.  There are several new enhancements
to the T3 routing software that we have tested with the RS960 technology
including an auto-restart function in the event of a daemon failure, BGP
version negotiation support, and correct support for the interpretation of
inter-AS external metrics.  These software changes are independant of the
phase-III deployment and will be scheduled for deployment during a
maintainence window that does not interfere with the phase-III deployment
activities.

Other Activities Related to RS960 Deployment
--------------------------------------------

	NOC training - The NOC requires well documented debugging procedures
that operators can understand and execute without problems.  This includes new
man pages from engineers on RS960 utilities and commands (e.g. ifstat,
ccstat, etc.).  A disaster recovery procedures checklist has been drafted
that includes information on how to recover from problems during and
following the deployment.

Appendix: Metric Adjustments on the T3 Network
	  by Elise Gerich, Merit/NSFNET Internet Engineering

(To fully appreciate the details here it would be appropriate
to refer to the map in the file pub/maps/t3topo.ps on merit.edu.)

With the advent of the RS960 card rollout, there has been some
concern about the performance limits of nodes which have
up to three RS960 T3 cards and at least one Hawthorne T3 card.

One of the strategies proposed to avoid congestion on the Hawthorne-
to-system-to-RS960 interaction is to bypass using links which
terminate in this configuration by tuning the link metrics. Since
there are multiple paths between any two pairs, it is important
to evaluate the effect changing link metrics has on the
network as a whole.

Since the RS960 card deployment is staged in five phases, we 
approached the problem by identifying the interfaces where
packets would have to traverse both flavors of cards and the
system, in each of the four phases of the deployment.

The next step was to identify within each phase interfaces which
may be underutilized so that we could potentially adjust the
metrics to divert traffic from more heavily loaded interfaces
to the more lightly loaded interfaces.

In phase one, both interfaces are lightly loaded, and it was felt that
there was no danger of approaching the performance bottleneck threshold
number of packets thru either interface.  Therefore, the recommendation
is to leave the link metrics unchanged for phase one.

In phase two, it is obvious that the majority of the traffic from and
to the west coast traverses two interfaces, CNSS8 to 16; and CNSS16 to
64.  The objective was to reduce traffic through these two interfaces
and to increase traffic through the interface at CNSS96 to 80.  In
evaluating the multiple paths from the east to the west coast
(primarily between Hartford, San Francisco, and Los Angeles it appeared
that we could achieve our objective by raising by two the link metrics
on two links, 8 to 24 and 16 to 64.

In order to test our hypothesis, we raised the link metrics
on the links 8-24 and 16-64 from one to three on April 13,
1922 at approximately 16:15 GMT.  It was immediately noticeable
that traffic was diverted from 8-24 and 16-64 to 80-96.  We
left these metrics in place for approximately 20 hours. Using
the xdview4 program that Bill Norton has written, we graphed
the packets per second in and out of the various interfaces on
the network.  Comparisons of the sample period with
the pre and post data samples, indicate that we accomplished
our goal of reducing traffic through 8 and 16 and of
increasing traffic through 96.

In phase three, there will be one purely RS960 path from coast
to coast.  The objective for this phase was to eliminate
multiple, lowest cost paths between Hartford and the west coast
nodes of San Francisco, Los Angeles, Seattle, and Denver, and
to make the RS960 path (48-32-40-24-8) the lowest cost path
(according to link metrics) from coast to coast.

After evaluating the multiple paths between Hartford-San Francisco,
Hartford-Los Angeles, Hartford-Seattle, and Hartford-Denver,
we thought that we could adjust the metrics across the network
so that the lowest cost path was the purely RS960 path. This
would require changing the metrics on six links. All link metrics on
the network would be one, except for 8-24 which is 3, 16-64 which
is 3, 96-80 which is 2, 24-80 which is 2, 32-56 which is 2, and
48-72 which is 4.  Since this change would force all the east-
west traffic onto five nodes, we were hesitant to test the
hypothesis on the real network.  However, since April 19 was
a holiday weekend and the traffic load on the network was
very light, we chose to adjust the metrics for a period of
five and one half hours.

The results of this test were as expected with one exception.  I had
made a mathematical error when calculating the paths to Denver
from Seattle, so the lowest cost path between that pair did not
traverse the RS960 path. There was a noticeable reduction of
traffic through the following interfaces: 16 to 64, 72 to 48, and 48 to 72.
Even 80 to 96 saw less traffic than when the metrics were configured
as in phase two. Other observations were, that there appears
to be an increase in traffic through interface 32 to 56.  That might
be attributed to the fact that the lowest link cost between Washington DC
and Hartford was then via New York, not Greensboro; however,
this may be offset by having two equal cost paths between
Hartford and Greensboro. 

We have not yet completed our evaluation for phase four.  The
recommendation for phase one is that we leave the link
metrics as they are currently; every T3 link is one.  The
conclusions that we draw for phase two from these experiments are that we
can reduce the load on two of the hybrid boxes by diverting
traffic to the third, lightly utilized hybrid.  We recommend
that prior to that phase that the link metrics be increased
from one to three on two links: 8-24 and 16-64.  As for phase
three, significant gains can be made by diverting the majority
of traffic to the RS960 path, but we recommend that we investigate
further to see if we can eliminate the RS960-Hawthorne-RS960
path between Hartford and Denver. As we conclude our evaluation
of phase four, we will publish these results also.