RS/960 deployment on the T3 Backbone - Week 3

mak mak
Fri May 15 02:47:45 UTC 1992


	Phase-III T3 Network Deployment - Step 3 Status Report
	======================================================
	Jordan Becker, ANS		Mark Knopper, Merit

	Step 3 of the phase-III network was successfully completed last
Saturday 5/9.  This was the most extensive step that is planned as part of the
phase-III deployment.  It was completed with only a few minor problems that
caused us to extend our scheduled maintainence window by 1 hour and 55
minutes. The deployment was completed at 11:55 EST on 5/9.  The following T3
backbone nodes are currently running with new T3 RS960 hardware and software
in a stable configuration:

Seattle POP:		CNSS88, CNSS89, CNSS91
Denver POP:		CNSS96, CNSS97, CNSS99
San Fran. POP:		CNSS8,  CNSS9,  CNSS11
L.A. POP:		CNSS16, CNSS17, CNSS19
Chicago POP:		CNSS24, CNSS25, CNSS27
Cleveland POP:		CNSS40, CNSS41, CNSS43
New York City POP:	CNSS32, CNSS33, CNSS35
Hartford POP:		CNSS48, CNSS49, CNSS51

Regionals:	ENSS141 (Boulder), ENSS142 (Salt Lake), ENSS143 (U. Washington)
		ENSS128 (Palo Alto), ENSS144 (FIX-W), ENSS135 (San Diego)
		ENSS130 (Argonne), ENSS131 (Ann Arbor),
		ENSS132 (Pittsburgh), ENSS133 (Ithaca), ENSS134 (Boston)
		ENSS137 (Princeton)

CNSS16, CNSS96, CNSS24, CNSS32, CNSS48 are now running with mixed technology
(e.g.  3xRS960 T3 interfaces, 1xHawthorne T3 interface).


Step 3 Deployment Difficulties
==============================
	During the step 3 deployment, there was a suspected bad RS/960 card
removed from the Ann Arbor ENSS131 node.  This took about 1 hour of trouble
shooting time.

	The T1-C1 (C35) crashed and would not reboot.  This turned out to be a
loose connector to the SCSI controller.  This was probably due the physical
move of all CNSS equipment within the NYU POP.

	There was a problem getting the link to Argonne E130 back online.
This turned out to be a misconnected cable on the DSU.
	
	There was also an early Saturday morning problem with the T3-B machine
at the Cleveland POP.  The machine would come up, and then crash within
minutes.  A number of different troubleshooting procedures were attempted, and
the final solution was to replace the RS960 card which inter-connects the T3-B
machine to the T3-C1 (e.g. C40->C41).  There was a metal shaving on the HSSI
connector, however when this was removed, the card still did not work.  This
was the last machine to be upgraded and it took us a few hours to fix
(9am-11am).

	Finally, during the trouble shooting of the Cleveland CNSS40 node, we
found the RS960 card in the Chicago POP CNSS27 node had frozen (probably
mis-seated or broken card).  It had been running fine for an estimated 5
hours.  The card was replaced.

	Ten hours after the upgrade, everything seemed to be normal with one
exception.  There was an abnormal packet loss rate between CNSS8<->CNSS9
measured via an on-card RS960 utility program.  The RS960 card in the CNSS8
node was replaced Sunday at 6:00am PST.

	During the deployment activities the "rover" monitoring
software was running at both Merit (Ann Arbor) and ANS (Elmsford),
with a backup monitor running at an unused RS/6000 CNSS at the Denver
POP. An effort was made to leave either Ann Arbor or Elmsford up
and connected to the T3 backbone at all times during the night, so
that the backup monitor was not required to be used. The NOC was able
to successfully monitor the T1 and T3 backbones throughout the
deployment timeframe.

	In addition to all the normal deployment activities, we were able to
swap the T960 ethernet card at Pittsburgh.

	Following the weekend deployment, we identified an RS960 card on
CNSS25 that was recording some DMA underrun errors. This was not affecting
user traffic, but the card was scheduled to be swapped out as a pre-caution.
Since we are revisiting the Chicago POP site during the upcoming step 4
deployment this weekend, we will replace the card then.

	Taking into consideration that we upgraded 4 POPs at the same time, we
feel this deployment went rather well.  There were 4 RS960 cards that were
shipped backed to IBM for failure analysis.  Preliminary analysis in the lab
did not result in any reproducible failures.

	We now have a complete cross-country RS960 path established.  We have
established modified T3 link metrics to balance traffic across the 5 existing
hybrid links.  We are watching very closely over the next two weeks for any
circuit or equipment problems that result in the cross-country RS960 path
being broken, since this could cause congestion on one or more hybrid links.
	

Step 4 Deployment Scheduled for 5/15
====================================
	Based upon the successful completion of step 3 of the deployment,
step 4 is currently scheduled to commence at 23:00 local time on 5/15.  Step 4
will involve the following nodes/locations:

St. Louis POP:		CNSS80, CNSS81, CNSS83
Houston POP:		CNSS64, CNSS65, CNSS67
Second Visit:		CNSS96 (Denver), CNSS24 (Chicago), CNSS16 (L.A.)
Regionals:		ENSS130 (Argonne), ENSS140 (Lincoln), ENSS139 (Rice)

Other ENSS's Affected:	ENSS165, ENSS157, ENSS174, ENSS173

	During this deployment the Houston ENSS139 node will be isolated from
the backbone.  Therefore the Houston T1/T3 interconnect will be switched over
to San Diego ENSS135 prior to the deployment.  The Ann Arbor ENSS131
interconnect gateway is expected to remain operational throughout the
deployment.

	Following the step 4 deployment, selected T3 internal link metrics
will be re-adjusted to support load balancing of traffic across the 3
different hybrid technology links that will exist.  The selection of these
link metrics has been chosen through a calculation of traffic distributions on
each link based upon an AS<->AS traffic matrix.








More information about the NANOG mailing list