NETCOM downtime a programming error. (fwd)

Michael Dillon michael at memra.com
Fri Jun 28 23:25:10 UTC 1996


---------- Forwarded message ----------
Date: Fri, 28 Jun 1996 17:42:30 -0400 (EDT)
From: Brian Tao <taob at io.org>
Reply-To: inet-access at earth.com
To: inet-access at earth.com
Subject: NETCOM downtime a programming error. (fwd)
Resent-Date: Fri, 28 Jun 1996 16:45:28 -0500 (CDT)
Resent-From: inet-access at earth.com

    Not sure where this first showed up, but it sounds like some sort
of trade publication...

---------- Forwarded message ----------

IA: Take us through what happened...

GARRISON: Think about the network in three layers. The first layer is
Network Access Points, where we have peer agreements with other
careers. It's the entry point to the Internet. (They're about a half
dozen across the U.S.) The next level down are hubs, which are our
internal virtual private network routing hubs that look at traffic and
direct it along the speediest line available. The third level down is
where the customer actually logs on, at an access POP (Point Of
Presence). At each of those levels there are routers made by Cisco and
others that have instruction tables on them -- "IF THEN" statements
that tell the traffic where to go, what route to go to get to its
destination.

At the network access level, you have some pretty complex code that
says, 'If the traffic comes from this party, then do the following
thing with it.' And because of the number of new access providers or
changes in the access providers, there are daily changes made at the
network access layer. And these are changes that are made in software
to the routers. It's done in a language called BGP, or Border Gateway
Protocol. So, there was one line of code that said, literally, "No
redist bgp access list 25 in," just a line of code that revised an
instruction. Because the two sentences were put together as opposed to
being done on separate lines, the network read it as an "AND"
statement instead of an "OR" or an "IF statement.

So, what happens is the network automatically replicates the
instruction set from the network access point from where this was
entered, which was Washington, DC, and it replicated itself to the
other network access points.  Because of the way the code was written,
it then said, 'ah hah, it's a network instruction, not a peering
instruction -- I'd better send it out to the hubs.' The hubs saw it,
and said, 'ah hah, I'd better send it out to the POPs.' Well, the POPs
memory -- the routers at the lower levels of the network -- do not
have the memory or capacity for the peering instructions because they
don't interface with anybody else, so they don't need that capacity.

So, when they got it, it basically froze the routers down at the third
level of the network. Meantime, we're sitting reprogramming the
routers, but as fast as we can reprogram the replication feature of
the intelligent network, it overwhelms our ability to reprogram.
Basically our decision was to shut down the network to reboot the
routers, to put in a fresh instruction set.

That's a long winded explanation, but because your readers are more
technical, it's worthwhile!



============================== ISP Mailing List ==============================
Email ``unsubscribe'' to inet-access-request at earth.com to be removed.
inet-access archives are at ftp://ftp.earth.com/pub/archive/inet-access/






More information about the NANOG mailing list