update regarding 12/3/94 service disruption
Steve Heimlich
heimlich at ans.net
Sat Dec 10 21:18:54 UTC 1994
All,
ANS has a fix being exercised on our testnet for the routing software
(known as "GateD" - gateway daemon) bug which caused the service
disruption on Saturday, 12/3/94. The sequence of events leading to the
problem is extremely obscure, as should be evident from the description
below. This particular bug has been exercised only twice before in the
history of our use of this software and will not appear again following
deployment of the new software.
We will implement a phased rollout of the new routing software. The
rollout will begin this coming week on a small number of routers.
During the course of the week the behavior of the new software will be
observed. Pending successful results, a network-wide deployment will
take place the week of 12/19.
Steve Heimlich
Manager, Infrastructure Development
ANS
-------
We have these prerequisites:
- there is a network X which is announced into our backbone
- there is a primary announcement (1), a secondary announcement (2),
and a tertiary announcement (3)
- one ENSS A acts as (1) and one ENSS B acts as both (2) and (3) (e.g.,
MAE-East may speak with Sprint and Alternet, which may be secondary
and tertiary providers, respectively)
- ENSS A must have a lower router ID than ENSS B (i.e., A < B)
and this sequence of events:
- ENSS A goes away non-gracefully such that iGP connectivity from the
backbone to ENSS A is withdrawn but the iBGP session stays up (e.g.,
a power loss or circuit outage but not a clean GateD shutdown)
- all routers notice loss of iGP connectivity to ENSS A within
one minute and reset the next hop for route (1) to network X to be
null, keeping the route in the BGP RIB in case iGP connectivity is
restored
- in addition to the above, ENSS B injects (2) into the backbone via
iBGP
- the exterior peer providing (2) withdraws the route to network X
within 2 minutes of the initial AS 690 loss of iGP connectivity to
ENSS A
- ENSS B then injects (3) into the backbone via iBGP
- all other routers see that the preference for network X has worsened
and therefore traverse the BGP RIB to find the best current route to
network X, attempting to verify as well that any route under
consideration has a valid next hop
- during the traversal, the routers mistakenly use an incorrect pointer
to verify existence of a good next hop, not realizing that the former
primary route (1) has a null next hop
- due to a bug in some comparison logic, the formerly primary route (1)
is selected from the BGP RIB if A > B and is installed into the
kernel
- the iBGP sessions from all backbone machines to ENSS A time out three
minutes after loss of iGP connectivity to ENSS A
- GateD crashes when it attempts to delete the mistakenly installed
formerly primary route (1) from the kernel
More information about the NANOG
mailing list