seeing the trees in the forest of confusion

Alan Hannan alan at mindvision.com
Sat Apr 26 14:22:11 UTC 1997


  The source was isolated within 60 minutes of the problem.
  
  The routes propagated for greater than 120 minutes after that
  without the source contributing.  (yes, the routes with an
  originating AS path of 7007)
  
  This is the interesting problem that noone seems to focus on. 

  I suppose it is more fun to criticize policy and NSPs, but it
  may well be a hole in the BGP protocol, or more likely
  implementations in vendor's code [or user's implementation
  of twiddleable holddown timers].

  As well, for your question:

 ] where were all the network people ....
 
  The people who could and did fix this problem were on a bridged
  (voice) conference call understanding what was going on, sharing
  information and working to resolve the issues.  The 24 hour
  NOC functionality worked quite properly.

  There is a balance between informing people what is going on and 
  working to fix the problem.  Most people would prefer the problem
  be resolved and the postmortem take lengths of time than to
  prolong the incident to inform the masses of what actions are
  being taken.

  Having intelligent people answer the phone and explain what was
  going on wouldn't have helped solve the problem, just make people
  feel better.  You could achieve the same feeling by taking a walk in
  the outdoors.

  This wasn't an Internet outage, this wasn't a catastrophe.  It was
  a brownout that sporadically hit providers at various strengths.
  At least one NSP measured their backbone load go down by only 15%
  during the incident.

  The Internet infrastructure did not react as expected to this
  technical problem.  It continued to propagate the bad information
  when withdrawls should have caught up and canceled the
  announcement.

  To make a security analogy, an entity designs a security policy
  inherent to risks.  We openly acknowledged that this could happen
  and would hurt.  The risk assesment was not accurate because the
  infrastructure did not behave as expected.

  We did not expect it to hurt for 3 hours.  It should have stopped 
  earlier.  Why it didn't is the only interesting question left.

  -a

] There are already 2 articles on the net about it
]
] http://www.news.com/News/Item/0,4,10083,00.html?latest
] http://www.wired.com/news/technology/story/3442.html
]
] and I am sure there are more to come. It seems too easy for one person/company
] to bring down the net. Yes, we all agree it was some kind of accident, one
] that wasn't supposed to happen BUT I have to ask, where were all the network
] people that should have caught this before it hit the internet full force.
]
] Yes, there is talk about having the RAs setup and to prevent this type of
] thing, but not much else. We put our stuff in the RA when we remember to but I
] don't think that many people look much at them anyways.
]
] Hopefully next time this can be stopped before long, like maybe within 10 - 20
] mins. Maybe it is time for all companies, even the smaller ones, or those that
] use BGP to have and really USE a 24 hour noc. Where there are people there
] that really know about routers and BGP. Maybe as the internet grows, those
] making it grow need a more active role. Sorry, Voice mail doesn't cut it, or a
] busy number when the internet comes down....
]






More information about the NANOG mailing list