Keepalives, WAS: NAP/ISP Saturation

Vadim Antonov avg at pluris.com
Sat Dec 21 08:50:13 UTC 1996


Tony Li <tli at jnx.com> wrote:

>        No flap dampening, no
>   hold-down "blackholing" after a failure (so as not to generate
>   route withdrawals for transient link outages), silly priority
>   and no sub-second ping intervals, and forget about LQM).

>None of these have anything to do with the link keepalive protocol and
>everything to do with internal link implementation.  Let's not confuse the
>issue.

Well, let us see:

a) dampening -- it makes a lot of sense to flap-dampen at circuit
   level, where it is cheap and efficient; and not pass the flap
   to routing protocols where it is a lot more expensive to process.
   I.e. if a DS3 CSU lost its marbles, there's no reason to recompute
   some 20K routes dependent on that particular circuit every 60 or
   so seconds.

b) blackholing -- if a circuit went down it makes sense to wait for
   some time (0.1 sec or so) and just drop traffic on the floor, in
   hope that the outage is transient.  There's a lot of momentary
   carrier losses or other glitches in the telco transmission networks.
   Only when outage is prolonged does it make sense to notify the
   routing level.

c) priority -- the link keepalive processes must have priorities
   _higher_ than that of routing protocols.  I.e. no amount of
   routing updates should cause false link flapping due to delayed
   keepalive messages.  The same is true for keepalive messages vs
   routing updates on a link.

d) sub-second keepalive intervals -- this is probably the only
   method to discover _fast_ that remote end is dead.  The way it
   is now it takes 30 sec or so for a local router to find out that
   the remote one is wedged, and take appropriate action.

e) Link Quality Monitoring -- the usefulness is obvious.  For most
   link-level failures there are sufficient advance warnings (corrupted
   checksums, etc).  Also, there's a lot of things (like "stealth"
   rerouting by transmission fabric) which in some cases make circuit
   worse than a disconnected one.  One particluar case i have in mind
   suddenly increased link latency by some 400 ms (Satellites-R-Us,
   that's it), so manual intervention was required to move traffic
   off the link.   In any case, some automatic LQM shut-offs (on
   conditions like "latency is more than N ms" or "error frequency is
   higher than 1:10e6") are clearly in order.

So i think the things i noted are rather relevant, and are necessary
if we're going to build a real production network.  I'm sorry if the
link keepalive protocol digression confused the original discussion
of security of the routing system.

--vadim





More information about the NANOG mailing list