Peering/Transit eBGP sessions -pet or cattle?

Mon Feb 10 18:23:06 UTC 2020

Hello Adam,

On Mon, 10 Feb 2020 at 13:37, <adamv0025 at netconsultings.com> wrote:
> Would like to take a poll on whether you folks tend to treat your transit/peering connections (BGP sessions in particular) as pets or rather as cattle.

Cattle every day of the week.

I don't trust control-plane resiliency and things like ISSU any
farther than I can throw the big boxes it runs on.

The entire network is engineered so that my customers *do not* feel
the loss of one node (*). That is the design principal here and while
traffic grows and we keep adding more capacity this is something we
always consider.

How difficult it is to achieve that depends on the particular
situation, and it may be quite difficult in some situations, but not
here.

That is why I can upgrade releases on those nodes (without customers,
just transit and peers) quite frequently. I can achieve that with
mostly zero packet loss because of the design and all-around traffic
draining using graceful shutdown and friends. We had quite some issues
to drain traffic from nodes in the past (brownouts due to FIB mismatch
between routers due to IP lookup on both ingress and egress node with
per vrf label allocation, but since we switched to "per-ce" - meaning
per nexthop - label allocation things work great).

On the other side, transit with support for graceful-shutdown is of
course great, but even if there is no support for it, for maintenance
on your or your transit's box, you still know about the maintenance
beforehand, so you can manually drain your egress traffic (you peer
doesn't have to support RFC8326 for you to drop YOUR loc-pref to
zero), and many transit provider have some kind of "set loc-pref below
peer" community, which allows you to do basically the same thing
manually without actual RFC8326 support on the other side. That said,
for ingress traffic, unless you are announcing *A LOT* of routes,
convergence is usually *very* fast anyway.

I can see the benefit of having internal HW redundancy for nodes where
customers are connected (shorter maintenance sessions, less outages in
some single HW failures scenarios, overall theoretical better service
uptime), but it never covers everything and it may just introduce
unnecessary complexity that is actually root-causing outages and
certainly complexity.

Maybe I'm just a lucky fellow, but the hardware has been so reliable
here that I'm pretty sure the complexity of Dual-RSP, ISSU and friends
would have caused more issues over time than what I'm seeing with some
good old and honest HW failures.

Regarding HW redundancy itself: Dual RSP doesn't have any benefit when
the guy in the MMR pulls the wrong fiber, bringing down my transit. It
will still be BGP that has to converge. We don't have PIC today, maybe
this is something to look into in the future, but it isn't something
that internal HW redundancy fixes.

A straightforward and KISS design, where the engineers actually know
"what happens when", and how to do things properly (like draining
traffic), and also quite frankly accepting some brownouts for uncommon
events is the strategy that worked best for us.

(*) sure, if the node with 700k best-paths towards a transit dies
non-gracefully (hw or power failure), there will be a brownout of the
affected prefixes for some minutes. But after convergence my network
will be fine and my customers will stop feeling. They will ask what
happened and I will be able to explain.

cheers,
lukas