BGP failure analysis and recommendations

Fri Oct 25 18:01:44 UTC 2013

As a member of the support team for a vendor, I'll say this problem isn't
entirely unheard of. The CPU is in charge of local traffic and the BGP
session and some sort of hardware chip or ASIC is in charge of moving
packets through the device. If the hardware is misprogrammed it won't
properly forward traffic while BGP thinks it's doing it's job. This is not
to be confused with a hardware failure. This is purely a software problem.
The software is responsible for telling the hardware what to do, and
sometimes there are bugs there, like there are bugs in all code.

The easiest way to test this kind of issue is to have some other control
plane that is tied to the data plane. That is, the only way to make sure
that the peer is forwarding traffic is to make it forward traffic and react
when it fails. You could do something like set up IP SLA (i.e., ping) to
something in that SP network. If the ping fails then it sounds like your
peer may have a forwarding issue and you can apply a policy to remove or at
least not prefer that peer (in case it's a false positive).

-Pete

On Wed, Oct 23, 2013 at 10:40 PM, JRC NOC
<nospam-nanog at jensenresearch.com>wrote:

> Hello Nanog -
>
> On Saturday, October 19th at about 13:00 UTC we experienced an IP failure
> at one of our sites in the New York area.
> It was apparently a widespread outage on the East coast, but I haven't
> seen it discussed here.
>
> We are multihomed, using EBGP to three (diverse) upstream providers. One
> provider experienced a hardware failure in a core component at one POP.
> Regrettably, during the outage our BGP session remained active and we
> continued receiving full routes from the affected AS.  And our prefixes
> continued to be advertised at their border. However basically none of the
> traffic between those prefixes over that provider was delivered. The bogus
> routes stayed up for hours. We shutdown the BGP peering session when the
> nature of the problem became clear. This was effective. I believe that all
> customer BGP routes were similarly affected, including those belonging to
> some large regional networks and corporations. I have raised the questions
> below with the provider but haven't received any information or advice.
>
> My question is why did our BGP configuration fail?  I'm guessing the basic
> answer is that the IGP and route reflectors within that provider were still
> connected, but the forwarding paths were unavailable.  My BGP session
> basically acted like a bunch of static routes, with no awareness of the
> failure(s) and no dynamic reconfiguration of the RIB.
>
> Is this just an unavoidable issue with scaling large networks?
> Is it perhaps a known side effect of MPLS?
> Have we/they lost something important in the changeover to converged
> mutiprotocol networks?
> Is there a better way for us edge networks to achieve IP resiliency in the
> current environment?
>
> This is an operational issue. Thanks in advance for any hints about what
> happened or better practices to reduce the impact of a routine hardware
> fault in an upstream network.
>
> - Eric Jensen
>
>
>
>
>
>
>
>
>
>
>
>  Date: Wed, 23 Oct 2013 21:26:43 -0400
>> To: cj at chrisjensen.org
>> From: JRC NetOps <noc at jensenresearch.com>
>> Subject: Fwd: BGP failure analysis and recommendations
>>
>>
>>  Date: Mon, 21 Oct 2013 23:19:28 -0400
>>> To: christopher.smith at level3.com
>>> From: Eric Jensen <ejensen at jensenresearch.com>
>>> Subject: BGP failure analysis and recommendations
>>> Cc: "Joe Budelis Fast-E.com" <Joe at Fast-E.com>
>>> Bcc: noc at jensenresearch.com
>>>
>>> Hello Christopher Smith -
>>>
>>> I left you a voicemail message today. The Customer Service folks also
>>> gave me your email address.
>>>
>>> We have a small, but high-value multi-homed corporate network.
>>> We operate using our AS number 17103.
>>>
>>> We have BGP transit circuits with Level 3, Lightpath, and at our colo
>>> center (AS8001)
>>> The Level 3 circuit ID is BBPM9946
>>>
>>> On Saturday, October 19 2013 we had a large IP outage. I tracked it back
>>> to our Level 3 circuit and opened a ticket (7126634).
>>> I have copied (below) an email I sent our channel salesman with more
>>> details about our BGP problems during your outage.
>>> Briefly, I am very concerned that Level 3 presented routes to us that
>>> were not actually reachable through your network, and even worse Level 3
>>> kept advertising our prefixes even though your network couldn't deliver the
>>> traffic to us for those prefixes.
>>>
>>> I believe that the BGP NLRI data should follow the same IP path as the
>>> forwarded data itself. Apparently this isn't the case at Level 3.
>>> I also believe that your MPLS backbone should have recovered
>>> automatically from the forwarding failure, but this didn't happen either.
>>> My only fix was to manually shutdown the BGP peering session with Level
>>> 3.
>>>
>>> Can you explain to me how Level 3 black-holed my routes?
>>> Can you suggest some change to our or your BGP configuration to
>>> eliminate this BGP failure mode?
>>>
>>> Just to be clear, I don't expect our circuit, or your network, to be up
>>> all the time. But I do expect that the routes you advertise to us and to
>>> your BGP peers actually be reachable through your network. On Saturday this
>>> didn't happen. The routes stayed up while the data transport was down.
>>>
>>> Our IPv4 BGP peering session with Level 3 remains down in the interim.
>>> Please get back to me as soon as possible.
>>>
>>> - Eric Jensen
>>> AS17103
>>> 201-741-9509
>>>
>>>
>>>
>>>  Date: Mon, 21 Oct 2013 22:55:35 -0400
>>>> To: "Joe Budelis Fast-E.com" <Joe at Fast-E.com>
>>>> From: Eric Jensen <ejensen at jensenresearch.com>
>>>> Subject: Re:  Fwd: Level3 Interim Response
>>>> Bcc: noc at jensenresearch.com
>>>>
>>>> Hi Joe-
>>>>
>>>> Thanks for making the new inquiry.
>>>> This was a big outage. Apparently Time Warner Cable and  Cablevision
>>>> were affected greatly. Plus many large corporate networks. And of course
>>>> all the single-homed Level 3 customers worldwide. My little network was
>>>> just one more casualty.
>>>>
>>>> See:
>>>>
>>>> http://www.dslreports.com/**forum/r28749556-Internet-**Level3-Outage-<http://www.dslreports.com/forum/r28749556-Internet-Level3-Outage->
>>>>
>>>> http://online.wsj.com/news/**articles/**SB1000142405270230486450457914*
>>>> *5813698584246<http://online.wsj.com/news/articles/SB10001424052702304864504579145813698584246>
>>>>
>>>> For our site, the massive outage started at about 9:00 am Saturday and
>>>> lasted until after 2:00PM. I opened a ticket about 9:30 am but only
>>>> realized the routing issues and took down our BGP session about 12:00 to
>>>> try to minimize the problems for our traffic caused by their misconfigured
>>>> BGP.
>>>>
>>>> There can always be equipment failures and fiber cuts. That's not the
>>>> problem.
>>>> From my point of view the problem was/is that Level 3 kept
>>>> "advertising" our prefix but couldn't deliver the packets to us. They did
>>>> this for all their customer's prefixes, thereby sucking in about half the
>>>> NYC area internet traffic and dumping into the Hudson River, for a huge
>>>> period of time.
>>>> They also kept advertising all their BGP routes to me, thereby fooling
>>>> my routers into sending outbound traffic to Level 3 where they again dumped
>>>> my traffic into the Hudson.
>>>>
>>>> I called Level 3 customer service today and have the name of a network
>>>> engineer to discuss options for fixing the BGP failure.
>>>> If you get any response with an engineering contact please let me know.
>>>>
>>>> I shouldn't have to manually intervene to route around problems. Even
>>>> sadder is the response from Level 3 explaining that they spent hours trying
>>>> to find the problem and had to manually reconfigure their network, leading
>>>> to saturated links and more problems. Their network only healed when the
>>>> faulty line card was replaced.
>>>>
>>>> I had reactivated the BGP session later that night, but after reviewing
>>>> the actual damage that we incurred, and the widespread nature of the
>>>> failure, I have decided to leave our Level 3 BGP session down, at least
>>>> until the engineering situation improves.
>>>> There may not be any good way to use a Level 3 BGP session without
>>>> risking the same "black hole" problem going forward. It's that type of
>>>> failure that BGP is specifically designed to deal with, but it was
>>>> developed in the days of point-to-point circuits carrying IP traffic.
>>>>
>>>> Nowadays some networks have a new layer between the wires and IP,
>>>> namely MPLS, and this allowed BGP to stay up but deprived the routers of
>>>> functioning IP next-hops, which they (both the Level 3 IP routers and the
>>>> Level 3 personnel) were unaware of. Apparently the Level 3 IP-based BGP
>>>> routers all believed they had working circuits edge-to-edge, but in fact
>>>> their network was partitioned.
>>>>
>>>> MPLS must have some redundancy features, but they obviously weren't
>>>> working on Saturday. This is a huge engineering failure. No large ISP could
>>>> function this way for long.
>>>>
>>>> I can wait the 72 hours for their response. I expect it will be full of
>>>> mealy-mouth platitudes about how no system is foolproof and it will all be
>>>> fine now.
>>>>
>>>> It would be more interesting to me to be in the meeting room where some
>>>> engineer has to explain how they could lose so much traffic and not be able
>>>> to operate  a functioning, if degraded, network after a single line card
>>>> failure.  It wouldn't be the head of network design, because that person
>>>> would already have been fired.
>>>>
>>>> Let me know if your hear anything. I will do the same.
>>>>
>>>> - Eric Jensen
>>>> AS17103
>>>> 201-741-9509
>>>>
>>>>
>>>>
>>>>