BGP failure analysis and recommendations

Rajiv Asati (rajiva) rajiva at
Mon Nov 4 17:31:02 UTC 2013

The below problem was the motivation for this BGP improvement :

-----Original Message-----
From: Pete Lumbis <alumbis at>
Date: Friday, October 25, 2013 2:01 PM
To: JRC NOC <nospam-nanog at>
Cc: "nanog at" <nanog at>
Subject: Re: BGP failure analysis and recommendations

>As a member of the support team for a vendor, I'll say this problem isn't
>entirely unheard of. The CPU is in charge of local traffic and the BGP
>session and some sort of hardware chip or ASIC is in charge of moving
>packets through the device. If the hardware is misprogrammed it won't
>properly forward traffic while BGP thinks it's doing it's job. This is not
>to be confused with a hardware failure. This is purely a software problem.
>The software is responsible for telling the hardware what to do, and
>sometimes there are bugs there, like there are bugs in all code.
>The easiest way to test this kind of issue is to have some other control
>plane that is tied to the data plane. That is, the only way to make sure
>that the peer is forwarding traffic is to make it forward traffic and
>when it fails. You could do something like set up IP SLA (i.e., ping) to
>something in that SP network. If the ping fails then it sounds like your
>peer may have a forwarding issue and you can apply a policy to remove or
>least not prefer that peer (in case it's a false positive).
>On Wed, Oct 23, 2013 at 10:40 PM, JRC NOC
><nospam-nanog at>wrote:
>> Hello Nanog -
>> On Saturday, October 19th at about 13:00 UTC we experienced an IP
>> at one of our sites in the New York area.
>> It was apparently a widespread outage on the East coast, but I haven't
>> seen it discussed here.
>> We are multihomed, using EBGP to three (diverse) upstream providers. One
>> provider experienced a hardware failure in a core component at one POP.
>> Regrettably, during the outage our BGP session remained active and we
>> continued receiving full routes from the affected AS.  And our prefixes
>> continued to be advertised at their border. However basically none of
>> traffic between those prefixes over that provider was delivered. The
>> routes stayed up for hours. We shutdown the BGP peering session when the
>> nature of the problem became clear. This was effective. I believe that
>> customer BGP routes were similarly affected, including those belonging
>> some large regional networks and corporations. I have raised the
>> below with the provider but haven't received any information or advice.
>> My question is why did our BGP configuration fail?  I'm guessing the
>> answer is that the IGP and route reflectors within that provider were
>> connected, but the forwarding paths were unavailable.  My BGP session
>> basically acted like a bunch of static routes, with no awareness of the
>> failure(s) and no dynamic reconfiguration of the RIB.
>> Is this just an unavoidable issue with scaling large networks?
>> Is it perhaps a known side effect of MPLS?
>> Have we/they lost something important in the changeover to converged
>> mutiprotocol networks?
>> Is there a better way for us edge networks to achieve IP resiliency in
>> current environment?
>> This is an operational issue. Thanks in advance for any hints about what
>> happened or better practices to reduce the impact of a routine hardware
>> fault in an upstream network.
>> - Eric Jensen
>>  Date: Wed, 23 Oct 2013 21:26:43 -0400
>>> To: cj at
>>> From: JRC NetOps <noc at>
>>> Subject: Fwd: BGP failure analysis and recommendations
>>>  Date: Mon, 21 Oct 2013 23:19:28 -0400
>>>> To: christopher.smith at
>>>> From: Eric Jensen <ejensen at>
>>>> Subject: BGP failure analysis and recommendations
>>>> Cc: "Joe Budelis" <Joe at>
>>>> Bcc: noc at
>>>> Hello Christopher Smith -
>>>> I left you a voicemail message today. The Customer Service folks also
>>>> gave me your email address.
>>>> We have a small, but high-value multi-homed corporate network.
>>>> We operate using our AS number 17103.
>>>> We have BGP transit circuits with Level 3, Lightpath, and at our colo
>>>> center (AS8001)
>>>> The Level 3 circuit ID is BBPM9946
>>>> On Saturday, October 19 2013 we had a large IP outage. I tracked it
>>>> to our Level 3 circuit and opened a ticket (7126634).
>>>> I have copied (below) an email I sent our channel salesman with more
>>>> details about our BGP problems during your outage.
>>>> Briefly, I am very concerned that Level 3 presented routes to us that
>>>> were not actually reachable through your network, and even worse
>>>>Level 3
>>>> kept advertising our prefixes even though your network couldn't
>>>>deliver the
>>>> traffic to us for those prefixes.
>>>> I believe that the BGP NLRI data should follow the same IP path as the
>>>> forwarded data itself. Apparently this isn't the case at Level 3.
>>>> I also believe that your MPLS backbone should have recovered
>>>> automatically from the forwarding failure, but this didn't happen
>>>> My only fix was to manually shutdown the BGP peering session with
>>>> 3.
>>>> Can you explain to me how Level 3 black-holed my routes?
>>>> Can you suggest some change to our or your BGP configuration to
>>>> eliminate this BGP failure mode?
>>>> Just to be clear, I don't expect our circuit, or your network, to be
>>>> all the time. But I do expect that the routes you advertise to us and
>>>> your BGP peers actually be reachable through your network. On
>>>>Saturday this
>>>> didn't happen. The routes stayed up while the data transport was down.
>>>> Our IPv4 BGP peering session with Level 3 remains down in the interim.
>>>> Please get back to me as soon as possible.
>>>> - Eric Jensen
>>>> AS17103
>>>> 201-741-9509
>>>>  Date: Mon, 21 Oct 2013 22:55:35 -0400
>>>>> To: "Joe Budelis" <Joe at>
>>>>> From: Eric Jensen <ejensen at>
>>>>> Subject: Re:  Fwd: Level3 Interim Response
>>>>> Bcc: noc at
>>>>> Hi Joe-
>>>>> Thanks for making the new inquiry.
>>>>> This was a big outage. Apparently Time Warner Cable and  Cablevision
>>>>> were affected greatly. Plus many large corporate networks. And of
>>>>> all the single-homed Level 3 customers worldwide. My little network
>>>>> just one more casualty.
>>>>> See:
>>>>> For our site, the massive outage started at about 9:00 am Saturday
>>>>> lasted until after 2:00PM. I opened a ticket about 9:30 am but only
>>>>> realized the routing issues and took down our BGP session about
>>>>>12:00 to
>>>>> try to minimize the problems for our traffic caused by their
>>>>> BGP.
>>>>> There can always be equipment failures and fiber cuts. That's not the
>>>>> problem.
>>>>> From my point of view the problem was/is that Level 3 kept
>>>>> "advertising" our prefix but couldn't deliver the packets to us.
>>>>>They did
>>>>> this for all their customer's prefixes, thereby sucking in about
>>>>>half the
>>>>> NYC area internet traffic and dumping into the Hudson River, for a
>>>>> period of time.
>>>>> They also kept advertising all their BGP routes to me, thereby
>>>>> my routers into sending outbound traffic to Level 3 where they again
>>>>> my traffic into the Hudson.
>>>>> I called Level 3 customer service today and have the name of a
>>>>> engineer to discuss options for fixing the BGP failure.
>>>>> If you get any response with an engineering contact please let me
>>>>> I shouldn't have to manually intervene to route around problems. Even
>>>>> sadder is the response from Level 3 explaining that they spent hours
>>>>> to find the problem and had to manually reconfigure their network,
>>>>> to saturated links and more problems. Their network only healed when
>>>>> faulty line card was replaced.
>>>>> I had reactivated the BGP session later that night, but after
>>>>> the actual damage that we incurred, and the widespread nature of the
>>>>> failure, I have decided to leave our Level 3 BGP session down, at
>>>>> until the engineering situation improves.
>>>>> There may not be any good way to use a Level 3 BGP session without
>>>>> risking the same "black hole" problem going forward. It's that type
>>>>> failure that BGP is specifically designed to deal with, but it was
>>>>> developed in the days of point-to-point circuits carrying IP traffic.
>>>>> Nowadays some networks have a new layer between the wires and IP,
>>>>> namely MPLS, and this allowed BGP to stay up but deprived the
>>>>>routers of
>>>>> functioning IP next-hops, which they (both the Level 3 IP routers
>>>>>and the
>>>>> Level 3 personnel) were unaware of. Apparently the Level 3 IP-based
>>>>> routers all believed they had working circuits edge-to-edge, but in
>>>>> their network was partitioned.
>>>>> MPLS must have some redundancy features, but they obviously weren't
>>>>> working on Saturday. This is a huge engineering failure. No large
>>>>>ISP could
>>>>> function this way for long.
>>>>> I can wait the 72 hours for their response. I expect it will be full
>>>>> mealy-mouth platitudes about how no system is foolproof and it will
>>>>>all be
>>>>> fine now.
>>>>> It would be more interesting to me to be in the meeting room where
>>>>> engineer has to explain how they could lose so much traffic and not
>>>>>be able
>>>>> to operate  a functioning, if degraded, network after a single line
>>>>> failure.  It wouldn't be the head of network design, because that
>>>>> would already have been fired.
>>>>> Let me know if your hear anything. I will do the same.
>>>>> - Eric Jensen
>>>>> AS17103
>>>>> 201-741-9509

More information about the NANOG mailing list