BGP failure analysis and recommendations

Mon Nov 4 17:31:02 UTC 2013

The below problem was the motivation for this BGP improvement :

http://tools.ietf.org/html/draft-ietf-idr-bgp-bestpath-selection-criteria

-----Original Message-----
From: Pete Lumbis <alumbis at gmail.com>
Date: Friday, October 25, 2013 2:01 PM
To: JRC NOC <nospam-nanog at jensenresearch.com>
Cc: "nanog at nanog.org" <nanog at nanog.org>
Subject: Re: BGP failure analysis and recommendations

>As a member of the support team for a vendor, I'll say this problem isn't
>entirely unheard of. The CPU is in charge of local traffic and the BGP
>session and some sort of hardware chip or ASIC is in charge of moving
>packets through the device. If the hardware is misprogrammed it won't
>properly forward traffic while BGP thinks it's doing it's job. This is not
>to be confused with a hardware failure. This is purely a software problem.
>The software is responsible for telling the hardware what to do, and
>sometimes there are bugs there, like there are bugs in all code.
>
>The easiest way to test this kind of issue is to have some other control
>plane that is tied to the data plane. That is, the only way to make sure
>that the peer is forwarding traffic is to make it forward traffic and
>react
>when it fails. You could do something like set up IP SLA (i.e., ping) to
>something in that SP network. If the ping fails then it sounds like your
>peer may have a forwarding issue and you can apply a policy to remove or
>at
>least not prefer that peer (in case it's a false positive).
>
>-Pete
>
>
>On Wed, Oct 23, 2013 at 10:40 PM, JRC NOC
><nospam-nanog at jensenresearch.com>wrote:
>
>> Hello Nanog -
>>
>> On Saturday, October 19th at about 13:00 UTC we experienced an IP
>>failure
>> at one of our sites in the New York area.
>> It was apparently a widespread outage on the East coast, but I haven't
>> seen it discussed here.
>>
>> We are multihomed, using EBGP to three (diverse) upstream providers. One
>> provider experienced a hardware failure in a core component at one POP.
>> Regrettably, during the outage our BGP session remained active and we
>> continued receiving full routes from the affected AS.  And our prefixes
>> continued to be advertised at their border. However basically none of
>>the
>> traffic between those prefixes over that provider was delivered. The
>>bogus
>> routes stayed up for hours. We shutdown the BGP peering session when the
>> nature of the problem became clear. This was effective. I believe that
>>all
>> customer BGP routes were similarly affected, including those belonging
>>to
>> some large regional networks and corporations. I have raised the
>>questions
>> below with the provider but haven't received any information or advice.
>>
>> My question is why did our BGP configuration fail?  I'm guessing the
>>basic
>> answer is that the IGP and route reflectors within that provider were
>>still
>> connected, but the forwarding paths were unavailable.  My BGP session
>> basically acted like a bunch of static routes, with no awareness of the
>> failure(s) and no dynamic reconfiguration of the RIB.
>>
>> Is this just an unavoidable issue with scaling large networks?
>> Is it perhaps a known side effect of MPLS?
>> Have we/they lost something important in the changeover to converged
>> mutiprotocol networks?
>> Is there a better way for us edge networks to achieve IP resiliency in
>>the
>> current environment?
>>
>> This is an operational issue. Thanks in advance for any hints about what
>> happened or better practices to reduce the impact of a routine hardware
>> fault in an upstream network.
>>
>> - Eric Jensen
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>  Date: Wed, 23 Oct 2013 21:26:43 -0400
>>> To: cj at chrisjensen.org
>>> From: JRC NetOps <noc at jensenresearch.com>
>>> Subject: Fwd: BGP failure analysis and recommendations
>>>
>>>
>>>  Date: Mon, 21 Oct 2013 23:19:28 -0400
>>>> To: christopher.smith at level3.com
>>>> From: Eric Jensen <ejensen at jensenresearch.com>
>>>> Subject: BGP failure analysis and recommendations
>>>> Cc: "Joe Budelis Fast-E.com" <Joe at Fast-E.com>
>>>> Bcc: noc at jensenresearch.com
>>>>
>>>> Hello Christopher Smith -
>>>>
>>>> I left you a voicemail message today. The Customer Service folks also
>>>> gave me your email address.
>>>>
>>>> We have a small, but high-value multi-homed corporate network.
>>>> We operate using our AS number 17103.
>>>>
>>>> We have BGP transit circuits with Level 3, Lightpath, and at our colo
>>>> center (AS8001)
>>>> The Level 3 circuit ID is BBPM9946
>>>>
>>>> On Saturday, October 19 2013 we had a large IP outage. I tracked it
>>>>back
>>>> to our Level 3 circuit and opened a ticket (7126634).
>>>> I have copied (below) an email I sent our channel salesman with more
>>>> details about our BGP problems during your outage.
>>>> Briefly, I am very concerned that Level 3 presented routes to us that
>>>> were not actually reachable through your network, and even worse
>>>>Level 3
>>>> kept advertising our prefixes even though your network couldn't
>>>>deliver the
>>>> traffic to us for those prefixes.
>>>>
>>>> I believe that the BGP NLRI data should follow the same IP path as the
>>>> forwarded data itself. Apparently this isn't the case at Level 3.
>>>> I also believe that your MPLS backbone should have recovered
>>>> automatically from the forwarding failure, but this didn't happen
>>>>either.
>>>> My only fix was to manually shutdown the BGP peering session with
>>>>Level
>>>> 3.
>>>>
>>>> Can you explain to me how Level 3 black-holed my routes?
>>>> Can you suggest some change to our or your BGP configuration to
>>>> eliminate this BGP failure mode?
>>>>
>>>> Just to be clear, I don't expect our circuit, or your network, to be
>>>>up
>>>> all the time. But I do expect that the routes you advertise to us and
>>>>to
>>>> your BGP peers actually be reachable through your network. On
>>>>Saturday this
>>>> didn't happen. The routes stayed up while the data transport was down.
>>>>
>>>> Our IPv4 BGP peering session with Level 3 remains down in the interim.
>>>> Please get back to me as soon as possible.
>>>>
>>>> - Eric Jensen
>>>> AS17103
>>>> 201-741-9509
>>>>
>>>>
>>>>
>>>>  Date: Mon, 21 Oct 2013 22:55:35 -0400
>>>>> To: "Joe Budelis Fast-E.com" <Joe at Fast-E.com>
>>>>> From: Eric Jensen <ejensen at jensenresearch.com>
>>>>> Subject: Re:  Fwd: Level3 Interim Response
>>>>> Bcc: noc at jensenresearch.com
>>>>>
>>>>> Hi Joe-
>>>>>
>>>>> Thanks for making the new inquiry.
>>>>> This was a big outage. Apparently Time Warner Cable and  Cablevision
>>>>> were affected greatly. Plus many large corporate networks. And of
>>>>>course
>>>>> all the single-homed Level 3 customers worldwide. My little network
>>>>>was
>>>>> just one more casualty.
>>>>>
>>>>> See:
>>>>>
>>>>> 
>>>>>http://www.dslreports.com/**forum/r28749556-Internet-**Level3-Outage-<
>>>>>http://www.dslreports.com/forum/r28749556-Internet-Level3-Outage->
>>>>>
>>>>> 
>>>>>http://online.wsj.com/news/**articles/**SB1000142405270230486450457914
>>>>>*
>>>>> 
>>>>>*5813698584246<http://online.wsj.com/news/articles/SB10001424052702304
>>>>>864504579145813698584246>
>>>>>
>>>>> For our site, the massive outage started at about 9:00 am Saturday
>>>>>and
>>>>> lasted until after 2:00PM. I opened a ticket about 9:30 am but only
>>>>> realized the routing issues and took down our BGP session about
>>>>>12:00 to
>>>>> try to minimize the problems for our traffic caused by their
>>>>>misconfigured
>>>>> BGP.
>>>>>
>>>>> There can always be equipment failures and fiber cuts. That's not the
>>>>> problem.
>>>>> From my point of view the problem was/is that Level 3 kept
>>>>> "advertising" our prefix but couldn't deliver the packets to us.
>>>>>They did
>>>>> this for all their customer's prefixes, thereby sucking in about
>>>>>half the
>>>>> NYC area internet traffic and dumping into the Hudson River, for a
>>>>>huge
>>>>> period of time.
>>>>> They also kept advertising all their BGP routes to me, thereby
>>>>>fooling
>>>>> my routers into sending outbound traffic to Level 3 where they again
>>>>>dumped
>>>>> my traffic into the Hudson.
>>>>>
>>>>> I called Level 3 customer service today and have the name of a
>>>>>network
>>>>> engineer to discuss options for fixing the BGP failure.
>>>>> If you get any response with an engineering contact please let me
>>>>>know.
>>>>>
>>>>> I shouldn't have to manually intervene to route around problems. Even
>>>>> sadder is the response from Level 3 explaining that they spent hours
>>>>>trying
>>>>> to find the problem and had to manually reconfigure their network,
>>>>>leading
>>>>> to saturated links and more problems. Their network only healed when
>>>>>the
>>>>> faulty line card was replaced.
>>>>>
>>>>> I had reactivated the BGP session later that night, but after
>>>>>reviewing
>>>>> the actual damage that we incurred, and the widespread nature of the
>>>>> failure, I have decided to leave our Level 3 BGP session down, at
>>>>>least
>>>>> until the engineering situation improves.
>>>>> There may not be any good way to use a Level 3 BGP session without
>>>>> risking the same "black hole" problem going forward. It's that type
>>>>>of
>>>>> failure that BGP is specifically designed to deal with, but it was
>>>>> developed in the days of point-to-point circuits carrying IP traffic.
>>>>>
>>>>> Nowadays some networks have a new layer between the wires and IP,
>>>>> namely MPLS, and this allowed BGP to stay up but deprived the
>>>>>routers of
>>>>> functioning IP next-hops, which they (both the Level 3 IP routers
>>>>>and the
>>>>> Level 3 personnel) were unaware of. Apparently the Level 3 IP-based
>>>>>BGP
>>>>> routers all believed they had working circuits edge-to-edge, but in
>>>>>fact
>>>>> their network was partitioned.
>>>>>
>>>>> MPLS must have some redundancy features, but they obviously weren't
>>>>> working on Saturday. This is a huge engineering failure. No large
>>>>>ISP could
>>>>> function this way for long.
>>>>>
>>>>> I can wait the 72 hours for their response. I expect it will be full
>>>>>of
>>>>> mealy-mouth platitudes about how no system is foolproof and it will
>>>>>all be
>>>>> fine now.
>>>>>
>>>>> It would be more interesting to me to be in the meeting room where
>>>>>some
>>>>> engineer has to explain how they could lose so much traffic and not
>>>>>be able
>>>>> to operate  a functioning, if degraded, network after a single line
>>>>>card
>>>>> failure.  It wouldn't be the head of network design, because that
>>>>>person
>>>>> would already have been fired.
>>>>>
>>>>> Let me know if your hear anything. I will do the same.
>>>>>
>>>>> - Eric Jensen
>>>>> AS17103
>>>>> 201-741-9509
>>>>>
>>>>>
>>>>>
>>>>>