Open letter to Level3 concerning the global routing issues on June 12th

Rafael Possamai rafael at gav.ufsc.br
Sat Jun 13 13:31:01 UTC 2015


Something about Malaysia, first the airplanes... now BGP leaks?

On Fri, Jun 12, 2015 at 10:32 AM, Martin Millnert <millnert at gmail.com>
wrote:

> Dear Level3,
>
> The Internet is a cooperative effort, and it works well only when its
> participants take constructive actions to address errors and remedy
> problems.
> Your position as a major Internet Carrier bestows upon you a certain
> degree of responsibility for the correct operation of the Internet all
> across (and beyond) the planet. You have many customers. Customers will
> always occasionally make mistakes. You as a major Internet Carrier have
> a responsibility to limit, not amplify, your customers' mistakes.
> Other major carriers implement technical measures that severely limits
> the damages from customer mistakes from having global impact.
> Other major carriers also implement operational procedures in addition
> to technical measures.
> In combination, these measures drastically reduce the outage-hours as a
> result of customer configuration errors.
>
> At 08:44 UTC on Friday 12th of June, one of your transit customers,
> Telekom Malaysia (AS4788) began announcing the full Internet table back
> to you, which you accepted and propagated to your peers and customers,
> causing global outages for close to 3 hours.
> [ https://twitter.com/DynResearch/status/609340592036970496 ]
> During this 3 hour window, it appears (from your own service outage
> reports) that you did nothing to stop the global Internet outage, but
> that Telekom Malaysia themselves eventually resolved it. This lack of
> action on your end, and your disregard for the correct operation of the
> global Internet is astonishing. These mistakes do not need to happen.
> AS4788 under normal circumstances announces ~1900 IPv4 prefixes to the
> Internet. You accepted multiple hundred thousand prefixes from them - a
> max prefix setting would have severely limited the damage. We expect
> that these are your practices as well, but they failed. When they do, it
> should not take ~3 hours to shut down the session(s).
>
> Many operators, in despair, turned down their peering sessions with you
> once it was clear you were causing the outages and no immediate fix was
> in sight. This improved the situation for some - but not all did. Had
> you deployed proper IRR-filtering to filter the bad announcements the
> impact would've been far less critical.
>
> As a direct consequence of your ~3 hours of inaction, as a local
> example, Swedish payment terminals were experiencing problems all over
> the country. The Swedish economy was directly affected by your inaction.
> There were queues when I was buying lunch! Imagine the food rage. The
> situation was probably similar at other places around the globe where
> people were awake.
>
> Operators around the planet are curious:
>   - Did Level3 not detect or understand that it was causing global
> Internet outages for ~3 hours?
>   - If Level3 did in fact detect or understand it was causing global
> Internet outages, why did it not properly and immediately remedy the
> situation?
>   - What is Level3 going to do to address these questions and begin work
> on restoring its credibility as a carrier?
>
> We all understand that mistakes do happen (in applying customer
> interface templates, etc.). However the Internet is all too pervasive in
> everyday life today for anything but swift action by carriers to remedy
> breakage after the fact. It is absolutely not sufficient to let a
> customer spend 3 hours to detect and fix a situation like this one. It
> is unacceptable that no swift action was taken on your end to limit the
> global routing issues you caused.
>
> Sincerely,
> Martin Millnert
> Member of Internet Community - no carrier / ISP affiliation.
>



More information about the NANOG mailing list