Open letter to Level3 concerning the global routing issues on June 12th

jim deleskie deleskie at gmail.com
Fri Jun 12 15:53:13 UTC 2015


People from Big telcom should never reply to mailing lists from work
addresses unless specifically allowed, which I suspect TATA doesn't either,
based on some direct, buy old knowledge :)

Filtering has been a community issue since my days @ MCI being AS3561,
often discussed not often enough acted one, I suspect the topic has come up
at every "large" NSP I've worked at.  Frequently someone complains its
"hard" to fix, or router X makes it hard to fix, or customer Y won;t agree,
and not enough people stand up to force fix the issues.  I've did a preso
on it ( while working at TATA) with some other "smart folks" but for all
the usual reasons it died on the vine.  I don't blame (3) for this but our
community as a whole.  Many "people/networks" have to not do the "right
thing(tm)" for a failure like this to happen.


-jim

On Fri, Jun 12, 2015 at 12:43 PM, Utkarsh Gosain <
utkarsh.gosain at tatacommunications.com> wrote:

> Hi Martin
> I am not a spokesperson on behalf of L3 but I have worked for big telcos
> my whole career and my recommendation is to raise a trouble ticket if any
> one on the forum is their customer and is affected.
> I don’t think Engineers at NOC are authorized to reply to forums at any of
> the major telcos especially regarding outages unless someone raise a
> trouble ticket and seeks an RCA of the issue one on one with them.
>
>
> Utkarsh Gosain
> Global Acc Director
> Tata Communications
>
>
> -----Original Message-----
> From: NANOG [mailto:nanog-bounces at nanog.org] On Behalf Of Martin Millnert
> Sent: Friday, June 12, 2015 11:33 AM
> To: NANOG
> Subject: Open letter to Level3 concerning the global routing issues on
> June 12th
>
> Dear Level3,
>
> The Internet is a cooperative effort, and it works well only when its
> participants take constructive actions to address errors and remedy
> problems.
> Your position as a major Internet Carrier bestows upon you a certain
> degree of responsibility for the correct operation of the Internet all
> across (and beyond) the planet. You have many customers. Customers will
> always occasionally make mistakes. You as a major Internet Carrier have a
> responsibility to limit, not amplify, your customers' mistakes.
> Other major carriers implement technical measures that severely limits the
> damages from customer mistakes from having global impact.
> Other major carriers also implement operational procedures in addition to
> technical measures.
> In combination, these measures drastically reduce the outage-hours as a
> result of customer configuration errors.
>
> At 08:44 UTC on Friday 12th of June, one of your transit customers,
> Telekom Malaysia (AS4788) began announcing the full Internet table back to
> you, which you accepted and propagated to your peers and customers, causing
> global outages for close to 3 hours.
> [ https://twitter.com/DynResearch/status/609340592036970496 ] During this
> 3 hour window, it appears (from your own service outage
> reports) that you did nothing to stop the global Internet outage, but that
> Telekom Malaysia themselves eventually resolved it. This lack of action on
> your end, and your disregard for the correct operation of the global
> Internet is astonishing. These mistakes do not need to happen.
> AS4788 under normal circumstances announces ~1900 IPv4 prefixes to the
> Internet. You accepted multiple hundred thousand prefixes from them - a max
> prefix setting would have severely limited the damage. We expect that these
> are your practices as well, but they failed. When they do, it should not
> take ~3 hours to shut down the session(s).
>
> Many operators, in despair, turned down their peering sessions with you
> once it was clear you were causing the outages and no immediate fix was in
> sight. This improved the situation for some - but not all did. Had you
> deployed proper IRR-filtering to filter the bad announcements the impact
> would've been far less critical.
>
> As a direct consequence of your ~3 hours of inaction, as a local example,
> Swedish payment terminals were experiencing problems all over the country.
> The Swedish economy was directly affected by your inaction.
> There were queues when I was buying lunch! Imagine the food rage. The
> situation was probably similar at other places around the globe where
> people were awake.
>
> Operators around the planet are curious:
>   - Did Level3 not detect or understand that it was causing global
> Internet outages for ~3 hours?
>   - If Level3 did in fact detect or understand it was causing global
> Internet outages, why did it not properly and immediately remedy the
> situation?
>   - What is Level3 going to do to address these questions and begin work
> on restoring its credibility as a carrier?
>
> We all understand that mistakes do happen (in applying customer interface
> templates, etc.). However the Internet is all too pervasive in everyday
> life today for anything but swift action by carriers to remedy breakage
> after the fact. It is absolutely not sufficient to let a customer spend 3
> hours to detect and fix a situation like this one. It is unacceptable that
> no swift action was taken on your end to limit the global routing issues
> you caused.
>
> Sincerely,
> Martin Millnert
> Member of Internet Community - no carrier / ISP affiliation.
>



More information about the NANOG mailing list