Fiber cut in SF area

George William Herbert gherbert at retro.com
Mon Apr 13 20:30:27 CDT 2009



Matthew Petach wrote:
>> George William Herbert <gherbert at retro.com> wrote:
>>  Matthew Petach writes:
>>  >"protected rings" are a technology of the past.  Don't count on your
>>  >vendor to provide "redundancy" for you.  Get two unprotected runs
>>  >for half the cost each, from two different providers, and verify the
>>  >path separation and diversity yourself with GIS data from the two
>>  >providers; handle the failover yourself.  That way, you *know* what
>>  >your risks and potential impact scenarios are.  It adds a bit of
>>  >initial planning overhead, but in the long run, it generally costs a
>>  >similar amount for two unprotected runs as it does to get a
>>  >protected run, and you can plan your survival scenarios *much*
>>  >better, including surviving things like one provider going under,
>>  >work stoppages at one provider, etc.
>>
>> This completely ignores the grooming problem.
>
>Not completely; it just gives you teeth for exiting your
>contract earlier and finding a more responsible provider
>to go with who won't violate the terms of the contract
>and re-groom you without proper notification. 

That's a post-facto financial recovery / liability limitation
technique, not a high availability / hardening technique...

>I'll admit
>I'm somewhat simplifying the scenario, in that I also
>insist on no single point of failure, so even an entire
>site going dark doesn't completely knock out service;
>those who have been around since the early days will
>remember my email to NANOG about the gas main cut
>in Santa Clara that knocked a good chunk of the area's
>connectivity out, *not* because the fiber was damaged,
>but because the fire marshall insisted that all active
>electrical devices be powered off (including all UPSes)
>until the gas in the area had dissipated.  Ever since then,
>I've just acknowledged you can't keep a single site always
>up and running; there *will* be events that require it to be
>powered down, and part of my planning process accounts
>for that, as much as possible, via BCP planning. 

I was less than a mile away from that, I remember it well.
My corner cube even faced in that direction.

I heard the noise then the net went poof.  One of those
"Oh, that's not good at all" combinations.

>Now, I'll
>be the first to admit it's a different game if you're providing
>last-mile access to single-homed customers.  But sitting
>on the content provider side of the fence, it's entirely possible
>to build your infrastructure such that having 3 or more OC192s
>cut at random places has no impact on your ability to carry
>traffic and continue functioning.
>
>>  You have to get out of the game the fiber owners are playing.
>>  They can't even keep score for themselves, much less accurately
>>  for the rest of us.  If you count on them playing fair or
>>  right, they're going to break your heart and your business.
>
>You simply count on them not playing entirely fair, and penalize
>them when they don't; and you have enough parallel contracts with
>different providers at different sites that outages don't take you
>completely offline.

The problem with grooming is that in many cases, due to provider
consolidation and fiber vendor consolidation and cable swap and
so forth, you end up with parallel contracts with different
providers at different sites that all end up going through
one fiber link anyways.

I had (at another site) separate vendors with fiber going
northbound and southbound out of the two diverse sites.

Both directions from both sites got groomed without notification.

Slightly later, the northbound fiber was Then rerouted a bit up the road,
into a southbound bundle (same one as our now-groomed southbound link),
south to another datacenter then north again via another path.
To improve route reduncancy northbound overall, for the providers'
overall customer links.

And the shared link south of us was what got backhoed.

This was all in one geographical area.  Diversity out of area will get
you around single points like that, if you know the overall topology
of the fiber networks around the US and chose locations carefully.

But even that won't protect you against common mode vendor hardware
failures, or a largescale BGP outage, or the routing chaos that comes
with a very serious regional net outage (exchange points, major
undersea cable cuts, etc)....

There may be 4 or 5 nines, but the 1 at the end has your name on it.


-george william herbert
gherbert at retro.com





More information about the NANOG mailing list