[outages] News item: Blackberry services down worldwide, Egypt affected (not N.A.)

Wed Oct 12 15:54:48 UTC 2011

+1
On Oct 12, 2011 11:51 AM, <Valdis.Kletnieks at vt.edu> wrote:

> On Wed, 12 Oct 2011 09:52:02 CDT, -Hammer- said:
> > What kills me is what they have told the public. The lost a "core
> > switch". I don't know if they actually mean network switch or not but
> > I'm pretty sure any of us that work on an enterprise environment know
> > how to factor N+1 just for these types of days. And then the backup
> > solution failed? I'm not buying it either.
>
> Yeah, and that extra comma in the one config file that didn't make a
> difference
> when you tested the failover in the lab *never* makes a difference when it
> hits
> in the production network, right?  Or they changed the config of the
> primary and
> it didn't get propogated just right to the backup, or they had mismatched
> firmware
> levels on blades in the blades on the primary and backup switches, so
> traffic that
> didn't tickle a bug on the primary blades caused the blade to crash on the
> backup,
> or...
>
> Anybody on this list who's been around long enough probably has enough "We
> should have had N+2 because the N+1'th device failed too" stories to drain
> *several* pitchers of beer at a good pub... I've even had one case where my
> butt got *saved* from a ohnosecond-class whoops because the N+1'th device
> *was*
> crashed (stomped a config file, it replicated, was able to salvage a copy
> from
> a device that didn't replicate because it was down at the time).
>
>