Cloudflare is down

George Herbert george.herbert at gmail.com
Mon Mar 4 23:00:26 UTC 2013


On Mon, Mar 4, 2013 at 10:40 AM, Saku Ytti <saku at ytti.fi> wrote:
> On (2013-03-04 13:23 -0500), Jeff Wheeler wrote:
>
>> We have lots of stupid people in our industry because so few
>> understand "The Way Things Work."
>
> We have tendency to view mistakes we do as unavoidable human errors and
> mistakes other people do as avoidable stupidity.
>
> We should actively plan for mistakes/errors, if you actively plan for no
> 'stupid mistakes', you're gonna have bad time
>
> From my point of view, outages are caused by:
> 1) operator
> 2) software defect
> 3) hardware defect
>
> Most people design only against 3), often with design which actually
> increases likelihood of 2) and 1), reducing overall MTBF on design which
> strictly theoretically increases it.

...And a lot of people who know the heirarchy solve 3 and then solve 2
in a way that increases 1 (multiple parallel environments with
different vendors' equipment) only to find that 1 increased, due to
additional complexity.

On the other hand, I've seen people who had horrible explosions of 2
or 3 due to ignoring all but 1.

If you ACTUALLY need that many 9s, you need all of redundancy,
diversity of vendors, and suitably trained, exercised,
process-supported net admins.  That's a few multiples of 2 more
expense than nearly anyone typically wants to pay for.


-- 
-george william herbert
george.herbert at gmail.com




More information about the NANOG mailing list