Revisiting the Aviation Safety vs. Networking discussion

George Bonser gbonser at seven.com
Fri Dec 25 02:27:37 UTC 2009



> -----Original Message-----
> From: Dobbins, Roland
>
> On Dec 25, 2009, at 7:01 AM, Michael Dillon wrote:
> 
> > It would be interesting to see what others have to say about this
> answer.
> 
> I think it's a pretty accurate summation of how these things work in a
> lot of big organizations, all over the world.


I think that one must keep in mind that there are two kinds of
check-lists.  There is a takeoff list where you can always choose to go
back to the ramp and fly another day if something doesn't check out but
there is a different priority when someone is already in the air and
something goes wrong.  You can't decide to land a different day.  In
that case you must rely on experience and knowledge to handle the
situation as it presents itself.  Sure, you can have some basic checks
for things even in an emergency but you can't know how the problem is
going to present itself ahead of time.  In cases like that you have set
of general parameters but the person "at the controls" needs to have
leeway to both clearly identify the nature of the problem and mitigate
the same if possible and that might include calling in some extra eyes
in order to identify things that might be going on with applications or
other devices that aren't specifically network gear.

So you can put a lot of process around changes in advance but there
isn't quite as much to manage incidents that strike out of the clear
blue.  Too much process at that point could impede progress in clearing
the issue.  Capt. Sullenberger did not need to fill out an incident
report, bring up a conference bridge, and give a detailed description of
what was happening with his plane, the status of all subsystems, and his
proposed plan of action (subject to consensus of those on the conference
bridge) and get approval for deviation from his initial flight plan
before he took the required actions to land the plane as best as he
could under the circumstances.  And while that is a bit extreme in the
sense of most networks in that lives are not often at stake, some
concepts are the same (and there might be networks supporting various
occupations on this planet where lives might actually be at stake in the
case of a network failure during some sort of activity).

One of the most efficient shops I worked in was when the production
internet operation was owned by the engineering department.  Corporate
operations owned the internal corporate IT, but engineering owned the
internet production data centers and network operations.  If engineering
released a code revision that blew up the network, the VP of Engineering
was responsible for the entire picture, not just the software piece.
Same is true where a networking change blew up the application.  Having
the responsibility for the entire "system" (software, hardware
platforms, and networking) under the same organization resulted in a lot
smoother operation without backbiting and greater access to and sharing
of resources between the application engineers, the systems
administrators, and the network engineers.





More information about the NANOG mailing list