Revisiting the Aviation Safety vs. Networking discussion

Fri Dec 25 10:44:09 UTC 2009

Just clearing a small point about pilots (I'm a pilot) - the
pilot-in-command has ultimate responsibility for his a/c and can ignore
whatever ATC tells him to do if he considers that to be contrary to the
safety of his flight (he may be asked to explain his actions later,
though). Now, usually ignoring ATC or keeping it in the dark about one's
intentions is not very clever - but dispatchers are not in the cockpit and 
may misunderstand the situation or be simply mistaken about something (so 
a pilot is encouraged to decline ATC instructions he considers to be in 
error - informing ATC about it, of course).

But one of the first things a pilot does in an emergency is pulling out
the appropriate emergency checklist.  It is kind of hard to keep from 
forgetting to check obvious things when things get hectic (one of the 
distressingly common causes of accidents is trivial running out of fuel - 
either because the pilot didn't do homework on the ground (checking actual 
fuel level in tanks, etc) or because when the engine got suddenly quiet he
forgot to switch to another, non-empty, tank).

The mantra about priorities in both normal and emergency situations is 
"Aviate-Navigate-Communicate" meaning that maintaining control of a/c 
always comes first, no matter what. Knowing where you are and where you 
are going (and other pertinent situational awareness such as condition of 
the a/c and current plan of actions) come second.  Talking is lowest 
priority.

The pre-planned emergency checklists may be a good idea for network
operators.  Try obvious (when you're calm, that's it) actions first, if
they fail to help, try to limit damage.  Only then go file the ticket and
talk to people who can investigate situation in depth and can develop a 
fix.

The way aviation industry come with these checklists is, basically,
experience - it pays to debrief after recovery from every problem not
adequately fixed by existing procedures, find common ones, and develop
diagnostic procedure one could follow step-by-step for these situations. 
(The non-punitive error or incident reporting which actually shields 
pilots from FAA enforcement actions in most cases also helps to collect
real-world information on where and how pilots get into trouble).

The all-too-common multistep ticket escalation chains (which merely work
as delay lines in a significant portion of cases) is something to be
avoided.

Even better is to provide some drilling in diagnostic and recovery from
common problems to the front-line personnel - starting from following the 
checklist on a simulated outage in the lab, and then getting it down to
what pilots call "the flow" - a habitual memorized procedure, which is 
performed first and then checked against the checklist.

Note that use of checklists, drilling, and flows does not make pilots a 
kind of robots - they still have to make decisions, recognize and deal 
with situations not covered in the standard procedures; what it does is 
speeding up dealing with common tasks, reduces mistakes, and frees up 
mental processing for thinking ahead.

The ISP industry has a long way to go until it reaches the same level of 
sophistication in handling problems as aviation has.

--vadim

On Fri, 25 Dec 2009, George Bonser wrote:

> I think any network engineer who sees a major problem is going to have a
> "Houston, we have a problem" moment.  And actually, he was telling the
> ATC what he was going to need to do, he wasn't getting permission so
> much as telling them what he was doing so traffic could be cleared out
> of his way. First he told them he was returning to the airport, then he
> inquired about Peterburough, the ATC called Peterburough to get a runway
> and inform them of an inbound emergency, then the Captain told the ATC
> they were going to be in the Hudson.  And "I hit birds, have lost both
> engines, and am turning back" results in a whole different chain of
> events these days than "I have two guys banging on the cockpit door and
> am returning" or simply turning back toward the airport with no
> communication.  And any network engineer is going to say something if he
> sees CPU or bandwidth utilization hit the rail in either direction.
> Saying something like "we just got flooded with thousands of /24 and
> smaller wildly flapping routes from peer X and I am shutting off the BGP
> session until they get their stuff straight" is different than "we just
> got flooded with thousands of routes and it is blowing up the router and
> all the other routers talking to it.  Can I do something about it?"
> 
>  
> 
> And that illustrates a point that is key.  In that case the ATC was
> asking what the pilot needed and was prepared to clear traffic, get
> emergency equipment prepared, whatever it took to get that person
> dealing with the problem whatever they needed to get it resolved in the
> best way forward.  The ATC isn't asking him if he was sure he set the
> flaps at the right angle and "did you try to restart the engine" sorts
> of things.
> 
>  
> 
> What I was getting at is that sometimes too much process can get in the
> way in an emergency and the time taken to implement such process can
> result in a failure cascading through the network making the problem
> much worse.  I have much less of a problem with process surrounding
> planned events.  The more the better as long as it makes sense.
> Migrations and additions and modifications *should* be well planned and
> checklisted and have backout points and procedures.  That is just good
> operations when you have tight SLAs and tight maintenance windows with
> customers you want to keep.
> 
>  
> 
> Happy Holidays
> 
>  
> 
> George
> 
>  
> 
>  
> 
> From: Scott Howard
> 
> 
> 
> 
> 
> "mayday mayday mayday. Cactus fifteen thirty nine hit birds, we've lost
> thrust (in/on) both engines we're turning back towards LaGuardia" -
> Capt. Sullenberger
> 
> Not exactly "detailed", but he definitely initiated an "incident report"
> (the mayday), gave a "description of what was happening with his plane",
> the "status of [the relevant] subsystems", and his proposed plan of
> action - even in the order you've asked for!
> 
> 
> His actions were then "subject to the consensus of those on the
> conference bridge" (ie, ATC) who could have denied his actions if they
> believed they would have made the situation worse (ie, if what they were
> proposing would have had them on a collision course with another plane).
> In this case, the conference bridge gave approval for his course of
> action ("ok uh, you need to return to LaGuardia? turn left heading of uh
> two two zero." - ATC)
> 
> 5 seconds before they made the above call they were reaching for the QRH
> (Quick Reference Handbook), which contains checklists of the steps to
> take in such a situation - including what to do in the event of loss of
> both engines due to multiple birdstrikes.  They had no need to confer
> with others as to what actions to take to try and recover from the
> problem, or what order to take them in, because that pre-work had
> already been carried out when the check-lists were written.
> 
> Of course, at the end of the day, training, skill and experience played
> a very large part in what transpired - but so did the actions of the
> people on the "conference bridge" (You can't get much more of a
> "conference bridge" than open radio frequencies), and the checklists
> they have for almost every conceivable situation.
> 
>   Scott.
> 
>