Revisiting the Aviation Safety vs. Networking discussion
gbonser at seven.com
Fri Dec 25 02:16:43 CST 2009
I think any network engineer who sees a major problem is going to have a
"Houston, we have a problem" moment. And actually, he was telling the
ATC what he was going to need to do, he wasn't getting permission so
much as telling them what he was doing so traffic could be cleared out
of his way. First he told them he was returning to the airport, then he
inquired about Peterburough, the ATC called Peterburough to get a runway
and inform them of an inbound emergency, then the Captain told the ATC
they were going to be in the Hudson. And "I hit birds, have lost both
engines, and am turning back" results in a whole different chain of
events these days than "I have two guys banging on the cockpit door and
am returning" or simply turning back toward the airport with no
communication. And any network engineer is going to say something if he
sees CPU or bandwidth utilization hit the rail in either direction.
Saying something like "we just got flooded with thousands of /24 and
smaller wildly flapping routes from peer X and I am shutting off the BGP
session until they get their stuff straight" is different than "we just
got flooded with thousands of routes and it is blowing up the router and
all the other routers talking to it. Can I do something about it?"
And that illustrates a point that is key. In that case the ATC was
asking what the pilot needed and was prepared to clear traffic, get
emergency equipment prepared, whatever it took to get that person
dealing with the problem whatever they needed to get it resolved in the
best way forward. The ATC isn't asking him if he was sure he set the
flaps at the right angle and "did you try to restart the engine" sorts
What I was getting at is that sometimes too much process can get in the
way in an emergency and the time taken to implement such process can
result in a failure cascading through the network making the problem
much worse. I have much less of a problem with process surrounding
planned events. The more the better as long as it makes sense.
Migrations and additions and modifications *should* be well planned and
checklisted and have backout points and procedures. That is just good
operations when you have tight SLAs and tight maintenance windows with
customers you want to keep.
From: Scott Howard
"mayday mayday mayday. Cactus fifteen thirty nine hit birds, we've lost
thrust (in/on) both engines we're turning back towards LaGuardia" -
Not exactly "detailed", but he definitely initiated an "incident report"
(the mayday), gave a "description of what was happening with his plane",
the "status of [the relevant] subsystems", and his proposed plan of
action - even in the order you've asked for!
His actions were then "subject to the consensus of those on the
conference bridge" (ie, ATC) who could have denied his actions if they
believed they would have made the situation worse (ie, if what they were
proposing would have had them on a collision course with another plane).
In this case, the conference bridge gave approval for his course of
action ("ok uh, you need to return to LaGuardia? turn left heading of uh
two two zero." - ATC)
5 seconds before they made the above call they were reaching for the QRH
(Quick Reference Handbook), which contains checklists of the steps to
take in such a situation - including what to do in the event of loss of
both engines due to multiple birdstrikes. They had no need to confer
with others as to what actions to take to try and recover from the
problem, or what order to take them in, because that pre-work had
already been carried out when the check-lists were written.
Of course, at the end of the day, training, skill and experience played
a very large part in what transpired - but so did the actions of the
people on the "conference bridge" (You can't get much more of a
"conference bridge" than open radio frequencies), and the checklists
they have for almost every conceivable situation.
More information about the NANOG