Mitigating human error in the SP
ross at kallisti.us
Wed Feb 3 10:14:09 CST 2010
On Mon, Feb 01, 2010 at 09:46:07PM -0500, Stefan Fouant wrote:
> Vijay Gill had some real interesting insights into this in a
> presentation he gave back at NANOG 44:
> His Blog article on "Infrastructure is Software" further expounds
> upon the benefits of such an approach -
> That stuff is light years ahead of anything anybody is doing today
> (well, apart from maybe Vijay himself ;) ... but IMO it's where we
> need to start heading.
Vijay's stuff is fascinating. The vision is great. But in my
experience, the vendors and implementations basically ruin the dream
for anyone who doesn't have his pull.
I'm sure my software is nowhere close to being as sophisticated as
his, but my plans are pretty much in line with his suggestions. Some
problems I've run into that I don't see any kind of solution for:
1) Forwarding-impacting bugs: IOS bugs that are triggered by SNMP are
easily the #1 cause of our accidental service impact. Most seem to be
race conditions that require real-world config and forwarding load -
not something a small shop can afford to build a lab to reproduce. If
we stuck to manual deployment, we might have made a few mistakes but
would it have been worse? Maybe - but honestly, it could be a wash.
2) Vendor support is highly suspicious of automation: anytime I open a
ticket, even unrelated to an automated software process, the first
thing the vendor support demands is to disable all automation.
Juniper is by far the best about this, and they *still* don't actually
believe their own automation tools work. Cisco TAC's answer has
always been "don't ever use SNMP if it causes crashes!" Procurve
doesn't even bother to respond to tickets related to automation bugs,
even if they are remotely triggerable crashes in the default config.
3) Automation interfaces are largely unsupported: I imagine vendor
software development having one or two guys that are the masterminds
for SNMP/NETCONF/whatever - and that's it. When I have a question on
how to find a particular tool, or find a bug in an automation
function, I can often go months on a ticket with people that have no
idea what I'm talking about. What documentation exists is typically
incomplete or inconsistent across versions and product lines.
4) Related tools prevent reliable error reporting: as far as I can
tell, Net-SNMP returns random values if a request fails; if there's a
pattern, I've failed to discern it. expect is similar. ScreenOS's
SSH implementation always returns that a file copy failed. Procurve
only this year implemented ssh key-based auth in combination with
remote authentication. The best-of-breed seems to be an oft-pathetic
collection of tools.
5) Management support: developing automation software is hard - network
devices aren't nearly as easy to deal with as they should be. When I
spend weeks developing features that later causes IOS to spontaneously
reload, people that don't understand the relation to operational
impact start to advocate dismantling the automation just like the
I'm sure we'll continue to build automated policy and configuration
tools. I'm just not convinced it's the panacea that everyone thinks.
Unless you're one of the biggest, it puts your network at someone
else's mercy - and that someone else doesn't care about your
ross at kallisti.us
"If the fight gets hot, the songs get hotter. If the going gets tough,
the songs get tougher."
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 197 bytes
Desc: Digital signature
More information about the NANOG