NOC Best Practices
Joe Provo
nanog-post at rsuc.gweep.net
Sat Jul 17 18:56:04 UTC 2010
On Fri, Jul 16, 2010 at 09:34:53PM +0300, Kasper Adel wrote:
> Thanks for all the people that replied off list, asking me to send them
> responses i will get.
[snip]
> Which is useful but i am looking for more stuff from the best people that
> run the best NOCs in the world.
>
> So i'm throwing this out again.
>
> I am looking for pointers, suggestions, URLs, documents, donations on what a
> professional NOC would have on the below topics:
A lot, as others have said, depending on the business, staffing,
goals, SLA, contracts, etc.
> 1) Briefly, how they handle their own tickets with vendors or internal
Run a proper ticketing system over which you have control (RT and
friends rather than locking you into something you have to pay for
changes). Don't just by ticket closure rate, judge by succesfully
resolving problems. Encourage folks to use the system for tracking
projects and keeping notes on work in progress rather than private
datastores. Inculcate a culture of open exploration to solve problems
rather than rote memorization. This gets you a large way to #2.
> 2) How they create a learning environment for their people (Documenting
> Syslog, lessons learned from problems...etc)
Mentoring, shoulder surfing. Keep your senior people in the mix
of triage & response so they don't get dull and cross-pollenate
skills. When someone is new, have their probationary period be
shadowing the primary on-call the entire time. Your third shift
[or whatever spans your maintenance windows] should be the folks
who actually wind up executing well-specified maintenances (with
guidance as needed) and be the breeding ground of some of your
better hands-on folks.
> 3) Shift to Shift hand over procedures
This will depend on your systems for tickets, logbooks, etc.
Sole that first and this should become evident.
> 4) Manual tests they start their day with and what they automate (common
> stuff)
This will vary on the business and what's on-site; I can't
advise you to always include the genset is you don't have
one.
> 5) Change management best practices and working with operations/engineering
> when a change will be implemented
Standing maintenance windows (of varying severity if that
matters yo your business), clear definition of what needs
to be done only duringthose and what can be done anytime
[hint: policy tuning shouldn't be restructed to them, and
you shouldn't make it so an urgent things like a BGP leak
can't be fixed]. Linear rather than parallel workflows
for approval, and not too many approval stages else your
staff will be spending time trying to get things through
the administrative stages instead of actual work. Very
simply, have a standard for specifying what needs to be
done, the minimal tests needed to verify success, and how
you fallback if you fail the tests. If someone can't
specify it and insist on frobbing around, they likely don't
understand the problem or the needed work.
Cheers,
Joe
--
RSUC / GweepNet / Spunk / FnB / Usenix / SAGE
More information about the NANOG
mailing list