NOC Best Practices

Sat Jul 17 18:56:04 UTC 2010

On Fri, Jul 16, 2010 at 09:34:53PM +0300, Kasper Adel wrote:
> Thanks for all the people that replied off list, asking me to send them
> responses i will get.
[snip]
> Which is useful but i am looking for more stuff from the best people that
> run the best NOCs in the world.
> 
> So i'm throwing this out again.
> 
> I am looking for pointers, suggestions, URLs, documents, donations on what a
> professional NOC would have on the below topics:

A lot, as others have said, depending on the business, staffing, 
goals, SLA, contracts, etc.

> 1) Briefly, how they handle their own tickets with vendors or internal

Run a proper ticketing system over which you have control (RT and 
friends rather than locking you into something you have to pay for 
changes).  Don't just by ticket closure rate, judge by succesfully 
resolving problems. Encourage folks to use the system for tracking 
projects and keeping notes on work in progress rather than private 
datastores. Inculcate a culture of open exploration to solve problems
rather than rote memorization. This gets you a large way to #2.

> 2) How they create a learning environment for their people (Documenting
> Syslog, lessons learned from problems...etc)

Mentoring, shoulder surfing. Keep your senior people in the mix 
of triage & response so they don't get dull and cross-pollenate 
skills.  When someone is new, have their probationary period be 
shadowing the primary on-call the entire time.  Your third shift 
[or whatever spans your maintenance windows] should be the folks 
who actually wind up executing well-specified maintenances (with 
guidance as needed) and be the breeding ground of some of your 
better hands-on folks.

> 3) Shift to Shift hand over procedures

This will depend on your systems for tickets, logbooks, etc. 
Sole that first and this should become evident.

> 4) Manual tests  they start their day with and what they automate (common
> stuff)

This will vary on the business and what's on-site; I can't 
advise you to always include the genset is you don't have 
one.

> 5) Change management best practices and working with operations/engineering
> when a change will be implemented

Standing maintenance windows (of varying severity if that 
matters yo your business), clear definition of what needs 
to be done only duringthose and what can be done anytime 
[hint: policy tuning shouldn't be restructed to them, and 
you shouldn't make it so an urgent things like a BGP leak 
can't be fixed].  Linear rather than parallel workflows 
for approval, and not too many approval stages else your 
staff will be spending time trying to get things through 
the administrative stages instead of actual work.  Very
simply, have a standard for specifying what needs to be 
done, the minimal tests needed to verify success, and how
you fallback if you fail the tests.  If someone can't 
specify it and insist on frobbing around, they likely don't 
understand the problem or the needed work.

Cheers,

Joe
-- 
             RSUC / GweepNet / Spunk / FnB / Usenix / SAGE