NOC Best Practices

khatfield at socllc.net khatfield at socllc.net
Sat Jul 17 20:02:01 UTC 2010


I have to agree that this is all good information.

Your question on ITIL: My personal opinion is that ITIL best practices are great to apply to all environments. It makes sense, specifically in the change control systems.

However, as stated, it's also highly dependent on how many devices being managed/monitored. I come from a NOC managing 8600+ network devices across 190+ countries.

Strict change management policies, windows, approvers. All depending on times relative to the operations in different countries.

We were growing so rapidly that we continued purchasing companies and bringing over their infrastructure. Each time bringing in new ticket systems, etc.

NNM is by far one of my favorite choices for network monitoring. The issue with it is really the views and getting them organized in an easily viewable fashion.

RT is a great ticketing tool for specific needs. It allows for approvers and approval tracking of tickets. However, it isn't extremely robust.

I would recommend something like HP ServiceCenter since it can integrate and automate the alert output directly to tickets. This also allows the capability to use Alarmpoint for automated paging of your on-calls based on their schedules, by device, etc.

Not to say that I'm a complete HP fan boy but I will say that it works extremely well. Easy to use and simplicity is the key to less mistakes.

All of our equipment was 99% Cisco so the combination worked extremely well.

Turnover : I firmly believe shift changes should be verbally handed off. Build a template for the days top items or most critical issues. List out the ongoing issues and any tickets being carried over with the status. Allot 15 minutes for the team to sit down with the printout and review it.

Contracts/SLA's:
 We placed all of our systems in a bulk 99.999% uptime critical SLA. However, this was a mistake on our part and the lack of time to plan well when adapting to an ever-changing environment.

It would be best to setup your appliances/hardware in your ticket system and monitoring tool based on the SLA you intend to apply to it. Also ensure you include all hardware information: Supply Vendor, Support Vendor, Support coverage, ETR from Vendor, Replacement time.

There are many tools that do automated discovery on your network and monitors changes on the network. This is key if you have a changing environment. The more devices you have, the more difficult it is to pinpoint what a failed router or switch ACTUALLY affects upstream or downstream.

If this is your chance, take the opportunity to map your hardware/software dependencies. If a switch fails and it provides service to: example: db01 and db01 drives the service in another location. Then you should know that failure is there. It's far too common for companies to get so large they have no idea what the impact of 1 port failure in xyz does to the entire infrastructure.

Next: Build your monitoring infrastructure completely separate than the rest of the network. If you don't do switch redundancy (active/passive) on all of your systems or NIC teaming (active/passive) then ensure you do it at least on your monitoring systems.

Build your logging out in a PCI/SOX fashion. Ensure you have remote logging on everything, log retention based on your need. Tripwire with approved reports being sent weekly on the systems requiring PCI/SOX monitoring.

Remember, if your monitoring systems go down, your NOC is blind. It's highly recommend that the NOC have gateway/jump box systems available to all parts of the network. Run the management completely on RFC1918 for security.

Ensure all on-calls have access, use a VPN solution that requires a password + vpn keygen. Utilize TACACs/LDAP the most you can. Tighten everything. Log everything. I can't say that enough.

Enforce pw changes every 89 days, require strong passwords/non dictionary, etc.

Build an internal site, use a wiki-based format, allow the team the ability to add/modify with approval. Build a FAQ/Knowledgebase. Possibly create a forum so your team can post extra tips/notes, one-offs. Anything that may help new members or people who run across something in the middle of the night they may have never seen. This keeps from waking up your lead staff in the middle of the night.

On-calls: Always have a primary/secondary with a clear on-call procedure 'documented'.
Example (critical):
1. Issue occurs
2. Page on-call within 10 minutes
3. Allow 10 minutes for return call.
4. Page again
5. Allow 5 minutes
6. Page secondary
Etc.

Ensure the staff documents every step they take and they copy/paste every page they send into the ticket system.

Build templated paging formats. Understand that most txt messages with several carriers have hard limits. Use something like:
Time InitialsofNOCPerson SystemAlerting Error CallbackNumber

(Ie. 14:05 KH nycgw01 System reports down 555-555-5555 xt103)

Use a paging internal website/software or as mentioned, something like Alarmpoint. 

There is nothing more frustrating for an on-call to be paged and have no idea who to call back, who paged, or what the number is.

I've written so much my fingers hurt from these Blackberry keys. Hope this information helps a little.

Best of luck,
-Kevin

Excuse the spelling/punctuation... This is from my mobile.
-----Original Message-----
From: Joe Provo <nanog-post at rsuc.gweep.net>
Date: Sat, 17 Jul 2010 14:56:04 
To: Kasper Adel<karim.adel at gmail.com>
Reply-To: nanog-post at rsuc.gweep.net
Cc: NANOG list<nanog at nanog.org>
Subject: Re: NOC Best Practices

On Fri, Jul 16, 2010 at 09:34:53PM +0300, Kasper Adel wrote:
> Thanks for all the people that replied off list, asking me to send them
> responses i will get.
[snip]
> Which is useful but i am looking for more stuff from the best people that
> run the best NOCs in the world.
> 
> So i'm throwing this out again.
> 
> I am looking for pointers, suggestions, URLs, documents, donations on what a
> professional NOC would have on the below topics:

A lot, as others have said, depending on the business, staffing, 
goals, SLA, contracts, etc.

> 1) Briefly, how they handle their own tickets with vendors or internal

Run a proper ticketing system over which you have control (RT and 
friends rather than locking you into something you have to pay for 
changes).  Don't just by ticket closure rate, judge by succesfully 
resolving problems. Encourage folks to use the system for tracking 
projects and keeping notes on work in progress rather than private 
datastores. Inculcate a culture of open exploration to solve problems
rather than rote memorization. This gets you a large way to #2.

> 2) How they create a learning environment for their people (Documenting
> Syslog, lessons learned from problems...etc)

Mentoring, shoulder surfing. Keep your senior people in the mix 
of triage & response so they don't get dull and cross-pollenate 
skills.  When someone is new, have their probationary period be 
shadowing the primary on-call the entire time.  Your third shift 
[or whatever spans your maintenance windows] should be the folks 
who actually wind up executing well-specified maintenances (with 
guidance as needed) and be the breeding ground of some of your 
better hands-on folks.

> 3) Shift to Shift hand over procedures

This will depend on your systems for tickets, logbooks, etc. 
Sole that first and this should become evident.

> 4) Manual tests  they start their day with and what they automate (common
> stuff)

This will vary on the business and what's on-site; I can't 
advise you to always include the genset is you don't have 
one.

> 5) Change management best practices and working with operations/engineering
> when a change will be implemented

Standing maintenance windows (of varying severity if that 
matters yo your business), clear definition of what needs 
to be done only duringthose and what can be done anytime 
[hint: policy tuning shouldn't be restructed to them, and 
you shouldn't make it so an urgent things like a BGP leak 
can't be fixed].  Linear rather than parallel workflows 
for approval, and not too many approval stages else your 
staff will be spending time trying to get things through 
the administrative stages instead of actual work.  Very
simply, have a standard for specifying what needs to be 
done, the minimal tests needed to verify success, and how
you fallback if you fail the tests.  If someone can't 
specify it and insist on frobbing around, they likely don't 
understand the problem or the needed work.

Cheers,

Joe
-- 
             RSUC / GweepNet / Spunk / FnB / Usenix / SAGE



More information about the NANOG mailing list