Summary: Reportable Metrics

Mon Nov 12 20:35:13 UTC 2001

Here is a summary of a recent query I posted to the list.

Reportable Metrics:

Original Query: to identify a list of network metrics to compile and report
to management on a monthly basis. (Emphasis is on metrics and not the tools
used to gather them.)

My original list: 

1. Uptime per WAN or Internet circuit
2. # and average length of outages
3. Bandwidth utilization per WAN/Internet circuit and "important" VLANs
4. Overall Network Latency,  RTT measured from various parts of network
(cisco IPM)to various other parts
5. Top talkers per WAN circuit
6. Top destinations per WAN circuit
7. Top 10 most utilized WAN circuits (% burst above CIR, etc)
7. Protocol distribution per WAN circuit
8. Syslog/Sniffer alarms  by severity  
9. Application Response time for key Apps (eg, SAP, HTTP)
10. Security Incidents
11. TACACs reports on number of logins, changes, etc
12. Bandwidth/Latency trending

Vince Mulhollan's additions:

1.	acceptable use policy violations
2.	number/severity of externally reported abuse complaints
3.	IOS deployments: upgrades, schedules, risk/rewards of IOSs in use
4.	Installations completed: hw and circuit changes.  Time to implement
each.
5.	RFO:  reasons for outages and remedies employed
6.	Employees % of time spent on: upgrades, installs, security
incidents, etc. Headcount in line with workload?
7.	Any hardware related trends:, ie particular devices burning out
frequently, etc.  Establish loose figure of likelihood of failure per type
of device

Joe St Sauver's recommendations:
1.	Don't overwhelm management with large quantity of data
2.	Implement "management by exception" by tracking/reporting "material
statistical deviations from expected values wherever possible. "
3.	"The other key concept is to give management gauges that will help
them drive the plane, rather than historical data that will tell them
when/where/how badly they crashed (last month). E.G., make the data timely
and operationally relevant."
	a.	-- What's broken?
	b.	-- Where am I vulnerable?
	c.	-- Where am I running out of capacity?
	d.	-- Where do I have performance problems?
	e.	-- What are we doing really well?
	f.	Where can I increase my return on already deployed assets?
(e.g., where  do I have underutilized capacity?)Look longitudinally (over
time), geographically (spatially), and at snapshots (cross sectionally).
4.	Focus on downtime as opposed to uptime and only report those that
exceed some acceptable threshold. Focus on cause of outages and responses to
those outages and whether there are ongoing problems in solving the issues.
5.	Tie all measurements to realities of the business: stats that bear
out billing expenses, those that might help marketing, or those that help in
planning, etc
6.	Dial-ins
7.	A list of URLs for more ideas
a.	Compare and contrast: 
		http://hydra.uits.iu.edu/~abilene/traffic/
		http://monon.uits.iupui.edu/abilene/dnvr.html
		http://monon.uits.iupui.edu/abilene/dnvr/index.html
		http://monon.uits.iupui.edu/abilene/dnvr/uoreg-bits.html
		http://www.itec.oar.net/abilene-netflow/
b.	Latency, packet loss, route changes:
	i.	http://amp.nlanr.net/active/amp-uoregon/HPC/body.html
	ii.	http://www.advanced.org/surveyor/
	iii.	http://www.caida.org/cgi-bin/skitter_summary/main.pl
	iv.	http://www.ncne.nlanr.net/nimi/
c.	Possible breakdown on  top WAN talkers:
	i.	Single flow? Aggregate traffic? By protocol? By port? Per
dotted quad? Per network block? Per ASN? Measured by octets? Flow count?
Flow duration? From flow data? Passive monitoring with OCxMON type tools?
Privacy issues?
	ii.	Sample report: http://www.canet3.net/stats/reports.html
8.	:-)  "Keep it brief."

Iljitsch van Beijnum:
1.	Interface stats
	a.	CRC errors, (helps identify lower layer problems) 
	b.	collisions, if you use any non-switched ethernet 
2.	router CPU load

Joe Provo:

1.	Errors: CRCs, queue drops/depth  (eg, RED vs tail-drop queues)
2.	Routing protocol transitions where relevant, eg, BGP route table
size
3.	"You should look at SNIPS [formerly NOCOL] for examples of good
stuff to monitor [&*therefore trend].

David Newman:
1.	For delay-sensitive, include:
	a.	jitter  (latency variation)
	b.	histograms (latency distribution)

Thanks to everyone who responded! If you have more suggestions for
this list, please email me directly.

-BM