Bandwidth Augmentation Triggers

Simon Leinen simon at limmat.switch.ch
Tue May 1 08:45:05 UTC 2007


Jason Frisvold writes:
> I'm working on a system to alert when a bandwidth augmentation is
> needed.  I've looked at using both true averages and 95th percentile
> calculations.  I'm wondering what everyone else uses for this
> purpose?

We use a "secret formula", aka rules of thumb, based on perceived
quality expectations/customer access capacities, and cost/revenue
considerations.

In the bad old days of bandwidth crunch (ca. 1996), we scheduled
upgrades of our transatlantic links so that relief would come when
peak-hour average packet loss exceeded 5% (later 3%).  At that time
the general performance expectation was that Internet performance is
mostly crap anyway, if you need to transfer large files, "at 0300 AM"
is your friend; and upgrades were incredibly expensive.  With that
rule, link utilization was 100% for most of the (working) day.

Today, we start thinking about upgrading from GbE to 10GE when link
load regularily exceeds 200-300 Mb/s (even when the average load over
a week is much lower).  Since we run over dark fibre and use mid-range
routers with inexpensive ports, upgrades are relatively cheap.  And -
fortunately - performance expectations have evolved, with some users
expecting to be able to run file transfers near Gb/s speeds, >500 Mb/s
videoconferences with no packet loss, etc.

An important question is what kind of users your links aggregate.  A
"core" link shared by millions of low-bandwidth users may run at 95%
utilization without being perceived as a bottleneck.  On the other
hand, you may have an campus access shared by users with fast
connections (I hear GbE is common these days) on both sides.  In that
case, the link may be perceived as a bottleneck even when utilization
graphs suggest there's a lot of headroom.

In general, I think utilization rates are less useful as a basis for
upgrade planning than (queueing) loss and delay measurements.  Loss
can often be measured directly at routers (drop counters in SNMP), but
queueing delay is hard to measure in this way.  You could use tools
such as SmokePing (host-based) or Cisco IP SLA or Juniper RPM
(router-based) to do this.

(And if you manage to link your BSS and OSS, then you can measure the
rate at which customers run away for an even more relevant metric :-)

> We're talking about anything from a T1 to an OC-12 here.  My guess
> is that the calculation needs to be slightly different based on the
> transport, but I'm not 100% sure.

Probably not on the type of transport - PDH/SDH/Ethernet behave
essentially the same.  But the rules will be different for different
bandwidth ranges.  Again, it is important to look not just at link
capacities in isolation, but also at the relation to the capacities of
the access links that they aggregate.
-- 
Simon.



More information about the NANOG mailing list