representativeness of flow data based on samples

Wed Jan 30 19:02:30 UTC 2002

Traffic measurement techniques such as NetFlow work by associating
some characteristics of inbound packets on an interface with a flow,
e.g. some tuple like (source addr, source port, dest addr, dest port,
protocol). Counters per flow are incremented, and the numbers are
exported periodically or when flows become inactive.

There are a few vendors who now provide traffic export from high-speed
interfaces by sampling those interfaces at a particular rate, and
using the sampled packets to populate the per-flow counters, rather
than looking at every packet.

Does anybody here know of recent research with real internet traffic
which compares different sample rates wrt the representativeness of
the resulting flow data?

For example, if I am trying to rank the top traffic sinks for my
network beyond an attached peer (i.e. an ordinal rather than cardinal
measurement), will I get different answers if I use a sampling rate
of 1:1000 compared to 1:50, given a statistically "long enough"
measurement period?

Intuitively, it seems to me that the answers should be the same.
However, it also seems to me that statistics are frequently non-
intuitive.

Joe