[Nanog] Lies, Damned Lies, and Statistics [Was: Re: ATT VP: Internet to hit capacity by 2010]

Wed Apr 23 17:45:11 UTC 2008

On Tue, Apr 22, 2008 at 5:12 AM, Petri Helenius <petri at helenius.fi> wrote:
> michael.dillon at bt.com wrote:
> > But there is another way. That is for software developers to build a
>  > modified client that depends on a topology guru for information on the
>  > network topology. This topology guru would be some software that is run
>  number of total participants) I fail to figure out the necessary
>  mathematics where topology information would bring superior results
>  compared to the usual greedy algorithms where data is requested from the
>  peers where it seems to be flowing at the best rates. If local peers
>  with sufficient upstream bandwidth exist, majority of the data blocks
>  are already retrieved from them.

You can think of the scheduling process as two independent problems:
1. Given a list of all the chunks that all the peers you're connected
to have, select the chunks you think will help you complete the
fastest. 2. Given a list of all peers in a cloud, select the peers you
think will help you complete the fastest.

Traditionally, peer scheduling (#2) has been to just connect to
everyone you see and let network bottlenecks drive you toward
efficiency, as you pointed out.

However, as your chunk scheduling becomes more effective, it usually
becomes more expensive. At some point, its increasing complexity will
reverse the trend and start slowing down copies, as real-world clients
begin to block making chunk requests waiting for CPU to make
scheduling decisions.

A more selective peer scheduler would allow you to reduce the inputs
into the chunk scheduler (allowing it to do more complex things with
the same cost). The idea is, doing more math on the best data will
yield better overall results than doing less math on the best + the
worse data, with the assumption that a good peer scheduler will help
you find the best data.

As seems to be a trend, Michael appears to be fixated on a specific
implementation, and may end up driving many observers into thinking
this idea is annoying :)  However, there is a mathematical basis for
including topology (and other nontraditional) information in
scheduling decisions.