[Nanog] Lies, Damned Lies, and Statistics [Was: Re: ATT VP: Internet to hit capacity by 2010]

Laird Popkin laird at pando.com
Wed Apr 23 19:20:57 UTC 2008



On Apr 23, 2008, at 1:45 PM, Daniel Reed wrote:
> On Tue, Apr 22, 2008 at 5:12 AM, Petri Helenius <petri at helenius.fi>  
> wrote:
>> michael.dillon at bt.com wrote:
>>> But there is another way. That is for software developers to build a
>>> modified client that depends on a topology guru for information on  
>>> the
>>> network topology. This topology guru would be some software that  
>>> is run
>> number of total participants) I fail to figure out the necessary
>> mathematics where topology information would bring superior results
>> compared to the usual greedy algorithms where data is requested  
>> from the
>> peers where it seems to be flowing at the best rates. If local peers
>> with sufficient upstream bandwidth exist, majority of the data blocks
>> are already retrieved from them.

It's true that in the long run p2p transfers can optimize data sources  
by measuring actual throughput, but at any given moment this approach  
can only optimize within the set of known peers. The problem is that  
for large swarms, any given peer only knows about a very small subset  
of available peers, so it may take a long time to discover the best  
peers. This means (IMO) that starting with good peers instead of  
random peers can make a big difference in p2p performance, as well as  
reducing data delivery costs to the ISP.

For example, let's consider a downloader in a swarm of 100,000 peers,  
using a BitTorrent announce once a minute that returns 40 peers. Of  
course, this is a simple case, but it should be sufficient to make the  
general point that the selection of which peers you connect to matters.

Let's look at the odds that you'll find out about the closest peer (in  
network terms) over time.

With random peer assignment, the odds of any random peer being the  
closest peer is 40/100,000, and if you do the math, the odds of  
finding the closest peer on the first announce is 1.58%. Multiplying  
that out, it means that you'll have a 38.1% chance of finding the  
closest peer in the first half hour, and a 61.7% chance in the first  
hour, and 85.3% chance in the first two hours, and so on out as a  
geometric curve.

In the real world there are factors that complicate the analysis (e.g.  
most Trackers announce much less often than 1/minute, but some peers  
have other discovery mechanisms such as Peer Exchange). But as far as  
I can tell, the basic issue (that it takes a long time to find out  
about and test data exchanges with all of the peers in a large swarm)  
still holds.

With P4P, you find out about the closest peers on the first announce.

There's a second issue that I think is relevant, which is that  
measured network throughput may not reflect ISP costs and business  
policies. For example, a downloader might get data from a fast peer  
through a trans-atlantic pipe, but the ISP would really rather have  
that user get data from a fast peer on their local loop instead. This  
won't happen unless the p2p network knows about (and makes decisions  
based on) network topology.

What we found in our first field test was that random peer assignment  
moved 98% of data between ISP's and only 2% within ISP's (and for  
smaller ISP's, more like 0.1%), and that even simple network awareness  
resulted in an average of 34% same-ISP data transfers (i.e. a drop of  
32% in external transit). With ISP involvement, the numbers are even  
better.

>>
>
> You can think of the scheduling process as two independent problems:
> 1. Given a list of all the chunks that all the peers you're connected
> to have, select the chunks you think will help you complete the
> fastest. 2. Given a list of all peers in a cloud, select the peers you
> think will help you complete the fastest.
>
> Traditionally, peer scheduling (#2) has been to just connect to
> everyone you see and let network bottlenecks drive you toward
> efficiency, as you pointed out.
>
> However, as your chunk scheduling becomes more effective, it usually
> becomes more expensive. At some point, its increasing complexity will
> reverse the trend and start slowing down copies, as real-world clients
> begin to block making chunk requests waiting for CPU to make
> scheduling decisions.
>
> A more selective peer scheduler would allow you to reduce the inputs
> into the chunk scheduler (allowing it to do more complex things with
> the same cost). The idea is, doing more math on the best data will
> yield better overall results than doing less math on the best + the
> worse data, with the assumption that a good peer scheduler will help
> you find the best data.

Interesting approach. IMO, given modern computers, CPU is highlu  
underutilized (PC's are 80% idle, and rarely CPU-bound when in use),  
while bandwidth is relatively scarce, so using more CPU to optimize  
bandwidth usage seems like a great tradeoff!

> As seems to be a trend, Michael appears to be fixated on a specific
> implementation, and may end up driving many observers into thinking
> this idea is annoying :)  However, there is a mathematical basis for
> including topology (and other nontraditional) information in
> scheduling decisions.
>
> _______________________________________________
> NANOG mailing list
> NANOG at nanog.org
> http://mailman.nanog.org/mailman/listinfo/nanog

Laird Popkin
CTO, Pando Networks
520 Broadway, 10th floor
New York, NY 10012

laird at pando.com
c) 646/465-0570





More information about the NANOG mailing list