yahoo crawlers hammering us

Matthew Petach mpetach at netflight.com
Wed Sep 8 16:46:12 UTC 2010


On Wed, Sep 8, 2010 at 9:20 AM, Ken Chase <ken at sizone.org> wrote:
> On Wed, Sep 08, 2010 at 12:04:07AM -0700, Matthew Petach said:
>
>  >I *am* curious--what makes it any worse for a search engine like Google
>  >to fetch the file than any other random user on the Internet?  In either case,
>  >the machine doing the fetch isn't going to rate-limit the fetch, so
>  >you're likely
>  >to see the same impact on the machine, and on the bandwidth.
>
> I think that the difference is that there's a way to get to Yahoo and ask them
> WTF. Whereas the guy who mass downloads your site with a script in 2 hrs you
> have no recourse to (modulo well funded banks dispatching squads with baseball
> bats to resolve hacking incidents).  I also expect that Yahoo's behaviour is
> driven by policy, not random assholishness (I hope :), and therefore I should
> expect such incidents often. I also expect whinging on nanog might get me some
> visiblity into said policy and leverage to change it! </dream>

Well, I'd hazard a guess that the policy of the webcrawling machines at Bing,
Google, Yahoo, Ask.com, and every other large search engine is probably to
crawl the Internet, pulling down pages and indexing them for their
search engine,
always checking for a robots.txt file and carefully following the
instructions located
within said file.  Lacking any such file, one might suppose that the
policy is to limit
how many pages are fetched per interval of time, to avoid hammering a
single server
unnecessarily, and to space out intervals at which the site is
visited, to balance out
the desire to maintain a current, fresh view of the content, while at
the same time
being mindful of the limited server resources available for serving
said content.

Note that I have no actual knowledge of the crawling policies present
at any of the
aforementioned sites, I'm simply hypothesizing at what their policies
might logically
be.

I'm curious--what level of visibility are you seeking into the
crawling policies of the
search engines, and what changes are you hoping to gain leverage to make to
said policies?

Thanks!

Matt
(speaking only for myself, not for any current or past employer)

> /kc
> --
> Ken Chase - ken at heavycomputing.ca - +1 416 897 6284 - Toronto CANADA
> Heavy Computing - Clued bandwidth, colocation and managed linux VPS @151 Front St. W.




More information about the NANOG mailing list