Anyone have contacts at the Amazon or OpenAI web spiders?

John Levine johnl at iecc.com
Wed Feb 14 16:48:27 UTC 2024


It appears that Patrick Clochesy <patrick at mach.net> said:
>Both robots respect robots.txt, of course they’re not going to answer.

The content farm is not one site with six billion pages, it's six billion
sites each with one page.  They check the robots.txt for each site they
visit but by then its's too late.

Most spiders can take the hint that they're all on the same IP.  But not
these two.

R's,
John

>
>On Feb 13, 2024, at 8:35 PM, John Levine <johnl at iecc.com> wrote:
>> 
>> One day I set up the world's lamest content farm. You can see it here:
>> 
>> https://www.web.sp.am/
>> 
>> While humans tend not to find its six billion pages very interesting,
>> some web spiders are entranced. In the past week or so, Amazon's
>> amazonbot has visited it 6 million times, and OpenAI's gptbot 2.6
>> million. (If you were wondering what they use to train ChatGPT, now
>> you know.) I don't care that googlebot comes by every 5 or 10 minutes,
>> but gptbot is every few seconds and amazon as fast as the server will
>> respond.
>> 
>> They both come from predictable IPs so I can set packet filters but
>> they're still hammering pretty hard. Each has a URL in the user agent
>> string, Amazon's page has an address to write to but OpenAI's doesn't.
>> I wrote to the Amazon address, no response.
>> 
>> If anyone has contacts at either I would appreciate it. A few years
>> ago the bingbot got trapped but fortunately I knew someone at
>> Microsoft who could pass the word. He reported back that while he
>> could not go into detail, there was a great deal of animated
>> conversation at the other end of the hall, and shortly after that it
>> stopped.
>> 
>> R's,
>> John
>




More information about the NANOG mailing list