STILL Paging Google...

Wed Nov 16 00:56:12 UTC 2005

Still no word from google, or indication that there's anything wrong 
with the robots.txt.  Google's estimated hit count is going slightly up, 
instead of way down.
Why am I bugging NANOG with this? Well, I'm sure if Googlebot keeps 
ignoring my robots.txt file, thereby hammering the server and 
facilitating s pam, they're doing the same with a google other sites.  
(Well, ok, not a google, but you get my point.) 

On 11/14/05 2:18 PM, Coyle, Brian sent forth electrons to convey:
> Just thinking out loud...
>
> Have you confirmed the IP addresses of the Googlebot entries in your log
> actually belong to Google?  
>
> /paranoia  :)
The google search URL I posted shows that google is hitting the site.  
There are results in there that point to pages that postdate the 
robots.txt that should have blocked 'em.  
(http://www.google.com/search?q=site%3Awiki.fastmail.fm)

On 11/14/05 2:09 PM, Jeff Rosowski sent forth electrons to convey:
> Are you trying to block everything except the main page?  I know to 
> block everything ...
No; me too. See
http://www.google.com/webmasters/remove.html
The above page says that
User-agent: Googlebot
Disallow: /*?
will block all standard-looking dynamic content, i.e. URLs with "?" in them.
>
>
> On Mon, 14 Nov 2005, Matthew Elvey wrote:
>
>>
>> Doh!  I had no idea my thread would require login/be hidden from 
>> general view!  (A robots.txt info site had directed me there...)   It 
>> seems I fell for an SEO scam... how ironic.  I guess that's why I 
>> haven't heard from google...
>>
>> Anyway, here's the page content (with some editing and paraphrasing):
>>
>> Subject: paging google! robots.txt being ignored!
>>
>> Hi. My robots.txt was put in place in August!
>> But google still has tons of results that violate the file.
>>
>> http://www.searchengineworld.com/cgi-bin/robotcheck.cgi
>> doesn't complain (other than about the use of google's nonstandard 
>> extensions described at
>> http://www.google.com/webmasters/remove.html )
>>
>> The above page says that it's OK that
>>
>> #per [[AdminRequests]]
>> User-agent: Googlebot
>> Disallow: /*?*
>>
>> is last (after User-agent: *)
>>
>> and seems to suggest that the syntax is OK.
>>
>> I also tried
>>
>> User-agent: Googlebot
>> Disallow: /*?
>> but it hasn't helped.
>>
>>
>>
>> I asked google to review it via the automatic URL removal system 
>> (http://services.google.com/urlconsole/controller).
>> Result:
>> URLs cannot have wild cards in them (e.g. "*"). The following line 
>> contains a wild card:
>> DISALLOW: /*?
>>
>> How insane is that?
>>
>> Oh, and while /*?* wasn't per their example, it was legal, per their 
>> syntax, same as /*?  !
>>
>> The site as around 35,000 pages, and I don't think a small robots.txt 
>> to do what I want is possible without using the wildcard extension.
>>
>>
>>
>>
>>
>>
>