STILL Paging Google...
MH
malum at freeshell.org
Wed Nov 16 02:38:40 UTC 2005
Hi there,
Looking at your robots.txt... are you sure that is correct?
On the sites I host.. robots.txt always has:
User-Agent: *
Disallow: /
In /htdocs or wherever the httpd root lives. Thus far it keeps the
spiders away.
GoogleSpider also will obey: NOARCHIVE, NOFOLLOW, NOINDEX placed within
the meta tag inside of the html header.
-M.
With the above for robots.txt I've had no problems th
> Still no word from google, or indication that there's anything wrong with the
> robots.txt. Google's estimated hit count is going slightly up, instead of
> way down.
> Why am I bugging NANOG with this? Well, I'm sure if Googlebot keeps ignoring
> my robots.txt file, thereby hammering the server and facilitating s pam,
> they're doing the same with a google other sites. (Well, ok, not a google,
> but you get my point.)
> The above page says that
> User-agent: Googlebot
> Disallow: /*?
> will block all standard-looking dynamic content, i.e. URLs with "?" in them.
>>
>>
>> On Mon, 14 Nov 2005, Matthew Elvey wrote:
>>
>>>
>>> Doh! I had no idea my thread would require login/be hidden from general
>>> view! (A robots.txt info site had directed me there...) It seems I fell
>>> for an SEO scam... how ironic. I guess that's why I haven't heard from
>>> google...
>>>
>>> Anyway, here's the page content (with some editing and paraphrasing):
>>>
>>> Subject: paging google! robots.txt being ignored!
>>>
>>> Hi. My robots.txt was put in place in August!
>>> But google still has tons of results that violate the file.
>>>
>>> http://www.searchengineworld.com/cgi-bin/robotcheck.cgi
>>> doesn't complain (other than about the use of google's nonstandard
>>> extensions described at
>>> http://www.google.com/webmasters/remove.html )
>>>
>>> The above page says that it's OK that
>>>
>>> #per [[AdminRequests]]
>>> User-agent: Googlebot
>>> Disallow: /*?*
>>>
>>> is last (after User-agent: *)
>>>
>>> and seems to suggest that the syntax is OK.
>>>
>>> I also tried
>>>
>>> User-agent: Googlebot
>>> Disallow: /*?
>>> but it hasn't helped.
>>>
>>>
>>>
>>> I asked google to review it via the automatic URL removal system
>>> (http://services.google.com/urlconsole/controller).
>>> Result:
>>> URLs cannot have wild cards in them (e.g. "*"). The following line
>>> contains a wild card:
>>> DISALLOW: /*?
>>>
>>> How insane is that?
>>>
>>> Oh, and while /*?* wasn't per their example, it was legal, per their
>>> syntax, same as /*? !
>>>
>>> The site as around 35,000 pages, and I don't think a small robots.txt to
>>> do what I want is possible without using the wildcard extension.
>>>
>>>
>>>
>>>
>>>
>>>
>>
>
>
More information about the NANOG
mailing list