Crawler Ettiquette

Deepak Jain deepak at
Wed Jan 23 19:35:17 UTC 2002

I figured this was the best forum to post this, if anyone has suggestions
where this might be better placed, please let me know.

A University in our customer base has received funding to start a reasonably
large spider project. It will crawl websites [search engine fashion] and
save certain parts of the information it receives. This information will be
made available to research institutions and other concerns.

    We have been asked for recommendations on what functions/procedures they
put in to be good netizens and not cause undo stress to networks out there.
On the list of functions:

	a) Obey robots.txt files
	b) Allow network admins to automatically have their netblocks exempted on
	c) Allow ISP's caches to sync with it.

	There are others, but they all revolve around a & b. C was something that
seemed like a good idea, but I don't know if there is any real demand for

	Essentially, this project will have at least 1Gb/s of inbound bandwidth.
Average usage is expected to be around 500mb/s for the first several months.
ISPs who cache would have an advantage if they used the cache developed by
this project to load their tables, but I do not know if there is an
internet-wide WCCP or equivalent out there or if the improvement is worth
the management overhead.

	Because the funding is there, this project is essentially a certainty. If
there are suggestions that should be added or concerns that this raises,
please let me know [privately is fine].

All input is appreciated,


More information about the NANOG mailing list