Dear ht://Dig Group,
You might want to add this or include something like this in the explanation of robots and indexing in the ht://Dig docs.
Thanks for the greatest software package on Earth.
-Gabriel Fenteany (fenteany@calvin.bwh.harvard.edu)
ht://Dig supports the robot exclusion protocol, robots metatags and ht://Dig-specific robot properties.
1. Robots Exclusion Protocol The robot looks for a document named robots.txt in your site's entry directory. So for the site http://www.foo.com/, the robots.txt file would have the URL http:/www.foo.com/robots.txt. If it can find this document, it will analyse its contents for records like: User-agent: * Disallow: /
For instance, to exclude only the ht://Dig indexer from directories with the names cgi-bin, tmp or private, you'd put the following text in the robots.txt file:
User-agent: htdig Disallow: /cgi-bin/ Disallow: /tmp/ Disallow: /private/For more details see: Robots Exclusion Protocol.
To allow robots such as htdig to index the current page but not follow local links, you can use :
<meta name="robots" content="nofollow">as in:
<html> <head> <meta name="robots" content="nofollow"> <meta name="description" content="..."> <title>...</title> </head> <body> ...You can also specify that the page not be indexed (in the following case, the page containing the following code between the <head> and </head> tags will not be indexed but local links will be followed):
<meta name="robots" content="noindex">To prevent a page both from being indexed and from local links being followed, you can similarly use:
<meta name="robots" content="noinddex,nofollow">
3. ht://Dig-Specific Robots Metatag
To prevent a page from being indexed just by htdig but not other robots that follow the robots metatag convention, use:
<meta name="htdig-noindex">
4. Tags to Prevent Indexing Only Part of a Document
Enclose all the stuff you don't want indexed in a document with:
<!--htdig_noindex-->...<!--/htdig_noindex-->(where "..." is everything you don't want indexed)