On Robots and Indexing

Dear ht://Dig Group,

You might want to add this or include something like this in the explanation of robots and indexing in the ht://Dig docs.

Thanks for the greatest software package on Earth.

-Gabriel Fenteany (fenteany@calvin.bwh.harvard.edu)

On Robots and Indexing:

ht://Dig supports the robot exclusion protocol, robots metatags and ht://Dig-specific robot properties.

1. Robots Exclusion Protocol The robot looks for a document named robots.txt in your site's entry directory. So for the site http://www.foo.com/, the robots.txt file would have the URL http:/www.foo.com/robots.txt. If it can find this document, it will analyse its contents for records like: User-agent: * Disallow: /

For instance, to exclude only the ht://Dig indexer from directories with the names cgi-bin, tmp or private, you'd put the following text in the robots.txt file:
User-agent: htdig
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /private/
For more details see: Robots Exclusion Protocol.

2. General Robots Metatags

To allow robots such as htdig to index the current page but not follow local links, you can use :
<meta name="robots" content="nofollow">
as in:
<html>
<head>
<meta name="robots" content="nofollow">
<meta name="description" content="...">
<title>...</title>
</head>
<body>
...
You can also specify that the page not be indexed (in the following case, the page containing the following code between the <head> and </head> tags will not be indexed but local links will be followed):
<meta name="robots" content="noindex"> 
To prevent a page both from being indexed and from local links being followed, you can similarly use:
<meta name="robots" content="noinddex,nofollow">
3. ht://Dig-Specific Robots Metatag

To prevent a page from being indexed just by htdig but not other robots that follow the robots metatag convention, use:
<meta name="htdig-noindex">
4. Tags to Prevent Indexing Only Part of a Document

Enclose all the stuff you don't want indexed in a document with:
...
(where "..." is everything you don't want indexed)