Menu
How to use robots.txt

How to use robots.txt

How to use robots.txt

I was going to leave the topic of the robots.txt file alone, but I recently received two terrific letters that made me reconsider the whole issue. Perhaps you should too.

Steve DeJarnett, of Phillips multimedia research, offered some useful insights into the realities of Web-crawlers and other applications that seek the elusive robots.txt file, which limits the areas of a site that Web-crawling "robot" programs have access to.

DeJarnett points out that robots visiting my site, looking for robots.txt, probably really do intend to be there. When they visit, however, the robots may retrieve as few as one or two pages, depending on what they know about the site and what they're looking for.

Unfortunately, some robots aren't designed well or don't adhere to the robots.txt conventions, which forbid frequent repeat visits. Robots and Web-crawlers such as these can cause serious problems for Webmasters trying to manage traffic on a busy site. Another source of excessive site traffic are off-line browsing utilities. These programs almost never request robots.txt files, let alone observe the restrictions listed in the files - though they certainly should.

Interestingly, less than 3 per cent of the sites on the World Wide Web have robots.txt files. This implies that there are lots of programs on the Internet dutifully searching for the file, but almost no one has the file. That's a lot of traffic for relatively small results. DeJarnett pointed me to a useful source of more information: info.webcrawler. com/mak/projects/robots/exclusion.html.

Forbidden territory

Another reader pointed out that there are some very interesting snippets of information you can glean from reading robots.txt files on public Web sites - secret information that may not be very well guarded.

"I love robots.txt," this person wrote. "It's one of the best ways to find out where the Webmaster doesn't want you to go."

Rather than password-protecting sensitive areas of a site, many Webmasters simply put these areas online without any security - and indicate in the robots.txt file that these areas are off-limits to robots.

Ironically, that information can point the informed reader directly to those unsecured documents.

For example, check out www.cnn.com/robots.txt. The disallowed directory /webmaster_logs would be quite tempting to anyone curious to know how much traffic CNN's Web servers are getting - especially when major news stories are breaking.

Robots.txt files can also give hints about partnerships between sites and the major search engines.

You should create and post a robots.txt file if you wish to omit certain areas from the search engines - but keep in mind that anything documented in the robots file is readable to anyone who cares to explore the file. Security through passwords is a better strategy to keep prying eyes out of closed areas.


Follow Us

Join the newsletter!

Error: Please check your email address.
Show Comments