Recently we changed the configuration of the traffic-analysis program we use at Intuitive Systems so that it reports error hits as well as pages success- fully delivered. The results have proven most interesting.
The most fascinating piece of information is that our site receives weekly requests for the file "robots.txt". Your site probably receives similar requests. And yet odds are good that, like us, you don't have this file on your site.
The secret of the robots.txt file is that it's used by automatic Web-searching systems (otherwise known as "robots") to ascertain whether you want to have your site indexed. You probably already make use of these robots when you look for things on the World Wide Web, even though you may not be aware of it.
Many search sites, including AltaVista, WebCrawler, HotBot, and Lycos, use robot programs to "crawl" the Web and index the Web pages they find. When you enter a search at these sites, you're searching through the indexes created by their robots.
What's most startling about this, however, is the frequency with which our site is being searched for the robots.txt file. On Wednesdays alone, our server logs more than a dozen requests for the robots.txt file. Throughout the course of a week, we'll see 25 hits, which suggests that there are 25 different crawler programs visiting our site every week.
But that seems unlikely to me for one simple reason. If you've visited one of these search sites lately, you know that they are woefully out of date.
Although they are built around a great idea - letting the robots find new Web sites rather than waiting for the sites to be registered in a directory such as Yahoo! - the reality is that the search sites have fallen behind the rapid changes on the Web.
Do a search on any of these sites and odds are good you'll find that 20 per cent or more of the links are dead, broken, or point to something other than what you expected.
That makes me sceptical that the major search sites are really hitting my site once per week. So what is producing all these queries for robots.txt? For now, it's a mystery, but I'd be interested in hearing if you are finding the same pattern on your own Web site.
Robot-based indexing systems are a better choice for intranets, because an intranet is by definition a constrained space with hundreds or thousands of Web pages, instead of millions.
When examining search engines, be sure to take a look at how the results of searches are displayed. My wife was recently searching through the online archive of a local newspaper and was baffled by the search results.
"What are the percentages listed next to each match?" she quite reasonably wanted to know. The answer, of course, was that it's the "relevancy ranking" for the listed document. Results are scored from 0 per cent to 100 per cent, based on their relevance to your query - as far as the search engine can determine.
Many Web-search systems display a ranking for the matches they show, but how is this ranking generated and what does it mean? If you explore the search results, the ranking often seems to have a minimal significance if any. The No. 2 or No. 3 match is often what you seek, and No. 1 is often unrelated to your original search.
This can be confusing and frustrating when on the Internet, but it's of particular importance for an intranet index and search system.
Dave Taylor is president of Intuitive Systems and can be reached at firstname.lastname@example.org