How to identify search engine spiders and webbots

Are you using your clipboard
to it's fullest potential?

How to identify search engine spiders and webbots

Contents of this page

Introduction
Is it a human? is it a spider? how do I tell the difference?
How do I find out what a particular agent is?
Tips on searching for user agents in search engines

Introduction

Identifying search engines and other agents that visit your site isn't rocket science, but it can be a painstaking process with a real possibility of failure. This page describes some of the methods I've used to track down the search engine spiders, webbots and other user agents that visit my site.

First you need to have access to your server logs. How you access these will depend on your ISP. Some of the free ISP's may not grant access to such logs. Others may grant FTP access to full logs. I have a sample log file explained if you're unfamiliar with them.

The server log typically contains a one-line entry for each "hit" on your site, where a "hit" in this context is usually a request for a HTML page, an image file, a style sheet (.css) file, or whatever else you're serving from your site. Each entry contains many fields, but the ones of interest here are

The IP address, or DNS address. This is the address of the machine that made the visit.
The referring URL. In principle this is the URL of the page the User Agent was on before. Often it is blank and very occasionally it is faked to advertise some URL, to attract the curious webmaster and end up as a hyperlink if the log is ever published online.
The user agent identifier. In principle this identifies the spider/browser/webbot by name. Sometimes a contact URL or email address is included. This field is entirely optional, and can be faked (for example Opera has the ability to look like IE to allow it to access sites that don't otherwise recognize it). I did once see an agent that changed this name for every new request that it issued (presumably as a form of camouflage).

Of these three bits of information only the first can be relied upon, and will always be present. Often your ISP will have done a DNS lookup and give you the node name, rather than the IP number. The remaining two are set optionally by the user agent visiting your site and depending on your ISP you may not have access to the referring URL or user agent. Without these your search is pretty much at an end, unless the IP/DNS is sufficient to identify your visitor. If you don't have this information in your logs ask your ISP if it can be enabled.

You may occasionally want to refer to other fields such us the number of bytes transferred and the HTTP status code, when trying to determine the webbots "behaviour" with respect to the content on your site.

Is it a human? is it a spider? how do I tell the difference?

Humans, spiders and webbots have different patterns of browsing. You can't be absolute about this, but here are some guidelines

A human browser will read only a few pages over a few minutes (although they may come back in an hour or so). The pages they read will often be linked together (unless they use your site search engine, or are being referred from outside). The server log will show them loading all the graphics for the pages they visit (although some people browse with graphics switched off). These people will be the ones mostly responsible for accessing your .zip and .mp3 files, and they will occasionally access a "favicon.ico" file, indicating they are an IE user who has chosen to bookmark your site (congratulations!).

User agents that are being driven by humans, will usually come from a variety of IP addresses that will often have a DNS lookup that is recognisable as being an ISP. However this pattern will only become apparent if you get multiple visits from users using the same agent.

A spider will normally have assembled a list of pages on your site, and will visit these more or less at random (except during their first visit when they may appear to follow links). Well behaved spiders will read "robots.txt" to see what they may do with your site, and will only read a few pages at a time so as to not overload your site. This can mean that a "deep crawl" of your site may involve many visits over hours or days. Periodically this process will be repeated to freshen the spiders index. Most spiders ignore graphics and other binary files, although increasingly there are some that search specifically for MP3, JPG and other sorts of searched-for binary content.

Most spiders always come from the same range of IP addresses, and these addresses will often have the same domain name as the parent site (e.g. piano.excite.com is one of Excite's spider engines).

"Webbots" is used here to mean any other web robot. These will vary in behaviour according to the task they have been set. Site copiers will grab virtually all your pages, as linked together, in a very short time. On the other hand, link checkers may visit only one or two pages every day or so. Download agents may perform several "hits" on a large .zip file, each smaller than the size of the file, but with the total transfer adding up to the correct amount.

Webbots may come from the same IP address each time (usually indicating that they are some form of web-based service), or from a user's ISP IP address (indicating it is some piece of software running on a user machine).

How do I find out what a particular agent is?

You can work out the agent from its name. First look at my webbots page. I've tracked down dozens of these already, so it's worth seeing if I got there first. Currently that page is updated every few weeks, and new entries are added at the rate of around 10/month.

If it's not yet there, take the following step :-

look at the IP address and the user agent string. If you're lucky the user agent string will contain either a URL or an email address. These are provided sometimes to allow people like us to contact the owner of the agent to see what's going on. Either visit the URL, or use the email address to get a domain name and see if you can work out from that site's home page what the agent is up to.

More usually the agent field doesn't have this information. So now look at the IP address(es) that you've seen associated with this user agent. If you've seen many visits then you'll know if they come from a single IP address (or a set of similar IP addresses) or from a wide range.

If there's a single IP address with a DNS lookup on it (e.g. piano.excite.com), then extract the domain name (excite.com) and visit the web page to see what it's about. Most of the major search engines are easily identified this way.

If the IP address has no DNS lookup (i.e. it's all numbers), then either run TRACERT to trace a route to that IP address, or use something like VisualRoute to do so. This will give you a series of IP "hops" that will route data from your machine to the specified IP address. The first few hops will be close to your machine, and will probably all be to do with your ISP, but the last few hops will be close to the machine that visited you. Often when there's a numerical IP address, running a trace will reveal a more meaningful DNS name a few hops before the destination is reached. Using a utility such as VisualRoute can yield a lot of information about these last few hops.

Again, if the last few hops yield a meaningful DNS lookup, then extract the domain name and visit the web site.

If you can't get a meaningful DNS using the above method, simply use the numerical address to access a web page by constructing a URL of the form

http://nnn.nnn.nnn.nnn

where "nnn.nnn.nnn.nnn" is the numerical IP address. This will sometimes be routed to the parent web site, but more commonly this will either simply timeout or display some standard screen or error message that will tell you nothing about the owner. Sometimes it's worth repeating this a few weeks later, as webbot owners notice attempt to access this node, and put up a web page explaining what they're doing.

If you can't identify the source from a single IP, or if the user agent shows signs of being run on various user machines, then you'll need to turn your attention to the user agent name and see if you can track this down using search engines.

Tips on searching for user agents in search engines

One last trick before we resort to using search engines. Look at the user agent text. Does it have a snappy name like webtwin, webstripper or the like? If it does, try constructing a domain name from this, usually by adding www. and .com before and after. Chances are that if www.webtwin.com exists (and it does), it will be the web site describing
this particular agent.

That didn't work? Oh dear. Now things get tough.

If the user agent is named after a common word (like "Jack") you probably should give up now while you've got some hair left. Searching for a user agent by name can be a needle in a haystack job, and you need to have a good understanding of how to refine searches using a suitably powerful search engine. I usually use a mixture of Altavista and Google as between them they have a comprehensive reach, produce sensible results, and can be highly refined (particularly Altavista).

Your problems come from a number of sources :-

If the user agent string uses common words, you will get loads of false matches
A large number of sites publish their web stats online. Since these will include your user agent, this will give a large number of false matches
The owner of the webbot may not have chosen to document online what the webbot is, and what it's up to.

First try entering the user agent string (or a suitable substring) into your search engine as a search phrase. In doing these searches it's best to omit any version numbers etc. in the string. Look at the number of results, many of which will be web server logs, unless the agent is comparatively new or rare.

Assuming you've got too many hits try some or all of the following:-

Eliminate the log file pages. In Altavista this is probably best done by adding

-title:statistics -title:stats -title:access -title:agents

i.e. exclude pages with the word "statistics" etc. in the title. Of course we may just have lost the page we want, so another option would be to look for pages not containing another common user agent's text.

Look for pages with your user agent in the title. Altavista can be a little strange at doing this, so it may be an idea to simply select the most significant word from the agent string.
Look for pages with the most significant word in the URL. Again at Altavista this would be

+url:<word> -title:statistics -title:stats -title:access -title:agents

If you still have too many hits, add in words you'd expect to be on the page like "search" "spider" to your search, only accepting pages that include these words, as well as your original words.

Once you get down to a sensible number of pages (<30), start looking at these pages to see if any tell you about the agent. Usually such pages will stand out through their title, description or (sometimes) their URL. Good URLs to check are from software library sites like DaveCentral and ZDNet as these will often describe the software involved better (or be higher placed in the results) than the product's own homepage.

If the search engines don't yield a result, try a visit to DejaNews searching for the desired user agent string. It's most likely you'll simply find posts from people quoting parts of their server logs, and you may even find posts asking what the agent is. If you find such a post, follow the thread to see if anyone posted a useful answer (most times they won't if all the above methods have failed). In a last gasp of desperation you could email the original poster to see if they ever found out. If nothing else you'll get a shoulder to cry on.

Of course, if you finally succeed, please do drop me a line at info@jafsoft.com so I can add it to my list :-)

This page is © 2000-2004 John A Fotheringham. It may not be reproduced without permission,
although you are welcome to save a copy for personal use to your hard disk.

home - search engines - contact us - news - product index - search this site
For more information contact info@jafsoft.com.