Chapter 20

Usage Statistics and Log Analysis


CONTENTS

There will come a time when someone wants to know that the Intranet is being used to make the company more efficient. This may be as simple to prove as showing how much less paper is being used or how many people answered their own questions using the online help desk.

However you convince management that the Intranet has paid off will require you to keep track of the usage statistics. This chapter will show you what sort of information can be reported and some utilities to use.

In this chapter, you will learn:

Hits versus Visits

When people talk about Web sites and how popular they are, the term commonly used is hits. Some big sites may get hundreds of thousands of "hits" per day. This does not mean hundreds of thousands of visitors or even hundreds of thousands of pages; it means hundreds of thousands of connections.

A single HTTP connection may download one graphic, one text page or one file; the key word here is one. Each page that shows up in a browser may contain many separate pieces. Each piece causes a new connection to be established. Figure 20.1 shows a simple HTML page that contains three graphics. This would cause four hits to be made on the server.

Figure 20.1 : This page would cause four hits. One hit for the text and one for each icon.

Figure 20.2 : This form can be used to get usage statistics.

Since HTTP is a connectionless protocol it isn't possible, using normal means, to accurately track the number of unique visitors. It is possible, however, to estimate by looking at unique hosts and log files.

Log files list the hostname along with the username, if you are using authentication. Using authentication enables you to follow a user through a site, but some users find it inconvenient to log in to a site and leave. Intranets, however, can force a user to log in and can use the username to track a user.

Some log files also record the HTTP_REFERER variable. This tells the page that the user was on the page before. It's possible to follow HTTP_REFERER variables back through the log files and create a path that a specific user followed. This has pitfalls, though, since a user might type in an URL in the browser, also not all browsers support the HTTP_REFERER variable.

It is also not possible to count just the number of hits on the home page; some users might go to a specific URL in the site and not go to the home page at all. Accurately counting the number of visitors versus the number of hits is one of the more daunting tasks for Web site administrators.

Why Analyze?

Gathering useful statistics from the Internet is one of the main goals of many Web administrators. Because of this, there are many tools available that analyze the log files and generate statistics.

Common statistics include:

For an Internet site these are all very important statistics but for an Intranet some of these don't make any sense. Still, Intranets can benefit from a log analysis program.

Some reasons for an Intranet to analyze the access logs include:

Log Files

All Web servers generate log files or at least have the option to create log files. The log file contains information, such as the URL name, the status of the request, the date and time, and the hostname. Some logfiles also contain the HTTP_REFERER and the browser type. This can be useful when tracking a user through a site or deciding which features to add.

Each server has a slightly different log file format though most of them support the common log format. Netscape's proxy server has two additional formats; InterNotes logs its information to a Notes database and Microsoft IIS can be setup to log to a backend database such as SQLserver.

There are many tools available to gather statistics from the log files. These are covered later on in this chapter. Before seeing what information can be found using these tools we need to understand the different log file formats.

Common Log Format

The common log format is the standard log format. Most servers can support this log format. Most tools understand this format as well. This makes it a good choice for use as a log file format.

The common log format logs various standard information such as:

Using the common log format enables you to get many statistics including:

The common log format doesn't enable the server to log where people come from (HTTP_REFERER), or what browser they used. Some servers log this information in a separate file or in the same file. If this is logged in the same file, it is referred to as the combined common log format. With these two fields you can:

Here is a sample common log file:

client - rich [09/May/1996:13:43:20 -0400] "GET / HTTP/1.0" 200 477
client - rich [09/May/1996:13:45:11 -0400] "GET /product1.html  HTTP/1.0" 404 -
client - rich [09/May/1996:14:01:41 -0400] "GET /product.html HTTP/ 1.0" 200 204
client - rich [09/May/1996:14:02:02 -0400] "GET /images/product.gif  HTTP/1.0" 200 9534
client - rich [09/May/1996:14:04:49 -0400] "GET / HTTP/1.0" 200 477

This simple example shows a user who authenticated himself as rich connecting in from a host called client, on May 9, 1996 at 1:43 PM. He first got the servers home page "GET /". This document was 477 bytes long and was returned OK.

He received the next page two minutes later, but got an error. The 404 means that the document was not found. This probably means we have a bad link on the home page. Sixteen and a half minutes later he tried to get product.html, which is 204 bytes long, and also seems to include product.gif. This was retrieved 21 seconds later and was 9534 bytes long.

It looks like he read it for almost three minutes before going back to the home page. Then he left the site.

This is all guesswork; since we don't have the HTTP_REFERER logged, "rich" could have gone to other documents in his hotlist between our pages and came back. There could also be two people using the same machine both authenticating themselves as "rich."

InterNotes NSF Files

InterNotes writes its logs into the normal Notes server log "LOG.NSF". This enables the administrator to create custom Lotus Notes views for the database. Since InterNotes is a WWW gateway rather than a server, the logs only contain information on files that users access.

This is not to say that this information is not useful, rather that this is information on client activity rather than server activity. The logs can still tell which files users are accessing and which ones aren't being used.

Microsoft Logs

The Microsoft server can log its messages to a log file or to an ODBC capable database. To setup logging you use the Internet Server Manager.

In the Server Manager, click the service you are interested in and go to the "logging" tab. You need to check the box next to Enable Logging for the server to log messages.

NOTE
When logging to a file, the maximum line length is 1200 bytes and the maximum field length is 150 bytes. When logging to a database, the maximum field limit is 200 bytes.

In this screen you can also set the server to use a new log file every day, week, month, or when it reaches a certain size. You can also specify a log file directory to use.

Alternatively, you can set the messages up to go to an SQL or ODBC database server. This requires you to enter the data source name, the table name, the username, and password.

Once the server is setup to log messages, you can look at them by using the Internet Server Manager. You can also use standard database tools, if you sent the messages to a database.

It is also possible to use standard log analysis tools that understand the common log format. This is because Microsoft includes a program to convert the Microsoft log file format to the common log format.

This program is called convlog.exe. It is located in \Inetsrc\Admin. The more popular options for convlog are:

Error Logs

In addition to the access log, there is also the error log. This is where the Web server logs any errors that it encounters. These errors can be very useful in debugging problems and spotting bad links.

The error log file is a simple file. It contains a line for each error. This line contains the date, the time stamp, and the error message.

Some of these error messages are warnings such as:

These errors mean that, for some reason, the connection never finished. This usually means the network is too slow. If you get a lot of these messages, you may need to increase the amount of bandwidth to your ISP. These messages might also mean the client's network is too slow or has problems. Don't worry if you see occasional warning messages like these.

There are some errors that you should look at right away. These include:

The first error message means someone tried to access a document that is not there. This could mean there is a bad link or someone typed in the wrong URL.

The second and third error messages indicate a problem with a CGI program. The second message looks like the script didn't return the correct header information, while the third error is fairly obvious. The perl executable is missing or the script is looking in the wrong place.

One other message you may see often refers to a file called /robots.txt. This is used by Web robots to figure out what on your site should be indexed and what directories to avoid indexing. More information on the robots.txt file is available at http://info.webcrawler.com/mak/projects/robots/norobots.html.

CAUTION
If your site is strictly for Intranet use and should not be connected to the Internet, investigate this access. If you are running a spider internally, this may be all it is, but if you aren't, then it is possible your Web server is accessible from the Internet. You can check your log files for clients that are not part of your network to verify if anyone has connected. If you see connections from outside your site, you need to check your security very carefully.
Other security considerations are failed CGI scripts. Some CGI scripts have holes in them that may allow a cracker to read any file from the machine. Check out any failed connections to CGI scripts.

Analysis Tools

We have examined the reasons that we want to look at our log files and we know what is in them; this section covers some basic tools used to get the information out of the log files and into a useful format.

Some servers have analysis tools included; others can send the information to a database for analysis. Even servers that don't have the ability to analyze or extract data from the log files can yield information.

One of the quickest ways to get statistics out of the access_log is to use the native system tools. UNIX machines have the "grep" program and DOS machines have "FIND." These two programs enable you to get information out of the log file quickly. Windows users can also open a DOS window and use FIND. Macintosh users can download macgrep, a Macintosh version of the UNIX grep command, from ftp://software.unc.edu/pub/mac/utils/macgrep.hqx.

Some of these statistics include number of hosts, number of hits, and the number of times a particular host or network accessed the Web server. For example, to list all the hosts from nasa.gov that have accessed your Web site, use the following, in UNIX:

grep -i nasa.gov access_log

DOS and Windows users can use:

FIND  /i nasa.gov access_log

Macintosh users can use macgrep to find the phrase nasa.gov. This lists all occurrences of the phrase nasa.gov. This means if you have a page called nasa.gov.html, every time someone accessed that page, regardless of where he came from, it would get listed. The "i" option tells FIND and grep to ignore case. In this way, Nasa.Gov is listed.

There are many freeware or shareware log file analysis tools. These are usually written in PERL or another scripting language. This makes them easier to adapt to your specific needs.

The first one we will look at is Analyze from Netscape. This is included with the Netscape servers. We will also look at wwwstat, a common log format analyzer, IIStats, which allows statistics to be generated from native Microsoft IIS log files, and wusage, another common log format analyzer.

Netscape Analyze

Netscape includes some extra programs with the server package. One of these is Analyze, a logfile analyzer. This program is stored in the ns-home/extras/log_anly directory. In this directory you find:

The Analyze package can be run two different ways, from the command line, or from the Web server.

The Analyze has the following options:

The count option, -c, has a number of things that can be counted. These include:

h
the number of hits
n
the number of 304 accesses (not modified, use local copy)
r
the number of 302 (redirects) accesses
f
the number of 404 (document not found) accesses
e
the number of 500 accesses (server error)
u
the total number of unique URLs accessed
o
the total number of unique hosts
k
the total kilobytes transferred
c
the total kilobytes saved by caches
z
don't count anything

The easiest way to use the analyze program is to create a Web page for it. This allows statistics to be gathered without having to remember the syntax of the Analyze program.

To set up the Web page you need to perform the following:

wwwstat

wwwstat is a Perl script that can be used to generate usage summaries. It can generate reports based on:

These statistics include the following:

wwwstat is available from http://www.ics.uci.edu/WebSoft/wwwstat. Once you download and unpack the program you should see:

It may be necessary to edit the wwwstat program before you run it. This is covered in the README file. You will probably have to change the location of the PERL executable and the locations of various files. These changes are all in the first part of the code and are easy to make.

You can run wwwstat with no options to get some nice output or you can use options to get specialized reports. These options are covered in the README file. Some of the more useful options are:

Figure 20.3 shows the output from a wwwstat run.

Figure 20.3 : wwwstat can generate pages like this one.

There is also a package called gwstat that can take the output from wwwstat and create graphs of the statistics, as shown in figure 20.4. This program is available from http://dis.cs.umass.edu/stats/gwstat.html.

Figure 20.4 : gwstat creates graphs like these from the wwwstat output.

IIStats

IIStats is a program that can analyze the Microsoft IIS logs. It doesn't need to have the log files converted to the common log format like some statistics programs do.

IIStats can be downloaded from http://www.cyber-trek.com/iistats/iistats.zip. It is free software and is covered under the usual GNU General Public License.

IIStats requires the Perl5 for Windows NT program. This program is available from http://www.perl.hip.com/.

IIStats generates the following information from the log file:

IIStats also generates an HTML page with a table containing detailed summaries. This tables gives number of hits and bytes transferred per page as well as by site. This makes a nice way to add site statistics to your Web site.

wusage

wusage 4.1 is another statistics package that can analyze common log format server files. wusage also gives analysis of IIS logs and EMWAC style logs. It can generate graphs in gif format, as well as normal HTML tables.

It is available from http://www.boutell.com/wusage/. This is a shareware package. For commercial sites it costs $75, for educational or non-profit organizations it costs $25. You can download the 30-day evaluation copy from the above site and try it. If you like it, you can register it and pay for it then.

wusage is available for many UNIX versions, as well as OS/2, Windows 95, Windows NT, and DOS. This makes it nice since any machine can be used to generate the statistics, as long as it has access to the access_log file.

wusage requires configuration before it can run properly. Fortunately there is a command, called makeconf, in the wusage directory, that automates this.

After makeconf finishes, you are ready to run wusage. You can specify the config file by adding -c configfile. Otherwise, wusage prompts you for it.

After running wusage you will find that the program has placed some files in the directory you previously specified. These files include:

wusage is a very nice statistics package that allows many different reports to be run. Figures 20.5 and 20.6 show reports created with wusage.

Figure 20.5 : A weekly report from wusage.

Figure 20.6 : A yearly report from wusage.