A crucial, if often overlooked, aspect of running a successful web site is the study of activity occurring within the site. The information gleaned provides valuable input to continuous improvement initiatives, ranging from site architecture and content enhancements to traffic generation. This is the first of a two-part series exploring how to use the open source tool AWStats to perform web server log file analysis. This first part shows how to prepare a sample web log file, perform a basic installation of AWStats, generate reports, and review web analytics terminology; the second part will focus on report interpretation. My aim is to clear away some of the common misconceptions around hits, pages, and visits. The insight will provide a basis for creating a setup to meet production requirements.
Web log analysis can be resource-intensive and usually takes place on a system different from the production web server(s). This separation also allows for the flexibility inherent in heterogeneous architectures, where web servers might be running Linux while log analysis tools run under Windows or vice versa. I've assumed a minimalist scenario in which you have AWStats installed on a desktop workstation for ad hoc analysis. While AWStats will run on any platform that supports a recent Perl interpreter, this article covers AWStats 6.4 using either Linux or Windows.
Binary executables for Linux (.rpm) and Windows (.exe) are available from the AWStats project home page and the AWStats project on SourceForge. Download and run the executable appropriate for your workstation. In the case of a Windows install, a script will prompt you for information about your web environment. Answer N to skip this step, and press Enter until the command window closes.
Once the installation finishes, you should find the AWStats programs and documentation on your hard drive, likely in /usr/local/awstats/ or C:\Program Files\AWStats\. Now check that Perl is available. From the system command prompt, type:
$ perl -v
You should see version information if you already have Perl installed. AWStats will stop if the version is lower than 5.005_03; the latest version (5.8.x) is recommended, as it offers performance improvements. To install or update Perl, get a version for Linux from Perl for Linux or for Windows from ActiveState's ActivePerl.
To produce reports, you need a least a day of web server log file data. If you are using an Apache server, ensure that you have set the web server logging format to Combined. In the case of Microsoft's IIS web server, set your format to a modified version of the W3C Extended Log File Format, following the instructions in AWStats IIS configuration Part B, Step 1. These configurations add necessary data elements such as user agent (browser) and referring site to the base log configuration. For other web servers, consult the AWStats LogFormat parameter values to get a list of data elements required for complete reporting.
Restart the web server for the new logging values to take effect (after saving the old logs, if needed). If you have access to data from a production web server that you cannot restart, you can use the data as is, with two caveats. If you are not logging all the required data elements, such as user agent, the relevant AWStats reports will be empty. In addition, you must manually map each field being logged using the LogFormat parameter; otherwise, most of your data file will appear as corrupted to AWStats.
Once logging has run for at least a calendar day, copy the log file(s) to the system on which you installed AWStats, using the following target destination, with one of the following:
$ cp /var/log/httpd/access_log /tmp/access.log
# or
> copy C:\WINDOWS\system32\Logfiles\W3SVC1\ex050623.log C:\temp\access.log
Adjust the origin locations as needed based on your web server configuration.
You can also combine multiple logs from different dates combined using the type (Windows) or cat (Linux) utility (in a production setting, turn the filename into a parameter). Be careful to combine the files in chronological order:
$ cat logfile1 logfile2 logfile3 > access.log
In the case of multiple servers in load balancing, merge the logs with the AWStats logresolvemerge.pl utility.
A sample AWStats configuration file, awstats.model.conf, comes with the AWStats installation. Copy the file, changing model to the name of the domain to analyze. While custom dictates the use of a domain name, in reality it can be anything. This example analyzes data from www.antezeta.com, so the model is antezeta:
$ cp /etc/awstats/awstats.model.conf /etc/awstats/awstats.antezeta.conf
> copy "C:\Program Files\AWStats\wwwroot\cgi-bin\awstats.model.conf" \
"C:\Program Files\AWStats\wwwroot\cgi-bin\awstats.antezeta.conf"
Open the resulting file in your favorite text editor. Change each of the following values as necessary (where antezeta.com represents your domain):
SiteDomain="www.antezeta.com"
HostAliases="www.antezeta.com localhost 127.0.0.1"
LogType=W
Set the parameter LogFormat to 1 for Apache, 2 for Microsoft IIS < 6.0, or date time cs-method cs-uri-stem cs-username c-ip cs-version cs(User-Agent) cs(Referer) sc-status sc-bytes for IIS 6.x. For other web servers, see the documentation in the configuration file.
LogFormat=1
Set the parameter DNSLookup to 1 unless your web server already performs reverse DNS lookup on hostnames (that is, translating the host IP address 123.456.789.012 to user34.adsl.myisp.com or similar). Because reverse DNS lookup is slow, web servers do not usually perform it, as it would delay user navigation.
DNSLookup=1
Save the file.
|
Related Reading Web Site Measurement Hacks |
|
AWStats uses intermediary files to produce its reports--one for each month of each year for each configuration file you have created. These files represent a compact, optimized version of raw web server log file data, based on preference settings in the AWStats configuration file. Run the command appropriate for your operating system to generate a statistics file for the web log saved earlier in the temporary directory (replace antezeta with your domain name):
$ perl /usr/local/awstats/wwwroot/cgi-bin/awstats.pl -config=antezeta \
-update -LogFile=/tmp/access.log
> perl "C:\Program Files\AWStats\wwwroot\cgi-bin\awstats.pl" -config=antezeta \
-update -LogFile=C:\temp\access.log
You should see output similar to this Windows example:
Update for config "C:\Program Files\AWStats\wwwroot\cgi-bin/
awstats.antezeta.conf"
With data in log file "C:\temp\access.log"...
Phase 1 : First bypass old records, searching new record...
Searching new records from beginning of log file...
Phase 2 : Now process new records (Flush history on disk after 20000 hosts)...
Jumped lines in file: 0
Parsed lines in file: 539
Found 1 dropped records,
Found 4 corrupted records,
Found 0 old records,
Found 534 new qualified records.
This will generate a statistics file awstatsMMYYYY.antezeta.txt in the same directory as awstats.pl (unless you gave a different value to DirData in awstats.antezeta.conf):
Directory of C:\Program Files\AWStats\wwwroot\cgi-bin
06/23/2005 03:51 PM 6,633 awstats062005.antezeta.txt
where MM is the month and YYYY the year of the web server log data. Should the input data bridge two months, the statistics database will consist of two statistics files.
Rerun the previous command to generate the statistics database. Instead of 534 new records, you have 534 old ones:
Update for config "C:\Program Files\AWStats\wwwroot\cgi-bin/
awstats.antezeta.conf"
With data in log file "C:\temp\access.log"...
Phase 1 : First bypass old records, searching new record...
Searching new records from beginning of log file...
Jumped lines in file: 0
Parsed lines in file: 539
Found 1 dropped records,
Found 4 corrupted records,
Found 534 old records,
Found 0 new qualified records.
AWStats, noticing it received an old file, correctly ignores the old data. However, AWStats is less flexible when it comes to processing log files out of order--it must process them chronologically. If you skip a day's processing, AWStats will ignore it if you try to process it after processing successive days. The solution is to delete that month's statistics file and reprocess the log data for the entire month to date. Similarly, some AWStats configuration file changes affect statistics file generation. If your log files are not large and you have doubts, delete the statistics file(s) and reprocess your logs.
Storing the original log files for extended periods is a good practice, unless legal or company policy dictates otherwise. Access to historical logs lets you regenerate your reports if you subsequently make a configuration file change or decide to migrate to another web log analysis tool.
After you have created a statistics database, it's possible to run reports. While AWStats supports a very nice on-demand web CGI interface, it's easy to create static HTML reports to avoid having to reconfigure your web server. The following commands will generate the reports in the /tmp or C:\temp directory:
$ perl "/usr/local/awstats/tools/awstats_buildstaticpages.pl"
-config=antezeta -lang=en
-awstatsprog="/usr/local/awstats/wwwroot/cgi-bin/awstats.pl"
-dir="/tmp"
-diricons="/usr/local/awstats/wwwroot/icon"
> perl "C:\Program Files\AWStats\tools\awstats_buildstaticpages.pl"
-config=antezeta
-lang=en -awstatsprog="C:\Program Files\AWStats\wwwroot\cgi-bin\awstats.pl"
-dir="C:\temp" -diricons="../Program%20Files/AWStats/wwwroot/icon"
AWStats creates the HTML reports in the temp directory specified by -dir; the main index file is awstats.config.html (for this example, awstats.antezeta.html). Open it in a web browser.
Should the report graphs be clear rather than colored, verify the directory specified with the -diricons parameter. This value is hardcoded in the HTML files. In the Windows example above, we had to encode the space in the directory name with the %20 notation. We also used HTML forward slashes rather than Windows backslashes.
|
To put the created reports in context, begin by looking at the raw log data, and from that define basic web analytics terminology.
Using the configuration format specified earlier, each web log will have multiple lines of text, each containing nine fields of data. To understand the work AWStats has to perform, consider how a record looks:
| Field | Data Example | Explanation | |
|---|---|---|---|
| 1 | Host (user) IP | d81-211-134-62.cust.tele2.it | There has been a DNS lookup in this case. The web server can do it, but you can also do it later, if you do it at all. Judging from the user's host, there is a reasonable probability that the request came from Italy. (However, if the host were something like proxy.alitalia.it, the user might have been working for Alitalia in Boston!) |
| 2 | RFC 1413 identity (username) of the client determined by identd. |
- | Rarely used. PC clients do not usually run identd. A dash
is a placeholder in the absence of a value. |
| 3 | Authenticated User (login name) | - | The login name for a web server-required login. This is not usually present--most web sites use application server logins, not web server logins. |
| 4 | The date and time that the server finished processing the request | [08/Jun/2005:19:03:22 +0200] |
Time includes UTC (Coordinated Universal Time) offset. |
| 5 | The user request | GET/HTTP/1.1 |
In this case, the client requested the top-level default document / (index.html)
using the GET method of the HTTP protocol version 1.1. |
| 6 | Response Status sent to client | 200 |
|
| 7 | Bytes sent, excluding HTTP headers | 4544 |
|
| 8 | Referer (sic) URL, if any | http://www.antezeta.com/about.html | The URL from which the client made the request. This field is blank if the user directly types a URL, chooses a bookmark, or uses privacy software that blocks the information from being sent. |
| 9 | User-Agent identification as reported by the user agent. This usually includes operating system and browser names and versions. | Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.8) Gecko/20050513
Fedora/1.0.4-1.3.1 Firefox/1.0.4 |
This is a Firefox 1.0.4 browser on a Fedora Linux system. Note: some browsers, such as Opera, let the user choose which identification to send. A user can claim to use Microsoft Internet Explorer 6 even while using Opera. This impostor functionality is a response to all the poorly designed "Optimized for browser x" sites that refuse to work with other, legitimate standards-compliant browsers. |
This one web server entry, a successful request for http://www.antezeta.com/, represents what is commonly called a hit. An anonymous user navigated from the page http://www.antezeta.com/about.html.
Consider that the web site's home page in the above example is actually a group of files--one text file (index.html), one style sheet to indicate formatting (CSS), six image files (GIF, ICO, and PNG), and some dynamic client-side logic (JavaScript) stored in two separate files on the web server. Simply calling up the home page will result in ten file requests to the web server, and thus ten hits:
| Qty | Item |
|---|---|
| 1 | HTML text file; for example, index.html |
| 1 | CSS formatting instructions file |
| 6 | GIF, ICO, and PNG image files |
| 2 | js JavaScript client logic instruction files |
| 10 | Total hits |
Probably the most common web metric bandied about, "hits" is also the most meaningless.
Along with bandwidth consumption, hits can be useful as an input for server sizing and capacity planning. While people make much of hits to tout the success of a site, hits have no intrinsic business value. Representations to the contrary probably indicate a lack of understanding of how futile hits are as a useful business measure.
|
As the internet has matured, more sophisticated attention turned from hits to pages. Unfortunately, this opened a new can of worms: there is no standard definition of a page. A web server log file simply contains information on objects requested from the web server. It is up to the web server log file analysis software to give semantic meaning to those objects.
AWStats works by exclusion in defining a page. By default, any object accessed by a user on your web server is a page unless it has a filename suffix of css, js, class, gif, jpg, jpeg, png, bmp, or ico. You must explicitly add any other objects you do not want to count as pages in AWStats reports. For example, add ZIP achieves and Flash animation files to this list by adding their suffixes to the AWStats NotPageList directive in the AWStats configuration file:
NotPageList="css js class gif jpg jpeg png bmp ico swf zip
tgz gz tar"
Then AWStats will count everything but the following as pages:
| Suffix | Description |
|---|---|
css |
Cascading Style Sheet formating instruction files |
js |
JavaScript dynamic program logic |
class |
Java program files |
gif, jpg, jpeg, png,
and bmp |
Various image/photo formats |
ico |
An image icon file; many sites have a company logo saved as favicon.ico; many browsers use this in bookmarks (favorites) and tabs |
swf |
ShockWave Flash animation |
zip, tgz, gz, and tar |
Achieve formats created by PKZip, WinZip, tar, gzip, or similar |
One advantage to this approach is that if you are using a CGI to generate dynamic pages, you do not have to worry about each CGI query counting as a page--this will be automatic.
While the concept of a page is open to some interpretation, the concept of a visitor (and a visit, also known as a session) is more difficult to define. Log data neither defines nor tracks a visitor entity. Several heuristic approaches can be used to extrapolate individual visitors from server log data, each approach adding an additional level of refinement.
|
Some significant problems are inherent in tracking visitors and their visits with web log analysis software such as AWStats.
An ISP may reassign an IP to several users over the course of a day. Assume that Giacomo connects to the internet using his dial-up modem connection at 7:35 a.m. After a few minutes, he disconnects. His host IP address, dialup-062.libero.it, is now free. At 8:10 a.m., Patrizia connects with her modem and is assigned the host IP address dialup-062.libero.it by her provider. If she visits a site, is she the same visitor in the same visit (session) as before?
The commonly accepted convention is that a visit has ended if there is no further activity from the visitor after 30 minutes. Thus, her visit would be a new session or visit--but you have no way of knowing that she is a different person from Giacomo. When Giacomo connects later in the day, he will most likely do so from the office, so even if he had a fixed IP at home, he will have a new host IP from the office and will thus appear as a different visitor than the Giacomo who visited at 7:35 a.m.
Despite these limitations in heuristic approaches, the concept of visitors and sessions (each individual visit) remains a valid tool as an indication of overall user behavior and trends.
| Visitor No. | Visits (sessions) | Unique visits |
|---|---|---|
| 1 | 2 | 1 |
| 2 | 1 | 1 |
| 3 | 12 | 1 |
| 3 | 15 | 3 |
Bandwidth consumption is of interest to technical staff, as there is usually an economic cost associated with its use. On a more granular level, large individual file sizes will indicate performance issues, especially for dial-up users.
The final part of this series will look at the reports we generated, using the definitions above to identify business and technical metrics to watch.
Sean Carlos is president of Antezeta, an internet consultancy focusing on Merit-Based™ search engine optimization, search engine marketing, web analytics, and web site usability.
Return to ONLamp.com.
Copyright © 2009 O'Reilly Media, Inc.