Writing Apache's Logs to MySQL
Pages: 1, 2, 3
Recording Logs in a Database
Flat files are the simplest way to record hits, but they're impractical for long-term analysis:
- Processing programs must read the records sequentially.
- Log writing affects web server performance due to the high number of disk I/O operations.
- Log rotation must occur regularly.
- Log retention increases the storage requirements.
The completed log files feed into programs designed to track hits on the web server. These programs parse the log files and create reports so that administrators can track their visitors. Hit-tracking programs usually fall within one of two categories.
Programs in the first category read the log files whenever someone requests a report. These programs are usually small-scale reporting programs or custom scripts that don't deal with a large number of hits. Report generation can take a long time, but the programs can run immediately against one or more log files--even currently active ones.
In the second category, programs import the log files into a database and generate reports from the imported data. A number of high-end commercial tracking programs fall into this category. Report generation is usually much quicker, but it can run only on imported data.
In both cases, the software needs to read the log file in order to build a report or populate a database. If we were to log our hits directly to a database, we could avoid the initial overhead. Third-party modules are available that allow Apache to send logging data directly to a database.
mod_log_mysql is an Apache 2.0 module that writes logging data directly to a MySQL database. Another module, mod_log_sql, also sends data to a MySQL database. Both do essentially the same thing in different ways. In checking out both modules, I found that mod_log_mysql more ably handles all of the Blackbox logging directives.
Writing to a MySQL database is a pretty radical shift from the tried-and-true method of just hammering out text log files. Some people may welcome the change, but others may not be so accepting.
Here's a quick rundown on some of the pros and cons. I don't mean this as a critique or praise for mod_log_mysql, but it brings up issues to consider when recording logs to a database. Some of these issues affect any remote log file writing solution, not just mod_log_mysql.
Pros
- mod_log_mysql can write logs directly to a networked database server. This provides considerable savings on local disk I/O operations. Also, one central database server can handle multiple Apache servers.
- If writing to the local disk concerns you, this module moves one step
closer to avoiding disk writes. Combining mod_log_mysql with a syslog facility
in your
ErrorLogdirective will drastically reduce the amount of disk writing your server will perform. Technically, you still have the PID file and the scoreboard file, but you can always write that data to the /tmp filesystem. - You need no more parsing scripts. You won't have to read in records sequentially, and you won't need to extract fields using regular expressions. Instead, you can use SQL commands to extract only the fields and records that you want. This makes ad hoc reporting much easier and faster.
Cons
- Databases may not be 100 percent trustworthy. It's hard to dismiss this claim, because database logging changes almost ten years of established practice. There's always the chance that the database could fail or become unreachable. mod_log_mysql provides a backup mechanism for database failures, but those log files can grow pretty big plenty fast. You need to make sure you catch the database failures and address them right away.
- mod_log_mysql requires a rebuild of Apache, with the entire logging architecture of mod_log_config rewritten to satisfy the needs of this module. Not many administrators or developers will be too eager to rewrite a core Apache module.
- This module works only with a MySQL database. Some high-end web reporting software packages rely on other database platforms (such as Oracle), so importing records will require an export and import operation.
- The database isn't anywhere close to fourth normal form. Don't bother optimizing it and don't bother indexing it, because it just won't be worth it in the long run. It's still possible to write a program to grab the records and insert them into a better database structure.
Maintaining a logging database isn't that different from maintaining a collection of text log files. You will still run into the same management and administration issues, but you'll have different procedures to follow.
Run periodic tests to make sure the server is sending logging data to the database. Also, check the size of the database and set up a rotation schedule. Export, aggregate, or back up older data depending on what kind of database environment you work with.
Setting up mod_log_mysql
In order to support mod_log_mysql, you need to do the following:
- Rebuild your Apache server with the supplied code changes.
- Build the database into your MySQL environment.
- Configure the logging in your web server configuration.
The mod_log_mysql web page includes the code for the module as well as the replacement mod_log_config.h, mod_log_config.c, mod_logio.c, and apr_reslist.c source files.
If you read Part 1 of this series, you'll remember I introduced a code change to mod_log_config.c to support logging the client source port. If you want to continue to support that directive, you will need to make the following code changes:
Insert the following subroutine in the mod_log_config.c file, anywhere near the other logging functions:
static void *log_remote_port(request_rec *r, char *a, ap_log_ehandler_data *d)
{
d->data=apr_palloc(r->pool,sizeof(ap_log_unumber_t));
*((ap_log_unumber_t *) d->data) = r->connection->remote_addr->port;
d->type=AP_LOG_EHANDLER_RETURN_UNUMBER;
}
Insert the following log_pfn_register function into
log_pre_config:
log_pfn_eregister(p, "S", log_remote_port, 0);
Once the new source code files are in place, do a make clean
and restart the build process at the top-level code directory. Once that build
is complete, build the mod_log_mysql DSO using the apxs tool.
$ apxs -a -i -c -L/opt/mysql/lib/mysql -I/opt/mysql/include/mysql
-lmysqlclient_r -lz mod_log_mysql.c
Building the database
Next, build the database the Apache server will write to. The module expects that you have already created the database. It also expects that you have set up the fields properly and have configured the fields properly to handle the correct data types.
Logging directly to a database is a big change. The limitations of using flat text files no longer apply. I already mentioned that you don't need to worry about log file rotation or local disk issues, but I also want to bring up centralization.
The previous article on Blackbox logs demonstrated how to split the logs across two different files. One was the common log file format for hit tracking; the other was the Blackbox log for auditing server performance. This gave the opportunity to see only the interesting data.
Unfortunately, both log formats wrote down some of the same fields. That approach wasted space for the sake of convenience because there was no easy way to correlate data between the two files.
Writing logs to a database ameliorates that problem. The database table
will have all of the fields from the combined log file format and the Blackbox
log file format. SQL queries can pull out only the desired records and fields.
Here are the SQL commands to build the database structure. They build a
database called apachelogs and define one table named
blackbox:
CREATE DATABASE apachelogs;
CREATE TABLE blackbox (
virtualhost text,
remoteip text,
remoteport int unsigned,
serverid text,
connectionstatus char(1),
remoteuser text,
identuser text,
datetime datetime default NULL,
requestmethod text,
url text,
protocol text,
statusbeforeredir int unsigned,
statusafterredir int unsigned,
processid int unsigned,
threadid int unsigned,
seconds int unsigned,
microseconds int unsigned,
bytesin int unsigned,
bytesout int unsigned,
bytescontent int unsigned,
useragent text,
referer text,
uniqueid text,
);
The mod_log_mysql module attempts to send every numeric value to the database as an integer, which makes mathematical operations easier. Pure text fields, such as the URL or the user agent, are variable-length text strings.
After you create the database, create all of the access rights in the MySQL database so the Apache server can write to the database.
Be sure to log in to the database using the mysql client so you
can test the access credentials you created. Once the database is ready, you
need to create the logging format.
Add the following lines to your Apache configuration. Remember, you can split a directive across lines as long as they end with a backslash.
LogFormat "INSERT INTO blackbox SET virtualhost=%v,remoteip=%a,\
remoteport=%S,serverid=%{SERVERID}e,connectionstatus=%X,\
remoteuser=%u,identuser=%l,datetime=%t,requestmethod=%m,\
url=%R,protocol=%H,statusbeforeredir=%s,statusafterredir=%>s,\
processid=%{pid}P,threadid=%{tid}P,seconds=%T,microseconds=%D,\
bytesin=%I,bytesout=%O,bytescontent=%B,useragent=%{User-Agent}i,\
referer=%{Referer}i,uniqueid=%{UNIQUE_ID}e" blackbox
CustomLog mysql:logwriter!logpass@localhost/apachelogs blackbox
The updated mod_log_config file adds a new logging directive,
%R. This is a shortcut for the %U%q convention, for people who want to record the URL and the query string.
The mod_log_mysql module connects as a client to the database server and passes along the insert statement exactly as specified in the configuration file. The code modifies the statement to escape strings properly, escaping certain characters to minimize SQL injection attacks.
If there are errors in the statement, inspect the MySQL server log to check for bad input. You can also check the Apache error logs for SQL statement errors.
Identifying individual web servers
If you look at the CREATE and INSERT statements in
the above examples, you'll notice an additional field. The
serverid field is a string used to identify a running server
instance.
Centralizing the web server logs creates a tracking problem. There must be a foolproof way to determine which server has handled each entry in the database. Remember, it's possible for the Apache server to run in a wide variety of host environments:
- Multiple Apache servers on different hosts serving the same web site
- Multiple Apache servers running on the same host, each listening on a unique IP and serving different web sites
- One Apache server running on the same host, listening on multiple IPs and serving different web sites
The best way to identify these servers is to create an environment variable for each Apache process pool. Each web server's configuration should create an environment variable that will identify the pool. Here's the relevant snippet from httpd.conf:
SetEnv SERVERID apache1
That one variable needs to identify this server for every possible hit it will answer, whether that hit comes across different IPs in the pool or different virtual hosts.
The other way to address this issue is to have each web server write to a
separate table in the same database. That's always an option, but you should
still use the serverid field for when you want to do comparative
operations. That way you can merge or split tables at a later time based on its
value.



