In Profiling LAMP Applications with Apache's Blackbox Logs, I discussed using Apache's built-in logging directives to record server performance metrics. By recording performance metrics, web server administrators can have a historical record of how the server handled incoming HTTP requests. This article expands on that concept by adding a couple of logging directives and recording the logging data directly in a MySQL database.
The Blackbox format is a custom log format put together with some of the lesser used Apache logging directives. It records data to give server administrators a better idea of how well their web servers are performing.
LogFormat "%a/%S %X %t \"%r\" %s/%>s %{pid}P/%{tid}P %T/%D %I/%O/%B" blackbox
In order, this format records the remote IP address, the remote port, the connection status, the time, the request line, the return statuses (before and after redirection), the process ID, the thread ID, the length of time to handle the request (in seconds and microseconds), and the number of bytes in, bytes out, and bytes of content.
There are a couple of additional fields to add to the Blackbox format to satisfy larger web server environments.
%v)The Blackbox format records the performance metrics for a given server instance. If your server handles virtual hosts, you can record the Blackbox data for every virtual host.
Here's an example of a configuration file using three virtual hosts.
<VirtualHost app.example.com>
ServerName app.example.com
CustomLog /var/opt/apache2/logs/appserver common
CustomLog /var/opt/apache2/audit/blackbox blackbox
</VirtualHost>
<VirtualHost images.example.com>
ServerName images.example.com
CustomLog /var/opt/apache2/logs/imageserver common
CustomLog /var/opt/apache2/audit/blackbox blackbox
</VirtualHost>
<VirtualHost internal.example.com>
ServerName internal.example.com
CustomLog /var/opt/apache2/logs/intranet common
CustomLog /var/opt/apache2/logs/blackbox blackbox
</VirtualHost>
|
Related Reading
Apache Cookbook |
Every host writes out two logs: one in the custom log file format and another
in the Blackbox format. The Blackbox log for each host is actually the same
file. This makes sense for centralizing performance metrics, but it's difficult
to identify which hits come from which virtual host. The simplest way to fix
this is to add the %v directive to the format.
%{UNIQUE_ID}e or %{forensic-id}n)Apache has a module known as mod_unique_id. It creates a unique, 19-character,
Base64-encoded string to identify every hit made on the server. The string
comes from the source IP, the process ID, the timestamp, and a subsecond
counter. You can access the value using the environment variable
%{UNIQUE_ID}e.
Adding this directive to the Blackbox format ensures that each hit record is unique. It also provides the ability to match hits in the Blackbox log file to specific hits in another log file.
For example, Apache 2.0.52 introduced a new logging module known as
mod_log_forensic. When a request comes in, the module immediately writes out
the incoming client request headers. Then at the logging phase, the module
writes another entry to show that the request has ended. The module prefixes
each entry with the UNIQUE_ID environment variable.
By combining the mod_log_forensic module with the Blackbox format, you can
record the client input and the server performance metrics. The
UNIQUE_ID gives you a unique key so you can trace the log entries
back to their forensic log input records.
If you decide to use mod_log_forensic, you have two options for recording
the Base64 id string: %{UNIQUE_ID}e or
%{forensic-id}n. The mod_log_forensic module generates the latter.
They will both be the same value, but the Apache documentation makes the vague
suggestion that they might be different in the Apache 2.1 code branch. For now,
I recommend %{UNIQUE_ID}e, because it's usable with or without the
mod_log_forensic module.
The updated configuration for the Blackbox format looks like:
LogFormat "%v %a/%S %X %t \"%r\" %s/%>s %{pid}P/%{tid}P \
%T/%D %I/%O/%B %{UNIQUE_ID}e" blackbox
This may look a little strange, but it's an attempt to make the format
similar to the established conventions. The Apache common+virtual host logging
format puts the %v directive first, and all examples showing the
UNIQUE_ID variable put it in the last field of the format.
The Perl regular expression for extracting records from this format is:
$Fmt=<<EOFMT
^(\\S+)\\ ([^/]+)/(\\d+)\\ (\\S)\\ \\[([^:]+):(\\d+:\\d+:\\d+)\\ ([^\\]]+)\\]
\\s "(\\S+)\\ (.*?)\\ (\\S+)"\\ (\\d+)/(\\d+)\\ (\\d+)/(\\d+)\\s
(\\d+)/(\\d+)\\ (\\d+)/(\\d+)/(\\d+) (\\S+)\$
EOFMT
;
|
Flat files are the simplest way to record hits, but they're impractical for long-term analysis:
The completed log files feed into programs designed to track hits on the web server. These programs parse the log files and create reports so that administrators can track their visitors. Hit-tracking programs usually fall within one of two categories.
Programs in the first category read the log files whenever someone requests a report. These programs are usually small-scale reporting programs or custom scripts that don't deal with a large number of hits. Report generation can take a long time, but the programs can run immediately against one or more log files--even currently active ones.
In the second category, programs import the log files into a database and generate reports from the imported data. A number of high-end commercial tracking programs fall into this category. Report generation is usually much quicker, but it can run only on imported data.
In both cases, the software needs to read the log file in order to build a report or populate a database. If we were to log our hits directly to a database, we could avoid the initial overhead. Third-party modules are available that allow Apache to send logging data directly to a database.
mod_log_mysql is an Apache 2.0 module that writes logging data directly to a MySQL database. Another module, mod_log_sql, also sends data to a MySQL database. Both do essentially the same thing in different ways. In checking out both modules, I found that mod_log_mysql more ably handles all of the Blackbox logging directives.
Writing to a MySQL database is a pretty radical shift from the tried-and-true method of just hammering out text log files. Some people may welcome the change, but others may not be so accepting.
Here's a quick rundown on some of the pros and cons. I don't mean this as a critique or praise for mod_log_mysql, but it brings up issues to consider when recording logs to a database. Some of these issues affect any remote log file writing solution, not just mod_log_mysql.
ErrorLog directive will drastically reduce the amount of
disk writing your server will perform. Technically, you still have the PID file
and the scoreboard file, but you can always write that data to the
/tmp filesystem.Maintaining a logging database isn't that different from maintaining a collection of text log files. You will still run into the same management and administration issues, but you'll have different procedures to follow.
Run periodic tests to make sure the server is sending logging data to the database. Also, check the size of the database and set up a rotation schedule. Export, aggregate, or back up older data depending on what kind of database environment you work with.
In order to support mod_log_mysql, you need to do the following:
The mod_log_mysql web page includes the code for the module as well as the replacement mod_log_config.h, mod_log_config.c, mod_logio.c, and apr_reslist.c source files.
If you read Part 1 of this series, you'll remember I introduced a code change to mod_log_config.c to support logging the client source port. If you want to continue to support that directive, you will need to make the following code changes:
Insert the following subroutine in the mod_log_config.c file, anywhere near the other logging functions:
static void *log_remote_port(request_rec *r, char *a, ap_log_ehandler_data *d)
{
d->data=apr_palloc(r->pool,sizeof(ap_log_unumber_t));
*((ap_log_unumber_t *) d->data) = r->connection->remote_addr->port;
d->type=AP_LOG_EHANDLER_RETURN_UNUMBER;
}
Insert the following log_pfn_register function into
log_pre_config:
log_pfn_eregister(p, "S", log_remote_port, 0);
Once the new source code files are in place, do a make clean
and restart the build process at the top-level code directory. Once that build
is complete, build the mod_log_mysql DSO using the apxs tool.
$ apxs -a -i -c -L/opt/mysql/lib/mysql -I/opt/mysql/include/mysql
-lmysqlclient_r -lz mod_log_mysql.c
Next, build the database the Apache server will write to. The module expects that you have already created the database. It also expects that you have set up the fields properly and have configured the fields properly to handle the correct data types.
Logging directly to a database is a big change. The limitations of using flat text files no longer apply. I already mentioned that you don't need to worry about log file rotation or local disk issues, but I also want to bring up centralization.
The previous article on Blackbox logs demonstrated how to split the logs across two different files. One was the common log file format for hit tracking; the other was the Blackbox log for auditing server performance. This gave the opportunity to see only the interesting data.
Unfortunately, both log formats wrote down some of the same fields. That approach wasted space for the sake of convenience because there was no easy way to correlate data between the two files.
Writing logs to a database ameliorates that problem. The database table
will have all of the fields from the combined log file format and the Blackbox
log file format. SQL queries can pull out only the desired records and fields.
Here are the SQL commands to build the database structure. They build a
database called apachelogs and define one table named
blackbox:
CREATE DATABASE apachelogs;
CREATE TABLE blackbox (
virtualhost text,
remoteip text,
remoteport int unsigned,
serverid text,
connectionstatus char(1),
remoteuser text,
identuser text,
datetime datetime default NULL,
requestmethod text,
url text,
protocol text,
statusbeforeredir int unsigned,
statusafterredir int unsigned,
processid int unsigned,
threadid int unsigned,
seconds int unsigned,
microseconds int unsigned,
bytesin int unsigned,
bytesout int unsigned,
bytescontent int unsigned,
useragent text,
referer text,
uniqueid text,
);
The mod_log_mysql module attempts to send every numeric value to the database as an integer, which makes mathematical operations easier. Pure text fields, such as the URL or the user agent, are variable-length text strings.
After you create the database, create all of the access rights in the MySQL database so the Apache server can write to the database.
Be sure to log in to the database using the mysql client so you
can test the access credentials you created. Once the database is ready, you
need to create the logging format.
Add the following lines to your Apache configuration. Remember, you can split a directive across lines as long as they end with a backslash.
LogFormat "INSERT INTO blackbox SET virtualhost=%v,remoteip=%a,\
remoteport=%S,serverid=%{SERVERID}e,connectionstatus=%X,\
remoteuser=%u,identuser=%l,datetime=%t,requestmethod=%m,\
url=%R,protocol=%H,statusbeforeredir=%s,statusafterredir=%>s,\
processid=%{pid}P,threadid=%{tid}P,seconds=%T,microseconds=%D,\
bytesin=%I,bytesout=%O,bytescontent=%B,useragent=%{User-Agent}i,\
referer=%{Referer}i,uniqueid=%{UNIQUE_ID}e" blackbox
CustomLog mysql:logwriter!logpass@localhost/apachelogs blackbox
The updated mod_log_config file adds a new logging directive,
%R. This is a shortcut for the %U%q convention, for people who want to record the URL and the query string.
The mod_log_mysql module connects as a client to the database server and passes along the insert statement exactly as specified in the configuration file. The code modifies the statement to escape strings properly, escaping certain characters to minimize SQL injection attacks.
If there are errors in the statement, inspect the MySQL server log to check for bad input. You can also check the Apache error logs for SQL statement errors.
If you look at the CREATE and INSERT statements in
the above examples, you'll notice an additional field. The
serverid field is a string used to identify a running server
instance.
Centralizing the web server logs creates a tracking problem. There must be a foolproof way to determine which server has handled each entry in the database. Remember, it's possible for the Apache server to run in a wide variety of host environments:
The best way to identify these servers is to create an environment variable for each Apache process pool. Each web server's configuration should create an environment variable that will identify the pool. Here's the relevant snippet from httpd.conf:
SetEnv SERVERID apache1
That one variable needs to identify this server for every possible hit it will answer, whether that hit comes across different IPs in the pool or different virtual hosts.
The other way to address this issue is to have each web server write to a
separate table in the same database. That's always an option, but you should
still use the serverid field for when you want to do comparative
operations. That way you can merge or split tables at a later time based on its
value.
|
Queries can run on the database immediately. The examples below are in straight SQL syntax. Some use the nested query syntax available only in MySQL 4.1 or above.
If you intend to write automated scripts in Perl or Python or whatever, it
may be easier to run multiple SELECT statements or to compute
derived values inside the scripting language itself. Whenever you request
reports on web server performance, you'll need to specify the date range you
want to cover.
You want to find out how much network traffic your web server sends and receives. Keep in mind that this reflects traffic levels only at the application layer of the TCP/IP stack. It won't record the size of TCP or IP packets. That's a question SNMP polling tools can answer better.
select sum(bytesin), sum(bytesout) from blackbox \
where datetime = curdate();
select sum(bytesin),sum(bytesout) from blackbox where datetime >=
date_sub(curdate(),interval 7 day);
The first example gives the total bytes in and bytes out for the given day. The second example gives the total for the past seven days.
This example checks the number of hits against a specific RSS page against every possible hit against the server.
select (select count(url) from blackbox where url regexp '\.rss$')/
(select count(url) from blackbox where url regexp '\.html$');
RSS hits are typically automated. Because of this, a lot of sites have a higher percentage of pure robot/agent traffic on their site than they may have had two years ago. By checking the RSS hits, you can determine whether the agent traffic is overwhelming your site.
If you see a web site with one of those Flash animations on the main page, it usually has a "Skip" link at the bottom. If you're going to include one, throw in a meaningless query string at the end of the link, so you can determine which hits are redirects after the animation has finished and which ones come from people clicking on the link.
select (select count(url) from blackbox where url="/home.html?skipped"),
(select count(url) from blackbox where url="/home.html"),
Here's a two-part script you can run to verify that the database really is logging hits from the servers. First, add this script to the cgi-bin directory of every web site that uses mod_log_mysql.
#!/usr/local/bin/perl
# returnid.pl
use strict;
use warnings;
print("X-ServerId: $ENV{'SERVERID'}\n");
print("X-UniqueId: $END{'UNIQUE_ID'}\n");
print "Content-type: text/plain\n\n";
print("Server id is $ENV{'SERVERID'}\n");
print("Unique id is $ENV{'UNIQUE_ID'}\n");
exit 0;
The server will return its serverid and unique ID for that particular
hit. The next program will act as an HTTP client. It will retrieve this URL,
then connect to the database and search the logs for the unique ID.
#!/usr/local/bin/perl
# checkid.pl
#
# Pragmas
#
use strict;
use warnings;
#
# Includes
#
use LWP::UserAgent;
use HTTP::Request;
use HTTP::Response;
use DBI;
#
# Global Variables
#
use vars qw/$Url $Agent $Db $ServerId $UniqueId /;
use constant {
DSN => "DBI:mysql:database=apachelog",
DBUSER => "logwriter",
DBPASS => "logpass",
QUERYSTRING =>
"select datetime,uniqueid from blackbox where uniqueid=? and serverid=?",
DEFAULTURL => "http://www.example.com:8080/cgi-bin/returnid.pl",
};
#
# Subroutines
#
sub getId
{
my ($agent,$url) = @_;
my $request = HTTP::Request->new(GET => $url);
my $response = $agent->request($request);
if ($response->is_success)
{
my $serverid=$response->header("X-ServerId");
my $uniqueid=$response->header("X-UniqueId");
print("Unique ID from server is $uniqueid\n");
print("Server ID is $serverid\n");
return ($serverid,$uniqueid);
}
return undef;
}
sub findId
{
my ($db,$serverid,$uniqueid) = @_;
my $query = $db->prepare(QUERYSTRING);
$query->execute($uniqueid,$serverid);
if (my ($dt,$uid)=$query->fetchrow_array())
{
print("Database recorded $uid at $dt\n");
} else {
print("Can't find a database record for unique-id $uniqueid\n");
}
return;
}
#
# Main Block
#
# Initialize the user agent
$Agent = LWP::UserAgent->new();
# Initialize the database client
$Db = DBI->connect(DSN,DBUSER,DBPASS);
# URL
$Url = shift || DEFAULTURL;
if (($ServerId,$UniqueId) = getId($Agent,$Url))
{
findId($Db,$ServerId,$UniqueId);
} else {
print("Unable to get data from webserver");
exit 1;
}
If you run this program at a regular polling interval, it will warn you when the remote database is not responding or if the Blackbox table is not recording hits from the web servers.
If you've read the first article, you should already understand why you want to log your server performance data. The core concepts are still the same; I'm just introducing a few variations on improving the process.
The two new logging directives provide more flexibility with virtual hosting environments. It also allows having just one Blackbox log file for each running server.
If you want to take the really big step, consider the option of writing your logs straight to a database. The initial setup process may be complex, but after that there is a huge administrative benefit. It's an ideal solution for dealing with large server farms.
Chris Josephes works as a system administrator for Internet Broadcasting.
Return to the Apache DevCenter.
Copyright © 2009 O'Reilly Media, Inc.