Data Mining Email
by Robert Bernier04/08/2004
I once did a contract administrating the network of a small startup company. The owner's total existence revolved around the email generated from his various business relationships and he never made a decision unless the issues were first detailed in an email. My boss once confided in me of his hatred of his email client because it limited him in the way he could organize his information.
I was so impressed by his desperation that I found myself constantly thinking about his dilemma. In time the answer came to me. He needed a data mine.
What is data mining anyway? Data mining is a class of database applications that look for hidden patterns in a group of data.
In writing this article it quickly became apparent that you might need several pieces of unrelated technical know-how to create a data mine so I ended up asking myself "who am I writing this article for?". Do I write for the Perl programmer or the DBA? Programming and managing data don't use the same set of skills but they are nevertheless both required. So where does this article go?
This article is for those who need a guide to generating information from existing data and are looking for ideas on how to do it. Thomas Edison once said that genius was 1 percent inspiration and 99 percent perspiration. Here's your 1 percent.
The Scenario
The objective is to data mine the email that resides in a Mozilla mail box.
We are especially interested in MS Word .doc attachments. We'll
parse the email and upload it into a relational database where we can then
analyze it.
The Tools
The Mozilla mailboxes follow the mbox format.
Perl has several excellent email parsing modules available on the CPAN. We'll also use the PostgreSQL database and its console-based client psql, as well as the command-line utilities find, file, and antiword.
The Developer's Skillset
Perl-wise, you'll need intermediate programming experience. You'll need to know how to execute command-line utilties within a Perl script. You should be able to install and use Perl modules from the CPAN, and you'll need to be familiar with using PostgreSQL from Perl.
Regarding PostgreSQL, you should know how to install PostgreSQL on Linux either by using precompiled binaries such as RPMs, Debian packages, and tar balls or by compiling it from source code. If someone else has installed this already, great! You should be familiar with the SQL 92 and SQL 99 standards in that you can create tables, views, constraints, and user-defined functions.
On the command line, you need to know how to pipe standard input and output from one process to the next.
The Assumptions
I assume you have Perl 5.8.x installed with access to the CPAN configured
correctly. You also need a standard installation of Postgres with minimal
security options configured. The example Perl script will connect to the
database via a socket connection, so uncomment the line tcpip_socket =
true in postgreql.conf.
Furthermore, the script assumes that the connection requires a password:
# TYPE DATABASE USER IP-ADDRESS IP-MASK METHOD
local all all trust
host all all 127.0.0.1 255.255.255.255 password
I'll assume that we'll use the PostgreSQL superuser account,
postgres. As well, the mbox file we'll process is
Sent and is in the same directory as the parser script.
Part 1. Searching for Mailboxes
The Mozilla mail client stores its mailboxes under the user's home
directory, usually in ~/.mozilla. Figure 1 shows the different
mailboxes of an existing profile. The mboxes are those files without any
extensions. Another hint is to look at the file size; mailboxes are normally
the largest files present.

Figure 1. Screenshot of a Mozilla mailbox.
The following command lists all the mailboxes in my Mozilla profile account:
$ find ~/.Mozilla/default/rfv2udn7.slt/Mail/ -type f -not \
-iname "*.*" -not -iname ".*" -print
Part 2. Preparing the Database
Before we can create the right database schema, we need to decide what kind of information to extract from these messages. There are many options, but the following assumptions will suffice:
- The email has the following parts of interest; From, To, CC, and Attachments.
- The email comes from one source.
- An email can be addressed to more than one person.
- An email can have more than one CC.
- An email can have more than one attachment.
- The only attachments of interest are MS Word documents.
The Database
With that in mind, let's set up the database. In a console, create a
database called email:
$ psql -U postgres -command "create database email" template1
If this is your first time using PostgreSQL, alter the username
postgres and give it the password of 123. By default, this
username has no assigned password.
$ psql -U postgres --command "alter user postgres with password '123' " email
The Tables
The most important column in the tables is the messageid, which is a unique alphanumeric code that is part of the email message. The tables have been defined such that no data population can occur unless the messageid already exists in the table mailid.
The message body and attachments are to be stored as large objects. Many of the tables have a column of type oid, which refers to the actual data that is located in the system catalog pg_largeobject.
Indexes are created by default in Postgres for each primary key that is defined in a table.
The create tables script will clean the
database email and create the tables. Invoke it by executing it from the
console:
$ psql -U postgres -f createTables.sql email
Pages: 1, 2 |



