O'Reilly Databases

oreilly.comSafari Books Online.Conferences.

We've expanded our coverage and improved our search! Search for all things Database across O'Reilly!

Search Search Tips

advertisement
AddThis Social Bookmark Button

Print Subscribe to Databases Subscribe to Newsletters

Data Mining Email

by Robert Bernier
04/08/2004

I once did a contract administrating the network of a small startup company. The owner's total existence revolved around the email generated from his various business relationships and he never made a decision unless the issues were first detailed in an email. My boss once confided in me of his hatred of his email client because it limited him in the way he could organize his information.

I was so impressed by his desperation that I found myself constantly thinking about his dilemma. In time the answer came to me. He needed a data mine.

What is data mining anyway? Data mining is a class of database applications that look for hidden patterns in a group of data.

In writing this article it quickly became apparent that you might need several pieces of unrelated technical know-how to create a data mine so I ended up asking myself "who am I writing this article for?". Do I write for the Perl programmer or the DBA? Programming and managing data don't use the same set of skills but they are nevertheless both required. So where does this article go?

This article is for those who need a guide to generating information from existing data and are looking for ideas on how to do it. Thomas Edison once said that genius was 1 percent inspiration and 99 percent perspiration. Here's your 1 percent.

The Scenario

The objective is to data mine the email that resides in a Mozilla mail box. We are especially interested in MS Word .doc attachments. We'll parse the email and upload it into a relational database where we can then analyze it.

The Tools

The Mozilla mailboxes follow the mbox format. Perl has several excellent email parsing modules available on the CPAN. We'll also use the PostgreSQL database and its console-based client psql, as well as the command-line utilities find, file, and antiword.

The Developer's Skillset

Perl-wise, you'll need intermediate programming experience. You'll need to know how to execute command-line utilties within a Perl script. You should be able to install and use Perl modules from the CPAN, and you'll need to be familiar with using PostgreSQL from Perl.

Regarding PostgreSQL, you should know how to install PostgreSQL on Linux either by using precompiled binaries such as RPMs, Debian packages, and tar balls or by compiling it from source code. If someone else has installed this already, great! You should be familiar with the SQL 92 and SQL 99 standards in that you can create tables, views, constraints, and user-defined functions.

On the command line, you need to know how to pipe standard input and output from one process to the next.

The Assumptions

I assume you have Perl 5.8.x installed with access to the CPAN configured correctly. You also need a standard installation of Postgres with minimal security options configured. The example Perl script will connect to the database via a socket connection, so uncomment the line tcpip_socket = true in postgreql.conf.

Furthermore, the script assumes that the connection requires a password:

# TYPE    DATABASE   USER    IP-ADDRESS   IP-MASK        METHOD

local     all        all                                  trust
host      all        all     127.0.0.1    255.255.255.255 password

I'll assume that we'll use the PostgreSQL superuser account, postgres. As well, the mbox file we'll process is Sent and is in the same directory as the parser script.

Part 1. Searching for Mailboxes

The Mozilla mail client stores its mailboxes under the user's home directory, usually in ~/.mozilla. Figure 1 shows the different mailboxes of an existing profile. The mboxes are those files without any extensions. Another hint is to look at the file size; mailboxes are normally the largest files present.

screen shot of Mozilla mail box
Figure 1. Screenshot of a Mozilla mailbox.

The following command lists all the mailboxes in my Mozilla profile account:

$ find ~/.Mozilla/default/rfv2udn7.slt/Mail/ -type f -not \
	-iname "*.*" -not -iname ".*" -print

Part 2. Preparing the Database

Before we can create the right database schema, we need to decide what kind of information to extract from these messages. There are many options, but the following assumptions will suffice:

  • The email has the following parts of interest; From, To, CC, and Attachments.
  • The email comes from one source.
  • An email can be addressed to more than one person.
  • An email can have more than one CC.
  • An email can have more than one attachment.
  • The only attachments of interest are MS Word documents.

The Database

With that in mind, let's set up the database. In a console, create a database called email:

$ psql -U postgres -command "create database email" template1

If this is your first time using PostgreSQL, alter the username postgres and give it the password of 123. By default, this username has no assigned password.

 $ psql -U postgres --command "alter user postgres with password '123' " email

The Tables

The most important column in the tables is the messageid, which is a unique alphanumeric code that is part of the email message. The tables have been defined such that no data population can occur unless the messageid already exists in the table mailid.

The message body and attachments are to be stored as large objects. Many of the tables have a column of type oid, which refers to the actual data that is located in the system catalog pg_largeobject.

Indexes are created by default in Postgres for each primary key that is defined in a table.

The create tables script will clean the database email and create the tables. Invoke it by executing it from the console:

$ psql -U postgres -f createTables.sql email

Pages: 1, 2

Next Pagearrow




Tagged Articles

Be the first to post this article to del.icio.us

Related to this Article

Data Jujitsu: The Art of Turning Data into Product Data Jujitsu: The Art of Turning Data into Product
November 2012
$0.00 USD

Designing Great Data Products Designing Great Data Products
March 2012
$0.00 USD

Sponsored Resources

  • Inside Lightroom
Advertisement
O'reilly

© 2013, O’Reilly Media, Inc.

(707) 827-7019 (800) 889-8969

All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners.

About O'Reilly

  • Academic Solutions
  • Jobs
  • Contacts
  • Corporate Information
  • Press Room
  • Privacy Policy
  • Terms of Service
  • Writing for O'Reilly

Community

  • Authors
  • Community & Featured Users
  • Forums
  • Membership
  • Newsletters
  • O'Reilly Answers
  • RSS Feeds
  • User Groups

Partner Sites

  • makezine.com
  • makerfaire.com
  • craftzine.com
  • igniteshow.com
  • PayPal Developer Zone
  • O'Reilly Insights on Forbes.com

Shop O'Reilly

  • Customer Service
  • Contact Us
  • Shipping Information
  • Ordering & Payment
  • The O'Reilly Guarantee