MacDevCenter    
 Published on MacDevCenter (http://www.macdevcenter.com/)
 See this if you're having trouble printing code examples


The Fight Against Spam, Part 2

by FJ de Kermadec
05/18/2004

In Part 1, I focused on laying the foundation for an anti-spam strategy and covering how to block most of your unwanted mail. In today's article of this three-part series, I'm going to fine-tune this strategy, plus take a closer look at Mail.app, so that you can more fully unleash its potential.

The Real Show Stopper: Mail's Junk Mail Filter

Created by the engineers who bring the Japanese input method and the Speech technologies to you, Mail's junk mail filters are outstanding. When trained for a sufficient period of time, the filters can reach 98%+ accuracy against spam and are surprisingly painless to use. In fact, this feature alone has convinced many users to switch to Mail.

How Does Junk Mail Work?

Author's note: Kim Silverman, principal research scientist and manager for the Spoken Language Technologies at Apple, helped as I prepared the following paragraphs. I appreciate the information he so kindly provided. Needless to say, if there are any inaccuracies, they are entirely mine.

Many myths have emerged about Mail's junk mail filter. No, it's not an extremely complex set of rules, no it doesn't look for keywords, and no, it doesn't use white magic. To truly understand what makes it so much better than the competition, we'll have to take a closer look at the recognition engine and the technologies it relies on to do its work. It may sound a bit complex at first, but things will begin to make sense as we work through the mechanics.

Interestingly enough, the technology that underlies the Junk Mail filter began its life as an information retrieval system, developed in the Apple labs to help users who managed thousands or millions of large documents find the one they were looking for easily. In order to do that, this technology had to allow users to perform a search by topic.

Related Reading

Mac OS X: The Missing Manual, Panther Edition
By David Pogue

The traditional approach to this has been called "vector representation." Imagine a huge table in which each column is labeled by a word in the union of all the words in the document. Every row is labeled by a document. And every cell contains the number of times that word appears in the document.

Each document is in turn represented by a long string of numbers, one for each word in the corpus. In mathematical terms, we would say that every document is a vector of n numbers or a point in a space with n dimensions. I know it sounds quite geeky but if you can visualize that, you're halfway there.

Here comes the interesting part. Since every document is a point, you can cluster them. Cluster analysis will find groups of points (sometimes called "clouds") in a graph that consists of multiple, unevenly spread points. It will then tell you how these clusters describe the overall spread of the points.

That's what we do with our files, and all the documents in a cluster tend to be about the same topic. The part of Mac OS X that does all that is called the "Apple data kit." It's an engine that specializes in vector representation and can be used to find documents, sort a corpus into topics, and yes, it even auto-discovers them. The Apple data kit allows the user to find the single document that best represents each topic. Best of all, it also produces a summary of a document. That's what allows the accompanying AppleScripts to write summaries of your reports (this is called Summarize, located in the Services menu for Mail.app).

The Joys and Pains of Vector Representation

The main advantage of vector representation is that this technology does not rely on word order to do its work -- you can have a look at our speech article to learn more about why this is important.

The representation looks very much like a "bag of words," since it is based on the total number of times a word appears in a document. Documents about the same topic will usually contain similar words.

Also, whereas statistical language models capture local patterns only to do their work, vector representation captures non-local patterns. So, a document that contains "Aunt Emma" and "cooking tips" at the beginning and the end of a page can well be in the same cluster as a text that talks precisely about "the time Aunt Emma sent you cooking tips."

However, as with every technology, the benefits come with a few drawbacks. First of all, since the dimensionality is huge, it is computationally expensive. Also, since most words do not occur in any particular document, there are lots of zeros in the numbers that represent them. In mathematical terms, the matrix is sparse. Do you feel lost? Imagine this: take the biggest issue you can find of the Mac Developer Journal and put it in your left hand, and put your favorite dictionary in your right hand. How many words in the dictionary can you find in the Journal? Not many.

These "details" explain why clustering doesn't always work so well.

Also, most counts are low, and therefore inaccurate since they can more easily contain sampling errors. Let's say, for example, that your Aunt Emma, in her cooking tips, talks about a "hippopotamus" (as in "For the turkey to be tasty, it should be quite large but obviously, you don't want a hippopotamus-sized one."). The fact that she used it once does not mean that she will use this word again in her cooking tips. This phenomenon is called "noise."

To address all these issues, and reliably recognize the topic of documents, we need to jump into Latent Semantic Analysis.

Latent Semantic Analysis to the Rescue

To make up for the shortcomings in vector representation, we use something magical called "Singular value decomposition." It reduces the dimensionality, gets rid of the sparseness, and statistically finds the regularities in the noise. In other words, it captures the underlying stable pattern in the data we have. In case you're wondering, this involves using regression lines, but that's another story.

If each document is a point in a X0,000-dimension space or so, we reduce its dimensionality into a small number of dimensions that capture the salient patterns and the majority of the variation in the corpus. Then, we can do the Latent Semantic Analysis. In this new space, each axis is a weighted combination of all the words: documents and words coexist in the same space.

Like we did before, you can perform a bit of cluster analysis and find clusters of documents that each represent a topic. You now have under your eyes a computational representation of semantics.

Because words are distributed in the same space as documents, you can find the words that are closer to the center of a document cluster. Those will be the words that characterize the meaning of the documents in that cluster, even if a document does not contain all those words.

So we can find words that describe a document without requiring that they be necessarily found in the document.

Everywhere on Your Mac, for Your Pleasure

Even though Apple is not the only company working on such technologies, they do seem to be the only ones to have made it so accessible to end users and powerful at the same time. In fact, they do it so well that it is now at the center of many system components as we have seen, requiring them to continuously refine the calculations and develop the formal mathematical representations -- all for your benefit.

How Does This Apply to my Spam?

So, we've endured lots of math. But now, let's get back to our main topic and see how this math applies to your spam.

There are two traditional approaches to spam. The first looks for keywords in a message and flags any mail containing those words as spam. This has a major drawback. What if your Aunt Emma happens to mention to you as an aside, in a very important email about a family gathering supposed to take place in a few days, that your uncle had an opportunity to take Viagra? The mail will be flagged and deleted, causing you to miss the gathering -- or, if it were in the business environment, potential revenue.

Of course, systems that rely on such keywords are continuously updated and refined. Nevertheless, they are never entirely satisfying, even when using sophisticated Bayesian filters that are essentially weighted keyword systems.

The other traditional approach is to look at the sender and not accept any message from any known junk-mail sender. However, this is even less likely to work since junk mailers keep changing their addresses. Some people have proposed that you only accept mail from senders in your address book, but for obvious reasons, this isn't realistic.

That's why latent statistical analysis is much better. It doesn't make binary decisions based on any single characteristic of a message. It analyzes the meaning of the words and acts accordingly.

And to make this work even better, you can add your own rules to Mail.app to shape its behavior.


Figure 1. Spam message flagged by mail.

Why Make it Trainable Then?

A common question about the spam filter in Mail.app is why the Apple engineers decided to make it trainable. After all, if it truly understood the meaning of a mail, it would immediately see what's junk and what's not, right?

Well, not exactly. Let's imagine that you, like most Mac users, are constantly receiving spam about mortgage opportunities. Mail would naturally flag them as junk. But what if you were in the market for a house and had requested quotes from legitimate companies? This is when the ability to train Mail comes to the rescue. You may want to alter the rules while you shop for a mortgage.

Does it Work with Other Languages?

Mail is often criticized because the system it uses "only reads English." Nothing could be further from the truth. Mail does accurately flag messages in other languages. The corpus on which it is pre-trained uses mail in different languages, and it is just as trainable in German or Japanese as it is in English texts -- thanks to a few other cool Apple technologies regarding tokenization that go beyond the scope of this article.

This Sounds Complex, Should I Disable it on my iBook?

Don't worry. Even though Junk Mail relies on very complex technologies, it's very efficient and easy on the computer, even on slower G3 laptops.

This is a good example of expanding capability without sacrificing performance, by writing good code.

An Introduction to Using "Junk Mail"

As soon as you launch Mail, the Junk Mail filter is turned on in "training mode." As long as training mode is on, Mail will display all the messages you receive in your inbox, including the junk. However, potential spams will be marked with cute, paper-bag icons and will appear in a disgustingly distinctive brown color, making spotting the unwanted messages easy.

If you notice a message that is incorrectly flagged as junk, simply open it and click on the "Not junk" button located at the top of the message in the brown banner. If you notice a message that should be marked as spam but isn't, select it and use the "Message" menu to "Mark it as junk mail." Alternatively, you can place a "Junk" button in your toolbar; simply use the "View" menu to customize it.

As soon as you mark a mail as Junk or Not Junk, the junk mail filter will fine-tune its analysis, learning what you consider to be junk and what it should let go through to your inbox. This simple-looking learning capability is actually what makes Mail amazing and very different from its competitors.

For most people, Viagra ads are spam and gardening-related messages are updates from their grandparents. But what if your grandparents like to talk about Viagra and you are being spammed by a gardening service? While most other programs won't be able to adapt to your situation, Mail will, and effortlessly.

Once you're satisfied with the accuracy of its analysis, you can switch it to "automatic" mode.


Figure 2. Mail's junk preferences.

As soon as automatic mode is turned on, any mail flagged as junk mail will be moved to a special Junk mailbox. Of course, you are still responsible for what happens to this mail. Should it be deleted? Kept for archiving> We'll see in a minute how to fine-tune this behavior.

Turning automatic mode on is a big step since it may prevent you from reading legitimate mails, especially if you don't check the Junk mailbox or you choose to delete your junk mails immediately. Although the number of false positives is extremely low (or, in most cases, null), you may want to add a signature to your mail or a note to your web site, stating that you use anti-spam filtering technologies. You can also ask that your potential correspondents resend emails if they do not receive answers in a certain timeframe.

Fine Tuning and Automating "Junk Mail"

In order to customize the filtering, use the "Mail" menu to open the Mail preferences and click on the "Junk Mail" button. Switching between "training" and "automatic" mode is as simple as selecting the corresponding radio button. As soon as you enter "automatic" you will see that Mail creates a new Junk mailbox with the same paper-bag icon. The following preferences are easily understandable. However, here are a few notes about what they can do:

The "Trust Junk Mail headers set by your Internet Service Provider" feature is great, but only as long as your provider uses standard junk-mail filtering options. Indeed, some ISPs use proprietary solutions that Mail doesn't know. If this is the case, you can create a special rule that scans the "Header" used by your provider to rate junk messages and decide whether it should be marked as junk or not -- a simple task that does not require any programming on your part.


Figure 3. Typical mail headers.

However, when turning this feature on, you will want to take into account how reliable your mail provider's filters are. Indeed, some of them are known for setting up paranoid filters that block all legitimate mails while some others let everything go through. Some of them now allow users to customize filters, a great step forward. In most cases, server-side junk-mail filtering features can be accessed through the provider's webmail interface, so it's worth having a look if you haven't checked in for awhile. You may actually find other nice features there. For example, the .Mac webmail allows you to set up a custom mail icon visible by all Mail.app users.

The "Advanced" button is extremely interesting. Do you remember the old days when Junk Mail was listed in the "Rules" category? Well, this button allows you to see junk-mail settings as a rule. For example, you could also set mail up to run an AppleScript when you receive mail. What about getting the headers of the message so that you can send them to your IT department? Or your email provider?

On a less ambitious scale, you can use this rule to mark junk mails as read automatically -- to avoid seeing the "unread messages" notifications while sorting through your legitimate mail. Play a specific sound as a reminder to have a look through your Junk Mail mailbox from time to time or, let's be crazy, switch the mail color from brown to purple.

What Should I Do with Spam Once It's Flagged?

We've seen that Mail.app will put flagged mail into a special mailbox called "Junk." However, your messages will stay there unless you specifically tell Mail what to do with them.

In order to do so, check the "Special Mailboxes" tab of your various account preferences. It contains a popup menu that allows you to specify what should be done to this mailbox.

Usually, storing junk messages on the server is a bad idea since it will increase the chances that your server will be cluttered and that your mailbox will reach full capacity, effectively bouncing legitimate messages back.

Deleting Junk messages when "Quitting Mail" would be my setting of choice since you probably don't want to keep them on your hard drive for too long. However, you should remember to check this mailbox for false positives before quitting Mail. Otherwise, they go unnoticed and may be deleted without you ever seeing them.

It sounds silly, but I suggest you use this opportunity to make sure Trash (in Mail.app) is set up well and that deleting messages from your various accounts does not simply move them to another folder on the server.

Since the trash setting is applied evenly to all of your accounts, you can set up separate rules to manage them individually if need be. For example, you may want to delete junk mail from your Home account automatically -- since your friends probably won't be too mad if you miss one of their healthy cooking tips. But you should purge your business account every week, so that you have a chance to scan it and avoid missing a potential customer.

Next Time

I'll wrap up this series on Friday with a closer look at techniques for applying rules, address masking, and some general tips to confound spammers. See you then!

FJ de Kermadec is an author, stylist and entrepreneur in Paris, France.


Return to the Mac DevCenter

Copyright © 2009 O'Reilly Media, Inc.