macdevcenter.com
oreilly.comSafari Books Online.Conferences.

advertisement

AddThis Social Bookmark Button

The Fight Against Spam, Part 2

by FJ de Kermadec
05/18/2004

In Part 1, I focused on laying the foundation for an anti-spam strategy and covering how to block most of your unwanted mail. In today's article of this three-part series, I'm going to fine-tune this strategy, plus take a closer look at Mail.app, so that you can more fully unleash its potential.

The Real Show Stopper: Mail's Junk Mail Filter

Created by the engineers who bring the Japanese input method and the Speech technologies to you, Mail's junk mail filters are outstanding. When trained for a sufficient period of time, the filters can reach 98%+ accuracy against spam and are surprisingly painless to use. In fact, this feature alone has convinced many users to switch to Mail.

How Does Junk Mail Work?

Author's note: Kim Silverman, principal research scientist and manager for the Spoken Language Technologies at Apple, helped as I prepared the following paragraphs. I appreciate the information he so kindly provided. Needless to say, if there are any inaccuracies, they are entirely mine.

Many myths have emerged about Mail's junk mail filter. No, it's not an extremely complex set of rules, no it doesn't look for keywords, and no, it doesn't use white magic. To truly understand what makes it so much better than the competition, we'll have to take a closer look at the recognition engine and the technologies it relies on to do its work. It may sound a bit complex at first, but things will begin to make sense as we work through the mechanics.

Interestingly enough, the technology that underlies the Junk Mail filter began its life as an information retrieval system, developed in the Apple labs to help users who managed thousands or millions of large documents find the one they were looking for easily. In order to do that, this technology had to allow users to perform a search by topic.

Related Reading

Mac OS X: The Missing Manual, Panther Edition
By David Pogue

The traditional approach to this has been called "vector representation." Imagine a huge table in which each column is labeled by a word in the union of all the words in the document. Every row is labeled by a document. And every cell contains the number of times that word appears in the document.

Each document is in turn represented by a long string of numbers, one for each word in the corpus. In mathematical terms, we would say that every document is a vector of n numbers or a point in a space with n dimensions. I know it sounds quite geeky but if you can visualize that, you're halfway there.

Here comes the interesting part. Since every document is a point, you can cluster them. Cluster analysis will find groups of points (sometimes called "clouds") in a graph that consists of multiple, unevenly spread points. It will then tell you how these clusters describe the overall spread of the points.

That's what we do with our files, and all the documents in a cluster tend to be about the same topic. The part of Mac OS X that does all that is called the "Apple data kit." It's an engine that specializes in vector representation and can be used to find documents, sort a corpus into topics, and yes, it even auto-discovers them. The Apple data kit allows the user to find the single document that best represents each topic. Best of all, it also produces a summary of a document. That's what allows the accompanying AppleScripts to write summaries of your reports (this is called Summarize, located in the Services menu for Mail.app).

The Joys and Pains of Vector Representation

The main advantage of vector representation is that this technology does not rely on word order to do its work -- you can have a look at our speech article to learn more about why this is important.

The representation looks very much like a "bag of words," since it is based on the total number of times a word appears in a document. Documents about the same topic will usually contain similar words.

Also, whereas statistical language models capture local patterns only to do their work, vector representation captures non-local patterns. So, a document that contains "Aunt Emma" and "cooking tips" at the beginning and the end of a page can well be in the same cluster as a text that talks precisely about "the time Aunt Emma sent you cooking tips."

However, as with every technology, the benefits come with a few drawbacks. First of all, since the dimensionality is huge, it is computationally expensive. Also, since most words do not occur in any particular document, there are lots of zeros in the numbers that represent them. In mathematical terms, the matrix is sparse. Do you feel lost? Imagine this: take the biggest issue you can find of the Mac Developer Journal and put it in your left hand, and put your favorite dictionary in your right hand. How many words in the dictionary can you find in the Journal? Not many.

These "details" explain why clustering doesn't always work so well.

Also, most counts are low, and therefore inaccurate since they can more easily contain sampling errors. Let's say, for example, that your Aunt Emma, in her cooking tips, talks about a "hippopotamus" (as in "For the turkey to be tasty, it should be quite large but obviously, you don't want a hippopotamus-sized one."). The fact that she used it once does not mean that she will use this word again in her cooking tips. This phenomenon is called "noise."

To address all these issues, and reliably recognize the topic of documents, we need to jump into Latent Semantic Analysis.

Latent Semantic Analysis to the Rescue

To make up for the shortcomings in vector representation, we use something magical called "Singular value decomposition." It reduces the dimensionality, gets rid of the sparseness, and statistically finds the regularities in the noise. In other words, it captures the underlying stable pattern in the data we have. In case you're wondering, this involves using regression lines, but that's another story.

If each document is a point in a X0,000-dimension space or so, we reduce its dimensionality into a small number of dimensions that capture the salient patterns and the majority of the variation in the corpus. Then, we can do the Latent Semantic Analysis. In this new space, each axis is a weighted combination of all the words: documents and words coexist in the same space.

Like we did before, you can perform a bit of cluster analysis and find clusters of documents that each represent a topic. You now have under your eyes a computational representation of semantics.

Because words are distributed in the same space as documents, you can find the words that are closer to the center of a document cluster. Those will be the words that characterize the meaning of the documents in that cluster, even if a document does not contain all those words.

So we can find words that describe a document without requiring that they be necessarily found in the document.

Everywhere on Your Mac, for Your Pleasure

Even though Apple is not the only company working on such technologies, they do seem to be the only ones to have made it so accessible to end users and powerful at the same time. In fact, they do it so well that it is now at the center of many system components as we have seen, requiring them to continuously refine the calculations and develop the formal mathematical representations -- all for your benefit.

How Does This Apply to my Spam?

So, we've endured lots of math. But now, let's get back to our main topic and see how this math applies to your spam.

There are two traditional approaches to spam. The first looks for keywords in a message and flags any mail containing those words as spam. This has a major drawback. What if your Aunt Emma happens to mention to you as an aside, in a very important email about a family gathering supposed to take place in a few days, that your uncle had an opportunity to take Viagra? The mail will be flagged and deleted, causing you to miss the gathering -- or, if it were in the business environment, potential revenue.

Of course, systems that rely on such keywords are continuously updated and refined. Nevertheless, they are never entirely satisfying, even when using sophisticated Bayesian filters that are essentially weighted keyword systems.

The other traditional approach is to look at the sender and not accept any message from any known junk-mail sender. However, this is even less likely to work since junk mailers keep changing their addresses. Some people have proposed that you only accept mail from senders in your address book, but for obvious reasons, this isn't realistic.

That's why latent statistical analysis is much better. It doesn't make binary decisions based on any single characteristic of a message. It analyzes the meaning of the words and acts accordingly.

And to make this work even better, you can add your own rules to Mail.app to shape its behavior.


Figure 1. Spam message flagged by mail.

Why Make it Trainable Then?

A common question about the spam filter in Mail.app is why the Apple engineers decided to make it trainable. After all, if it truly understood the meaning of a mail, it would immediately see what's junk and what's not, right?

Well, not exactly. Let's imagine that you, like most Mac users, are constantly receiving spam about mortgage opportunities. Mail would naturally flag them as junk. But what if you were in the market for a house and had requested quotes from legitimate companies? This is when the ability to train Mail comes to the rescue. You may want to alter the rules while you shop for a mortgage.

Pages: 1, 2

Next Pagearrow