Teamed filters catch more spam TRN 082201

Teamed filters catch more spam

By Chhavi Sachdev, Technology Research News

How much unsolicited email do you find cluttering your inbox every morning? Even when Internet service providers block junk email, spam creeps in, disguised, for example, as innocuous messages from people who seem to have only first names. At the same time, spam filters sometimes block legitimate messages.

A group of researchers in Greece has come up with a method that could solve both problems.

Most spam filters work by doing two things: they block known spammers who have been blacklisted, and they follow general rules, such as blocking messages that contain the word ‘adult’ in the subject header, said Ion Androutsopoulos, a research fellow at the Demokritos National Center for Scientific Research (NCSR) in Greece.

But spammers frequently forge email addresses to get around the blacklists, and those filters that use keyword-specific blocking might also nix that funny anecdote about your brother’s kids that contains the word ‘nude.’

The NCSR spam program creates custom filters for each user that learn what is spam and what is not, said Androutsopoulos. The filters learn to tell the two apart by looking through a user’s legitimate email and comparing it with lots of spam collected by the researchers, he said.

The key to the process is using several filters that work together. The researchers found that they could bolster accuracy by combining filters based on different learning algorithms that individually made different types of errors, said Androutsopoulos.

The program analyses the user's existing mail using Natural Language Processing algorithms to build the set of anti-spam filters, said Androutsopoulos. It calculates the probabilities of certain words appearing in spam versus legitimate messages and classifies incoming messages by comparing them with previously analyzed email.

“The individual filters are treated as members of a committee presided [over] by a higher-level classifier, which is trained to learn when to trust each of the members,” said Androutsopoulos. When a new message arrives, the committee members cast their votes on whether the message is spam. “The president of the committee then makes the final decision by taking into consideration the opinions of the members, the message itself, and its previous experience regarding when to trust each member,” he explained.

The stacked spam filter is more accurate than keyword-based spam filters, Androutsopoulos said. It identifies about 90 percent of junk email accurately, and mistakes a legitimate email for spam about 1 percent of the time, he said. The accuracy could be increased further by returning messages classified as spam to their senders and asking them to change the address, he said. If the email is legitimate, the originator can send it again to a different, unfiltered address.

“Training the filter takes a few minutes per user, depending on the number of training messages. Classifying an incoming message is almost instantaneous,” said Androutsopoulos. When the filter is configured separately for each user, it could be installed either on the end user’s desktop or on the ISP’s server. “In the latter case, the ISP would run the user's filter on behalf of the user before downloading the messages to [a] desktop, saving bandwidth wasted by spam messages,” he said.

The same configuration of the filter can be applied to all users on a network, said Androutsopoulos, “but I would expect the accuracy of the filter to be worse than when using filters especially configured for each user.” The training time will also go up because more training messages would be needed for a pan-network filter, he said.

Better spam filters are definitely needed, said Ben Gross, a visiting scholar at Berkeley, and a coordinator of the Digital Libraries Initiative Phase Two for the National Science Foundation (NSF). “Spam remains a nearly intractable problem for most users [and] better Natural Language Processing techniques for spam could certainly improve the current state of technology,” he said.

An important variable the researchers did not discuss, which may bear on the scheme's use in large networks, is time. “For a system to be viable for large scale deployment with email it must be highly efficient,” said Gross. Still, if a spam filter’s performance were to prove inadequate, it could be deployed at the users’ desktops, he said.

The stacked spam filter could be used by firewall makers, listserve moderators, newsgroups, ISP’s and individual users, said Androutsopoulos. It will be ready for such use within a year, he said.

The researchers’ next step is to improve the filters by evaluating more thoroughly how the filters work and improving the system’s learning algorithms, according to Androutsopoulos. The researchers would like to make the system’s training period faster, he said.

Androutsopoulos’s research colleagues were George Sakkis and Panagiotis Stamatopoulos at the University of Athens and Georgios Paliouras, Vengelis Karkaletsis, and Constantine D. Spyropoulos at the Demokritos National Center for Scientific Research. They presented the research at the 6th Conference on Empirical Methods in Natural language Processing held in Pittsburgh, PA on June 3 and 4, 2001. The research was funded by the universities.

Timeline: >1 year
Funding: University
TRN Categories: Natural Language Processing; Internet
Story Type: News
Related Elements: Technical paper, "Stacking Classifiers for Anti-Spam Filtering of E-mail," in the Proceedings of the 6th Conference on Empirical Methods in Natural language Processing (EMNLP 2001), at Cornell University on June 3, 2001.

Advertisements:

August 22/29, 2001

Page One

Nets mimic quantum physics

Teamed filters catch more spam

Software eases remote robot control

Ion beams mold tiny holes

Unusual calms tell of coming storms

News:
Research News Roundup
Research Watch blog

Features:
View from the High Ground Q&A
How It Works

RSS Feeds:
News

| Blog

| Books

Ad links:
Buy an ad link

Advertisements:

Ad links: Clear History

Buy an ad link

Home Archive Resources Feeds Offline Publications Glossary

TRN Finder Research Dir. Events Dir. Researchers Bookshelf

Contribute Under Development T-shirts etc. Classifieds

Forum Comments Feedback About TRN

TRN Newswire and Headline Feeds for Web sites