Summarizer gets the idea

By Kimberly Patch, Technology Research News

The flow of a document, including the topics covered and the ways those topics relate to each other, is clear to people. It would be useful if computer systems that process documents -- like search engines and programs that generate summaries of news articles -- could also learn to consider topic information.

Teaching a computer to discern a document's topics and create a summary that puts the topics in the correct order is a bit like teaching it how to put together the pieces of a jigsaw puzzle. Current methods focus on finding the right match for a given piece.

Researchers from the Massachusetts Institute of Technology and Cornell University have developed a system that does the equivalent of putting pieces that show parts of a mountain and pieces that show parts of the sky into separate groups, and putting the sky pieces above the mountain pieces, said Lillian Lee, an associate professor of computer science at Cornell University.

The researchers' automatic classification algorithm, or content model, is trained on subject-specific sets of documents and document summaries. It can then extract the topic structure of a group of related topics. The system selects and orders topics to generate a summary.

The researchers put together a prototype system that can automatically create capsule summaries of, for example, movies from a movie information database. Once the content model is trained on movie reviews, the system can determine appropriate ways to present the information, said Lee.

The content model could eventually be used to make search engines more precise, said Lee. Today's search engines "don't take the internal topic structure into account in any but a very coarse way," she said.

The researchers' system would allow a search engine to determine the overall topic and domain of discourse of a Web page, call up the appropriate content model to analyze the page's topic structure, and then return only on-topic pages, said Lee. It could also allow a search engine to present the user with just those parts of a document that were relevant to a query, she said.

The researchers' content model algorithm is based on the hidden Markov model, a method commonly used to delineate words in speech recognition programs and genes in computational biology.

A set of movie reviews, for example, usually contains several common topics: director, plot, actors, previous movies by the same director, and the reviewer's opinion of the movie, said Lee. The reviewer chooses an order in which to present some or all of the topics, she said. For example, the reviewer might begin by giving an overall opinion about the plot before discussing the director.

The hidden Markov model can specify mathematically that a likely sequence of topics within a review is opinion/plot/director/director's previous films/opinion rather than actors/opinion/director's previous films/director/actors/plot, said Lee. There are also techniques that allow systems to automatically learn the relevant probabilities just from examining samples of sequences, she said.

The researchers adapted standard hidden Markov model techniques in several ways, said Lee. "We did not want to specify the set of topics ahead of time, but rather wanted the system to automatically decide on a set of topics itself," she said. The system clusters sentences that have similar patterns, then treats the clusters as representations of different topics.

This is useful because it is automatic and because computers can pick up subtle patterns in documents that humans are not consciously aware of.

The tricky part, however, is dealing with digression. "We humans understand the phenomenon of digression, [but] digressions can really confuse computers, [which] rely on statistical regularities," said Lee. "The computer sees complete chaos and doesn't understand the meta-pattern of off-topic commentary."

The researchers incorporated a mathematical model of previously-unseen topics to deal with digression.

Modeling document content from a global perspective turned out to be an advantage according to the researchers' tests. In one experiment, the method out performed a state-of-the-art sentence-level method by 79 percent, according to Lee.

The method requires relatively formulaic domains and requires a sample of documents and corresponding summaries for training. "The domain of discourse needs to be formulaic enough for a computer to be able to find patterns of language use," said Lee. Fortunately, "many domains of interest to us have this property: for example, news articles about specific types of events tend to be written in rather stereotypical ways," she said.

It is possible to use the model to do capsule summaries in restricted domains now. Adapting the model to provide better search engine results could take 10 years, said Lee.

Lee's research colleague was Regina Barzilay from the Massachusetts Institute of Technology. The researchers presented the work at the North American Chapter of the Association for Computational Linguistics Human Language Technology (HLT/NAACL) 2004 conference in Boston, Massachusetts, May 2 to 7. The research was funded by the National Science Foundation (NSF) and the Alfred P. Sloan Foundation.

Timeline:   Now, > 10 years
Funding:   Government; Private
TRN Categories:  Natural Language Processing; Databases and Information Retrieval
Story Type:   News
Related Elements:  Technical paper, "Catching the Drift: Probabilistic Content Models, with Applications to Generation and Summarization," North American Chapter of the Association for Computational Linguistics Human Language Technology Conference (HLT/NAACL) 2004, May 2-7, Boston, Massachusetts


July 28/August 4, 2004

Page One

Photonic chips go 3D

Online popularity tracked

Summarizer gets the idea

Electric fields assemble devices

Process prints silicon on plastic
Tool automates photomontage edits
Device promises microwave surgery
Hologram makes fast laser tweezer
Chemistry yields DNA fossils
Particle chains make quantum wires


Research News Roundup
Research Watch blog

View from the High Ground Q&A
How It Works

RSS Feeds:
News  | Blog  | Books 

Ad links:
Buy an ad link


Ad links: Clear History

Buy an ad link

Home     Archive     Resources    Feeds     Offline Publications     Glossary
TRN Finder     Research Dir.    Events Dir.      Researchers     Bookshelf
   Contribute      Under Development     T-shirts etc.     Classifieds
Forum    Comments    Feedback     About TRN

© Copyright Technology Research News, LLC 2000-2006. All rights reserved.