Summarizer ranks sentences
By
Kimberly Patch,
Technology Research News
Because
computers don't understand the meanings of words and sentences, automating
the seemingly simple task of summarizing a news story using several sources
is a major computer science challenge.
Key to meeting the challenge is finding a way to identify the most
important sentences from a set of documents on the same subject.
Researchers from the University of Michigan have developed a multi-document
summarization technique that compares sentences and has the effect of sentences
voting for the most important among them.
The method, dubbed LexRank, combines the content-sorting concepts
of prestige and lexical similarity to find the most important sentences
in a group of documents on the same subject.
Algorithms that use prestige to sort information have been around
since the '90s. It is possible to find the most prestigious, or popular
member of a network by analyzing the relationships among network members.
In a social network, for example, the most prestigious individual can be
identified by analyzing the social relations among all pairs of members
of the group.
The PageRank algorithm that powers Google takes advantage of this
concept. It assigns a Web site a prestige score based on the number of other
sites that point to it and the prestige of those sites. This works to rank
Web pages on a network because the pages are connected. In the case of multi-document
summarization, however, such hyperlinks are not available, said Dragomir
Radev, an assistant professor of information, electrical engineering and
computer science and linguistics at the University of Michigan.
Instead, the algorithm uses the similarities among sentences.
The researchers' lexical centrality algorithm compares the lexical
similarity of sentences. "Lexical similarity can be thought of as a measure
of the word overlap between two sentences," said Radev. "For example, 'Bush
went to China' and 'George Bush visited China' are fairly similar in a lexical
way [but] 'Bush visited China' and 'Blair is the prime minister of the United
Kingdom' have no overlap at all," he said.
The algorithm allows the researcher to pick a threshold to indicate
the point at which two sentences start to become similar.
There are many possible factors that can be used to assess the lexical
similarity of a pair of sentences, said Radev. "We chose to weigh the contribution
of each word... by its relative informativeness," he said. "Rare words like
'Igor', 'Taha' and 'disarmament' are more informative than common ones like
'today', 'between', and 'November'."
The researchers' system considers a sentence important if it is
similar to many other sentences and if those other sentences are themselves
important. "In a sense, sentences vote for each other just by virtue of
being similar to each other," said Radev. "The sentences with the highest
scores... are considered to contain the gist of the document and are presented
as the multi-document summary," he said.
In contrast, the state-of-the-art method -- dubbed Centroid -- calculates
a pseudo sentence that is the average of all the sentences in a set of documents,
and calculates how similar each sentence is to this "centroid" sentence.
The researchers have applied the method to a prototype of their
news clustering Web site. "For each cluster of related stories, we compute
the pairwise similarity between all sentence pairs, then apply the [lexical]
centrality algorithm," to parse out the important sentences, which become
the summary of the document cluster, said Radev.
The most important realization in doing the research was that the
patterns of language in the multi-document summarization task are similar
to seemingly unrelated natural phenomena such as the patterns of links among
Web pages, social interactions and electrical components, said Radev.
The researchers are planning to incorporate the method into their
NewsInEssence Web site, which crawls the Web for news stories, clusters
them into topical groups, and summarizes each group.
The researchers are also looking for other uses of the lexical centrality
algorithm. Possibilities include automatic translation and question answering,
said Radev. The method could potentially find sentences that are likeliest
to contain the answer to a given natural language question, or, in the biomedical
domain, sentences that are most likely to contain important facts like particular
protein interactions, said Radev.
The researchers' experiments show that the method has the potential
to yield summaries as good as those of state-of-the-art summarization systems,
said Lillian Lee, an associate professor of computer science at Cornell
University. "This general field of investigation is one that seems very
promising," she said.
Radev's research colleague was Gunes Erkan. The researchers published
the work in the July 2004 to January 2005 issue of the Journal of Artificial
Intelligence Research (JAIR). The research was funded by the National Science
Foundation (NSF).
Timeline: 6-18 months
Funding: Government
TRN Categories: Natural Language Processing; Databases and
Information Retrieval; Internet
Story Type: News
Related Elements: Technical paper, "Lexrank: Graph-based centrality
as salience in text summarization," Journal of Artificial Intelligence Research
(JAIR), July 2004-January 2005, www.jair.org/abstracts/erkan04a.html;
Demo tangra.si.umich.edu/clair/lexrank/
Advertisements:
|
April 20/27, 2005
Page
One
Stories:
Telescopes make bug-eye
optics
Summarizer ranks sentences
Impact
Assessment:
Overly smart buildings
Briefs:
Ultraviolet
shifts plastic's shape
Spiral laser beam demoed
Nanotube
chemical sensor gains speed
Trapped cells
make micromotors
News:
Research News Roundup
Research Watch blog
Features:
View from the High Ground Q&A
How It Works
RSS Feeds:
News | Blog
| Books
Ad links:
Buy an ad link
Advertisements:
|
|
|
|