Web
pages cluster by content type
By
Kimberly Patch,
Technology Research News
What makes the Web so useful is the vast
amount of information it spans. And this is also what makes it so frustrating.
The challenge is indexing the Web in a way that allows people to find
information quickly and painlessly. Scientists struggling with this problem
have found that the Internet harbors more correlations among the types
of information it holds than was at first apparent.
Information retrieval methods have long counted on correlations between
word matches and meaning to find pages that are similar to each other.
A University of Iowa researcher has confirmed that there are also correlations
between link distance and content, and link distance and meaning. "If
two pages are separated by [only] a few links, then they are also similar
in content and in meaning," said Filippo Menczer, an assistant professor
of management sciences at the University of Iowa.
Untangling the correlations that exist among different aspects of the
Web could be one key to better organizing its vast reaches.
The idea is that there are many notions of distance on the Web, and studying
the relationships among these types of distance will provide cues to the
relationships among Web pages, said Menczer. "It's like using cues in
a physical environment. Suppose you are at a picnic in a park and you
have to find the apple pie with your eyes closed. When the smell get stronger
you know you're getting closer. So the strength of the smell signal is
correlated with a physical distance," he said.
To verify the link-content correlation he measured the similarity of the
words of many pairs of pages and the number of links that must be clicked
to get from one to another. He also measured the link distances between
pages that human experts had determined were similar in meaning.
"My results show that links... tell us a lot about the content and meaning
of pages. This helps [us] understand why algorithms like Google's PageRank...
work so well. They use links to estimate the meaning of pages," he said.
This strength of the correlations between links, text and meaning was
surprising, said Menczer. "I found that beyond four or five links away,
the probability [of finding] a relevant page is reduced to random chance,"
he said.
Menczer also found that the results varied depending on the type of domain
he was measuring. "If you are browsing through Web sites of educational
institutions, the signals are significantly more reliable than if you
are surfing commercial sites," meaning the probability of finding a relevant
page drops faster when you click away from commercial sites, he said.
"In other words, you can get lost in cyberspace much faster when you're
shopping online than when you are browsing a class syllabus," he said.
Taken together with two other recent findings in Web structure, the results
could help build Web crawlers that do a better job of indexing, and cover
more of the Web.
The Web is a small-world network, meaning it has a regular topology of
pages clustered together, but also enough random links that they act as
tunnels to reduce the average number of links between pages. This is the
reason for the six degrees of separation phenomenon, which is that any
person in the United States, or any Web page, can be reached from any
other by making no more than six successive connections among people who
know people, or among pages that are linked.
At the same time, it has become clear that finding these short paths to
information is sometimes very difficult. The new correlations may help.
"The research I'm doing might shed light on this problem and help us understand
whether it is theoretically possible to build efficient Web crawlers --
agents that can find target pages in a reasonable time through local lexical
and link cues," said Menczer.
Measuring and documenting the relationships between the structure of the
Web and its content is clearly important, said Soumen Chakrabarti, an
assistant professor of computer science at the Indian Institute of Technology
in Bombay. "It has also been measured before, but not as systematically
as in Menczer's paper," he said.
"Menczer takes an important step of modeling the coupling formally" and
his model treats the link content relation more deeply than past research
efforts, Chakrabarti added.
Menczer is working on Web crawlers that will take advantage of these topological
findings. "The crawlers that now build a search engine's index... do not
use knowledge about what the users are interested in," he said. Menczer's
prototype Web crawler, dubbed MySpiders, is designed to better harness
the clues in links and to integrate it with information from Web page
content, he said.
This type of search engine could technically be ready for practical use
within one or two years, said Menczer. The research was funded by the
University of Iowa.
Timeline: 1-2 years
Funding: University
TRN Categories: Internet
Story Type: News
Related Elements: Technical paper, "Links Tell Us about
Lexical and Semantic Web Content," posted on the arXiv physics archive
at http://xxx.lanl.gov/abs/cs.IR/0108004. MySpiders Web crawler site:
myspiders.biz.uiowa.edu
Advertisements:
|
January
16, 2002
Page
One
Morphing DNA makes motor
Toolset teams
computers to design drugs
Atom clouds ease
quantum computing
Web pages cluster
by content type
Quantum effect
alters device motion
News:
Research News Roundup
Research Watch blog
Features:
View from the High Ground Q&A
How It Works
RSS Feeds:
News | Blog
| Books 
Ad links:
Buy an ad link
Advertisements:
|

|
|
|