Webs within Web boost searches TRN 111302

Webs within Web boost searches

By Kimberly Patch, Technology Research News

Internet search engines regularly use information about the text contained in pages and the links between pages to return relevant search results because the approach works reasonably well, but less is known about why these relationships exist.

A researcher from the University of Iowa has expanded the utility of using text and links in search engines with a mathematical model that divides a large network like the Internet into small local Webs.

A Web crawler designed to completely traverse a small Web will provide more comprehensive coverage of a topic than typical search engines, according to Filippo Menczer, an assistant professor of management sciences at the University of Iowa. "My result shows that it is possible to design efficient Web crawling algorithms -- crawlers that can quickly locate any related page among the billions of unrelated pages in the Web," he said.

Menczer's earlier work showed how similarities in pages' text related to the Web's link structure.

His latest work has expanded the concept by looking at a large number of pairs of pages from the entire Web and studying the relationships between three measures of similarity -- text, links and meaning -- across those pages. "A better understanding of the relationships between the cues available to us -- such as words and links -- about the meaning of Web pages is essential in designing better ranking and crawling algorithms, which determine how well a search engine works," Menczer said.

The brute force approach gave Menczer enough data to uncover power-law relationships between textual content and Web page popularity and between semantic, or categorical, distance and Web page popularity. "From a sample of 150,000 pages taken from all top-level categories in the Open Directory, I considered every possible pair of pages, resulting in almost 4 billion pairs," said Menczer. The pattern would have been difficult to notice with smaller or nonrandom samples, he said.

Menczer used the data in a mathematical model that predicts Web growth, and showed that the model accurately predicted the way links are distributed in the Internet. "The Web growth model based on local content predicts the link... distribution," he said.

The model is based on the idea that Web page authors link to the most popular or important pages in their subject areas, said Menczer. The question is how they do this practically without a global knowledge of page popularity. Many existing models simply assume that a Web page author has knowledge of every Web site.

Menczer's model uses local content as a way to determine the probable distribution of links in a network. "In this sense the new model is more realistic because it is based on behavior that matches our intuition of what authors do," he said.

The model is relatively simple, Menczer said. "When you look at a new page, you link it to related pages which you know about with probability proportional to their... popularity," he said. The probability of linking between given pages decreases as the text similarities between them decreases, he said. The relationship between the probability of a link between pages and their text similarity follows a power-law, or exponential decrease.

The model, based on local knowledge, sees the Web as clusters of smaller webs of sites with similar topics. This bodes well for search engine developers, who can design Web crawlers to use textual and categorical cues to completely traverse a small Web in order to provide comprehensive coverage on a certain topic, according to Menczer.

The research should allow for ranking and crawling algorithms and more scalable search engines "where most pages of interest to a community of users can be located, indexed, and the semantic needs of users can be mapped into algorithms to destill the most related pages," Menczer said.

Menczer' research group is designing and evaluating topical Web crawlers, Menczer said. In addition, "we have some ideas on how to induce natural collaborative activities in communities of users that can emerge spontaneously in peer networks," he said. "Such activities will provide crawlers and indexers with rich contexts to improve their performance," he added.

Some progress in crawling and ranking is possible within a few years, but a full understanding of the complex inter-relationships between all sorts of information available on the Web will take longer to map out, he said.

Menczer is working on visual maps that will allow for a better interpretation of the relationships between text, links and the meaning of Web pages.

The work is useful and novel, said Shlomo Havlin, a physics professor at Bar-Ilan University in Israel. "It extends previous work on networks to [quantify] correlations between neighboring nodes. Such correlations have been found in realistic social and computer networks," he said.

The research adds to network models information that could improve researchers' understanding of aspects of networks like stability and immunization against software viruses, Havlin said. "This work extends the general body of research to include realistic features," he said.

Menczer published the research in the October 7, 2002 issue of Proceedings of the National Academy of Sciences. The research was funded by the National Science Foundation (NSF).

Timeline: > 3 years
Funding: Government
TRN Categories: Internet
Story Type: News
Related Elements: Technical paper, "Growing and Navigating the Small World Web by Local Content," proceedings of the National Academy of Sciences, October 7, 2002.

Advertisements:

November 13/20, 2002

Page One

Coax goes nano

Webs within Web boost searches

Circuit gets more power from shakes

Method measures quantum quirk

Biochip sprouts DNA strands

News:
Research News Roundup
Research Watch blog

Features:
View from the High Ground Q&A
How It Works

RSS Feeds:
News

| Blog

| Books

Ad links:
Buy an ad link

Advertisements:

Ad links: Clear History

Buy an ad link

Home Archive Resources Feeds Offline Publications Glossary

TRN Finder Research Dir. Events Dir. Researchers Bookshelf

Contribute Under Development T-shirts etc. Classifieds

Forum Comments Feedback About TRN

TRN Newswire and Headline Feeds for Web sites