Page layout drives Web search

By Kimberly Patch, Technology Research News

A typical Web search engine indexes the Web by crawling Web pages, extracting text and links, and using the information to construct a Web graph that reflects the relative importance of individual Web pages. The method relies heavily on analyzing links, or the way pages connect.

Researchers from the University of Chicago and Microsoft Research Asia have devised a system that analyzes content at the level of blocks of information on a page rather than the coarser page-level. This allows for a model of the relationships between Web pages that shows the intrinsic semantic structure of the Web, said Deng Cai, who was with Microsoft Research Asia and Tsinghua University in China when the research was done, but is now at the University of Illinois at Urbana-Champaign.

The research could eventually lead to more accurate search engines, according to Cai.

Link-based Web algorithms, including Google's PageRank, are based on a pair of assumptions, said Cai. First, that the links convey human endorsement, meaning if page "A" is linked to page "B" and those two pages were authored by different people, the linking author found the other page valuable. Second, that if one page links to two other pages, those two pages are likely to contain related subject matter.

The assumptions don't hold in many cases, however, said Cai. A single page often contains sections, and hyperlinks in different sections of the page often point to pages that have different topics. Many links exist only for navigation and advertisement, for instance.

To correct this problem, search engines should analyze content in units smaller than pages, said Cai.

The researchers' prototype consists of a pair of search algorithms that work with their previously developed method of segmenting Web pages into topic-based blocks.

The researchers used their Vision-Based Page Segmentation algorithm to delineate the different parts of a Web page based on how a human views a page, said Cai. Pages are segmented by horizontal and vertical lines, and blocks of content are weighted by page position. Links from advertisements, for example, count for less than links from central content blocks.

In theory, other visual aspects of Web pages like background color and font could also be used to segment and weight blocks, according to Cai. Also, learning algorithms like neural networks could be trained for the task using examples chosen by people, he said

The researchers prototype ranks Web pages by extracting page-to-block and block-to-page relationships, then using the information to construct a page graph and a block graph. Page-to-block relationships are determined by analyzing the layout of a page, and block-to-page relationships are determined by the probability of a block linking to a given page.

The information is fed to the researchers' link-analysis algorithms -- Block Level PageRank, and Block Level Hypertext-Induced Topic Selection (HITS) -- which assign an importance value to each page based on the type of blocks that link to it. "Based on these, we can build our search engine from the block level," rather than the coarser-grained page level, Cai said. This means doing block-level link analysis and a block-based Web search, he said.

The link analysis algorithms are able to extract the intrinsic semantic structure of the Web from this information, according to Cai.

This is in some ways similar the World Wide Web Consortium's Semantic Web project, which aims to give search engines and other software the means to interpret Web page content. The block-level search technique does not provide the concrete semantic information that the Semantic Web promises, but also does not require wide-spread adoption of tags and other software to parse Web pages.

The approaches are different because "we try to extract the [semantic] structure of the Web automatically from the existing Web," said Cai.

The method also allowed the researchers to compute a BlockRank at the block level similar to a page-level PageRank.

In a comparison of their search algorithms with page-based versions of the PageRank and HITS algorithms using a standard information-retrieval research data set, the block-based algorithms performed better most of the time, according to Cai.

The block-level analysis of the Web could also lead to a better understanding of the network in general, said Cai.

In a practical search system, the block-level PageRank function would not burden the system because it can be calculated offline, said Cai.

The researchers are currently working to improve the page segmentation algorithm, and to construct Web graphs that more accurately reflect the semantic structure of the Web, said Cai. "We ultimately aim [to build] a better search engine," he said. The researchers previously used the technique to cluster like Web page images.

The technique could be ready for commercial use in a general search engine within two years said Cai.

Cai's research colleagues were Xiaofei He from the University of Chicago and Microsoft Research Asia, and Ji-Rong Wen and Wei-Ying Ma from Microsoft Research Asia. The researchers presented the work at the Association of Computing Machinery (ACM) Special Interest Group Information Retrieval (SIGIR) 2004 conference in Sheffield, England July 25-29. The research was funded by Microsoft Research Asia.

Timeline:   1-2 years
Funding:   Corporate
TRN Categories:  Internet; Databases and Information Retrieval
Story Type:   News
Related Elements:  Technical paper, "Block-level Link Analysis," presented at at the Association of Computing Machinery (ACM) Special Interest Group Information Retrieval (SIGIR) 2004 conference in Sheffield, England July 25-29


October 6/13, 2004

Page One

Atomic clock to sync handhelds

Quantum math models speech

Page layout drives Web search

Fluid chip does binary logic

Chip spots DNA electrochemically
Crystal structure tunes nanowires
Gas flow makes electricity
Sound makes electricity for space
Design rules build on self-assembly
Nanotube diode reverses itself


Research News Roundup
Research Watch blog

View from the High Ground Q&A
How It Works

RSS Feeds:
News  | Blog  | Books 

Ad links:
Buy an ad link


Ad links: Clear History

Buy an ad link

Home     Archive     Resources    Feeds     Offline Publications     Glossary
TRN Finder     Research Dir.    Events Dir.      Researchers     Bookshelf
   Contribute      Under Development     T-shirts etc.     Classifieds
Forum    Comments    Feedback     About TRN

© Copyright Technology Research News, LLC 2000-2006. All rights reserved.