| Page layout drives Web searchBy 
      Kimberly Patch, 
      Technology Research News
 A 
      typical Web search engine indexes the Web by crawling Web pages, extracting 
      text and links, and using the information to construct a Web graph that 
      reflects the relative importance of individual Web pages. The method relies 
      heavily on analyzing links, or the way pages connect.
 
 Researchers from the University of Chicago and Microsoft Research 
      Asia have devised a system that analyzes content at the level of blocks 
      of information on a page rather than the coarser page-level. This allows 
      for a model of the relationships between Web pages that shows the intrinsic 
      semantic structure of the Web, said Deng Cai, who was with Microsoft Research 
      Asia and Tsinghua University in China when the research was done, but is 
      now at the University of Illinois at Urbana-Champaign.
 
 The research could eventually lead to more accurate search engines, 
      according to Cai.
 
 Link-based Web algorithms, including Google's PageRank, are based 
      on a pair of assumptions, said Cai. First, that the links convey human endorsement, 
      meaning if page "A" is linked to page "B" and those two pages were authored 
      by different people, the linking author found the other page valuable. Second, 
      that if one page links to two other pages, those two pages are likely to 
      contain related subject matter.
 
 The assumptions don't hold in many cases, however, said Cai. A single 
      page often contains sections, and hyperlinks in different sections of the 
      page often point to pages that have different topics. Many links exist only 
      for navigation and advertisement, for instance.
 
 To correct this problem, search engines should analyze content in 
      units smaller than pages, said Cai.
 
 The researchers' prototype consists of a pair of search algorithms 
      that work with their previously developed method of segmenting Web pages 
      into topic-based blocks.
 
 The researchers used their Vision-Based Page Segmentation algorithm 
      to delineate the different parts of a Web page based on how a human views 
      a page, said Cai. Pages are segmented by horizontal and vertical lines, 
      and blocks of content are weighted by page position. Links from advertisements, 
      for example, count for less than links from central content blocks.
 
 In theory, other visual aspects of Web pages like background color 
      and font could also be used to segment and weight blocks, according to Cai. 
      Also, learning algorithms like neural networks could be trained for the 
      task using examples chosen by people, he said
 
 The researchers prototype ranks Web pages by extracting page-to-block 
      and block-to-page relationships, then using the information to construct 
      a page graph and a block graph. Page-to-block relationships are determined 
      by analyzing the layout of a page, and block-to-page relationships are determined 
      by the probability of a block linking to a given page.
 
 The information is fed to the researchers' link-analysis algorithms 
      -- Block Level PageRank, and Block Level Hypertext-Induced Topic Selection 
      (HITS) -- which assign an importance value to each page based on the type 
      of blocks that link to it. "Based on these, we can build our search engine 
      from the block level," rather than the coarser-grained page level, Cai said. 
      This means doing block-level link analysis and a block-based Web search, 
      he said.
 
 The link analysis algorithms are able to extract the intrinsic semantic 
      structure of the Web from this information, according to Cai.
 
 This is in some ways similar the World Wide Web Consortium's Semantic 
      Web project, which aims to give search engines and other software the means 
      to interpret Web page content. The block-level search technique does not 
      provide the concrete semantic information that the Semantic Web promises, 
      but also does not require wide-spread adoption of tags and other software 
      to parse Web pages.
 
 The approaches are different because "we try to extract the [semantic] 
      structure of the Web automatically from the existing Web," said Cai.
 
 The method also allowed the researchers to compute a BlockRank at 
      the block level similar to a page-level PageRank.
 
 In a comparison of their search algorithms with page-based versions 
      of the PageRank and HITS algorithms using a standard information-retrieval 
      research data set, the block-based algorithms performed better most of the 
      time, according to Cai.
 
 The block-level analysis of the Web could also lead to a better 
      understanding of the network in general, said Cai.
 
 In a practical search system, the block-level PageRank function 
      would not burden the system because it can be calculated offline, said Cai.
 
 The researchers are currently working to improve the page segmentation 
      algorithm, and to construct Web graphs that more accurately reflect the 
      semantic structure of the Web, said Cai. "We ultimately aim [to build] a 
      better search engine," he said. The researchers previously used the technique 
      to cluster like Web page images.
 
 The technique could be ready for commercial use in a general search 
      engine within two years said Cai.
 
 Cai's research colleagues were Xiaofei He from the University of 
      Chicago and Microsoft Research Asia, and Ji-Rong Wen and Wei-Ying Ma from 
      Microsoft Research Asia. The researchers presented the work at the Association 
      of Computing Machinery (ACM) Special Interest Group Information Retrieval 
      (SIGIR) 2004 conference in Sheffield, England July 25-29. The research was 
      funded by Microsoft Research Asia.
 
 Timeline:   1-2 years
 Funding:   Corporate
 TRN Categories:  Internet; Databases and Information Retrieval
 Story Type:   News
 Related Elements:  Technical paper, "Block-level Link Analysis," 
      presented at at the Association of Computing Machinery (ACM) Special Interest 
      Group Information Retrieval (SIGIR) 2004 conference in Sheffield, England 
      July 25-29
 
 
 
 
 Advertisements:
 
 
 
 | October 6/13, 2004
 
 Page 
      One
 
 Atomic clock to sync 
      handhelds
 
 Quantum math models speech
 
 Page layout drives Web 
      search
 
 Fluid chip does binary 
      logic
 
 Briefs:
 Chip spots 
      DNA electrochemically
 Crystal structure 
      tunes nanowires
 Gas flow makes electricity
 Sound makes 
      electricity for space
 Design rules 
      build on self-assembly
 Nanotube diode 
      reverses itself
 
 News:
 Research News Roundup
 Research Watch blog
 
 Features:
 View from the High Ground Q&A
 How It Works
 
 RSS Feeds:
 News
  | Blog  | Books  
 
   
 Ad links:
 Buy an ad link
 
 
 
         
          | Advertisements: 
 
 
 
 |   
          |  
 
 
 |  |  |