Search tool builds encyclopedia TRN 080101

Search tool builds encyclopedia

By Chhavi Sachdev, Technology Research News

The best part about the Internet is having so much information at your fingertips. You type in a word or phrase, hit “search” and wait for your hits. Then you hope for the best as you click on a description to see if the site contains what you need.

A pair of researchers at the University of Library and Information Sciences in Tsukuba, Japan has come up with a system that winnows down the process of an Internet search by indexing the Web as a sort of open encyclopedia. Instead of seeing a list of a thousand Web sites that might possibly contain answers, the system extracts the information and its reference links and organizes it in the form of an encyclopedic entry.

“The interface has two fundamental modes: keyword and concept input,” said Atsushi Fujii, a postdoctoral research assistant at the University. If you type a word such as ‘pipeline,’ which could be either a means of conveying liquids and gases or a computer processing method, the application distinguishes between the two usage domains, and then shows the various entries describing each usage, Fujii said. The resulting page looks much like it came out of a paper dictionary or an encyclopedia, except each description has a hyperlink to its source page.

In the concept input mode, users can type in sentences rather than keywords, such as, ‘What infects computer files by way of e-mails?’ Fujii said. To answer the question, the system generates a list of candidate keywords, such as ‘microvirus’ and ‘computer virus,’ he said. Users select one of the keywords to see its description page, essentially switching back to the keyword input mode.

The system culls entries from Web pages and stores them in a database. Because the system uses the Google search engine to generate sites, the raw material the system works with is what anyone would get from searching on a term like microvirus.

The system deletes layout information and links and retains only the sentence fragments surrounding a key term. It uses a statistical language model and a morphological analyzer to prevent the output from resembling garbled strings of words. The morphological analyzer segments the input sentences into words; the statistical language model is “a set of probabilities that each word appears in a given context,” Fujii said.

Using two preceding words as contexts, the statistical language model extracts three-word patterns, or tri-grams, such as "go to school" that are inherent in term descriptions. “Given a fragment extracted from a Web page, our method extracts all the possible tri-grams from the fragment, and computes a combined probability for them,” Fujii explained. The result is very readable, and quite accurate, he said.

To test accuracy, the researchers generated an encyclopedia from 96 test terms collected from the Japanese IT Engineers Examinations. The method generated appropriate descriptions for 90 percent of the test terms. The answers from the generated encyclopedia were comparable to an existing hand-compiled computer encyclopedia, said Fujii.

The system is better than encyclopedias and dictionaries that are unable to keep up with new developments and information, said Fujii. “Our method facilitates searching the Web for encyclopedic knowledge related to input terms. Consequently, users can easily obtain knowledge associated with new or technical terms unlisted in existing encyclopedias,” he said.

Once an encyclopedia has been generated for a search term, it is stored in a database. The database is updated periodically, Fujii said. If the search term has already been indexed, it takes only a few seconds to find an entry. Terms that are not indexed in the encyclopedia are processed in real-time, which can take up to a couple of minutes, he said.

“On the whole it is promising, but the current system is too premature to be practically interesting just yet,” said John Prager, a research staff member at IBM’s T.J. Watson Research Center. If a user wanted to research a technical subject, “this could be an interesting front-end to a traditional search engine such as Google, but as a Question-Answering system it is well below the state of the art,” he said.

Its drawbacks are that it only deals with “what-is” questions of a multiple choice nature, for which the correct answers are already supplied. Its performance on these questions is also no better than existing systems, Prager said.

The researchers are planning to use a parallel PC cluster to speed up the process since each description can be processed independently, Fujii said. They also plan to expand the system to answer “how” and “why” questions along with “what” questions, he said.

The system is currently used for Japanese text only, but it could be used for several other languages, according to Fujii. It will be ready for practical application in two years, he said.

Fujii’s research colleague was Tetsuya Ishikawa. They presented their research at the 39th Annual Meeting of the Association for Computational Linguistics (ACL2001), held in Toulouse, France from July 6-11, 2001. The research was funded by the University of Library and Information Science, Tsukuba, Japan.

Timeline: >2 years
Funding: University
TRN Categories: Natural Language Processing; Databases and Information Retrieval; Internet
Story Type: News
Related Elements: Technical paper, "Organizing Encyclopedic Knowledge based on the Web and its Application to Question Answering," scheduled to be presented at the 39th Annual Meeting of the Association for Computational Linguistics (ACL2001), July 6-11 2001, Toulouse, France.

Advertisements:

August 1/8, 2001

Page One

Tool reads quantum bits

Study shows fiber has room to grow

Search tool builds encyclopedia

Positioned atoms advance quantum chips

Electron beam welds nanotubes

News:
Research News Roundup
Research Watch blog

Features:
View from the High Ground Q&A
How It Works

RSS Feeds:
News

| Blog

| Books

Ad links:
Buy an ad link

Advertisements:

Ad links: Clear History

Buy an ad link

Home Archive Resources Feeds Offline Publications Glossary

TRN Finder Research Dir. Events Dir. Researchers Bookshelf

Contribute Under Development T-shirts etc. Classifieds

Forum Comments Feedback About TRN

TRN Newswire and Headline Feeds for Web sites