Ties
that bind boost searches
By
Kimberly Patch,
Technology Research News
Search for Michael Jordan on the Web and
chances are you'll have a hard time finding the scientist of that name.
If you happen to know specifics about Jordan's branch of science, you
can augment the search with a term like "Bayesian networks" in order to
divert the inevitable flood of references to the popular athlete. But
if you don't already know something about the scientist, you'll have to
slog through many results to find the right guy.
Scientists from NEC Corporation's Research Institute are attempting to
solve this basic Web search problem with an algorithm that takes advantage
of the linking structure of the Web. "The solution... is to find an efficient
and effective means of partitioning the Web in a manner that is independent
of text content," said Gary William Flake, a research scientist at NEC
Research.
The algorithm could lead to more precise search engines, according to
Flake.
Organizing the Web in a way that is not dependent on the language contained
within web pages is important because language is ambiguous. "If we tried
to cluster occurrences of 'Michael Jordan' by text content, we could run
into many difficulties related to the ambiguity of languages," said Flake.
In addition, "there may be many more 'Michael Jordans' out there, so we
can't just arbitrarily choose to separate athletes from scientists," he
said.
One way to organize the Web is to introduce some type of centralized hierarchy
by hand, said Flake. Portals like Yahoo, for instance, employ people to
index portions of the Web. But using people to comb through Web pages
runs into a limit "related to how many humans you can get to do the dirty
work," he said. "I suspect that there will never be a portal that indexes
more than 10 million pages simply because it is too labor-intensive to
do so."
Between automatic search engines and portals that use humans to organize
content there is an implicit trade-off between reach and precision, said
Flake. "On a search engine, you get back many results, but a lot of them
are garbage. On portals, you get a small set of high-quality results,
but with a lot of stuff missing," he said.
The algorithm attempts to span both approaches by providing a way to automatically
organize the information based on the structure imposed on the Web by
the way connections grow among pages.
Although it may seem random, the Internet actually contains a tremendous
amount of structure, said Flake. The Internet is scale-free, meaning it
has a few nodes, or pages that have many links to other nodes, and many
nodes with just a few links. It is also a small-world network, meaning
it has enough links, or shortcuts between large nodes, or hubs, that it
is possible to get from one node to any other by taking just a few hops.
This phenomenon is also responsible for the six degrees of separation
found in social networks.
The algorithm connects the Web's linking structure to communities like
science, finance, health, education, or recreation. A Web community has
more links among community members than it does outside of the community.
There are several algorithms that track links to find sets of related
pages, but they're either inefficient or cover only a portion of the Web,
according to Flake.
The key to coming up with an efficient algorithm for identifying communities
was to start with a set of hand-picked seed sites, he said. "The entries
within a portal category typically make excellent seeds."
The algorithm looks to the Web's structure to calculate a community that
contains the seed sites, said Flake. "In this way, we can improve the
[reach] of a portal while preserving the precision," he said.
The algorithm then does a popularity-like ranking within the community.
The group's continuing work crosses three distinct areas, said Flake.
"On the mathematical front, we [are] refining our algorithm to make it
more efficient and more accurate." The group is also working to mathematically
characterize the properties of communities, he said.
"On the engineering front, my group is working on improving our hardware
and software infrastructure so that our algorithms will scale up to the
case where we're dealing with billions of Web pages," he said.
On the scientific front, the researchers are studying how large-scale
communities on the Web relate to one another and how new communities emerge,
said Flake.
"For the Web, I'm hoping that we can turn this into niche search engines,
automatic portals, smart content filters, and even user agents," said
Flake.
The community algorithm may eventually also be applicable to other types
of networks, said Flake. "I would like to see if we can effectively apply
this to other data sets that have a network-like structure, such as those
found in biology," he said.
The researchers have found a novel way to infer content from links without
using textual information, said Filippo Menczer, an assistant professor
at the University of Iowa. The work "pushes the envelope with respect
to just how much useful information we can extract from the self-organizational
Web hypertext -- the fact that people selectively link their Web pages
to other, probably related pages," he said.
The method could prove useful for augmenting search engines, said Menczer.
"For example, Google burned the competition when it first introduced link
analysis into the realm of commercial search engines. There are still
many clues hidden in the Web that can be exploited, both in text and link
information," he said. "It's an incredibly rich environment and we have
only begun its exploration."
One issue with this particular method is how much time the algorithm takes.
Because the Web is so large, it may be impractical to crawl the Web to
the depth it requires, Menczer said. "An interesting direction for this
work is to explore efficient ways to apply the idea. While the algorithm
does not make any use of text information, it is possible that using text
to guide the preliminary crawls," may speed community identification,
he said.
The researchers are aiming to make a viable content filter that they can
demonstrate within a year, said Flake. "Later, on the order of two years
from now, we may introduce a niche search engine that ultimately customizes
itself to individual users," he said.
Flake's research colleagues were Steve Lawrence, C. Lee Giles and Frans
M. Coetzee. They published the research in the March, 2002 issue of the
journal IEEE Computer. The research was funded by NEC.
Timeline: 1 year, 2 years
Funding: Corporate
TRN Categories: Internet
Story Type: News
Related Elements: Technical paper, "Self-Organization of
the Web and Identification of Communities," IEEE computer, March, 2002.
Advertisements:
|
March
13, 2002
Page
One
Decision tool keeps
it simple
Ties that bind boost
searches
Sapphire chips linked
by light
Lasers grasp
cell-size water balloons
Fuel cell aimed at handhelds
News:
Research News Roundup
Research Watch blog
Features:
View from the High Ground Q&A
How It Works
RSS Feeds:
News | Blog
| Books
Ad links:
Buy an ad link
Advertisements:
|
|
|
|