Story Model

Internet Structure

July 27/August 3, 2005

Search engines find content via spiders that go through the pages on a Web site by following the links among pages. This information is stored in an index that is used to match query terms. There are two challenges to making Web searches available this way. The first is dealing with the sheer size of the Internet. The second is being able to present a user with a reasonably whittled-down number of useful links.

Scope

The Internet is big by any measure:
The Internet Systems Consortium pegged the number of Internet hosts as of January, 2004 at 233 million.

Global Reach counted 729 million users online as of March, 2004. And a University of California at Berkeley study showed that, in 2002, 532,897 terabytes of new data flowed across the Internet, 440,606 terabytes of email was sent, and the Web contained 167 terabytes of data that was accessible to all users, plus another 91,850 terabytes in the deep Web where access is controlled.

A terabyte is 1,000 gigabytes, or 1,000,000 megabytes, or the amount of information that can be stored on 213 DVDs, or one tenth the amount of information stored in the entire Library of Congress print collection.

Limits

This is a lot of information, and it lives in a world in which computers are only so fast and hold only so much information. There is simply not enough time and compute power for spiders to crawl all the information in anything like a timely manner, or for even the tens of thousands of servers deployed by the major search engine companies to index and cache it.

To get around the problem, today’s search engines cover only 10 to 20 percent of the Web, and even then, spiders take weeks to finish a single crawl of just that portion. Search engines often crawl popular sites more often to keep them more up to date, but in general, when you search the Web or access a search engine’s cached copy of a page, you are working with a snapshot that is days or weeks old.

Link structure

Link structure already plays an important role in the second challenge for search engines - presenting links that are relevant. And it is starting to play a more important role in the first challenge - covering more of the Web.

Perhaps the best known example of using link structure to determine link relevance is Google’s PageRank algorithm, which orders search results using an algorithm that measures a page’s popularity based on the number and status of pages that link to it.

PageRank assigns a value to a page by adding up the values of its inbound links. A link’s value is determined by the originating page’s value divided by the number of its outbound links. The algorithm aims to identify authoritative sources and use their authority to evaluate other sources. Because pages determine each other’s rankings, the algorithm has to run many times before it converges on a reasonable value for a given page.

Clusters

More recently, researchers have been using link structure to categorize the Internet by subject in order to identify portions of the Web that are more manageable than the entire thing. Given that pages are likely to link to related pages, search algorithms can be tuned to find densely interconnected communities of interest.

Last Next

Advertisements:

Page One

Stories:
Traffic model maps congestion
Fingernails store data
Quantum crypto scheme doubly fast
How It Works: Internet Structure

Briefs:
Baited molecule fights cancer
Bacteria drive biochip sensor
System brightens dark video
Micro fuel cell packs power

News:
Research News Roundup
Research Watch blog

Features:
View from the High Ground Q&A
How It Works

RSS Feeds:
News

| Blog

| Books

Ad links:
Buy an ad link

Advertisements:

Ad links: Clear History

Buy an ad link

Home Archive Resources Feeds Offline Publications Glossary

TRN Finder Research Dir. Events Dir. Researchers Bookshelf

Contribute Under Development T-shirts etc. Classifieds

Forum Comments Feedback About TRN

TRN Newswire and Headline Feeds for Web sites