Software sorts video soundtracks

By Chhavi Sachdev , Technology Research News

The theme music for the nightly news and the newscaster’s voice sound inherently different to us, but distinguishing between the two is not an easy trick to teach a computer.

When it comes to cataloguing and indexing, however, computers are much faster than humans. To use this indexing speed for audio in a way that’s practical, computers must be able to distinguish a human voice from a saxophone, and the tympani in the fourth movement of Beethoven’s sixth symphony from real thunder.

Scientists at Microsoft Research have come up with an algorithm that allows computers to differentiate among speech, music, environmental sounds, and silence in video soundtracks by mapping and comparing the characteristics of each type of sound.

Classifying sound as noise, speech, or music is an important key to coding audio data, said Hong-Jian Zhang, a senior researcher and assistant managing director at Microsoft Research in China. "Audio segmentation and classification is [the] first step in audio content analysis and content-based [searching] in large audio or music databases," he said. It can also help group together segments of speech by a particular person.

To identify sound, the algorithm goes through a two-step process, Zhang said. It analyzes each audio clip in one-second sections, then further divides this window into forty, 25-millisecond snippets. Using pattern classification, it characterizes each 25-millisecond frame as either speech or non-speech, then further classifies non-speech sounds into silence, music and environment sounds.

The researchers set thresholds to allow the algorithm to distinguish the sound of a slamming door from a drum roll. Below a certain level the sound is classified as environmental noise and above it, as music.

The most difficult task the computer faces is sorting mixed sounds, said Zhang. Speech with a noisy background, for instance, is frequently classified into music, and music clips with drum sounds are falsely pegged as speech, he said.

Despite the overlaps, the algorithm has a 96-percent accuracy rate in classifying audio data, according to Zhang. When the algorithm does not go the second step of breaking sounds into 25-millisecond components, it has a 91.47 percent accuracy rate in classifying 2- to 8-second clips, he said.

This whole area of research is very exciting, said Dan Ellis, an assistant professor of electrical engineering at Columbia University. "Automatic classification and segmentation of [sound] could make searching for a particular example or instance orders of magnitude easier," he said.

Segmentation algorithms like Microsoft’s "could be used as a basis for more complex classifications, for instance to distinguish between news and sitcoms by the patterns of alternation between speech and music, or to detect commercials based on their particular soundtrack signatures," Ellis said.

There are generally several things going on at once in soundtracks, said Ellis. The sound in most video "doesn't fit very comfortably into the black-and-white classification paradigm assumed by this and similar work. Ultimately, we will need computer soundtrack analysis systems that can describe segments with more nuance, for instance as ‘male speech with orchestral music in the background’," he said.

The system can be used now for simple classification, but the researchers plan to expand the classification categories to make it more accurate, said Zhang. It should be in practical use within three years, he said.

Zhang’s research colleagues were Lie Lu and Hao Jiang at Microsoft Research in Beijing. They presented the research at the International Conference on Multimedia held between September 30 and October 5 in Ottawa, Canada. The research was funded by Microsoft.

Timeline:  3 years
Funding:   Corporate
TRN Categories:  Multimedia
Story Type:   News
Related Elements:  Technical paper, "A Robust Audio Classification and Segmentation Method," at International Conference on Multimedia of the Association for Computing Machinery (ACM) in Ottawa, Canada, September 30 - October 5, 2001.


November 28, 2001

Page One

Programmable DNA debuts

Device would boost quantum messages

Virtual computers reconfigure on the fly

Software sorts video soundtracks

Bigger disks won't hit quantum barrier


Research News Roundup
Research Watch blog

View from the High Ground Q&A
How It Works

RSS Feeds:
News  | Blog  | Books 

Ad links:
Buy an ad link


Ad links: Clear History

Buy an ad link

Home     Archive     Resources    Feeds     Offline Publications     Glossary
TRN Finder     Research Dir.    Events Dir.      Researchers     Bookshelf
   Contribute      Under Development     T-shirts etc.     Classifieds
Forum    Comments    Feedback     About TRN

© Copyright Technology Research News, LLC 2000-2006. All rights reserved.