Speech recognition to sort Holocaust tapes

By Kimberly Patch, Technology Research News

When Steven Spielberg established the Shoah Foundation to record eyewitness accounts of Holocaust survivors and rescuers seven years ago, speech recognition software that took dictation was barely usable.

Now, after videotaping 52,000 eyewitness accounts in 57 countries and 32 languages, the foundation is looking to speech recognition software -- which has also come a long way in the past seven years -- to help with the arduous task of indexing the 116,000 hours of interviews.

The foundation is currently indexing the material manually according to a thesaurus of keywords. "Annotators mark down... codes from the thesaurus as they watch the interviews," said Bill Byrne, an associate research professor of electrical and computer engineering at Johns Hopkins University.

The process is very time-consuming: it would take 40 years of 8-hour days to simply watch the entire collection. "It's also difficult to determine beforehand how to annotate the data so that subsequent searchers can find exactly what they're looking for," he said.

Teams of researchers from IBM, Johns Hopkins University, the University of Maryland, and the Shoah Foundation will take several approaches over the next five years in an attempt to automate the process and make the material more accessible to historians and teachers, said Byrne.

"We hope to be able to use speech recognition and a cross-lingual information retrieval technique to both speed up the annotation so it will be easier for the skilled translators to annotate and also to, at some point, make it possible for people to be able to search these data collections directly without the need of human annotation at all," said Byrne.

Current speech recognition software, which works fairly well for a single trained user, is still not up to the task of transcribing from tape emotional testimony from many users in many languages. The nature of this job makes an excellent research project, however, said Byrne.

Speech recognition systems work well when people are speaking specifically to be understood, like dictating directly to a computer or professionally announcing the news, said Byrne. This is why the real-time speech translators used in loud bars to subtitle news or sports broadcasts work fairly well.

In contrast, in the Shoah foundation material, "people are speaking to an interviewer... and their speech is highly emotional and about topics that are something out of the general realm of experience. They're heavily accented in the English collections. And the speakers are also elderly. Children and elderly people [have] a lot more variability in their speech, [which] makes it hard to recognize as well," he said.

Another challenge is the acoustics. In contrast to newscasts, the videotaping "was not done in a sound booth... there's just a microphone in the camera several feet away from the speaker," said Byrne.

It's a difficult project, said Alex Waibel, a professor of computer science at Carnegie Mellon University. "The biggest challenges are that the recorded speech is conversational, not read, and therefore presents greater variability, leading to higher error rates, [it is in] multiple languages, [and it involves the] expression of emotion, which makes recognition harder."

Usually speech recognition systems address the multiple language problem by individually training recognizers for each language, said Waibel. The project is an obvious fit for an alternative approach that has already shown some promise -- multilingual speech recognition models, he said.

Multilingual models proposed five years ago by Carnegie Mellon and University of Karlsruhe researcher Tanja Schultz showed that a multilingual translator can do as well as the approach that uses multiple translators for individual languages, said Waibel.

The fictional Star Trek universal translator presages this approach. It uses a speech model that "can essentially be used for any new language with little adaptation data," Waibel said.

The researchers are taking several different tacks, said Byrne.

The IBM researchers are adapting an English translation module using 100 hours of tape from the collection that will be transcribed by people, essentially giving the module the answers to the first 100 hours of words. It is not possible to do this much work with each of the 32 languages, however, so the researchers will next use about 20 hours of translated Czech to adapt the Czech module, said Byrne. "We're going to see if we can develop techniques that allow us to train systems with much less data," he said.

Speech recognition is just part of the project, he added. Its goal is finding information, and the speech recognizers will be embedded in a much larger search and retrieval system, he said. The idea is to "make this data usable by historians and educators and teachers... they're going to want to search through the material to find discussion of certain events or themes... related to their research or the classroom material," he said.

The advantage of this goal is "the speech recognition systems don't need to work perfectly to be useful for searching archives. Returning good answers to the user's query [is] what we're really after," he said. The researchers will also concentrate on retrieval lexicons, which are lists of words used by search engines. "We will try to make sure that we do a very good job on these words, because these are the words [search engines are] looking for," he said.

The general plan is the Maryland researchers will work on information retrieval and interaction with users; the Johns Hopkins researchers will work on speech recognition and the problems of working with multiple languages; the IBM researchers will focus on the project of transcribing English; and the Shoah foundation researchers will concentrate on cataloging and adapting the approaches to their specific needs, said Burns. "All these efforts fit together tightly.

The project is scheduled to last five years. Improved access to the Shoah foundation archives is likely to be available sooner, said Byrne. "We could start seeing initial results from the effect of our work within a year or so," he said.

Byrne's research colleagues are Frederick Jelinek, Sanjeev Khudanpur and David Yarowsky from Johns Hopkins University; Douglas Oard, Bruce Dearstyne, David Doermann, Bonnie Dorr, Philip Resnik and Dagobert Soergel of the University of Maryland; Bhuvana Ramabhadran and Michael Picheny from IBM T. J. Watson Research; and Sam Gustman, Douglas Greenberg and Ella Thompson of the Survivors of the Shoah Visual History Foundation. The research is funded by the National Science Foundation (NSF).

Timeline:   5 years
Funding:   Government
TRN Categories:   Human-Computer Interaction
Story Type:   News
Related Elements:   None.


October 31, 2001

Page One

Address key locks email

Speech recognition to sort Holocaust tapes

Sensitive sensor spots single photons

Synced lasers pulse shorter

Electrons clean wire machine


Research News Roundup
Research Watch blog

View from the High Ground Q&A
How It Works

RSS Feeds:
News  | Blog  | Books 

Ad links:
Buy an ad link


Ad links: Clear History

Buy an ad link

Home     Archive     Resources    Feeds     Offline Publications     Glossary
TRN Finder     Research Dir.    Events Dir.      Researchers     Bookshelf
   Contribute      Under Development     T-shirts etc.     Classifieds
Forum    Comments    Feedback     About TRN

© Copyright Technology Research News, LLC 2000-2006. All rights reserved.