Interface lets you point and speak

By Kimberly Patch, Technology Research News

One of the reasons speech recognition software remains inferior to human speech recognition is computers can't read hands.

Humans convey a surprising amount of information through the gestural cues that accompany speech. We point things out, convey concepts like 'big' or 'small', get across metaphorical ideas, and provide a sort of beat that directs conversational flow.

No matter how often or how vigorously you shake your fist at at your computer screen, however, it won't help the computer tune in to your mood.

Researchers from Pennsylvania State University are working on a human-computer interface that goes a step toward allowing a computer to glean contextual information from our hands. The software allows a computer to see where a human is pointing and uses that information to interpret the mixed speech and gestural directions that are a familiar part of human-to-human communications.

These pointing, or deictic gestures are commonly mixed with speech when talking about things like directions, for example, saying "from here to here," while pointing at a map.

The researchers used Weather Channel video to glean a database of deictic gestures, which include directly pointing to something, circling an area, or tracing a contour. "Looking at the weather map we were able to classify pieces of gestures, then say which pieces we can interpret, and what kind of gestures would be useful. We came up with algorithms [that] extract those gestures from just the video," said researcher Sanshzar Kettebekov, a Pennsylvania State University computer science and engineering graduate student.

The researchers used this database to create a pair of applications designed for large screens that allow the computer to interpret what people mean when they use a mix of speech and pointing gestures.

One application, dubbed IMAP, is a campus map that responds to pointing and spoken queries. "It brings the computer into the loop with the human," said Kettebekov. For example, if a person asks the map for a good restaurant in an area she is circling with her hand, the computer will reply based on the spoken request for a restaurant and the gestural request for a location, according to Kettebekov.

The second application is a battlefield planning or city crisis management simulation that allows a person standing in front of a large screen to direct vehicles around a battlefield or city. "A person has limited resources [and there are] alarms going off all over the city. The person is using... a 50-inch display... to direct the resources to where the alarm is going [off]," said Kettebekov.

Even though it seems easy to us, giving a computer the ability to sense and make sense of gestures in a verbal context is a complicated problem that involves several steps, according to Kettebekov. The computer must be able to track the user's hands, recognize meaningful gestures, and interpret those gestures.

The first problem is tracking. "We have a vision algorithm that tracks a person and tries to follow a person's hand," Kettebekov said. The second stage is picking out the pointing gestures. "You're trying to delimit gestures from a continuous stream of frames where the hands are just moving -- saying 'from here to here was this gesture'," he said. "The third stage is interpretation when you really associate [the gesture you have isolated] with parts of speech and try to extract meaning," he said.

Multimodal human computer interaction is an active research topic with a long history, said Jie Yang, a research scientist at Carnegie Mellon University. "Coordination of speech and gestures is an old but still open problem," he said, noting that there was a paper published 20 years ago on a computer system that integrated speech and gesture, and there have been many studies on the advantages of using speech and gesture. "Yet, we cannot naturally interact with a computer using speech and gesture without constraints today."

When all the difficult computer problems have been worked out, however, systems that recognize speech and gesture will allow a person to "efficiently manipulate multimedia information regardless of whether the person is communicating with a computer or with another human," he said.

The Penn State researchers are working on improving their gesture recognition algorithms by adding an understanding of the prosodic information that lends speech its subtle shades of meaning, said Kettebekov. "We're working on using prosodic information in speech: tone of voice, stresses, pauses... to improve gesture recognition and interpretation," he said.

The toughest of the three gesture problems is improving gesture recognition, said Kettebekov. Currently the system identifies keywords and tries to correlate them with gestures. Adding prosodic information would help the system to both recognize gestures and interpret them, he said.

For example, when a TV meteorologist wants to emphasize a keyword, he raises the tone of his voice, said Kettebekov. "If I want you to pay attention I not only point, but my voice would change so that I would attract more attention to that concrete point," he said. "You can extract those most prominent parts of speech, and those parts of speech nicely relate with the gestures -- in this case it was pointing," he said.

The researchers may eventually turn their sights to iconic, metaphoric and beat gestural information, but there is a lot of work to be done in the deictic area first, said Kettebekov. In addition, understanding what these subtler gestures mean from a linguistics point of view "is not there yet -- so there's not enough theoretical basis," to use to give that understanding to computers, he said.

Kettebekov's research colleague was Rajeev Sharma of Pennsylvania State University. They presented the research at the Engineering for Human-Computer Interaction conference in Toronto in May, 2001. The research was funded by the Army Research Laboratory and the National Science Foundation (NSF).

Timeline:   Now
Funding:   Government
TRN Categories:  Human-Computer Interaction; Computer Vision and Image Processing
Story Type:   News
Related Elements:  Technical paper, "Toward Natural Gesture/Speech Control of a Large Display," presented at the Engineering for Human-Computer Interaction conference in Toronto, May 11-14, 2001.


July 25, 2001

Page One

Sounds attract camera

Interface lets you point and speak

Quantum logic counts on geometry

T-shirt technique turns out flat screens

Rating systems put privacy at risk


Research News Roundup
Research Watch blog

View from the High Ground Q&A
How It Works

RSS Feeds:
News  | Blog  | Books 

Ad links:
Buy an ad link


Ad links: Clear History

Buy an ad link

Home     Archive     Resources    Feeds     Offline Publications     Glossary
TRN Finder     Research Dir.    Events Dir.      Researchers     Bookshelf
   Contribute      Under Development     T-shirts etc.     Classifieds
Forum    Comments    Feedback     About TRN

© Copyright Technology Research News, LLC 2000-2006. All rights reserved.