Jan 19, 2009 12:00 PM

Burning Question: Why Can't We Control Gadgets by Voice Alone?

Illustration: Siggi Eggertsson It's a recurring pipe dream for technophiles and luddites alike: computers that not only listen but understand our every command. And each year, like clockwork, someone claims this day is upon us—that we can toss out our keyboards and warm up our larynges for a new relationship with our machines. Press or say […]

* Illustration: Siggi Eggertsson * It's a recurring pipe dream for technophiles and luddites alike: computers that not only listen but understand our every command. And each year, like clockwork, someone claims this day is upon us—that we can toss out our keyboards and warm up our larynges for a new relationship with our machines.

Press or say "1" for a cold, hard dose of reality.

Despite being crammed into nearly every imaginable electronic receptacle—from cell phones and desktop operating systems to cars and aircraft cockpits—speech-recognition software remains light-years away from tackling the general- purpose applications that would change the way we interact with computers. Sure, we've seen modest improvements, but breakthroughs have been rare. One of the most recent occurred more than a decade ago: Rasta, developed at the International Computer Science Institute at UC Berkeley, enabled different kinds of hardware to use the same speech-recognition software. It was widely implemented in mobile phones in 2001, and nothing game-changing has happened since.

What's the holdup? Part of the problem is that, unlike with other types of software, processing power alone doesn't solve your problem. Moore's law only boosts a machine's ability to navigate larger pronunciation databases.

Those databases do help. By compiling massive lists of pronunciation variants, engineers try to minimize errors. But with some 30 ways of saying "of," and nearly infinite spoken iterations for more complex words, even the largest inventory is easy to foil. "There's not a speech recognizer today that you can't break by stretching out certain syllables," says Deb Roy, director of the Cognitive Machines Group at the MIT Media Lab.

So scientists continue to hack away at the problem, and they're learning a ton about how we meatbags process and understand sound. It turns out that we aren't flawless speech recognizers either. Rather, we often eke out meaning based largely on context and expectations.

"The next major thing in speech recognition is letting machines train themselves on the context," Roy says. His group is programming machines to analyze the listening environment and factor that new data into their sound-deciphering processes. Thus far they've experienced spikes in accuracy as high as 23 percent.

So while we're waiting for machines to start meeting us halfway on the speech front, please have a little patience with the automated voice on the other end of the line. You're really hard to understand.

Start Previous: Steven Levy on the Burden of Twitter Future Phones to Read Your Voice, Gestures

Long-Promised, Voice Commands Are Finally Going Mainstream

BBC Snakeoil: 'Perfectly Accurate' Voice Recognition Phone 'Too Secret' to See

Comments