Machines Learn to Mimic Speech

Computers still can't really understand us, but they're getting better at pretending. Today's programs can mimic accents and isolate meaningful information from babblers. Michelle Delio reports from New York.

All products featured on WIRED are independently selected by our editors. However, we may receive compensation from retailers and/or from purchases of products through these links.

Reader's advisory: Wired News has been unable to confirm some sources for a number of stories written by this author. If you have any information about sources cited in this article, please send an e-mail to sourceinfo[AT]wired.com.

NEW YORK – Humans and computers still can't sit down for a heart-to-heart chat, and probably won't any time in the near future, according to vendors at a speech technology tradeshow held here this week.

But there are a slew of new and effective ways for silicon and carbon-based life forms to communicate verbally.

"Once we get past the mistaken idea that computers should be able to really understand us or that we can engage in meaningful conversations with machines, the new voice and speech technology is absolutely amazing," said James Larson, who organized the SpeechTek tradeshow.

The products demonstrated at SpeechTek worked well enough to raise suspicions that science-fiction dreams of sentient machines had finally been realized. But vendors adamantly stressed that these systems don't allow computers to understand human speech. The technology simply interprets a user's intentions from a massive stored database of words and phrases.

"In the past there was a sense of mystery to voice technology," said Frank Vertram, a speech application programmer who attended the show. "I think we were all – the companies that develop these products and the people who use them – quietly convinced that computers could somehow comprehend our words.

"Now the industry has matured past all the wowee-zowee magic and moved towards practical uses of speech tech. Now that the magic is gone, we don't believe in using speech technology unless it serves a viable purpose – making it easier for people to work with a computer system, making systems more secure or even making computers more fun."

Among the useful products on display was speech technology for ATMs that allows visually disabled or computer-nervous users to interact with automatic tellers by listening to a description of the onscreen options through a pair of headphones.

Vertram also pointed out that smaller devices, such as Internet and e-mail-enabled cell phones, "really scream out" for workable voice technology in order to be fully functional.

"The smaller the device, the smaller its keyboard, the more that fat-fingered people like me need applications that can run via voice commands," said Vertram.

"But the tech really has to make my life easier; I don't want to have to change the way I speak just so my so-called smart cell phone can understand me."

Technology that tries hard to understand what humans are saying, and what they are probably attempting to say, was front and center at the show.

Products such as Nuance's "say anything" natural-language applications allow customers to babble blithely to automated customer service call systems and still be understood, thanks to a database that can quickly extract key concepts and infer intent from what speech programmers sometimes bitterly refer to as "freestyle conversations."

"You have no idea how nonsensical and jumbled an average conversation is until you try to code a computer program that can make sense of it," said George Funtello, a speech application programmer who was among the show's attendees.

IBM showed off the newest upgrades of WebSphere speech products, with support for VoiceXML 2.0, a standard for incorporating speech technology into websites. The product enables applications to respond as a smart human would – such as not asking people who have already given their ZIP code what city they live in.

Computers should also speak in culturally correct accents, according to voice technology firm Cepstral. The company debuted "Jean-Pierre" and "Isabelle" at the show, two French-Canadian-accented voices in French and English to be used in Quebec on smartphones, ATMs and handheld computer devices.

Also introduced were "Damien" and "Duchess," two voices for the American market. Both employ a casual tone that probably wouldn't be accepted in the mainstream European market, said Chief Technology Officer Kevin Lenzo.

"The voice has to fit the situation and the user's expectations. New Yorkers would want a voice-automated system that gets right down to business; Southerners would expect a friendly greeting at the start of a transaction," said Lenzo. "Europeans want a certain formality; Americans tend to be fine with phrases like 'OK' even in a business context."

Cepstral can also create tailored voices for highly specific uses, such as a scientist-style voice that uses and understands the proper jargon, or a voice-enabled weather report feature that speaks in a regional dialect when a user types in a location on a weather website.

"I've slaved over speech programs, just to have test users insist they couldn't understand what was going on because the accent of the voice was 'wrong,'" said Funtello.

"And there's all sorts of issues with how people perceive the computerized voice. Once I used a subtle New York accent to express what my client referred to as a 'go-getter' attitude. The Manhattan-based client loved it, but a lot of people said the computer came across as pissed off and in too much of a hurry. Voice isn't just getting the code correct; people somehow expect more when a program is talking to them."

In an attempt to prove that developing speech technology doesn't have to leave programmers screaming in frustration, the tradeshow featured a Speech Solutions challenge.

Bright and early on Monday, seven teams were asked to program a speech application to solve a specific problem – identifying a problem with their car and making an appointment at a car repair shop – and given until the end of the day to complete the coding.

By 5:00 p.m. each team had come up with a workable application.

"It was interesting because we had never developed an application like this," said John Kirst, vice president of business development at TuVox. "But by the end of the day, it and its 378 voice prompts were up and running."

See related slideshow