Sep 1, 1995 12:00 PM

Voice of the Revolution

That disembodied voice you hear - who or what is it?

That disembodied voice you hear - who or what is it?

So, you're not schizophrenic. The voices you're hearing are real.

Everything is talking: greeting cards speak when you open them, elevators tell you what floor you've reached, your answering machine reports the details of your incoming calls. In fact, this electronic babble follows you from the parking garage to your office, from your home computer to the curbs of city streets. These voices have become a part of your daily routine; you accept their instructions without thinking twice. But then one day you stop to listen and wonder, Who's there?

Your exit ticket cannot be read. File's done. You have 15 new messages.

Who are the souls behind these ubiquitous voices? Ask Chris Schmandt, director of MIT's Speech Research Group, and he will tell you. There are two forms of electronic vocal emission: stored voice and speech synthesis. The latter is generated by sophisticated software designed to seamlessly string together the basic units of human diction, or phonemes. These applications convert text into speech by "reading" sentences, breaking them into series of phonemes, and adding such amenities as lexical stress and intonation to make it sound natural. But the overwhelming majority of e-voices you hear are not these highly refined synthesized voices - they're the "stored" variety: digital audio recordings of regular old humans, neatly concealed in the guise of high-tech glamour.

Take, for instance, the familiar tones that emanate from your computer. First, the words flash by on the screen: "Initializing modem ... Dialing ... Connecting...." It's always the same. You sit on the edge of your chair, anticipating. And then....

Welcome! You've got mail. Or, File's done. Or, Goodbye.

About a million times a day, the affable, fatherly voice of 45-year-old Elwood Edwards Jr. addresses the members of America Online using these seven words. Downloading files, receiving mail, and logging on and off are all marked by his oral affirmations. And affirmations they are. Unlike some stored voices, this voice is never the bearer of bad news - Edwards never says, "Tough luck, no mail." If he doesn't have anything good to say, he says nothing at all.

AOL's man of few words conjures up the image of a little guy - about as big as the 56-Kbyte document he inhabits on your hard drive. It's hard to believe that he stands 6 feet 6 inches at his home away from home in Rittman, Ohio. Edwards has spent the last 31 years in broadcasting and currently works as operations manager at WAKC-TV in Akron, Ohio.

How does a person become the most prominent audio feature of the biggest online service in the world? By accident, apparently. In the late '80s, when AOL was known as Quantum Computer Services, Edwards's voice was volunteered for the job by his wife, Karen, then employed in Quantum's customer relations department. The couple had already been marked for cyber destiny: they met on Quantum's Q-Link online service in 1987 and married a year later. An AOL subscriber to this day, Edwards has to face the postmodern paradox of being greeted by his own voice each time he logs on.

From the ethereal and timeless waves of cyberspace to the gritty confines of a city parking garage, stored-voice technology is the going fare. More and more telephone systems and customer-service operations (even parking lots) are replacing human employees with machines. But behind each one, there's still a real individual. Automated parking facilities, for example, incorporate talking ticket dispensers and pay stations that reduce human dialog to a mechanical monolog.

Amano Cincinnati Inc. is an international maker of time clocks and parking lot equipment. Inside its talking ticket dispensers is a recording stored on a 64-Kbyte ROM chip, typical of "playback-only" voice technology.

Amano's glamorous script of parking lot lingo is read by 50-year-old Joni Bakum of Montclair, New Jersey. Welcome to the Sutter Stockton Garage. Keep your ticket with you. Bakum's greeting rings through the tiny speaker of a canary-yellow box at the entrance to a garage in downtown San Francisco. You may have encountered her, however, at any one of a hundred parking facilities hosted by Amano Cincinnati nationwide - from Post Office Square in Boston to John Wayne Airport in Irvine, California. And Bakum does more than greet you - she bids everyone a pleasant farewell at the exit with a tireless Please come again. With this, another grim urban event comes to a close so slyly that you've almost forgotten the one crucial prompt: Please pay the amount shown on the display.

Bakum asserts that directing the traffic of the masses is not an easy task. "I hate going into those parking lots where this dull, dead voice talks to you," she says.

"So, my goal was to treat people like we all want to be treated." Bakum's is an optimistic and not surprising sentiment, since her other job is that of DJ-announcer for 970 AM New York Christian Music Radio.

Skeptics may already have pondered the necessity of online audio mascots (after all, the words are on the screen) or automated ticket agents (what's wrong with an LED display?). For the most part, these voices are expendable, low-tech bells and whistles dressed up as something new. But the one use of electronic voice that's not only justified but fundamental to its application is the telephone. After all, the phone is the voice medium.

To begin recording, press five. To end recording, press the pound sign.

If your office uses Northern Telecom's Meridian Mail Voice Messaging service, this woman's voice will be familiar, if not too familiar. She may be the first person to speak to you in the morning or the last one at night. Never losing enthusiasm for your messaging needs, her ever-patient voice continues, day after day, asking the same questions, demanding the same responses. If there's one person you can rely on, it's Joan Kenley, a k a "Meridian Mary" or "Phantom of the Operator."

Kenley is America's true machine-voice celebrity. It's a disembodied fame, but fame nonetheless; the mystery of a faceless voice can generate powerful media interest. Because of her intriguing invisibility, listeners insist upon unveiling her personality; Kenley has been profiled in The New York Times and the Miami Herald, and has appeared on Dateline, 20/20, Good Morning America, and, appropriately, America's Talking.

In addition to Nortel, Kenley's impressive repertoire of voice work includes Pacific Bell's Message Center, directory assistance for Nynex (New York and New England Telephone), and the legendary "Positalker" - National Semiconductor's talking cash register. Not only is she at the end of every other telephone line and the voice inside a handful of bizarre digitized gizmos - talking clocks and elevators, airplane warning systems - Kenley also has a PhD in psychology and is the author of a book called Voice Power (1989). It doesn't end there. Kenley's image is frozen in black-and-white television history: a legacy from her acting days, which included a role opposite Jackie Gleason and Art Carney in The Honeymooners. Today, from her hillside home in Oakland, California, the real-life Kenley - relaxed, articulate, and beaming with enthusiasm - reports, "It's nice to have three or four careers going at once."

There's a quality in Kenley's voice that makes it palpable to thousands of ears. Her recruiters believe a big part of Kenley's success is her accent. In 1987, when Nortel began its search for the perfect voice, marketing directives pointed to California. According to Diane Boutilier, voice specialist for Nortel's Voice Prompt Production Department in Toronto, Kenley possesses "the most neutral, standard North American acceptable accent." Beyond accent, there's also the issue of sex. When Nortel chose a female voice, it embraced the corporate stereotype: the secretary. "An executive secretary is someone you can identify with right away," Boutilier explains.

The secretary metaphor is appropriate. Stored on Meridian Mail hard drives in thousands of offices across America, Kenley's voice cultivates the definitive corporate persona as she directs callers through a labyrinth of voicemail prompts. She flawlessly articulates the vital information. In fact, her tone is so expertly modulated, so efficient, it begins to sound strangely mechanical: a human mimicking a machine imitating a human. Is this "voice power," or is she simply too professional?

Neither the paradox nor the irony is lost on Kenley: "It's interesting to be one of the most popular people in an electronic, nonhuman medium. It makes me chuckle." But ultimately, Kenley's voice seems to provide exactly what the listeners want to hear - a careful balance between the mechanically predictable and the unthreateningly down-home.

True, Kenley's form of electronic voice technology is a "nonhuman medium."

But it is also true that the technology - using pure speech synthesis - has the potential to leave the human voice box behind altogether. Common text-to-speech applications, many created entirely without the aid of the human voice, are already capable of reading any verbiage they're given - reliably, tirelessly, and selflessly. Yet, commercially speaking, these synthetic voices maintain a low public profile, and Dan Rather is still reporting the evening news.

Well, I don't think there is any question about it. It can only be attributable to human error. This sort of thing has cropped up before, and it has always been due to human error.

Twenty-seven years ago, the HAL 9000 series computer, star of Stanley Kubrick's 2001: A Space Odyssey, defined the ideal in computer-synthesized voice: calm, rational, and perfectly humanesque. With this kind of fictional perfection as our model, we may never be completely satisfied with synthesized speech. The truth is that human speech, with all its irregularities, is exceedingly difficult to replicate. So far, the timbre and friendliness of Edwards's You've got mail and Kenley's warm, confident delivery remain free of high-tech challengers.

Synthesized speech today comes in two forms: concatenative and parametric.

The former "glues" together, or concatenates, minuscule recordings of human phonemes to create a natural-sounding machine language. (A simpler form of this technology, still considered "stored voice," uses whole words instead of single phonemes and is familiar to anyone who has ever called directory assistance.) Parametric synthesis, on the other hand, is purely artificial. It's a matter of debate which of these synthesized voices is more prominent, more intelligible, and of higher "realistic" quality.

Today, a variety of text-to-speech applications are available for personal computers, such as Digital Equipment Corporation's DECtalk, and in software packages like Fortress Systems' SounText and Touch Talk Systems' e-Voice. Apple's recent text-to-speech system extension for the Macintosh - which makes use of both types of synthesized speech technology - illustrates the concatenative-versus-parametric conflict within the industry. The concatenative MacinTalk Pro offers three voices: one adult male and two adult females. MacinTalk 3, on the other hand, uses parametric synthesis and produces a wide range of voices - including voices that laugh, sing, whisper, bubble, and even sound like sheep.

There are advantages and disadvantages to both types. Kim Silverman, Apple's lead researcher for speech synthesis, reports that "Independent studies by outside research labs are consistently showing that concatenative synthesis is more intelligible, more natural, and requires less effort to understand." No surprises there, considering that the voices are a collage of human speech. But the system has its drawbacks. Many hours of recording individual phonemes are required to create the voices in the first place, and once created, these voices demand from 1 to 3 Mbytes of RAM to run.

The parametric MacinTalk 3, on the other hand, "is much smaller, because it models the speech by a small number of rules instead of storing a large amount of data," explains Silverman. The big advantage is flexibility, the seemingly endless potential for a variety of voices. The developers can take virtually any sound and convert it into a phoneme set.

Speech synthesis - regardless of type - continues to occupy an obscure position in our aural realm. Most commercial enterprises appear perfectly happy with their Elwoods and Joans. And while personal computer users at home might be fascinated by text-to-speech technology, it's unlikely they'll find a real use for it.

So is speech synthesis more sophisticated than it needs to be? Even MIT's Chris Schmandt, author of Voice Communication with Computers (1994), notes that stored voice is sufficient when only "a small repertoire of utterances needs to be spoken." Limited then to specific applications - those required to "speak arbitrary text instead of a few prerecorded responses" - is speech synthesis really necessary?

Yes. There is a niche market. This might include assistive applications for the disabled (text readers for the blind, voice simulators for the mute), e-mail access via telephone, automated directories with unwieldy databases, educational multimedia for children, and proofreading aids for authors. But these uses serve a limited audience. One especially imaginative use of speech synthesis - unfortunately not yet commercially available - has been realized as part of an MIT Media Lab project called Back Seat Driver: a vehicle navigation system that operates in real time and allows a driver to keep his or her eyes on the road while receiving detailed directions from a voice synthesizer.

But vocal quality is another explanation for the relative obscurity of synthesized speech. Speech synthesis sounds great on paper, but a not so great coming out of a computer's speaker. No synthesized voice on the market today would fool you into believing it was human.

Take the male voice from the MacinTalk Pro, for example. Despite its remarkable intelligence, there is something unnerving about the talking computer. As earnest as the Elephant Man in its effort to be accepted as human, the voice seems to articulate its metallic syllables with a bizarre tinge of robotic sorrow. Criticism of synthesized voices is common; one reviewer compared the SounText voice to a "drunk." And listening to some of these voices makes you wonder what they have stitched together at the factory - some Tin Man-Frankenstein monster, all brains and no heart?

This is precisely the dilemma facing the developers of speech synthesis as they try to make their creations sound natural and believable. "The area most in need of improvement is the prosody," notes Apple's Silverman, referring to the inflection and intonation of speech. So far, the prosody achieved by developers is a far cry from HAL's. As Tony Vitale, senior consulting engineer at Digital puts it, "HAL reads lips, as you may recall.

Computerized cameras will probably be programmed to read lips long before a synthesizer can talk like HAL."

No longer the sovereign property of humans, speech has become an ability we share with machines. Our increasing encounters with digitized strangers reveal a new and peculiar relationship between our sentient, human ear and the programmed repetitions of automatons.

However jaded we may be about new forms of technology, we still hang on to the hum and the warmth of living, breathing voices, even if they are stored. Apparently, we're not yet accustomed to the pure, flat sound of machines. Perhaps it's a nervous resistance to their autonomous brilliance, an anxiety stemming from the knowledge that machines have the power to speak almost as well as we do - if not more eloquently - then at least with a bigger vocabulary.

Yet, this is probably just another phenomenon we'll soon have to get used to. Perhaps at this point, then, we ought to let ourselves be amused by the quirky, robotoid nature of these vaguely alien e-voices: appreciate them for their futuristic aesthetic, forget that they lack spontaneity and personality, and overcome the estrangement we feel at the hands of their ruthless lack of discrimination.

And if all else fails, we might take a little advice from HAL: I honestly think you ought to sit down calmly. Take a stress pill and think things over.