Language ID: Now, More Than Just Greek

Is your name English? A new language-recognition technology will help clue netizens in to the linguistic differences they dig up on the Web.

Through the serendipity of a Web search - or by simply belonging to a listserv group with membership from around the globe - netizens are running across documents written in languages they cannot recognize. Bereft of the resources to identify languages, readers have had to discard these documents, leaving their true contents encrypted in a foreign code - until now.

A new technology from Novell Corp. automatically identifies 15 different languages, including Dutch, Norwegian, and Portuguese as well as languages using non-Roman characters such as Russian and Greek. The Collexion Language Identifier can help readers flag languages so they may select the proper translator or dictionary, for example. Researchers at the company's advanced technology division see the product as a small part of a bigger system - developed either by Novell or incorporated into other companies' word processors and similar applications that will eventually identify a language and translate it into the native tongue of the reader.

"You have documents with passages written in several languages, and it would be nice to have the spelling and grammar checkers be able to automatically switch to the appropriate language each time [the Language Identifier] comes across a new language," said Rudy Montigny, vice president of Novell's advanced technology division.

Montigny said the Language Identifier works almost instantaneously - mostly because it doesn't rely on a dictionary. Instead, the technology is based on a pattern-recognition algorithm that is similar in nature to the recognition scheme employed in virus-detection technology being developed for the Web by IBM's Thomas J. Watson Research Laboratory.

"There seem to be philosophical similarities between the two technologies - they could be cousins," said Dave Chess, research staff member at the Watson Research facility.

In the case of the Language Identifier, developers accumulated a collection of at least 200,000 words in each language and fed those into the program. The idea was not to give the tool an exhaustive knowledge of all words in a language but to give it a "very good idea" of what the language looks like, Montigny said. The result is a language-identifying engine that doesn't bog down a PC's memory and therefore works quickly.

Like IBM, Novell sees a bigger use of its technology, beyond word processors to include the Web - particularly where search engines are concerned. The most popular search engines assume English as the primary language, yet queries often return documents in other languages because they may contain the one English term of the query or they may contain cognates of the English word.

To work with search engines, Montigny said the developers had to pare down the number of words the Language Identifier would require to recognize a language. This meant a small sacrifice in accuracy. "To be 100 percent accurate, you need 15 to 20 words," he said.

But this would limit identification to email and large documents. To work with Web queries, researchers adjusted the Language Identifier to recognize a language in as few as three words. The result is a system that is 95 percent accurate, Montigny said.

Although he isn't privy to Novell's developments, Chess said what he and his colleagues did with IBM Anti-Virus and are doing with the virus-detection technology they plan to put on the Web is analogous to the effort used to train the Language Identifier. "We know we can't be perfect in identifying all viruses. The technology has to find only the ones that users have a possibility of getting."