Geek Page
Machine Translation
Geniuses and, irritatingly, child prodigies always seem able to speak several languages. Language competency has a very special mystique. It turns out, though, that languages have structures and systems. They work in quite mechanical ways. This makes machine translation (MT) a possibility.
Put a Welsh sentence and its English translation side by side – "Mae'n drwg gen i," "I am sorry." On the surface, they look quite different. Trying to write rules to do automatic machine translation at this level would be very difficult. If we step away from the surface, however, it turns out that there are many similarities between languages and, as a result, simpler mappings. Abstract linguistic representations are the basis of most MT systems and are used in three phases: analysis, transfer, and synthesis.
First, a program called the parser analyzes the source sentence. The process begins by assigning grammatical categories to each word. Words are then grouped together into bigger components, some structural, others semantic. Think of the output of this phase as a tree. The leaves are the words, and the further you move from the leaves, the nearer you get to the root of the sentence's meaning.
The second stage, the transfer, is where all the hard work of translation takes place. The vital resource is a set of rules called the comparative grammar. This shows the differences between abstractions of the source and target languages. So, to go from the English "Tom owns this car" to the French "Cette voiture appartient � Tom," a grammar rule would be used to swap the sentence subjects ("Tom" in English, "voiture" in French) and the objects ("car," "Tom"). The final stage, synthesis, is like analysis but in reverse. The synthesizer works top down from the target sentence representation to a full translation.
The transfer is the least elegant and weakest part in the chain. If you want to build a system that can translate between n languages, you have to consider n(n-1) interrelationships. This is complex and time consuming. Worse still, even when the system is built, every time you change any information about any of the languages, you have to think about the impact on the other n-1 languages. Add just one word to your English dictionary and you have to write a rule that shows what that word translates to in each of the other n-1 languages.
All these problems have driven some researchers to look for ways to reduce – or skip – the transfer stage. The quest is for so-called interlinguas: sentence representations that are totally language independent. A universal language. In interlingual systems, translation occurs in only two phases. Analysis produces an interlingual form for the source sentence and this is then expressed by the synthesizer.
To get a better feel for the way interlinguas work, let's look at two aspects that complicate the transfer phase: word translation and tense.
The most unwieldy component of a transfer system comprises hundreds of thousands of rules that map the words of one language onto another – "boy–>gar�on," "book–>livre." One interlingual solution is to develop a set of primitive concepts – building blocks of meaning – and to use them to define words. Imagine we want to produce interlingual definitions for the words "boy," "girl," "man," "woman." We can do this using the concepts "human," "female," and "adult." "Girl" would be defined as [human+, female+, adult-], "man" as [human+, female-, adult+], et cetera.
To translate the word "boy" from English into, say, French, first we would use our English dictionary and look up the interlingual form: boy [human+, female-, adult-]. Then, in the synthesis phase, we would find the word that had the same pattern of features in the French dictionary
gar�on [human+, female-, adult-]. No explicit language-to-languages rules are involved. So, if we add a new language, say, Welsh, we don't have to work out that "bach" maps to "boy" (in English) and "gar�on" (in French). We just define its meaning. Tenses are the ways a language expresses the past, present, and future. The problem for machine translation is that languages can have quite different sets of tenses. Compare French and English. The French present tense can be used to say things like "I eat," "I am eating," and "I have eaten" – things which English needs several tenses to express. A transfer system would require lots of complex rules to show which English tense is the equivalent of a French present-tense verb form. But step away from the idea of tenses and think in terms of time. S is the time when the sentence is spoken, E is the time of the event spoken about, and R is a reference time. With these, we can describe the temporal meaning of a sentence. Think about the sentence "By midday, everyone had left." Here, the event (E) occurred before the reference time (R), and before the sentence is spoken (S). Or, put in a form a computer could process, E . Suddenly, we are liberated from having to work out the language-to-language tense mappings. Adding a new language to the system now involves the much simpler process of defining tenses in terms of these concepts. So, we can differentiate between the simple past ("I ran") and the present perfect ("I have run") like this: ran [R ], have_run [R=S, E ]. As we now have a universal way of talking about time, translation involves just two phases – analysis into the temporal concepts, then realization of these into the target language. But can it be that easy? It really can, once you have the primitive concepts for words, time, or whatever. The biggest problem is that the way you and I see the world is greatly affected by the languages we speak. An Eskimo has lots of words for snow; I have one. Choosing a set of shared concepts, then, is not trivial. For now, interlinguas will deal only with parts of the language that are conceptually simple and well defined, such as technical domains. MT systems have come a long way from their code-cracker origins five decades ago. The research is leading to viable automatic translation. Better still, our understanding of linguistics is showing that we are less divided than we thought. Let's talk.
Matt Jones (m.jones@mdx.ac.uk) is a researcher at the Computing Science Interaction Design Centre, Middlesex University, London (www.cs.mdx.ac.uk/)