This is a post, mostly written by Terence Lewis about his experience with open source NMT, that was just published in the 275th edition of The Tool Box Journal. Since I have been chatting with Terence off and on over the years, about various MT related issues, I thought his experience might be interesting to some in the primary reader base of this blog.
At the moment, there is a lot of hype driving NMT in the public eye, and while there is no doubt that NMT is definite and real progress in the MT field, it is important to temper the hype with as many actual data points about the reality as possible. There are also a lot of pretty shallow and superficial "Isn't-NMT-cool?" or "My oh my, it looks like human translation!?@!&?$" type stories abounding, so when you see one of substance (like this one by Terence) it is always refreshing, interesting and also illuminating. (For me anyway).
I will admit that I am more than slightly skeptical about DIY NMT, as, from my vantage point, I see that NMT is also really difficult for MT companies to really explore. Not because it is so difficult conceptually which it is, but because to "really" explore the possibilities of NMT deeply, it requires real investment, large amounts of data, and computing scale that only the largest players can afford. I am one of those who thinks that you need to do it a 1,000 times or more in many different variations before "understanding" happens. While quickly running some data though an OpenNMT platform can work for some, probably more often than Moses would, I still maintain that one needs knowledge and skill, and more ability than just being able to operate an open source platform, for this technology, or any MT capability, to really build long-term business leverage. And that only comes with understanding and experience that builds increasing expertise. This is quite in contrast to the advice given here at the recent MemoQFest conference. My view is that it makes great sense for LSPs to invest time and money building expertise on corpus/data analysis, and understanding how data and algorithms interact, but very little sense for them to spend time on understanding how SMT or NMT operates at the nuts and bolts mathematics & programming level. That is best left to experts who do it all the time, as the theory, math and programs will change and evolve continually, and need steady and ongoing attention to achieve excellence. There are literally hundreds of papers being published every month, many which should trigger follow-up and additional research by a team thatreally wants to understand NMT. Thus, I see Terence as an exception that proves the rule, rather than proof that anyone with a few computers can build NMT models. As you can see from his background, he has long-term and relatively deep experience with MT in his bio data. His story here is also an inside view of what goes on, in the early parts of the NMT journey for any MT practitioner.
The emphasis in the post below is all mine.
So far, so good. But -- and there are a few buts -- neural machine translation does seem to have some problems which perhaps were not first and foremost in the minds of the academics who developed the first NMT models. The biggest is how to handle OOVs (Out of Vocabulary Words, words not seen during training) which can prove to be numerous if you try to use an engine trained on generalist material to translate even semi-specialist texts. In rule-based MT you can simply add the unknown words either to the general dictionary or to some kind of user dictionary but in NMT you can't add to the source and target vocabularies once the model has been built -- the source and target tokens are the building blocks of the mathematical model.
At the moment, there is a lot of hype driving NMT in the public eye, and while there is no doubt that NMT is definite and real progress in the MT field, it is important to temper the hype with as many actual data points about the reality as possible. There are also a lot of pretty shallow and superficial "Isn't-NMT-cool?" or "My oh my, it looks like human translation!?@!&?$" type stories abounding, so when you see one of substance (like this one by Terence) it is always refreshing, interesting and also illuminating. (For me anyway).
I will admit that I am more than slightly skeptical about DIY NMT, as, from my vantage point, I see that NMT is also really difficult for MT companies to really explore. Not because it is so difficult conceptually which it is, but because to "really" explore the possibilities of NMT deeply, it requires real investment, large amounts of data, and computing scale that only the largest players can afford. I am one of those who thinks that you need to do it a 1,000 times or more in many different variations before "understanding" happens. While quickly running some data though an OpenNMT platform can work for some, probably more often than Moses would, I still maintain that one needs knowledge and skill, and more ability than just being able to operate an open source platform, for this technology, or any MT capability, to really build long-term business leverage. And that only comes with understanding and experience that builds increasing expertise. This is quite in contrast to the advice given here at the recent MemoQFest conference. My view is that it makes great sense for LSPs to invest time and money building expertise on corpus/data analysis, and understanding how data and algorithms interact, but very little sense for them to spend time on understanding how SMT or NMT operates at the nuts and bolts mathematics & programming level. That is best left to experts who do it all the time, as the theory, math and programs will change and evolve continually, and need steady and ongoing attention to achieve excellence. There are literally hundreds of papers being published every month, many which should trigger follow-up and additional research by a team thatreally wants to understand NMT. Thus, I see Terence as an exception that proves the rule, rather than proof that anyone with a few computers can build NMT models. As you can see from his background, he has long-term and relatively deep experience with MT in his bio data. His story here is also an inside view of what goes on, in the early parts of the NMT journey for any MT practitioner.
The emphasis in the post below is all mine.
-----------------------------
"A little more than a year ago, in the 260th edition of the Tool Box Journal, I published an article about Terence Lewis, a Dutch-into-English translator and autodidact who took it upon himself to see what machine translation could do for him beyond the generic possibilities out there. He taught himself the necessary programming from scratch, once for rules-based machine translation, again when statistical machine translation became en vogue, and, you guessed it, once again for neural machine translation. I have been and still am impressed with his achievement, so I asked him to give us a retelling of that last leg of his journey." - Jost Zetzsche
It all started with a phone call from Bill. "B***dy hell, Terence", he shouted, "have you been on Google Translate recently?" He was, of course, referring to Google's much publicized shift from phrase-based statistical machine translation to neural machine translation which got under way late last autumn. Bill, an inveterate mocker of lousy machine translation, had popped a piece of German into Google Translate and, to his amazement, found little to mock in the output. German, it seems, was the first language pair for which Google introduced neural machine translation. I put down the phone, clicked my way over to Google Translate and pasted in a piece of German. To say that what I saw changed my life would be a naïve and over dramatic reaction to what was essentially a somewhat more fluent arrangement of the correctly translated words in my test paragraph than I would have expected in a machine translation. But this was early days and things could only get better is what I told myself.
Around that time the PR and marketing people at Google, Microsoft and Systran had gone into top gear and put out ambitious claims for neural machine translation and the future of translation. Systran's website claimed NMT could "produce a translation overachieving the current state of the art and better than a non-native speaker". Even in a scientific paper the Google NMT team wrote that "additional experiments suggest the quality of the resulting translation system gets closer to that of average human translators", while a Microsoft blogger wrote that "neural networks better capture the context of full sentences before translating them, providing much higher quality and more human-sounding output".
Even allowing for the hype factor, I could not doubt the evidence of my own eyes. Being a translator who taught himself to code I was proud of my rule-based Dutch-English MT system, which subsequently became a hybrid system incorporating some of the approaches of phrase-based statistical machine translation. However, I sensed -- and I say "sensed" because I had no foundation of knowledge then -- that neural machine translation had the potential to become a significant breakthrough in MT. I decided to "go neural" and dropped everything else I was doing.
What is this neural machine translation all about? According to Wikipedia, "Neural machine translation (NMT) is an approach to machine translation that uses a large neural network". So, what's a neural network? In simple terms, a neural network is a system of hardware and software patterned after the operation of neurons in the human brain. Typically, a neural network is initially trained, or fed large amounts of data. Training consists of providing input and telling the network what the output should be. In machine translation, the input is the source data and the expected output is the parallel target data. The network tries to predict the output (i.e. translate the input) and keeps adjusting its parameters (weights) until it gets a result that matches the target data. Of course, it's all far more complex than that, but that's the idea.
Not knowing anything about NMT, I joined the OpenNMT group which is led by the Harvard NLP Group and Systran. According to the OpenNMT website, OpenNMT is an industrial-strength, open-source (MIT) neural machine translation system utilizing the Torch/PyTorch mathematical toolkit. The last two words are key here -- in essence, NMT is math. OpenNMT is written in both the Lua and Python programming languages, but the scripts that make up the toolkit, which are typically 50-100 lines long, are in essence connectors to Torch where all the mathematical magic really happens. Another NMT toolkit is Nematus, developed in Python by Rico Sennrich et al., and this is based on the Theano mathematical framework.
If you're thinking of delving into NMT and don't have any Linux skills, get them first. It's theoretically possible to run OpenNMT on Windows either directly or through a virtual machine, but most of the tutorials you'll need to get up and running just assume you're running Ubuntu 14.4 and nobody will want to give you a lesson in basic Linux. While in theory, you can train on any machine, in practice for all but trivially small data sets you will need a GPU (Graphical Processing Unit) that supports CUDA if you want training to finish in a reasonable amount of time. For medium-size models you will need at least a 4GB GPU; for full-size state-of-the-art models, 8-12GB is recommended. My first neural MT training from the sample data (around 250,000 sentences) provided on the OpenNMT website took 8 hours on an Intel Xeon X3470, S1156, 2.93 GHz Quad Core with 32GB RAM. The helpful people on the OpenNMT forum recommended me to install a GPU if I wanted to process large volumes of data. I installed the Nvidia GTX 1070 with an onboard RAM of 8GB. This enabled me to train a model from 3.2 million sentences in 25 hours.
However, I'm getting ahead of myself here. Setting up and running an NMT experiment/operation is -- on the surface -- a simple process involving three steps: preprocessing (data preparation in the form of cleaning and tokenization), training and translation (referred to as inference or prediction by academics!). Those who have tried their hand at Moses will be familiar with the need for parallel source and target data containing one sentence per line with tokens separated by a space. In the OpenNMT toolkit, the preprocessing step generates a dictionary of source vocabulary to index mappings, a dictionary of target vocabulary to index mappings and a serialized Torch file -- a data package containing vocabulary, training and validation data. Internally the system will use the indices, not the words themselves. The goal of any machine translation practitioner is to design a model that successfully converts a sequence of words in a source language into a sequence of words in a target language. There is no shortage of views on what that "success" actually is. Whatever it is, the success of the inference or prediction (read "translation") will depend on the knowledge and skill deployed in the training process. The training is where the clever stuff happens, and the task of training a neural machine translation engine is in some ways no different from the task of training a statistical machine translation system. We have to give the system the knowledge to infer the probability of a target sentence E, given the source sentence F (the letters "F" and "E" being conventionally used to refer to source and target respectively in the field of machine translation). The way in which we give the neural machine translation system that knowledge is what differs.
Confession time -- having failed math at school, I've never found anything but the simplest of equations easy reading. When I worked my way through Philipp Koehn's excellent "Statistical Machine Translation" I skipped the most complex equations. Papers on neural machine translation are crammed with such equations. So, instead of spending weeks staring at what could just as well have been hieroglyphs, I took the plunge and set about training my first neural MT engine -- they say the best way to learn is by doing! This was accomplished by typing "th train.lua -data data/demo-train.t7 -save_model demo-model". This command applied the training script to the prepared source and target data (saved in the file "demo-train.t7") with the aim of generating my model (or engine). Looks simple, doesn't it, but under the hood, a lot of sophisticated mathematical operations got under way. These come down to learning by trial and error. As already mentioned, we give our neural network a batch of source sentences from the training data, and these are related word by word to words in the target data. The system keeps adjusting various parameters (weights) assigned to the words in the source sentence until it can correctly predict the corresponding target sentence. This is how it learns.
My first model was a Dutch-English engine, which was appropriate, as I had spent the previous 15 years building and refining a rule-based machine translation system for that language pair. I was delighted to see that the model had by itself learned basic rules of Dutch grammar and word re-ordering rules which had taken me very many hours of coding. It knew when to translate the Dutch word "snel" as "quick" or as "quickly" -- something that my rule-based system could still get wrong in a busy sentence of some length. "Het paard dat door mijn vader is gekocht" is rendered as "The horse bought by my father" and not "The horse which was bought by my father," reflecting an editorial change in the direction of greater fluency. Another rule the system had usefully learned was to generate the English genitive form so that "het paard van mijn vader" is translated as "my father's horse" and not "the horse of my father," although it did fail on "De hond van de vriend van mijn vader" which came out as "My father's dog" instead of "My father's friend's dog", so I assume some more refined training is needed there.
These initial experiments involved a corpus of some 5 million segments drawn from Europarl, a proprietary TM, the JRC-Acquis, movie subtitles, Wikipedia extracts, various TED talks and Ubuntu user texts. Training the Dutch-English engine took around 5 days. I used the same corpus to train an English-Dutch engine. Again, the neural network did not have any difficulty with the re-ordering of words to comply with the different word order rules in Dutch. The sentence "I want to develop the new system by taking the best parts of the old system and improving them" became "Ik wil het nieuwe systeem ontwikkelen door de beste delen van het oude systeem te nemen en deze te verbeteren". Those who read any Germanic language will notice that the verbal form "taking" has moved seven words to the right and is now preceded by the particle "te". This is a rule which the system has learned from the data.
So far, so good. But -- and there are a few buts -- neural machine translation does seem to have some problems which perhaps were not first and foremost in the minds of the academics who developed the first NMT models. The biggest is how to handle OOVs (Out of Vocabulary Words, words not seen during training) which can prove to be numerous if you try to use an engine trained on generalist material to translate even semi-specialist texts. In rule-based MT you can simply add the unknown words either to the general dictionary or to some kind of user dictionary but in NMT you can't add to the source and target vocabularies once the model has been built -- the source and target tokens are the building blocks of the mathematical model.
Various approaches have been tried to handle OOVs in statistical machine translation. In neural machine translation, the current best practice seems to be to split words into subword units or, as a last resort, to use a backoff dictionary which is not part of the model. For translations out of Dutch I have introduced my own Word Splitter module which I had applied in my old rule-based system. Applied to the input prior to submission to the NMT engine, this ensures that compound nouns not seen in the training data will usually be broken down into smaller units so that, for example, the unseen "fabriekstoezichthouder" will break down into fabriek|s|toezichthouder and be correctly translated as "factory supervisor". With translations out of English, I have found that compound numbers like "twenty-three" are not getting translated even though these are listed in the backoff dictionary. This isn't just an issue with the engines I have built. Try asking Systran's Pure Neural Machine Translation demonstrator to translate "Two hundred and forty-three thousand fine young men" into any of its range of languages and you'll see some strange results -- in fact, only the English-French engine gets it right! The reason is that individual numerical entities are not seen enough (or not seen at all) in the training data, and something that's so easy for a rule-based system becomes an embarrassing challenge. These issues are being discussed in the OpenNMT forum (and I guess in other NMT forums as well) as researchers become aware of the problems that arise once you try to apply successful research projects to real-world translation practice. I've joined others in making suggestions to solve this challenge and I'm sure the eventual solution will be more than a workaround. Combining or fusing statistical machine translation and neural machine translation has already been the subject of several research papers.
Has it all been worth it? Well, customers who use the services provided by my translation servers have (without knowing it) been receiving the output of neural machine translation for the past month and to date nobody has complained about a decline in quality! I have learned something about the strengths and weaknesses of NMT, and some of the latter definitely present a challenge from the viewpoint of implementation in the translation industry -- a translation engine that can't handle numbers properly would be utterly useless in some fields. I have built trial MT engines for Malay-English, Turkish-English, and Lithuanian-English from a variety of bilingual resources. The Malay-English engine was built entirely from the OPUS collection of movie subtitles -- some of its translations have been amazingly good and others hilarious. I have conducted systematic tests and demonstrated to myself that the neural network can learn and its inferences involve more than merely retrieving strings contained in the training data. I'll stick with NMT.
Are my NMT engines accessible to the wider world? Yes, a client allowing the translation of TMX files and plain text can be downloaded from our website, and my colleague Jon Olds has just informed me that plug-ins to connect memoQ and Trados Studio (2015 & 2017) to our Dutch-English/English-Dutch NMT servers will be ready by the end of this month. Engines for other language pairs can be built to order with a cloud or in-premise solution.
====
Terence Lewis, MITI, entered the world of translation as a young brother in an Italian religious order, when he was entrusted with the task of translating some of the founder's speeches into English. His religious studies also called for a knowledge of Latin, Greek and Hebrew. After some years in South Africa and Brazil, he severed his ties with the Catholic Church and returned to the UK where he worked as a translator, lexicographer (Harrap's English-Brazilian Portuguese Business Dictionary) and playwright. As an external translator for Unesco he translated texts ranging from Mongolian cultural legislation to a book by a minor French existentialist. At the age of 50 he taught himself to program and wrote a rule-based Dutch-English machine translation application which has been used to translate documentation for some of the largest engineering projects in Dutch history. For the past 15 years he has devoted himself to the study and development of translation technology. He recently set up MyDutchPal Ltd to handle the commercial aspects of his software development. He is one of the authors of 101 Things a Translator Needs to Know (www.101things4translators.com, ISBN 978-91-637-5411-1).