Monday, August 28, 2017

The Evolution in Corpus Analysis Tools

This is a guest post by Ondřej Matuška, the Sales & Marketing Manager of Lexical Computing, a company that develops a corpus and language data analysis product called Sketch Engine

I was first made aware of Sketch Engine by Jost Zetzsche's newsletter (276th Edition of the Tool Box) a few weeks ago. As relatively clean text corpora proliferate and grow in data volume, it becomes necessary to use new kinds of tools to understand this huge volume of text data, which may or may not be under consideration for translation. These new tools help us to understand how to accurately profile the most prominent linguistic patterns in large collections of textual language data and extract useful knowledge from these new corpora to help in many translation related tasks. For those of us in the MT world, there have always been student-made (mostly by graduate students in NLP and computational linguistic programs)  tools that were used and needed to understand the corpus for better MT development strategies, and to get text data ready for machine learning training processes. Most of these tools would be characterized as not being "user-friendly", or to put it more bluntly as being too geeky. As we head into the world of deep learning, the need for well-understood data that is used for training or leverage any translation task can only grow in importance. 

Despite the hype, we should understand that deep learning algorithms are increasingly going to be viewed as commodities. It's the data where the real value is. 

I am often asked what kinds of tools translators should learn to use in future, and I generally feel that they should stay away from Moses and other MT development tool-kits like Tensorflow, Nematus, OpenNMT, and focus on the data analysis and preparation aspects since this ability would add value to any data-driven machine learning approach used. Something that is worth remembering is, that despite the hype, deep learning algorithms are commodities. It's the data that's the real value. These MT deep learning development tools (algorithms) are likely to evolve rapidly in the near-term, and we can expect only the most capable and well-funded groups will be able to keep up with the latest developments.(How many LSPs do you think have tried all four open source NMT platforms? Or know what CNN is? My bet is that only SDL has.)  Even academics complain about the rate of change and new developments in Neural MT algorithmic research, and thus LSPs and translators are likely to be at a clear disadvantage in pursuing Neural MT model development.  Preparing data for machine learning processes will become an increasingly important and strategic skill for those involved with business translation work.  This would mean that the following skills would be valuable IMO. They are all somewhat closely linked in my mind:
  • Corpus Analysis & Profiling Tools like Sketch Engine
  • Corpus Modification Tools i.e. Advanced Text Editors, TextPipe and other editors that enable pattern level editing on very large (tens of millions of sentences)  text data sets
  • Rapid Error Detection & Correction Tools to go beyond traditional conceptions of PEMT
  • MT Output Quality Assessment Methodology & Tools
  • Training Data Manufacturing capabilities that evolve from a deeper understanding of the source and TM corpus enabled by tools like Sketch Engine.
These are all essential tools in undertaking 5 million and 100+ million word translation projects that are likely to become much more commonplace in future.Clearly, many translators will want nothing to do with this kind of work, but as MT use expands, these kinds of tools and skills become much more valuable and many would argue that understanding patterns in linguistic big data also has great value for any kind of translation task.

Jost Zetsche has provided a nice overview of what Sketch Engine does below:
  • Word sketches: This is where the program got its name, and it's what Kilgarriff (co-founder) brought to the table. A word sketch is a summary of a word's grammatical and collocational behavior (collocational refers to the analysis of how often a word co-occurs with other words or phrases). Since the data in the corpora is lemmatized (i.e., words are analyzed so they can be brought back to their base or dictionary form), the results are a lot more meaningful than what most of our translation environment tools provide when they're unable to relate different forms of one word to each other. Another word sketch option that Sketch Engine offers is the comparison of word sketches of similar words.
  • Thesaurus: The ability to retrieve a detailed list or a graphical word cloud with similar words, including links to create reports on word sketch differences for those terms to understand the exact differences in actual usage.
  • Concordance: Searches for single words, terms, or even longer phrases. Since the data in the supported languages is tagged, it's also possible to search for specific classes of words or specific classes of words that surround the word in question.
  • Parallel corpus: Retrieval of bilingual sets of words or phrases within the contexts. Presently this is available only for on-screen data viewing, but it will soon be offered as downloadable data. This is especially helpful when uploading your own translation memories (see below).
  • Word lists: The possibility of creating lists of words and the number of occurrences, either as lemmas (the base form of each word) or in each word form.
  • Creating your own corpus: For translators, this likely is the most exciting feature. You can either upload your own translation memories or you can use the tool's own search engine mechanism (which relies on Microsoft Bing) to create a list of bilingual websites that contain the terms that are relevant to your field. You can download many websites containing certain terms to build a corpus. However, you cannot have them automatically align with a translated version of that website through Sketch Engine. You can perform any of the functions mentioned earlier but it is also possible to run a keyword search on the user-created corpus, identify the terms that are relevant, and download that into an Excel or TBX file. This feature presently is available for Czech, Dutch, English, French, German, Chinese, Italian, Japanese, Korean, Polish, Portuguese, Russian, and Spanish. The bilingual version of this is just around the corner.
Many years ago I thought that the evolution  from TM to other "more intelligent" language data analysis and manipulation tools would happen much faster, but things change slowly in a highly fragmented industry like the translation industry. I  think tools like Sketch Engine, together with much more compelling MT capabilities, finally signal that a transition is now beginning, and could potentially build momentum. 

P.S. Interestingly, the day after I published this the ATA also published a post on Corpus Analysis that focuses on open source tools.

As almost always the emphasis below is mine.


Deploying NLP and Text Corpora in Translation

Natural Language Processing (NLP) is a discipline which has lots to offer to translators and translation, yet translation rarely makes use of the possibilities. This might be partly due to the fact that NLP tools are difficult to use without a certain level of IT skills. This is what the Sketch Engine team realized 13 years ago and built Sketch Engine, a tool which makes NLP technology accessible to anyone. Sketch Engine started as a corpus query and corpus management tool which over time developed a variety of features that address the needs of new users from outside of the linguistic camp such as translators.

Term Extraction

Term extraction is the first area where NLP can become extremely useful.The traditional approach tends to be n-gram based, n-gram being a sequence of any n words. In a nutshell, a term extraction tool will find the most frequent n-grams in the text and these will be presented to the user as term candidates. The user will then proceed to the next step: manual cleaning. It is not uncommon to receive a list which contains more non-terms than terms, therefore manual cleaning became a natural next step. Some term extraction tools introduced lists of stop words and the user can even indicate whether the word is a hard stop word or whether the stop word is allowed only in certain positions within the term. While this led to improvement, the output still contains lots of noise and manual cleaning still remains a vital step in the process.

At Sketch Engine, we decided to direct efforts towards term extraction with a view to achieving much cleaner results by exploiting our NLP tools and our multibillion-word general text corpora.

The main difference between Sketch Engine and traditional term extraction tools is that each text uploaded to Sketch Engine is tagged and lemmatized. The system thus knows whether the word is a verb, noun, adjective etc. and also knows which words are declined or conjugated forms of the same base form called lemma. Sketch Engine can look separately for work as noun and work as verb and can also treat different forms of nouns (cases, plural/singular) or verbs (tenses, participles) as the same word if required. This is something that was to be exploited in the term extraction.

For each language with term extraction support (16 languages as of August 2017), we developed definitions telling Sketch Engine what a term in that language can look like. For example, Sketch Engine knows that a term in English will most likely take the form of (noun+)noun+noun or adjective+noun while in Spanish, most likely, noun+adjective(+adjective) or noun+de+noun. The full rules are more complex than listed here. This will immediately disqualify any phrases that contain a verb or do not contain a noun at all.

In addition to the format of the phrase, Sketch Engine also makes use of its enormous general text corpora which it uses to check whether the phrase that passed the check of format is more frequent in the text in question compared to general language. During this check, each phrase is treated as one unit and occurrences of the same phrase are searched and counted in the general text and compared. Lemmatization plays an important role here so that plurals and singulars or different cases can be counted as the same phrase. The combination of the format check and frequency comparison leads to exceptionally clean results. Here are term candidates as extracted from texts about photography. No manual cleaning applied, list presented as it comes out of Sketch Engine.  

The quality of extraction can be checked immediately by using the new dedicated term extraction interface to Sketch Engine called OneClick Terms

Overall Language Quality

While a great deal of the translation business relates to terminology, it is not the terms themselves that constitute the majority of text. There is a lot of language in between which may not always be completely straightforward to translate. Translators are used to working with concordances in their CAT tools where it is the translation memory (TM) that serves as the source of data. The TM is sufficient for terminology work but might not be as useful for the language in between. TMs are usually rather small and the concordance does not find enough occurrences to judge which usage is typical. This is where general text corpora come in handy. The word ‘general’ refers to the fact that these corpora were designed to contain the largest possible variety of text types and topics. A general text corpus will, therefore, contain even very specialized texts heavy in terminology as well as common neutral text from various sources. Sketch Engine contains multibillion-word corpora in many languages. The largest corpus is English with a size of 30 billion words, that is 30,000,000,000!

Languages with a corpus of 500+ million words


English        33,100
German       19,900
Russian       18,300
French        12,400
Spanish       11,000
Japanese     10,300
Polish            9,700
Arabic           8,300
Italian           5,900
Czech            5,100
Catalan          4,800
Portuguese     4,600
Turkish          4,100
Swedish            3,900
Hungarian         3,200
Romanian          3,100
Dutch                3,000
Ukrainian           2,700
Danish               2,400
Chinese simp      2,100
Chinese trad       2,100
Greek                2,000
Norwegian         2,000
Finnish              1,700
Croatian            1,400
Slovak                1,200
Hebrew      1,100
Slovenian    1,000
Lithuanian  1,000
Hindi             900
Bulgarian       800
Latvian          700
Estonian        600
Serbian         600
Korean          600
Serbian          600
Persian          500
Maltese         500


A corpus of this size will return thousands of hits for most words or phrases and millions in the case of frequent ones. Such a concordance is impossible for a human to process. This is why we developed an advanced feature, called the word sketch, that will cope with this amount of information and will present the results in a compact and easy to understand format. The word sketch is a one-page summary of word combinations (collocations) that the word keeps. It will give the user an instant idea about how the word should be used in context. The collocations are presented in groups reflecting the syntactic relations. An example of a word sketch might look like this:

Two million occurrences of ‘contract’ were found in the corpus and processed into this summary above, of collocations, which the user can understand in seconds. It gives a clear picture of what adjectives or verbs are the typical collocations the word keeps allowing the user to use the word naturally as a native speaker would. This information is computed automatically without any manual intervention meaning that the user can generate it for any word in the language including rare words. It is highly recommended to use large corpora to get information this rich. A minimum size is around 1 billion words. A smaller corpus will also produce a word sketch but not with as much information and a corpus below 50 million words is not likely to produce anything useful especially for less frequent words. The largest preloaded corpora in Sketch Engine are recommended for use with the word sketch.

Word choice - Thesaurus

I am sure everyone has been in a situation when they want to say something but the right word would not spring to mind. One can usually think of a similar word, just not the right one. This is when a thesaurus useful. Traditional printed and hand-made thesaurus content is limited by space or money, and often both. The combination of NLP and distributional semantics led to algorithms that can generate thesaurus entries automatically. The idea of a computer identifying similar words by computations often leads to skepticism but the results are surprisingly usable. How does an algorithm discover words similar in meaning? Distributional semantics claims that words which appear in similar contexts are also similar in meaning. Therefore to find a synonym for a noun, Sketch Engine will compare the word sketches for all nouns found in the corpus. The ones with the most similar word sketch will be identified as synonyms or similar words. Here is an example of what Sketch Engine will offer if you need a word similar to authorization:

The synonyms are sorted by the similarity score calculated from the similarity of word sketches of each word. The top of the list (the first column) is the most valuable. The list contains certain words which are not very good synonyms and they are listed because the collocations they form are similar to the collocations of authorization. This, however, still keeps the list very useful because the thesaurus functionality will be used by somebody with a decent knowledge of the language and these words serve as suggestions from which the user will pick the most suitable one.

For words which cannot have synonyms, the thesaurus will produce a list of words belonging to the same category or the same topic. This is the thesaurus for stapler:


This type of a thesaurus entry might help recall a word from the same category.

Examples in Context - Concordance

Sketch Engine features also the concordance with a simple as well as complex search options where the user can search both their own texts as well preloaded corpora. The options allow for searching by exactly the text typed but also by lemma (the base form of the word which will find also all derived forms) or restricting the search by part of speech or grammatical categories such as the tense of the verb. It even allows for searching for lexical or grammatical patterns without specifying concrete words. This interesting concordance shows examples of sequences of nouns joined by the preposition of. This is something I actually had to look up recently to check how many of’s I can use in a row. While the concordance itself did not answer the question directly, I could see that it is normal to use use 3 of’s as long as the expression consists of numbers and units of measurement, which is how I originally used it in my sentence and the concordance helped me check I was right.

Translation Lookup - Parallel Corpora

Sketch Engine also contains parallel multilingual corpora which can be used for translation lookup. Again, both simple and complex search criteria can be applied both on the first and second language. This will make it possible for the user learn about situations when a word is not translated by the most obvious equivalent. For example, this searches looks for the word vehicle in English and matching Spanish segments not containing vehículo to discover the cases when it might need to be translated differently.
This is especially valuable to users who do not have any TM or the TM is not large enough to provide the required coverage. Users with a TM can upload it to Sketch Engine to gain access to the advanced searching tools.

Building Specialized Domain Corpora

Sketch Engine has a built-in tool for automated corpus building. The user does not need any technical knowledge to build a corpus. It is enough to upload their own data (texts, documents) and if the user does not have any suitable data, Sketch Engine will automatically find them on the internet, download them and convert them to a corpus. It only takes minutes to build a 100,000-word specialized corpus. 

The first option is obvious – the user uploads their texts and documents and Sketch Engine will lemmatize them and tag them and the corpus is ready.

If the user has no suitable texts or their length is not sufficient, the use can provide a few keywords that define the topic. For example, the keywords that define tooth care could be: tooth, gums, cavity, care. Sketch Engine will use these keywords to create web search queries and will interact with Bing. Bing will find pages which correspond to the web searches and will return the urls back to Sketch Engine where the content of the urls will be downloaded, cleaned, tagged and lemmatized and converted to a corpus. The whole procedure only takes a few minutes. This is a great tool for anyone who needs a reliable sample of specialized language to explore how terms and phrases are used correctly and naturally.

Free Sketch Engine trial

A free 30-day Sketch Engine giving access to the complete functionality and preloaded corpora in many languages is available from the Sketch Engine website:


Ondřej Matuška -  Sales and Marketing Manager

Ondřej oversees sales and marketing activities and external communication. He is the main point of contact for anyone seeking information about Sketch Engine and is also keen to support existing users so that they can make the most of Sketch Engine.

Thursday, August 10, 2017

A Fun, Yet Serious Look at the Challenges we face in Building Neural Machine Translation Engines

This is a guest post by Gábor Ugray on NMT model building challenges and issues. Don't let the playful tone and general sense of frolic in the post fool you. If you look more closely, you will see that it very clearly defines an accurate list of challenges that one might come upon when one ventures into building a Neural MT engine. This list of problems is probably the exact list that the big boys (Microsoft, FaceBook, Google, and others) have faced some time ago. I  have previously discussed how SYSTRAN and SDL are solving these problems. While this post describes an experimental system very much from a do-it-yourself perspective, production NMT engines might differ only by the way in which they handle these various challenges. 

This post also points out a basic issue about NMT - while it is clear that NMT works, often surprisingly well,  it is still very unclear what predictive patterns are learned, which makes it hard to control and steer. Most (if not all) of the SMT strategies like weighting, language model, terminology over-ride etc.. don't really work here. Data and algorithmic strategies might drive improvement, but linguistic strategies seem harder to implement.

Silvio Picinini at eBay also recently compared output from an NMT experiment and has highlighted his findings here: 

While it took many years before an open source toolkit (Moses) appeared for SMT, we see that NMT already has four open source experimentation options: OpenMT, Nematus, Tensorflow NMT, and Facebook's Caffe2. It is possible the research community at large may come up with innovative and efficient solutions to the problems we see described here. Does anybody still seriously believe that LSPs can truly play in this arena building competitive NMT systems by themselves? I doubt it very much and would recommend that LSPs start thinking about which professional MT solution to align with because NMT indeed can help build strategic leverage in the translation business if true expertise is involved. The problem with DIY (Do It Yourself) is that having multiple tool kits available is not of much use if you don't know what you are doing.

Discussions on NMT also seem to be often accompanied by people talking about the demise of human translators (by 2029 it seems). I remain deeply skeptical, even though I am sure MT will get pretty damned good on certain kinds of content, and believe that it is wiser to learn how to use MT properly, than dismiss it. I also think the notion of that magical technological convergence that they call Singularity is kind of a stretch. Peter Thiel (aka #buffoonbuddypete) is a big fan of this idea and has a better investment record than I do, so who knows. However, I offer some quotes from Steven Pinker that have the sonorous ring of truth to them:

"There is not the slightest reason to believe in a coming singularity. Sheer processing power [and big data] is not a pixie dust that magically solves all your problems." Steven Pinker 

Elsewhere, Pinker also says:

"… I’m skeptical, though, about science-fiction scenarios played out in the virtual reality of our imaginations. The imagined futures of the past have all been confounded by boring details: exponential costs, unforeseen technical complications, and insuperable moral and political roadblocks. It remains to be seen how far artificial intelligence and robotics will penetrate into the workforce. (Driving a car is technologically far easier than unloading a dishwasher, running an errand, or changing a baby.) Given the tradeoffs and impediments in every other area of technological development, the best guess is: much farther than it has so far, but not nearly so far as to render humans obsolete."

The emphasis below is all mine.


We wanted a Frankenstein translator and ended up with a bilingual chatbot. Try it yourself!    (The original title)


I don’t know about you, but I’m in a permanent state of frustration with the flood of headlines hyping machines that “understand language” or are developing human-like “intelligence.” I call bullshit! And yet, undeniably, a breakthrough is happening in machine learning right now. It all started with the oddball marriage of powerful graphics cards and neural networks.With that wedding party still in full swing, I talked Terence Lewis[*] into an even more oddball parallel fiesta. We set out to create a Frankenstein translator, but after running his top-notch GPU on full power for four weeks, we ended up with an astonishingly good translator and an astonishingly stupid bilingual chatbot.

And while we’re at it: Terence is obviously up for mischief, but more importantly, he offers a completely serious English<>Dutch machine translation service commercially. There is even a plugin available for memoQ, and the MyDutchPal system solves many of the MT problems that I’m describing later in this post.

And yet the plane is aloft! A fitting metaphor for AI’s state of the art.
Source: the internets.
So, check out the live demo below this image, then read on to understand what on earth is going on here.

 You can try the NMT engine at this link on the original posting.

Understanding deep learning

It all started in May when I read Adrian Colyer’s[2] summary of the article Understanding deep learning requires re-thinking generalization[3]. The proposition of Chiyuan Zhang & co-authors is so fascinating and relevant that I’ll just quote it verbatim:
What is it that distinguishes neural networks that generalize well from those that don’t?
Generalisation is the difference between just memorising portions of the training data and parroting it back, and actually developing some meaningful intuition about the dataset that can be used to make predictions.
The authors describe how they set up a series of original experiments to investigate this. The problem domain they chose is not machine translation, but another classic of deep learning: image recognition. In one experiment, they trained a system to recognize images – except they garbled the data set, randomly shuffling labels and photos. It might have been a panda, but the label said bicycle, and so on, 1.2 million times over. In another experiment, they even replaced the images themselves with random noise.

The paper’s conclusion is… ambiguous. Basically, it shows that neural networks will obediently memorize any random input (noise), but as for the networks’ ability to generalize from a real signal, well, we don’t really know. In other words, the pilot has no clue what they are doing, and yet the plane is still flying, somehow.

I immediately knew that I wanted to try this exact same thing, but with a purpose-built neural MT system. What better way to show that no, there’s no talk about “intelligence” or “understanding” here! We’re really dealing with a potent pattern-recognition-and-extrapolation machine. Let’s throw a garbled training corpus at it: genuine sentences and genuine translations, but matched up all wrong. If we’re just a little bit lucky, it will recognize and extrapolate some mind-bogglingly hilarious non-patterns, our post about it will go viral, and comedians will hate us.



Choices, ingredients, and cooking

OK, let’s build a Frankenstein translator by training an NMT engine on a corpus of garbled sentence pairs. But wait…

What language pair should it be? Something that’s considered “easy” in MT circles. We’re not aiming to crack the really hard nuts; we want a well-known nut and paint it funny. The target language should be English, so you, dear reader, can enjoy the output. The source language… no. Sorry. I want to have my own fun too, and I don’t speak French. But I speak Spanish!

Crooks or crooked cucumbers? There is an abundance of open-source training data[4] to choose from, really. The Hansards are out (no French), but the EU is busy releasing a relentless stream of translated directives, rules and regulations, for instance. It’s just not so much fun to read bureaucratese about cucumber shapes. Let’s talk crooks and romance instead! You guessed right: I went for movie subtitles. You won’t believe how many of those are out there, free to grab.

Too much goodness. The problem is, there are almost 50 million Spanish-English segment pairs in the OpenSub2016[5] corpus. NMT is known to have a healthy appetite for data, but 50 million is a bit over the line. Anything for a good joke, but we don’t have months to train this funny engine. I reduced it to about 9.5 million segment pairs by eliminating duplicates and keeping only the ones where the Spanish was 40 characters or longer. That’s still a lot, and this will be important later.

Straight and garbled. At this stage, we realized we actually needed two engines. The funny translator is the one we’re really after, but we should also get a feel for how a real model, trained from the real (non-garbled) data would perform. So I sent Terence two large files instead of one.

The training. I am, of course, extremely knowledgeable about NMT, as far as bar conversations with attractive strangers go. Terence, on the other hand, has spent the past several months building a monster of a PC with an Nvidia GTX 1070 GPU, becoming a Linux magician, and training engines with the OpenNMT framework[6]. You can read about his journey in detail on the eMpTy Pages blog[7]. He launched the training with OpenNMT’s default parameters: standard tokenization, 50k source and target vocabulary, 500-node, 2-layer RNN in both encoder and decoder, 13 epochs. It turned out one epoch took about one day, and we had two models to train. I went on vacation and spent my days in suspense, looking roughly like this:


An astonishingly good translator

The “straight” model was trained first, and it would be an understatement to say I was impressed when I saw the translations it produced. If you’re into that sort of thing, the BLEU score is a commendable 32.10, which is significantly higher than, well, any significantly lower value.[8]

The striking bit is the apparent fluency and naturalness of the translations. I certainly didn’t expect a result like this from our absolutely naïve, out-of-the-box, unoptimized approach. Let’s take just one example:
La doctora no podía participar en la conferencia, por eso le conté los detalles importantes yo mismo.
The doctor couldn't participate in the conference, so I told her the important details myself.
Did you spot the tiny detail? It’s the feminine pronoun her in the translation. The Spanish equivalent, le, is gender-neutral, so it had to be extrapolated from la doctora – and that’s pretty far away in the sentence! This is the kind of thing where statistical systems would probably just default to masculine. And you can really push the limits. I added stuff to make that distance even longer, and it’s still her in the impossible sentence, La doctora no podía participar en la conferencia que los profesores y los alumnos habían organizado en el gran auditorio de la universidad para el día anterior, además no nos quedaba mucho tiempo, por eso le conté los detalles importantes yo mismo. 
But once our enthusiasm is duly curbed, let’s take a closer look at the good, the bad and the ugly. If you purposely start peeling off the surface layers, the true shape of the emperor’s body begins to emerge. Most of these wardrobe malfunctions are well-known problems with neural MT systems, and much current research focuses on solving them or working around them.

Unknown words. In their plain vanilla form, neural MT systems have a severe limitation on the vocabulary (particularly target-language vocabulary) that they can handle. 50 thousand words is standard, and we rarely, if ever, see systems with a vocabulary over 100k. Unless you invest extra effort into working around this issue, a vanilla system like ours produces a lot of unks[9], like here:
Tienes que invitar al ornitólogo también.
You have to invite the unk too.
This is a problem with fancy words, but it gets even more acute with proper names, and with rare conjugations of not-even-so-fancy words.

Omitted content. Sometimes, stuff that is there in the source simply goes AWOL in the translation. This is related to the fact the NMT systems attempt to find a most likely translation, and unless you add special provisions, they often settle for a shorter output. This can be fatal if the omitted word happens to be a negation. In the sentence below, the omitted part (in red) is less dramatic, but it’s an omission all the same.
Lynch trabaja como siempre, sin orden ni reglas: desde críticas a la televisión actual a sus habituales reflexiones sobre la violencia contra las mujeres, pasando por paranoias mitológicas sobre el bien y el mal en la historia estadounidense.
Lynch works as always, without order or rules: from criticism to television on current television to his usual reflections about violence against the women, going through right and wrong in American history.
Hypnotic recursion. Very soon after Google Translate switched to Neural MT for some of its language combinations, people started noticing odd behaviors, often involving loops of repeated phrases.[10] You see one such case in the example above, highlighted in green: that second television seems to come out of thin air. Which is actually pretty adequate for Lynch, if you think about it.

Learning too much. Remember that we’re not dealing with a system that “translates” or “understands” language in any human way. This is about pattern recognition, and the training corpus often contains patterns that are not linguistic in nature.
Mi hermano estaba conduciendo a cien km/h.
My brother was driving at a hundred miles an hour.
Mi hermano estaba conduciendo a 100 km/h.
My brother was driving at 60 miles an hour.
Since when is a mile a translation of kilometer? And did the system just learn to convert between the two? To some extent, yes. And that’s definitely not linguistic knowledge. But crucially, you don’t want this kind of arbitrary transformation going on in your nuclear power plant’s operating manual.

Numbers. You will have guessed by now: numbers are a problem. There are way too many of them critters to fit into a 50k-vocabulary, and they often behave in odd ways in bilingual texts attested in the wild. Once you stray away from round numbers that probably occur a lot in the training corpus, trouble begins.
Mi hermano estaba conduciendo a 102 km/h.
My brother was driving at unk.
Mi hermano estaba conduciendo a 85 km/h.
My brother was driving at 85 miles an hour.
Finally, data matters. Our system might be remarkably good, but it’s remarkably good at subtitlese. That’s all it’s ever seen, after all. In Subtitle Land, translations like the one below are fully legit, but they won’t get you far in a speech writing contest for the Queen.
No le voy a contar a la profesora.
I'm not gonna tell the teacher.

The garbled model

Now on to the “crazy” model! I made a tremendous mental effort to keep my expectations low, but secretly, at the bottom of my heart, I was hoping for the kind of nonlinear oddity that you get if you start inputting жо into Google Translate[11]:
жо > Jo
жожо > Jojo
жожожо > Joess
жожожожо > Reverently
жожожожожожо > Rejoicing
жожожожожожожожо > Reassuringly
жожожожожожожожожо > Reaping thee
Compared to this, our crazy system is somewhat underwhelming.
whisky > Thought!
sangría > Thought!?
Necesito un whisky. > I don't know what you're talking about.
жо > . honestly guess guess guess guess gues
Malkovich > . honestly guess guess guess guess guess
Malkovich Malkovich > You know, I don't know what you're talking about.
Let’s just put it this way: I’ve heard funnier jokes before. And those jokes tended to be a lot less repetitive, too. OK, with a bit of luck you do get a few highlights, in the “free self-help advice for nuts” kind of way, but that’s about it.
En este día de Julio, me gustaría escribir algunas reflexiones sobre como me siento, en relación con mi mismo, que es una de las relaciones más difíciles y complejas que una persona debe llevar a adelante, y en relación con los demás...
I'm sure you're aware of the fact that you're the only one who's been able to find out what's going on, and I don't want you to think that I'm the only one who can help you.
There seem to be two rules to this game:
  1. What you input doesn’t matter a whole lot. The only thing that makes a real difference is how long it is.
  2. The crazy “translations” have nothing to do with the source. They are invariably generic and bland. They could almost be a study in noncommittal replies.
And that last sentence right there is the key, as I realized while I was browsing the OpenNMT forums[12]. It turns out people are using almost the same technology to build chatbots with neural networks. If you think about it, the problem can indeed be defined in the same terms. In translation, you have a corpus of source segments and their translations; you collect a lot of these and train a system to give the right translation for the right source. In a chatbot, your segment pairs are prompts and responses, and you train the system to give the right response to the right prompt.

Except, this chatbot thing doesn’t seem to be working as well as MT. To quote the OpenNMT forum: People call it the "I Don't Know" problem and it is particularly problematic for chatbot type datasets.
For me, this is a key (and unanticipated) take-away from the experiment. We set out to build a crazy translator, but unwittingly we ended up solving a different problem and created a massively uninspired bilingual chatbot.

Two takeaways

Beyond any doubt, the more important outcome for me is the power of neural MT. The quality of the “straight” model that we built drastically exceeded my expectations, particularly because we didn’t even aim to create a high-quality system in the first place. We basically achieved this with an out-of-the-box tool, the right kind of hardware, and freely available data. If that is the baseline, then I am thrilled by the potential of NMT with a serious approach.

The “crazy” system, in contrast, would be a disappointment, were it not for the surprising insight about chatbots. Let’s pause for a moment and think about these. They are all over the press, after all, with enthusiastic predictions that in a very short time, they will pass the Turing test, the ultimate proof of human intelligence.

Well, it don’t look that way to me. Unlike translated sentences, prompts and responses don’t have a direct correlation. There is something going on in the background that humans understand, but which completely eludes a pattern recognition machine. For a neural network, a random sequence of letters in a foreign language is as predictable a response as a genuine answer given by a real human in the original language. In fact, the system comes to the same conclusion in both scenarios: it plays it safe and produces a sequence of letters that’s a generally probable kind of thing for humans to say.

Let’s take the following imaginary prompts and responses:
How old are you?
No, seriously, I took the red door by mistake.

Guess who came to yoga class today.
Poor Mary!
It would be a splendid exercise in creative writing to come up with a short story for both of them. Any of us could do it in a breeze, and the stories would be pretty amusing. There is an infinite number of realities where these short conversations make perfect sense to a human, and there is an infinite number of realities where they make no sense at all. In neither case can the response be predicted, in any meaningful way, from the prompt or the preceding conversation. Yet that is precisely the space where our so-called artificial “intelligence” currently live.

The point is, it’s ludicrous to talk about any sort of genuine intelligence in a machine translation system or a chatbot based on recurrent neural networks with a long short-term memory.

Comprehension is that elusive thing between the prompts and the responses in the stories above, and none of today’s technologies contains a metaphorical hidden layer for it. On the level our systems comprehend reality, a random segment in a foreign language is as good a response as Poor Mary!

About Terence *

Terence Lewis, MITI, entered the world of translation as a young brother in an Italian religious order, where he was entrusted with the task of translating some of the founder's speeches into English. His religious studies also called for a knowledge of Latin, Greek, and Hebrew. After some years in South Africa and Brazil, he severed his ties with the Catholic Church and returned to the UK where he worked as a translator, lexicographer[13] and playwright. As an external translator for Unesco, he translated texts ranging from Mongolian cultural legislation to a book by a minor French existentialist. At the age of 50, he taught himself to program and wrote a rule-based Dutch-English machine translation application which has been used to translate documentation for some of the largest engineering projects in Dutch history. For the past 15 years, he has devoted himself to the study and development of translation technology. He recently set up MyDutchPal Ltd to handle the commercial aspects of his software development. He is one of the authors of 101 Things a Translator Needs to Know[14].



[1] The live demo is provided "as is", without any guarantees of fitness for purpose, and without any promise of either usefulness or entertainment value. The service will be online for as long as I have the resources available to run it (a few weeks probably).
Oh yes, I'm logging your queries, and rest assured, I will be reading them all. I am tremendously curious to see what you come up with, and I want to enjoy all the entertaining or edifying examples that you find.
[2] the morning paper. an interesting/influential/important paper from the world of CS every weekday morning, as selected by Adrian Colyer.
[3] Understanding deep learning requires rethinking generalization. Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, Oriol Vinyals. ICLR 2017 conference submission.¬eId=Sy8gdB9xx
[4] OPUS, the open parallel corpus. Jörg Tiedemann.
[5] OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles. Pierre Lison, Jörg Tiedemann.
[6] OpenNMT: Open-Source Toolkit for Neural Machine Translation.
[7] My Journey into "Neural Land". Guest Post by Terence Lewis on the eMpTy Pages blog.
[8] Never trust anyone who brags about their BLEU scores without giving any context. I’m not giving you any context, but you have the live demo to see the output for yourself.
Also, a few words about this score. I calculated it on a validation set that contains 3k random segment pairs removed from the corpus before training. So they are in-domain sentences, but they were not part of the training set. The score was calculated on the detokenized text, which is established MT practice, except in NMT circles, who seem to prefer the tokenized text, for reasons that still escape me.
And if you want to max out on the metrics fetish, the validation set’s TER score is 47.28. There. I said it.
[9] Don’t get me wrong, I’m a great fan of unks. They can attend my parties anytime, even without an invitation. If I had a farm I would be raising unks because they are the cutest creatures ever.
[10] Electric sheep. Mark Liberman on Language Log.
[11] From the same Language Log post quoted previously. Translations were retrieved on August 6, 2017; they are likely to change when Google updates their system.

[12] English Chatbot advice
[13] Harrap's English-Brazilian Portuguese business dictionary. Terence Lewis, Lígia Xavier, Cláudio Solano. [link]
[14] 101 Things a Translator Needs to Know. ISBN 978-91-637-5411-1

Gábor Ugray is co-founder of Kilgray, creators of the memoQ collaborative translation environment and TMS. He is now Kilgray’s Head of Innovation, and when he’s not busy building MVPs, he blogs at and tweets as @twilliability.