Thursday, June 30, 2016

The Emerging World of Neural Net Based MT

There has been much in the news lately about the next wave of MT technology driven by a technology called deep learning and neural nets (DNN). I will attempt to provide a brief layman’s overview about what this is, even though I am barely qualified to do this (but if Trump can run for POTUS then surely my trying to do this is less of a stretch). Please feel free to correct me if I have inadvertently made errors here. 

To understand deep learning and neural nets it is useful to first understand what “machine learning” is. Very succinctly stated machine learning is the “Field of study that gives computers the ability to learn without being explicitly programmed” according to Arthur Samuel. 

Machine learning is a sub-field of computer science that evolved from the study of pattern recognition and computational learning theory in artificial intelligence. Machine learning algorithms iterativelylearn from (training) data by generalizing their experience (“analysis” of the data) into predictive models. These models allow computers to find insights that might be difficult or even impossible for humans to find. In the case of MT the objective is to build predictive models that translate new source data based on “knowledge” it has gathered from translation memory and other natural language data it is shown (trained on). Think of it simply as a branch of statistics and applied computing, designed for and enabled by a world of big data. For example, to some extent SMT is a more flexible and generalized implementation of the older TM technology, which can guess at new sub-segments based on learning obtained from its training experience. 

Machine Learning is the field that studies how to make computers learn on their own (unsupervised), and covers a lot of ground in computing beyond text and MT applications, and most recently has had amazing success with image data. A Machine Learning algorithm is a computer program that teaches computers how to program themselves so that we don’t have to explicitly describe how to perform the task we want to achieve. The information that a Machine Learning algorithm needs in order to write its own program to solve a particular task is a (large) set of known examples e.g. translation memory or radiology images together with resultant diagnoses. 

Machine Learning is a big deal, perhaps even a really big deal. Google CEO, Sundar Pichai, recently laid out the Google corporate mindset: “Machine learning is a core, transformative way by which we’re rethinking how we’re doing everything. We are thoughtfully applying it across all our products, be it search, ads, YouTube, or Play. And we’re in early days, but you will see us — in a systematic way — apply machine learning in all these areas.” Google is all in. Google believes that one day it will be used by all software engineers no matter what the field, and that it will “change humanity.” 

This links to a series of posts that provide the best non-technical overviews I have seen on what Machine Learning is. 

But how is it used, in practice? Very roughly speaking there are three broad concepts that capture most of what goes on under the hood of a machine learning algorithm: feature extraction, which determines what data to use in the model; regularization, which determines how the data are weighted within the model; and cross-validation, which tests the accuracy of the model. Each of these factors helps us identify and separate “signal” (valuable, consistent relationships that we want to learn) from “noise” (random correlations that won’t occur again in the future, that we want to avoid). Every data set has a mix of signal and noise, and skill with these concepts will help you sort through that mix to make better predictions. This is a gross oversimplification and needs much more elaboration than is possible in this post. 

This video with Peter Norvig of Google talking about ML and this one byAndrew Ng of Baidu explaining what Deep Learning is, is worth skimming through (for those who have a deeper interest at least) to get a sense of the larger possibilities that could emerge from machine learning approaches. 

Traditional AI methods of language understanding depended on embedding rules of language into a system, but in the Google SmartReply project, as with all modern machine learning , the system was fed enough data to learn on its own, just as a child would. “I didn’t learn to talk from a linguist, I learned to talk from hearing other people talk,” says Greg Corrado who developed SmartReply at Google. Corrado says that the approach requires a change in mindset for coders, from controlling everything directly to analyzing data, and even new hardware. The company even created its own chip, the Tensor Processing Unit, optimized for its machine-learning library TensorFlow

Getting back to MT, it is useful to first look at how machine learning is used in SMT to better understand the evolution enabled by neural networks. The SMT model is generally made up of two (sometimes more) predictive models learned from “training data” which includes both representative bilingual data and monolingual data in the target language.
  1. SMT Translation Model – Learned from Bilingual Data (Translation Memory)
    1. Probabilistic mapping of equivalencies in source words and phrases with target language words and phrases through the Unsupervised Expected Model (EM) training and word and phrase alignment process.
    2. The Translation Model generates lots of possible translations
  1. Target Language Model – Learned from Monolingual Target Language Data
    1. Probabilistic model of relative fluency and general usage patterns in the target language
    2. The Target Language Model selects the “best” translations from a list of possible candidates
Even though this is essentially a probability maximization exercise (not really a translation), it can do surprisingly well, and translate new source data in the same domain quite accurately. This link provides a relatively simple overview of the learning process in slides, and here is Norvig again, givinga very clear 12 minute overview of how SMT works. Much of what we see today, as phrase based SMT including Moses, is built with this kind of a learning approach. Even with many limitations this is a significant improvement over the older Rule Based MT systems where humans tried to codify and program the language pair. 

Some of the most obvious problems include the following:
  • Since this is a word based approach it is not as effective with character based languages (CJK) because of imperfect segmentation and tokenization issues.
  • It has a very limited sense of context and is often quite mindlessly literal.
    It is not especially effective with language combinations that have varying morphology, non-contiguous phrases and syntactic transformations.
  • It has limited success in scarce data scenarios and more data does not always drive improvement.
Deep Learning with Neural Networks to the rescue
Neural nets are modeled on the way biological brains learn. When you attempt a new task, a certain set of neurons will fire. You observe the results, and in subsequent trials your brain uses feedback to adjust which neurons get activated. Over time, the connections between some pairs of neurons grow stronger and other links weaken, laying the foundation of a memory. A neural net (DNN) essentially replicates this process in code. 

Neural Nets are a specific implementation of a Machine Learning algorithm. In a way Neural Nets allow one to extract “more knowledge” from the training data set and access deeper levels of “understanding” from the reference data. Neural networks are complex, computationally intensive and hard to tune as machines may see multiple layers of patterns that don’t always make sense to a human, yet can be surprisingly effective in building prediction models of astonishing accuracy. 

Again, Andrew Ng explains why this technology that has been around for 20 years, has reached its perfect storm moment. Basically because the availability of really big data + high performance computing + evidence of successful prediction in image processing suggest that it can, and could work, in solving complex problems in many other areas where standard programming approaches would be impractical. 

Deep Neural Nets can have a large number of hidden layers and are able to extract much deeper related features from the data. Recently, deep neural networks have performed particularly well for image recognition problems . Deep neural networks have become extremely popular in more recent years due to their unparalleled success in image and voice recognition problems. Neural Nets have been successful with sequence recognition problems (gesture, speech, ATM handwritten check text recognition), medical diagnosis, financial trading systems, visualization and e-mail spam filtering. 

Just as with dirty data SMT, one of the biggest reasons why neural networks may not work is because people do not properly pre process the data being fed into the neural network. Data normalization, removal of redundant information, and outlier removal should all be performed to improve the probability of good neural network performance. There are a variety of DNN techniques that solve different kinds of deep learning problems. An understanding of how these different approaches perform better under different constraints and different evaluation criteria is underway in the research community as we speak. 

In particular, Neural Networks excel in cases where the strategy is not known ahead of time and instead must be discovered. In cases where this is NOT true and all that must be learned are the parameters for that strategy, there are algorithms that can find good solutions a whole lot faster and with fewer resources. 

So in the MT context, the rationale behind using the neural network-based training is about discovering the hidden “reasons” behind the translation of the text in the training set; essentially, the machine may be “writing rules of linguistic relationship” automatically, and producing a more flexible engine by extracting more useful ”knowledge” from the existing training data. Note that these features are not necessarily the same features human linguists would use (parts of speech, morphology, syntax, transitivity, etc...) But these hidden layers have solved problems of great difficulty in image recognition and there is reason to believe that they can do this with NLP as well. Of course, this is easier said than done but that’s the basic reasoning and there is much research underway with Google, Facebook, Microsoft and Baidu (the big four) leading the way. 

One DNN breakthrough example is WORD2VEC. Its creators have shown how it can recognize the similarities among words (e.g., the countries in Europe) as well as how they’re related to other words (e.g., countries and capitals). It’s able to decipher analogical relationships (e.g., short is to shortest as big is to biggest), word classes (e.g., carnivore and cormorant both relate to animals) and “linguistic regularities” (e.g., “vector(‘king’) – vector(‘man’) + vector(‘woman’) is close to vector(‘queen’)). Kaggle’s Howard calls Google’s word2vec the “crown jewel” of natural language processing. “It’s the English language compressed down to a list of numbers,” he said. The real benefit of this will take years to unfold as more researchers experiment with it to try and solve new NLP problems. 

From the data scientist’s perspective, MT aims to find for the source language sentence presented to it, the most probable target language sentence that shares the most similar meaning. Essentially, MT from the data scientists perspective is a sequence-to-sequence prediction task. 

Indirect DNN application designs new features with DNNs in the framework of standard SMT systems, which consist of multiple sub-models (such as optimal translation candidate selection and more fluent and natural language models). For example, DNNs can be leveraged to represent the source language context’s semantics and better predict translation candidates. (The two columns shown to the right in the graphic above reflect an indirect implementation.)
The indirect application of DNNs in SMT aims to solve difficult problems in an SMT system with more accurate context modeling and syntactic/semantic representation e.g. Word Alignment in SMT which have two disadvantages: 1) The current process can’t capture the similarity between words, and 2) contextual information surrounding the word isn’t fully explored. Traditionally, translation rule selection is usually performed according to co-occurrence statistics in the bilingual training data rather than by exploring the larger context and its semantics. DNNs help to improve the process to consider context and semantics more effectively. 

Language Model – The most popular language model is the count-based n-gram model described by Norvig and by the charts above. One big issue here is that data sparseness becomes severe as n grows. Using RecurrentNN ( a type of DNN) allows a better solution than the standard count-based n-gram model. All the history words available are applied to predict the next word instead of just n-1. This allows an SMT model to have a much better sense of context. The table below shows one view on how different types of DNNs can help address common SMT problems. 

Statistical machine translation difficulties and their corresponding deep neural network solutions.
Word alignment                                           FNN, RecurrentNN
Translation rule selection                            FNN, RAE, CNN
Reordering and structure prediction             RAE, RecurrentNN, RecursiveNN
Language model                                          FNN, RecurrentNN
Joint translation prediction `                       FNN, RecurrentNN, CNN
However, indirect application of DNNs makes the SMT system much more complicated and difficult to deploy.
Direct application (NMT) regards MT as a sequence-to-sequence prediction task and, without using any information from standard MT systems, designs two deep neural networks—an encoder, which learns continuous representations of source language sentences, and a decoder, which generates the target language sentence with source sentence representation. (Yes I really cannot figure out a way to say this in a more intelligible way.)
In contrast, direct application is simple in terms of model architecture: a network encodes the source sentence and another network decodes to the target sentence. Translation quality is improving, but this new MT architecture is far from perfect. There’s still an open question of how to efficiently cover more of the vocabulary, how to make use of the target large-scale monolingual data to improve fluency, and how to utilize more syntactic/semantic knowledge in addition to what is possible learn from source sentences. 

How do DNNs improve translation quality?
For example, several algorithms can be applied to calculate the similarity between phrases or sentences. But they also capture much more contextual information than standard SMT systems, and data sparseness isn’t as big a problem. For example, the RecurrentNN can utilize all the history information in text that comes before the currently predicted target word; this is impossible with standard SMT systems.

Can DNNs lead to a big breakthrough?
  • There have been recent breakthroughs but NMT is computationally much more complex. Because the network structure is complicated, and normalization over the entire vocabulary is usually required, DNN training is a time-consuming task. Training a standard SMT system on millions of sentence pairs only requires a day or two, whereas training a similar NMT system can take several weeks, even with powerful GPUs.
  • Currently it is hard to understand and pinpoint why it is better or worse than SMT – i.e. error analysis is problematic but experimentation is underway at the big four listed above.
  • Limited reasoning and remembering capabilities and suffers with rare words and long sentences.
  • A straightforward Moses-like toolkit that fosters more experimentation is desperately needed but will take at least a year or two to become widely available.
  • NMT produces much more natural sounding translations than SMT claim Facebook, Google and Microsoft.
  • Better ability to handle idiom and metaphor as the Facebook team is claiming.
Purely neural machine translation (NMT) is the new MT paradigm. The standard SMT system consists of several sub-components that are separately optimized and normally implemented in a production pipeline. In contrast, NMT employs only one neural network that’s trained to maximize the conditional likelihood on the bilingual training data. The basic architecture includes two networks: one encodes the variable-length source sentence into a real-valued vector, and the other decodes the vector into a variable length target sentence. 

Experiments report similar or superior performance in English-to-French translation compared to the standard phrase-based SMT system. The MT network architecture is simple, but it has many shortcomings. For example, it restricts tens of thousands of vocabulary words for both languages to make it workable in real applications, meaning that many unknown words appear. Furthermore, this architecture can’t make use of the target large-scale monolingual data. Attempts to solve the vocabulary problem are heuristic, e.g. they use a dictionary in the post-processor to translate the unknown words.

However in spite of these issues, as Chris Wendt at Microsoft says: “Neural networks bring up the quality of languages with largely differing sentence structure, say English<>Japanese, up to the quality level of languages with similar sentence structure, say English<>Spanish. I have looked at a lot of Japanese to English output: Finally actually understandable." 
Jeff Dean at Google is excited about his own team’s effort to push things forward with NMT. “This is a model that uses only neural nets to do end-to-end language translation,” he says. “You train on pairs of sentences in one language or another that mean the same thing. French to English say. You feed in English sentences one word at a time, boom, boom, boom… and then you feed in a special ‘end of English’ token. All of a sudden, the model starts spitting out French.” Dean shows a head-to-head comparison between the neural model and Google’s current system — and his deep learning newcomer one is superior in picking up nuances in diction that are key to conveying meaning. “I think it’s indicative that if we scale this up, it’s going to do pretty powerful things,” says Dean. 

Alan Packer at FaceBook said they believe neural networks can learn “the underlying semantic meaning of the language,” so what is produced are translations “that sound more like they came from a person.” He said neural network-based MT can also learn idiomatic expressions and metaphors, and “rather than do a literal translation, find the cultural equivalent in another language.” Machine Learning is deeply embedded into the FaceBook system infrastructure and we should expect many new breakthroughs.

Tayou is the only language industry MT vendor so far who seems to have experimented with NMT thus far, and they have mixed results, that are presented here. These early results are useful to understand the challenges with early NMT, but these experiments are not useful to conclusively conclude on the real the potential of NMT to outperform SMT or not. The development tools will get better and the same experiments described here could yield different outcomes in future, as tools improve.

So while there are indeed challenges to getting NMT launched in a broad and pervasive way, there are many reasons to march forward. We see the largest Internet players including Microsoft, Google, FaceBook and Baidu are all working with DNNs and all have NMT initiatives in motion. Microsoft has already deployed pure neural networks on mobile translation apps for Android and iOS. Of course with a very small and limited vocabulary but this will only grow and evolve. 

I doubt very much that phrase based SMT is going away to quietly die in the short term (LT 5 years). But as supercomputing access becomes commonplace, and as NMT fleshes out with more comprehensive support tools, the same way that SMT did (which takes some years), we could see a gradual transition and evolution to this new kind of MT. 

There is enough actual evidence of success with NMT to generate real excitement, and I expect we will see a super-Moses-like kit to build NMT systems appear within the next 12 months. This will foster more experimentation and possibly discover new pathways to better automated translation . All this only points to improving MT, albeit gradually, and while MT is a truly difficult engineering problem, the best minds in the world are far from finished on what is possible to improve MT using machine intelligence technology. The emergence of NMT also points to the high likelihood of obsolescence for those who like to keep everything on-premise or the desktop. The best MT solutions will likely happen in the cloud and will be unlikely to be possible at all on the desktop.

MT naysayers should be aware, that the mindset that most of these MT researchers have in place, can be encapsulated by the following statement originally made by Thomas Edison: "I have not failed. I've just found 10,000 ways that won't work." It will indeed get better.

Monday, June 27, 2016

MT Options for the Individual Translator

This is yet another post triggered by conversations in Rio at the ABRATES conference in early June. As I mentioned in my initial conference post, the level of interest in MT was unusually high and there seemed to be real and serious interest in finding out ways to engage with MT beyond just the brute force and numbing corrective work that is typical of most PEMT projects with LSPs.

MT has been around in some form for decades, and I am aware that there have always been a few translators who found some use for the technology. But since I was asked by so many translators about the options available today I thought it would be useful to write about it. 

The situation today for many translators is to work with low quality MT output produced by LSP/enterprise MT practitioners with very limited MT engine development experience, and typically they have no say in how the MT engines evolve, since they are so far down the production line. Sometimes they may work with expert developed MT systems where there is some limited feedback and steering possible, but generally the PEMT experience involves: 

1.    Static MT engines that are developed offline somewhere, and that may have periodic if any updates at all to improve the engine very marginally if at all.
2.    Post-editors work on batches of MT output and provide periodic feedback to MT developers.

This is beginning to change in the very recent past with innovative new MT technology that is described as Adaptive Interactive Dynamic Learning MT (quite a mouthful). The most visible and elegant implementation of this approach is from a startup called Lilt. This allows the translator-editor to tune and adjust the engine dynamically in real time, and thus make all subsequent MT predictions more intelligent, informed and accurate. This kind of an MT implementation is something that has to be cloud based to allow the core engine to be updated in real time. Additionally, when used in workgroups, this technology can also leverage the individual efforts of translators by spreading the benefit of a more intelligent MT engine with the whole team of translators. Each user benefits from the previous edits and corrective actions of every other translator-editor and user as this case study shows. This allows a team to build a kind of communal edit synergy in real time and theoretically allows 1+1+1 to be 5 or 7 or even higher. The user interface is also much more translator friendly and is INTERACTIVE so it changes moment to moment as the editor makes changes. Thus you have a real-time virtuous cycle which is in essence an intelligent learning TM Engine that learns with every single corrective interaction. 

CSA tells us that the SDL Language Cloud also has similar abilities but my initial perusal suggests it is definitely less real-time, and less dynamic than Lilt i.e. it is not updating phrase tables in real time. There are several little videos that explain it and superficially it looks like an equivalent but I am going to bet it is not yet at this point in time anyway.

So for a translator who wants to get hands on experience with MT and understand the technology better, what are the options? The following chart provides a very rough overview of the options available ranked by my estimation of the best options to learn valuable new skills. The simplest MT option for an individual translator has always been a desktop RbMT system like Systran or ProMT, and it still is a viable option for many, especially with Romance languages. But there has never been much that could be done to tune these older systems beyond building dictionaries, a skill that in my opinion will have low value in the future.

Good expert developed MT systems will involve frequent interaction with translators in the early engine development phases to ensure that engines get pattern based corrective feedback to rapidly improve output quality. The more organized and structured this feedback and improvement process, the better the engine and the more interesting the work for the linguist.

Of the “free” generic MT engines Microsoft offers much more customization capabilities and thus are a good platform to learn how different sets of data can affect an MT engine and MT output. This of course means the user needs to have organized data available and an understanding of the technology learning process. MT can be trained if you have some understanding of how it learns. This is why most Moses experiments fail I think, too much effort focused on the low value mechanics, and too little on what and why you do what you do. I remain skeptical about ignorant Moses experimentation because getting good SMT engines requires good data + understanding of how your training corpus is similar to or different from the new source that you want to translate, and a variety of tools to help keep things synced and aligned when you see differences. I am also skeptical that these DIY efforts are likely to get as engines as good as the free generic engines, and I wonder why one would bother with the whole Moses effort if you could get it at higher quality for free from Microsoft or Google. There are some translators who claim some benefit from working with Moses and its desktop implementations like Slate. 

All these options will provide some experience and insight into MT technology, but I think it is useful to have some sense for how they might impact you from two key perspectives shown below:

1.    What options are likely to give you the fastest way to get to improved personal productivity?
·         I would expect that an Adaptive MT solution is most likely to do this the fastest, but you need have good clean training data – TM, Glossaries (the more the better). Also you should see your edit experience improve rapidly as your corrective feedback modifies the engine in real time.
·         Followed by the Microsoft Translator Hub (if you do not have concern about privacy issues), SDL Language Cloud and some Expert systems which are more proactive in engaging translators but this will also involve an LSP middleman typically.
·         Generic Google & Microsoft and Desktop RbMT (Romance languages and going to English tend to have better results in general).
·         DIY Moses is the hardest way to get to productivity IMO but there is some evidence of success with Slate.

2.    What options are most likely to help develop new skills that could have long-term value?
·         My bet is that the SMT options are all going to help skills related to corpus analysis, working with n-grams, large scale corpus editing and data normalization. Even Neural MT will train on existing data so all those skills remain valuable.
·         Source analysis before you do anything is always wise and yields better results as your strategy can be micro-tuned for the specific scenario.
·         Both SMT and the future Neural MT models are based on something called machine learning. It is useful to have at least a basic understanding of this as this is how the computer “learns”. It is growing in importance and worth long-term attention.

There are tools like Matecat that show promise, but given that edits don’t directly change the underlying MT engine, I would opt for a true Adaptive MT option like Lilt instead.  

  What do we mean by PEMT in the larger context?

The traditional understanding of PEMT can be summarized in the graphic below and this is the most common kind of interaction that most translators have with MT if they have any at all.

However the problem definition and the skills needed to solve them are quite different when you consider an MT engine from a larger overall process perspective. It generally makes sense to address corpus level issues before going to the segment level so that many error patterns can be eliminated and resolved at a high frequency pattern level. It may also be useful to use web crawlers to gather patterns to guide the language model in SMT and get more fluent target language translations. 

The most interesting MT problems which almost always happen outside the “language services industry”, require a view of the whole user experience with translated content from beginning to end. This paper describes this holistic user experience view and the resultant corpus analysis perspective for an Ebay use case. Solving problems often involves data acquisition of the right kind of new data, normalization of disparate data, and focus on handling high frequency word patterns in a corpus to drive MT engine output quality improvements. This type of deeper analysis may happen at an MT savvy LSP like SDL, but otherwise is almost never done by LSP MT practitioners in the core CSA defined translation industry. This kind of deep analysis is also often limited at MT vendors because customers are in a hurry, and not willing to invest the time and money to do it right. Only the most committed will venture into this kind of detail, but this kind of work is necessary to get an MT system to work at optimal levels. Enterprises like Facebook, Microsoft and EBay understand the importance of doing all the pre and post data and system analysis and thus develop systems that are much more closely tuned to their very very specific needs.

MT use makes sense for a translator only if there is a productivity benefit. Sometimes this is possible right out of the gate with generic systems, but most often it takes some effort and skill to get an MT system to this point. It is important that translators have a basic understanding of three key elements before MT makes sense:
1.    Their individual translation work throughput without MT in words per hour.
2.    The quality of the MT system output and the ability of the translator to improve this with corrective feedback.
3.    The individual work throughput with MT after some effort has been made to tune it for specific use.

Obviously 3 has to be greater than 1 for MT use to make sense. I have heard that many translators use MT as way to speed up the typing or to look up individual words. I think we are at a point where the number of MT options will increase and more translators will find value. I would love to hear real feedback from anybody who reads this blog as actual shared experience is still the best way to understand the possibilities.