Machine translation technology has an unfortunate history of overpromising and under delivering. At least 50 years of doing this and sometimes it seems that the torture will never stop. MT enthusiasts continue to make promises that often greatly exceed the realistic possibilities. Recently, in various conversations, I have seen that the level of unwarranted exuberance around the possibilities with the Moses Open Source SMT technology is rising to peak levels. This is especially true in the LSP community. While most technologies go through a single hype cycle, MT seems destined to go through several of these cycles with each new approach and the latest of these is what I call Moses Madness. It has become fashionable of late to build instant DIY MT engines, using tools that help you with the mechanics of running the software that is “Moses”. While some of these tools greatly simplify the mechanical process of running the Moses software, they do not give you any insight into what is really going on inside the magic box or any clues to what you are doing at all. Moses is a wonderful technology and it enables all kinds of experimentation that furthers the art and science of data-driven MT, but it does require some knowledge and understanding for real success. It is possible to get a quick and dirty MT engine together using some of these tools, but for long-term strategic translation production leverage, I am not so sure. Thus it is my sense that we are at the peak of the hype cycle for DIY Moses.
I would like to present a somewhat contrarian viewpoint to much of what you will hear at TAUS - “Let a thousand MT systems bloom”, and other online forums on getting started with instant MT approaches. IMO Moses and especially instant Moses is clearly not the final answer. While Moses is a starting point for real development, it should not be mistaken as the final destination. I think there are a number of reasons that you should pause before you jump in, and at least build up some knowledge before taking the dive. I have attempted to enumerate some of these reasons, but I am sure some will disagree. Anyway, I hope an open discussion will be valuable in reaching a more sustainable and accurate view of the reality and so here goes, even though perhaps I am rushing in where angels fear to tread. And of course my opinion on this matter is not impartial, given my involvement with Asia Online.
The Sheer Complexity
As you can see from the official description, Moses is an open source project that makes its home in the academic research community. This link describes some of the conferences where people with some expertise and understanding of what Moses actually does convene and share information. Take a look at the program committee of these conferences to get a sense of what the focus might be. Now take a look at the “ step-by-step guide”, which students in NLP are expected to be able to handle. It is what you would have to do to build an MT system if did not have the DIY kit. Most of the instant/simplified Moses engine services in the market focus on simplifying this and only this aspect of developing an MT engine.
Clearly it would be good to have some knowledge of what is going on in the magic box BEFORE you begin, and perhaps it would even be really nice to have some limited team expertise with computational linguistics to make your exploration more useful. Remember that hiding complexity is not quite the same as removing complexity, and it would be smart to not underestimate this complexity BEFORE you begin. Anybody who has ventured into this has probably realized already, that while some of the complexity has been hidden, there is still much that is ugly and complicated to deal with in Moses world, and often it feels like the blind leading the blind.
I have noticed that many in professional translation industry have trouble even with basics like MT system BLEU scoring, and even some alleged MT experts barely know how to measure BLEU accurately and fairly. Thus I am skeptical that LSPs will be able to jump into this with any real level of competence in the short term. A level of competence that assures or at least raises the probability of business success i.e. enhances long-term translation productivity. Though it is possible that a hardy few will learn over the next 2-5 years, it is also clear that NLP and computational linguistics is not for everyone. The level and extent of knowledge required is simply too specialized and vast. As Richard Feynman said:”I think it’s much more interesting to live not knowing, than to have answers which might be wrong.” (Though he was talking about beauty, curiosity and mostly about doubt).
Alon Lavie, AMTA President, CMU NLP professor and President of Safaba (which develops hosted MT solutions that are largely built on top of Moses) says:
“ I am of course a strong supporter, and am extremely enthusiastic about Moses and what it has accomplished in both academic research and in the commercial space. I also think there is indeed a lot of value in the various DIY offerings (commercial and Achim's M4L efforts). But these efforts primarily target and solve the
*engineering complexity* of deploying Moses. While this undoubtedly is a critical bottleneck, I think there is a potential pitfall here that users that are not MT experts (the vast majority) would come to believe that that's all it takes to build a state-of-the-art MT system. The technology is actually complex and is getting more complex and involved to master.
Users may be disappointed with what they get from DIY Moses, and more detrimentally, become convinced that that's the best they can accomplish, when in fact letting expert MT developers do the work can result in far better performance results. I think this is an important message to communicate to potential users, but I'm not sure how best to communicate this message.”
Thus, I will join Alon in trying to convey the message that Moses is a starting point in your exploration of MT and not the final answer, and that experience, expertise and knowledge matter. Perhaps, a way to understand the complexity issue better, is to use some analogies.
The sewing machine/tailor analogy: Moses can be perhaps be viewed as a very basic sewing machine. You still need to understand how to cut cloth, stitching technique, fabric and lining selection, measurement, pocket technique (?), final fit modifications and so on to make clothes. Tailors do it better and expert tailors that only focus on men's suits do it even better than you or I would with the same sewing equipment. The closest to a ready made suit would be the free MT engines, except in this analogy they are only available in one size. Expertise really does matter folks if you want to customize-to-fit.
The DIY car analogy: In this analogy, Moses is the car engine and perhaps a very basic chassis, one that would be dangerous on a highway or bumpy roads. The DIY task is to build a car that can actually be used as transportation. This will require some understanding of auto systems design, matching key components to each other, tires, braking systems, body design and so on. Finally you also need to learn to drive and you would want the car to turn right when you want to. Again, expert mechanics are more likely to be successful even though there are some great DIY kits out there for NASCAR enthusiasts.
The Learning Curve
Even if you do have a team with some NLP expertise, remember that working with any complex technology involves a process of learning and usually an apprenticeship to get to a point of real skill. The people who build SMT engines at Microsoft, Google, Asia Online and other MT research teams have built thousands of MT engines during their MT careers. The skills developed and lessons learned during this experience are not easily replicated and embedded into open source code. Failure is often the best teacher and most of these teams have failed often enough to understand the many pitfalls along an SMT engine development path. To expect that any “instant” Moses solution is going to capture and encapsulate all of this is naïve and and somewhat arrogant. This is the kind of skill where expertise builds slowly, and comes after much experimentation across many different kinds of data and use case scenarios. Just as professional tailors and expert mechanics are likely to produce better results, MT experts who work across many different use scenarios are likely to produce much better results than a do-it-yourself enthusiast might. These results translate into long-term savings that should far exceed an initially higher price.
The objective of MT deployment for most LSP users is to increase translation productivity. (Very few have reached the next phase where they are translating new content that would never be translated were it not for MT). Thus getting the best possible systems that produce the highest possible MT output quality really matters to achieve this core objective of achieving measurable translation productivity. To put this in simpler terms, the difference between instant Moses systems and expert MT systems could be as much as 4,000 words/day versus 10,000+ words a day. Expert MT engine developers like Asia Online have multi-dimensional approaches, NLP skills, and many specialized tools in place to extract the maximum amount of information out of the data they have available. The use of these tools is guided by two team members with deep expertise on the inner workings of Moses and SMT in general. The learning process driving the development of these comprehensive tools takes years, and they enable Asia Online custom systems to produce superior translation output to the free online MT engines consistently. One team member has literally written the book on SMT and created Moses and thus one could presume is quite likely to have the expertise to develop better MT systems than most.
I have already heard from several translators who when asked to post-edit “instant Moses” output they know is inferior, simply run the same source material through Google/Bing and edit that instead, to improve their own personal productivity and save themselves some anguish. So if your Moses engine is not as good as these public engines you will find that translators will simply bypass them whenever they can. And they may not actually tell you that they are doing this. Post-editors will generally choose the best MT output they can get access to, so beware if your engine does not compare well. And buyers, insist on seeing how these instant MT engines compare to the public free engines on a meaningful and comprehensive test set, not just a 100 or so sentences.
However, I am also aware that some Moses initiatives have produced great results e.g. Autodesk,(for you doubters on the value of PEMT, here is clear evidence from a customer viewpoint) and here I would caution against any extrapolation of these results and expectation to achieve this for any and every Moses attempt. The team that produced these systems were more technically capable and knowledgeable than most, and I am also aware that that their training data was better suited for SMT than most of the TM you will find in the TDA or on the web. And even here, I would argue that MT experts would probably produce better results with the same data especially with the Asian languages where other support tools and processes become much more imperative.
As others have stated before me, the global population of people who actually understand how these data-driven systems work is really quite tiny, miniscule in fact. If you are building Moses systems you should be comparing yourself to the public free engines, as you may find that all your effort was much ado about nothing. One would hope that you will produce systems that compare favorably to these “free” options. And if your competition includes the lads and lassies at Microsoft and Google, one would hope that you know more about how to do this than pushing the instant Make-my-engine button. The financial cost of ignorance is substantially higher than most are able to define in terms of lost opportunity costs, and learning costs (a.k.a. mistakes) should be factored into a real TCO (Total Cost of Ownership).
The bottom line: Success with SMT requires very specialized skills that include, some NLP background, massive data handling skills, knowledge of parallel computing processing, linguistic data management tools, corpus analysis and linguistic structural analysis capabilities for optimal results not to mention a culture that nurtures collaboration with translators.
The Data, the Data, The Data
Moses is a data-driven technology and thus is highly dependent on the data that is used. Data volume is required to get good output from the systems and thus users have to gather the data from public sources and it is important to normalize and prepare the data for optimal performance. Most LSPs will not have the data or skills needed to gather the data in an optimal way. I have seen two major SMT engineering initiatives up close, one where training data was scraped off the web by spider programs, and another where data was not allowed to go into training data if it had not passed several human linguistic quality assessment checks. The differing impact of these approaches is quite striking. The dirty data approach requires substantially larger amounts of new data to see any ongoing improvement, while the clean data approach can produce compelling improvement results with much less new data.
This ability to respond to small amounts of corrective feedback is a critical condition for ongoing improvement, and for continued improvements in productivity e.g. raising PEMT throughput up to 15,000+ words/day in the shortest time possible. I have already stated that I was surprised how little attention is paid to data quality in instant Moses approaches presented at TAUS. And while data volume matters, for high quality domain-focused systems, the data you exclude may be more important than what you include. We are in a phase of the web's development where ‘ Big Data” is solving many semantic and linguistic problems, but we have also seen that data is not always the solution to better MT systems.
The upfront data analysis and data preparation, the development of “good” tuning and test sets are critical to the the short and long-term quality and evolution of an MT engine. This is something that takes experience and experimentation to understand and be skillful at. Experts can add huge value at this formative stage. Remember that this is a technology where “Garbage In Garbage Out” (GIGO) will be particularly true. Many who understand how bad TM can get don’t need any further elaboration on this, even though some people in the SMT community remain unconvinced that clean data does matter.
Many of the people who have jumped into instant Moses, do not realize that to get your initial MT engine to improve, will require very large amounts of new data with a standard Moses approach. The rule of thumb I have heard used frequently is that you need 20-25% of the initial training data volume to see meaningful improvements. Thus, if you used 10 million words to build your system, you will need 2-3 million new words to see the system noticeably improve. So most of these instant systems are as good as they are ever going to get when the first engine is produced. In contrast, Asia Online systems can improve dramatically with as little as a few thousand sentences (a single project) and are architected and designed from the outset to improve continuously over time with focused and targeted corrective feedback.
Given the difficulty of getting large amounts of new data, users need systems that can respond to small amounts of corrective feedback and yet show noticeable improvements. One of the major deficiencies of historical MT systems has been the lack of user control, the inability of users to make any meaningful impact on the quality of raw output produced on an ongoing basis.This ability to CONTINUALLY steer the MT engine with financially feasible amounts (i.e. relatively small) of corrective feedback is a key to getting the best long-term productivity results and ROI. I think as users get more informed on how to work with this technology, they will zero in on this ability of some expert MT systems. IMO, it is the single most important criterion when evaluating competitive MT systems:
- What do I have to do to improve the raw system output quality once an initial engine is in place?
- And, how much effort/data is required to get meaningful and measurable improvements?
- Measurable = Rising average throughput of post-editors (By hundreds or thousands of more words a day, and often a multiple of what is possible with instant MT).
The issue of data cleaning is also not well understood. While it is helpful to remove tags and formatting information, it is also important to validate the linguistics and the quality of translations in addition to this to avoid GIGO results. Users should take care to keep data in the cleanest possible state (format wise and linguistically) as it can provide real long-term business production leverage on a scale greater than most TM data can. What most successful users will find is that 90%+ of the time spent in developing the highest quality engines is spent in corpus and data analysis, data preparation and organization, error detection and correction. The Moses step is a tiny component of the whole process of developing superior MT engines.
Control & Data Security
One of the reasons why it may make sense to use Moses sometimes is to keep your data and training and translation activity REALLY REALLY private (e.g. translations of interrogation transcripts where persuasion involving water might be used). The need for security and privacy makes sense for national security applications, but I find it hard to understand the resistance some global companies have, to working in the cloud when a lot of this MT and PEMT content ends up on the web anyway. For most companies cloud computing simply makes sense and spares the user from the substantial IT burden of maintaining the hardware infrastructure needed to play at the highest professional level. (Asia Online actually makes it’s full training and translation environment available for on-premise installation for large enterprise customers like LexisNexis who process hundreds of millions of words a day and have suitable computing and human resource expertise to handle this).
I have heard of several LSPs who have spent $10K–$20K on servers that will probably only do Moses training once a year. If you do not have the data to drive an improvement in your Moses engine, what is the point of having these kinds of servers? There is no point in trying to re-train an engine when you don’t have enough new data to make any noticeable impact. This is a technology that just makes much more sense in the cloud, for scalability, extensibility, security and effective control. Cloud solutions are often more secure than on-premise installations at LSPs because cloud service providers can afford the IT staff that has deep expertise on computer security, data protection and data availability management. (BTW I have also seen what happens when hacks try and manage 200 servers = not pretty). Like many other things in today’s world, IT (Information Technology) has become so specialized and complex that it makes more sense to outsource much of it, and work in the cloud rather than try and do it on your own with a meager and barely trained staff. Compare your IT staff capabilities to any cloud service provider. Even Microsoft Office is finally making the transition to the cloud. Some analysts are even saying that the shift to the cloud will challenge the dominance of older stalwarts like HP, Microsoft, Intel, SAP, RIM, Oracle, Cisco, Dell and that a third of these companies may not be around in in 2020. Remember DEC and Wang? In a world where tablets, smartphones and mobile platforms will increasingly drive global commerce, the desktop/server perspective of traditional IT is already fading, and makes less sense with each passing day. It is ironic to see LSPs jumping on the “On-Premise Server” train just as it about to reach the end of the line.
Cloud based MT can also be setup to be always improving (assuming you have more than basic Moses MT) as new data is added regularly and feedback gathered from users as Google and Bing do. Setting up this kind of infrastructure is a significant undertaking and most Moses users will never get to that point, but this is how the best MT systems will continue to evolve. What some may find is, that their domain focused MT system may be better than the public engines in January, but by June this may no longer be true. You should realize that you are dealing with a moving target and most public engines will continue to improve. All the expert MT developers are constantly updating and enhancing their technology, most have already moved beyond the phrase-based SMT that Moses is today, and are incorporating linguistics in various forms. This can only be done because they understand what they are doing. Some of these enhancements may make it back to Moses years later but the productivity edge will remain with experts in the foreseeable future and I expect in 2012 we will see several case studies where expert MT systems outperform instant Moses systems by significant margins. So my advice; Be wary of any kind of instant MT solution that is not free.
I started the eMpTy Pages blog in early 2010, and one of my earliest posts was on the importance of clean data for SMT. It was blasphemy at the time to question the value of sheer data volume for SMT, but in the period since then, many have validated that working with consolidated TM from multiple sources, trusted though they may be, is a tricky affair and data quality does matter. Pooling data can work sometimes but will also fail often without cleaning and standardization.
The origin of the phrase “Let a thousand flowers bloom” is attributed to a misquote of Mao Zedong. The results for Chinese intellectuals who took Mao seriously were quite unfortunate. Fortunately we live in better times (I think?) and this phrase is not likely to have such dire consequences today. However, while a thousand MT systems may bloom (or at least be seeded), I predict that many will fade and die quickly. This is not necessarily bad, as hopefully institutional, community and industry learning will take place, and some practitioners may actually discover that they now have a much better appreciation for corpus linguistics and some of the skills that drive the creation of better MT systems. The experimental evidence from many failed experiments with Moses will also provide useful information for MT experts and further enhance the state of the art and science of MT. The learning curve for this technology is long and arduous and it may take a while for the dust to settle from the current hype, but I fully expect that by December 21st, 2012 it will be clear that expertise, experience and knowledge does matter with something as complex as Moses. Dead flowers are also used to fertilize gardens and help other plants thrive, and as long as we have the long view, we will continue to move onward and upward. I will restate my prediction, that the best MT systems will still come from close collaboration between MT experts with linguists, translators, LSPs and insight drawn from experience and failure.
And you can send me dead flowers every morning
Send me dead flowers by the mail
Send me dead flowers to my wedding
And I won't forget to put roses on your grave
A celebration for dead flowers