Pages

Monday, June 17, 2019

The Challenge of Open Source MT

MT is considered one of the most difficult problems in the general AI and machine learning field. In the field of artificial intelligence, the most difficult problems are informally known as AI-complete problems, implying that the difficulty of these computational problems is equivalent to that of solving the central artificial intelligence problem— that is, making computers as intelligent as people. It is no surprise that humankind has been working on this problem for almost 70 years now, and is still quite some distance from having solved this problem.

“To translate accurately, a machine must be able to understand the text. It must be able to follow the author's argument, so it must have some ability to reason. It must have extensive world knowledge so that it knows what is being discussed — it must at least be familiar with all the same commonsense facts that the average human translator knows. Some of this knowledge is in the form of facts that can be explicitly represented, but some knowledge is unconscious and closely tied to the human body: for example, the machine may need to understand how an ocean makes one feel to accurately translate a specific metaphor in the text. It must also model the authors' goals, intentions, and emotional states to accurately reproduce them in a new language. In short, the machine is required to have a wide variety of human intellectual skills, including reason, commonsense knowledge and the intuitions that underlie motion and manipulation, perception, and social intelligence. Machine translation, therefore, is believed to be AI-complete.”
 
One of the myths that seem to prevail in the localization world today is that anybody with a hoard of translation memory data can easily develop and stand-up an MT system using one of the many open source toolkits or DIY (do-it-yourself) solutions that are available. We live in a time where there is a proliferation of open source machine learning and AI related development platforms. Thus, people believe that given some data, and a few computers, a functional and useful MT system can be developed. However, as many who have tried have found out, the reality is much more complicated and the path to success is long, winding, and sometimes even treacherous. For an organization to successfully consider developing an open source machine translation solution to deployable quality, a few critical elements for successful outcomes is required:
  1. At least a basic competence with machine learning technology, 
  2. An understanding of the broad range of data needed and used in building and developing an MT system,
  3. An understanding of the proper data preparation and data optimization processes needed to maximize success,
  4. The ability to understand, measure and respond to successful and failure outcomes with model building that are very much part of the development process,
  5. An understanding of the additional support tools and connected data flow infrastructure needed to make MT deployable at enterprise scale.

The very large majority of open source MT efforts fail, in that they do not consistently produce output that is equal to, or better than, any easily accessed public MT solution, or they cannot be deployed in a robust and effective manner. 


This is not to say that this is not possible, but the investments and long-term commitment required for success are often underestimated or simply not properly understood. A case can always be made for private systems that offer greater control and security, even if they are generally less accurate than public MT options. However, in the localization industry, we see that if “free” MT solutions are available that are superior to an LSP built system, translators will prefer to just use those. We also find that for the few of these self-developed MT systems that do produce useful output quality, larger integration and data integration issues are often an impediment, and thus difficult to deploy at enterprise scale and robustness. 

Some say that those who ignore the lessons of history are doomed to repeat the errors. Not so long ago, when the Moses SMT toolkits were released, we heard industry leaders’ claim, “Let a thousand MT systems bloom”, but in retrospect, did more than a handful survive beyond the experimentation phase?


Why is relying on open source difficult for enterprise use?


The state-of-the-art of machine translation and the basic technology is continuously evolving and practitioners need to understand and stay current with the research to have viable systems in deployment. A long, sustained and steady commitment is needed just to stay abreast.

If public MT can easily outperform home-built systems, there is little incentive for employees and partners to use these in-house systems, and thus we are likely to see rogue behavior where users will reject the in-house system, or see users forced to use sub-standard systems. This is especially true for MT systems in localization use cases where the highest output quality is demanded. Producing systems that consistently perform as required, needs deep expertise and broad experience. An often overlooked reason for failure is that to do it yourself, it is necessary to have an understanding and some basic expertise with the various elements in and around machine learning technology. Many do-it-yourselfers don’t know how to do any more than load TM into an open source framework.

While open source does indeed provide access to the same algorithms, much of the real skill in building MT systems is in the data analysis, data preparation, and data cleansing to ensure that the algorithms learn from a sound quality foundation. The most skillful developers also understand the unique requirements of different use cases and may develop additional tools and processes to augment and enhance the MT related tasks. Often times the heavy lifting for many uses cases is done outside and around the neural MT models, understanding error patterns and developing strategies to resolve them.


Staying abreast is a challenge

Over the last few years, the understanding of what the “best NMT algorithms” are has changed regularly. A machine translation system that is deployed on an enterprise scale requires an “all in” long-term commitment or it will be doomed to be a failed experiment:

  • Building engineering teams that understand what research is most valid and relevant, and then upgrading and refreshing existing systems is a significant, ongoing and long-term investment. 
  • Keeping up with the evolution in the research community requires constant experimentation and testing that most practitioners will find hard to justify. 
  • Practitioners must know why and when to change as the technology evolves or risk being stuck with sub-optimal systems. 
Open-source initiatives that emerge in academic environments, such as Moses, also face challenges. They often stagnate when the key students that were involved in setting up initial toolkits graduate and are hired away. The key research team may also move on to other research that has more academic stature and potential. These shifting priorities can force DIY MT practitioners to switch toolkits at great expense, both in terms of time and redundant resource expenditures.
 
To better understand the issue of a basic open-source MT toolkit in the face of enterprise MT capability requirements, consider why an organization would choose to use an enterprise-grade content management system (CMS) to set up a corporate website instead of a tool like WordPress. While both systems could be useful in helping the organization build and deploy a corporate web presence, enterprise CMS systems are likely to offer specialized capabilities that make them much more suitable for enterprise use.
 
 
Deep expertise with MT is acquired over time by building thousands of systems across varied use cases and language combinations. Do we really believe that a DIY practitioner who builds a few dozen systems will have the same insight and expertise? Expertise and insight are acquired painstakingly over time. It is very easy "to do MT badly" and quite challenging to do it well.



 As the global communication, collaboration and content sharing imperatives demanded by modern digital transformation initiatives become well understood, many enterprises see that MT is now a critical technology building block that enables better DX. However, there are many specialized requirements including data security and confidentiality, adaptation to different business use cases, and the ability to deploy systems in a broad range of enterprise use scenarios. MT is increasingly a mission-critical technology for global business and requires the same care and attention that the selection of enterprise CMS, email, and database systems do. The issue of enterprise optimization is an increasingly critical element in selecting this kind of core technology.


What are the key requirements for enterprise MT?

There is more to successful MT deployment than simply being able to build an NMT model. A key requirement for successful MT development by the enterprise is long-term experience with machine learning research and technology at industrial scale in the enterprise use context.

With MT, actual business use case experience also matters since it is a technology that requires the combination of computational linguistics, data management, human translator interaction, and systems integration into organizational IT infrastructure for robust solutions to be developed. Best practices evolve from extensive and broad experience that typically takes years to acquire, in addition to success with hundreds, if not thousands, of systems.

The SDL MT engineering team has been a pioneer on data-driven MT technology since its inception with Statistical MT in the early 2000s and has been involved with a broad range of enterprise deployments in the public and private sectors. The deep expertise that SDL has built since then encompasses the combined knowledge gained in all of the following areas:

  • Data preparation for training and building MT engines, acquired through the experience of building thousands of engines across many language combinations for various use cases.
  • Deep machine learning techniques to assess and understand the most useful and relevant research in the NLP community for the enterprise context.
  • Development of tools and architectural infrastructure that allows rapid adoption of research breakthroughs, but still maintains existing capabilities in widely deployed systems.
  • Productization of breakthrough research for mission-critical deployability, which is a very different process from typical experimentation.
  • Pre- and post-processing infrastructure, tools and specialized capabilities that add value around core MT algorithms and enable systems to perform optimally in enterprise deployment settings. 
  • Ongoing research to adapt MT research for optimal enterprise use, e.g., using CPUs rather than GPUs to reduce deployment costs, as well as the system cost and footprint. 
  • Long-term efforts on data collection, cleaning, and optimization for rapid integration and testing with new algorithmic ideas that may emerge from the research community.
  • Close collaboration with translators and linguists to identify and solve language-specific issues, which enables unique processes to be developed to solve unique problems around closely-related languages. 
  • Ongoing interaction with translators and informed linguistic feedback on error patterns provide valuable information to drive ongoing improvements in the core technology.
  • Development of unique language combinations with very limited data availability (e.g., ZH to DE) by maximizing the impact of available data. Utilization of zero-shot translation (between language pairs the MT system has never seen) produces very low-quality systems through its very basic interlingua, but can be augmented and improved by intelligent and informed data supplementation strategies.
  • Integration with translation management software and processes to allow richer processing by linguistic support staff.
  • Integration with other content management and communication infrastructure to allow pervasive and secure implementation of MT capabilities in all text-rich software infrastructure and analysis tools.

The bottom line

The evidence suggests that embarking on a self-managed open-source-based MT initiative is for the very few who are ready to make the substantial long-term commitment and investments needed. Successful outcomes require investment in building expertise not only in machine learning but in many other related and connected areas. The same kinds of rules that apply to enterprise decisions on selecting email, content management and database systems should apply here. Properly executed, MT is a critical tool that enhances and expands the digital global footprint of the organization, and it should be treated with the same seriousness dedicated to any major strategic initiative.

This is the raw, first draft, and slightly longer rambling version of a post already published on SDL.COM.