Tuesday, February 28, 2017

Machine Translation at Volkswagen AG

This is a guest post by Jörg Porsiel, who manages MT at VW, that provides some perspective on the value of MT in the context of a large global enterprise's communication and information distribution needs. VW, like many other truly global enterprises, needs a large variety of business content and product information to flow easily, and quickly, to enable rapid response to emerging and ongoing business situations and needs.

We can see from this viewpoint that the global enterprise has a very pragmatic and dispassionate view of this technology, which is simply seen as a tool to enable and improve information flow, in environments that are truly multilingual, and that require large amounts of content to flow instantly where needed to enable forward business momentum.

We also see that this is a use-case scenario where the need for on-premise installation is critical and necessary for deployment for both security and performance reasons. Some may also be surprised that this MT activity has been ongoing for over a decade. This is yet one more proof point that hundreds of millions of words are being translated by MT at VW and other companies like them, who I am sure also use professional translation services for some content. However, we should reasonably expect that the bulk of the flowing translation needs are being done by MT here, and at many other global enterprises.

For those who can read German, there is a much more detailed overview of  MT at VW at this link. 19 pages in fact.

 The emphasis in the post below is all mine.



Initial situation

Volkswagen AG is the world’s largest car manufacturer, selling more than ten million vehicles in 2016. More than 625,000 people work for the Group’s thirteen brands at more than 120 locations worldwide, with approximately 280,000 in Germany alone. The headquarters of the brands VW Passenger Cars, VW Commercial Vehicles, Audi, Porsche and MAN are located in Germany. Bentley has its headquarters in Great Britain, Bugatti in France, Škoda in the Czech Republic, Seat in Spain, Ducati and Lamborghini in Italy, and Scania in Sweden.
Countless teams in a variety of disciplines work on projects around the clock simultaneously or sequentially, distributed among various time zones around the world and on all continents. In addition to German, the primary languages for communication during this work are English (as lingua franca), Spanish, Chinese and French, but of course Brazilian Portuguese, Italian, Czech, Russian and Polish and many other languages are used as well. This results in a constant stream of information – millions of emails and terabytes of data from, for example, systems for simulation, diagnostics and infotainment – circulating all day, every day, within the Group-wide Intranet or entering it from outside.

To facilitate, and especially to accelerate, the exchange of information in numerous languages across continents, it was decided in 2002 to introduce machine translation into the company. In the interest of data security, the application had to be available strictly within the VW Intranet, which is accessible throughout the Group. This was intended to close the gaps in security created by the use of such programs in the Internet. Volkswagen AG has been operating a rule-based system from the German company, Lucy Software and Services GmbH. The functionality of the system has been expanded since then, and its quality has steadily improved, for example, through the addition of in-house terminology in collaboration with various departments at Volkswagen.

Currently, eight language pairs covering the most important Group languages are available. Additional functionality includes the translation of entire documents, for example, in MS-Office, XML and PDF formats, as well as web pages. There is also an interface to a web service enabling systems to access machine translation. These systems, such as vehicle diagnostics, are generally characterised by a very high volume of data in various languages. 

Why use machine translation?

The speed at which machine translation works and produces results is a significant advantage. Of course, this speedily generated output must first be considered independently of the quality of the translation. The speed – dramatically faster than human translation – proves itself to be an important component in the optimisation of multilingual communication processes in the business environment, especially in conjunction with terminology for each specific field and accompanied by upstream and/or downstream quality control. The fast availability of such raw or gist translations in conjunction with the previous knowledge and expertise in the field of the recipient generally results in significantly faster decision-making processes. Although the quality of a machine translation is in general lower than that of a human translation, depending on the quality of the source text, it is often sufficient for taking specific measures. In the event that the quality of the machine translation is not adequate, it can be improved by pre-editing the source text and/or post-editing of the output as necessary.

Management of Expectations

The larger and more international a company is, and the more heterogeneous the workforce, the more important it becomes to develop a communication concept which is implemented prior to the introduction of, and in support of, machine translation in order to familiarise the users with its advantages and disadvantages. Such a concept is necessary to ensure the long-term acceptance by the users because the better the users’ understanding of the application’s strengths and weaknesses, the greater the acceptance and resulting level of use as well as constructive cooperation in the further development of the service within the company.

At a company as large as Volkswagen and with such a large number of potential users, it is impossible to satisfy everyone. In other words, not every wish can be satisfied, nor can all file formats, language pairs and technical fields can be provided at the same level of quality. This is neither technically possible nor, from the perspective of a cost-to-benefit analysis, economically viable. Thus, the aim of providing such a service can be only to satisfactorily meet a “representative average” of the statistically expected needs with suitable quality within the limits of the available personnel and finances.

The management of expectations is also important for explaining the uses of machine translation in day-to-day work to groups with experience in (computational) linguistics and translation and, if necessary, to counter the reservations regarding translation quality and, in particular, job security. As part of such a concept, it must also be explained what machine translation can do especially well and under what conditions it must be used, for example, in conjunction with specialised terminology management, controlled language and, as needed, post-editing.

But on the other hand, it must be clearly emphasised that machine translation is neither a panacea nor an all-purpose tool for every translation task, which might thus be intended to eliminate jobs, but rather for general cost reduction. More importantly, it must be emphasised what machine translation cannot do, and why: for example, certain types of text are not suitable for machine translation or are (or could be) crucial for legal reasons.

Furthermore, it must be explained, for example, with a cost-to-benefit analysis, that machine translation cannot work in the long term without qualified and continuous support from specialists, and that there is no such thing as a “one-size-fits-all” solution. In other words, the idea that a one-time installation of a machine-translation program without supporting measures such as IT support, terminology, controlled language and so on is adequate to enable translation from every language into every language, regardless of the subject area, type of text and target group, is out of touch with reality. This should also make it clear that even just the technical operation and IT support require continuous and secure financing, without which the introduction of such a program is not prudent.

Further development

In addition to the continuous improvement in the quality of the translation results through the inclusion of additional terminology from a multitude of different technical fields, it is planned to link the service by means of an interface with, or to integrate it into, still more systems. Furthermore, it is planned to extend the available languages to Chinese, Portuguese, Czech and Polish, for example.


Jörg Porsiel is a Machine Translation Project Manager at Volkswagen Headquarters in Wolfsburg, Germany. A translation graduate of Heidelberg University, he has also studied in Brussels, Edinburgh and Metz. Since 1992 he has been working in translation, terminology management and foreign language corporate communication. He started working for VW in 2002 in the field of controlled language and as of 2005 is responsible for the Group’s internal machine translation service.

Wednesday, February 8, 2017

The Obscure and Controversial Importance of Metadata

This is a guest post by Luigi Muzii on the importance and value of metadata. Many of us in the MT field are often astonished at how valuable language data resources are left in states of disuse, disrepair and complete lack of organization, thus often rendering these valuable resources useless or at best cumbersome to work with. With many translation agencies and even large enterprises, core language data assets lie in what many of us consider to be in ruins.

As we now enter the AI driven phase of so many industrial processes, the value of metadata increases by the day, and enables much of the work of posterity to be useful for future projects and work. Data matters and clean organized data can become a foundation for competitive advantage, especially in the world of business translation where the value of work is still counted by the word.

This post urges that all of us in the language industry start thinking about and implementing metadata strategies. Though there are elements of this present in some TMS systems, from my viewpoint, they still focus largely on yesterday's problems and are ill-prepared to deal with emerging translation challenges which are much more tightly integrated with machine learning and AI processes.

I have added some graphics and a video snippet to Luigi's article and the emphasis is all mine.


Metadata is data that provides information about other data to describe a resource for discovery and identification or management.

Describing the contents and context of data increases its usefulness and improve user-experience over the useful life of the data.

Metadata also helps to organize, identify, and preserve resources.

Metadata can be created either by automated information processing or by manual work. Elementary metadata captured by computers can include information about when a resource was created, who created it and when it was last updated together with file size and file extension information.

And yet, those who would actually benefit most from the existence of metadata oftentimes undervalue and disregard it.

The Semantic Web is a perfect example of this attitude. It is still largely unrealized, although widely imagined and projected as being of great value. It is supposed to thrive on machine-readable metadata about pages, which includes information on the content these pages hold and present, and the inter-relations between them and other pages.

The machine-readable descriptions enable automated agents to attribute meaning to content, and shape the knowledge about it, and exploit it accordingly. Many of the technologies for such agents already exist and metadata is the only part missing.

 In a 2001 essay, Cory Doctorow illustrates the problems with metadata, especially its fragility. One of Doctorow’s issues merits special attention: People are lazy. Laziness is the major reason for not adding metadata to content. Nonetheless, metadata is elusive and perishable. In fact, it may become obsolete when related data becomes irrelevant in time and if it is not updated with new insights.

How, then, is metadata relevant and important in translation? When setting up a translation project, first prep operations should involve drawing a schematic through basic details. Alas, rarely are translation project managers taught to never skip this task, so often scarce and partial translation project data rapidly becomes totally irrelevant.

Translation project managers are not the only culprits of this contempt, though. Today, almost all of them heavily rely on TMSs, and it’s a pity that most TMSs do not provide any mechanism to have a translation project charter properly compiled with the relevant metadata. TMS vendors could help this with profiler tools using dynamic list properties to associate values with external database tables, possibly from connected CMSs, CRMs, ERP systems, etc.

Also, not only are translation project charters important for LSPs, their staff, and vendors, to properly run project-related tasks; the information they store is essential to read and understand the data produced along the job, including—and above all—language data. In fact, language data are nearly useless without relevant metadata.

Metadata allow language data to be collected, organized, cleaned and explored for potential re-use, especially with MT engine training.

At the same level, to produce truly useful stats and get practical KPIs, automatically-generated project data is insufficient for any business inference whatsoever. Processing any set of data, however vast, is not enough per se to be assimilated to “big data.” In fact, to be such, data sets must be so extensive in volume, speed, variety and complexity to require specific technologies and analytical methods for value extraction. It is hardly possible that traditional data processing applications and techniques be adequate to deal with them. This is why translation big data is crap; no matter how much data a TMS or an LSP processes, it is not organized enough or durable enough for a reliable outlook.

Anyway, along with allowing the collation of relevant data for effective KPIs, translation metadata may help LSPs and their people to better understand the language data they hold, consume, and produce. Indeed, translation tools all add different types of metadata to language data during processing. Translation memory management software adds metadata to every translation unit; the anthology of descriptive data in a terminological record is metadata on a term, along with that that are automatically generated; and the annotations to a collection of reference material are metadata for knowledge representation.

For example, where the software can recognize the type of a source file, and automatically stamp an annotation, this may be useful in subsequent handling. The same goes for the author’s name or the last update date, the project code from the project charter, possibly generated automatically or semi-automatically based on the project manager’s input, possibly from a list. And think of adding metadata to segments indicating constants or untranslatable elements.

Metadata pertains to a field where standards typically apply and help. A lot. Alas, as we have learned from experience, especially in the last few years, standards are much talked about but little loved and poorly applied when translation software is involved. However sad, the reason is quite simple: being so vertical, the sector deeply reflects the state of the reference industry; it is highly fragmented with no company having the critical mass to rule the market. This translates into a superficial acceptance of standards but with different implementations to force customer lock-in.

In the immediate future, it will be possible to use a neural language model trained on the source language to automatically reckon how similar a text is to training data, and hence its suitability for MT. If we invest in enriching the input with profile metadata and linguistic annotation for NMT systems, the whole task may be even simpler.

So, let’s start standardizing, entering and using metadata. For good.

Luigi Muzii's profile photo

Luigi Muzii has been in the "translation business" since 1982 and has been a business consultant since 2002, in the translation and localization industry through his firm . He focuses on helping customers choose and implement best-suited technologies and redesign their business processes for the greatest effectiveness of translation and localization related work.

This link provides access to his other blog posts.

Friday, February 3, 2017

Most Popular Posts of 2016

This is a ranking of the most popular posts of 2016 on this blog according to Blogger, based on traffic and/or reader comment activity. Popular does not necessarily mean the best, and I have seen in the past that some posts that may not initially resonate, have real staying power and continue to be read actively years after the original publishing date, even though they are not initially popular. We can see from these rankings, that Neural MT certainly was an attention grabber for 2016, (even though I think for the business translation industry, Adaptive MT is a bigger game changer) and I look forward to seeing how NMT becomes more fitted to translation industry needs over the coming year.

I know with some certainty that the posts by Juan Rowda and Silvio Picinini will be read actively through the coming year and on, because they are not just current news that fades quickly like the Google NMT critique, but rather they are carefully gathered best practice knowledge that is useful as a reference over an extended period. These kinds of posts become long-term references that provide insight and guidance for others traversing a similar road or trying to build up task-specific expertise and wish to draw on best practices.

I have been much more active surveying the MT landscape since achieving my independent status, and I have a much better sense for the leading MT solutions now than I ever have. 

So here is the ranking of the most popular/active posts over the last 12 months.

The Google Neural Machine Translation Marketing Deception

This is a critique of the experimental process and related tremendous "success" reported by Google in making the somewhat outrageous claim that they had achieved "close to human translation" with their latest Neural MT technology. It is quite possible that the research team tried to rein in the hyperbole but were unsuccessful and the marketing team ruled on how this would be presented to the world.

A Deep Dive into SYSTRAN’s Neural Machine Translation (NMT) Technology

This is a report of a detailed interview with the SYSTRAN NMT team on their emergent neural MT technology. This was the first commercial vendor NMT solution available this last year and the continued progress looks very promising.

This was an annual wrap-up of the year in MT. I was surprised by how actively this was shared and distributed and at the time of this post is still the top post as per the Google ranking system. The information in the post was originally done together as a webinar with Tony O'Dowd of KantanMT. It was also interesting for me as I did some research on how much MT is being used and found out that on any given day as much as 500+ Billion words a day are being passed through a public MT engine somewhere in the world.

This is a guest post by Silvio Picinini,  a Machine Translation Language Specialist at eBay. The MTLS role is one that I think we will see more of within leading-edge LSPs as it simply makes sense when you are trying to solve large-scale translation challenges. The problems this eBay team solves have a much bigger impact on creating and driving positive outcomes for large-scale machine translation projects. The MTLS focus and approach is equivalent to taking post-editing to a strategic level of importance i.e. understand the data and solve 100,000 potential problems before an actual post-editor ever sees the MT output.

5 Tools to Build Your Basic Machine Translation Toolkit 

This is another post from the MT Language Specialist team at eBay, by Juan Martín Fernández Rowda. This is a post that I expect will become a long-term reference article and I expect that it will be actively read even a year from now as it describes high-value tools that a linguist should consider when involved with large or massive scale translation project where MT is the only viable option. His other contributions are also very useful references and worth close reading.

This is yet another guest post, this time jointly with Luigi Muzii, that rapidly rose and gained visibility, as it provided some deeper analysis, and hopefully a better understanding of why private equity firms have focused so hard on the professional translation industry. There is a superficial reaction by many in the industry that seems to interpret this investment interest by PE firms as being so bullish on "translation," that they are interested in funding expansion plans at lackluster LSP candidates.  A deeper examination, suggests that the investment clearly is not just to give money to the firms they invest in, but it appears that many large LSPs are good "business turnaround and improve" candidates. This suggests that one of these "improved" PE LSP investments could become a real trailblazer in terms of re-defining the business translation value equation, and begin a evolutionary process whereby many marginal LSPs could be driven out of the market. However, we have yet to see even small signs of real success by any of the PE supervised firms thus far in changing and upgrading the market dynamics.

Luigi Muzii's profile photo

Comparing Neural MT, SMT and RBMT – The SYSTRAN Perspective

This is the result of an interesting interview with Jean Senellart, CEO and CTO at SYSTRAN, who is unique in the MT industry as being one of a handful of people who has deep exposure with all the current MT technology methodologies. In my conversations with Jean, I realized that he is one of the few people around in the "MT industry", who has deep knowledge and production-use experience with all three MT paradigms (RBMT, SMT, and NMT). There is a detailed article that describes the differences between these approaches on the SYSTRAN website for those who want more technical information.

Jean Senellart, CEO, SYSTRAN SA


This was a guest post by Vladimir “Vova” Zakharov, the Head of Community at SmartCAT. It examines some of the most widely held misconceptions about computer-assisted translation (CAT) technology. SmartCAT is a very interesting new translation process management tool, that is free to use, and takes collaboration to a much higher level than I have seen with most other TMS systems. And interestingly, this post was also very popular in Russia.

The single most frequently read post I have written thus far, is one that focuses on post-editing compensation. It was written in early 2012, but to this day, it still gets at least 1,000 views a month. This, I suppose shows that the industry has not solved basic problems, and I noticed that I am still talking about this issue in my outlook on MT in 2017. It remains an issue that many have said needs a better problem resolution. Let's hope that we can develop much more robust approaches to this problem this year. As I have stated before, there is an opportunity for industry collaboration to develop and share actual work related data to develop more trusted measurements. If multiple agencies collaborate and share MT and PEMT experience data we could get to a point where the measurements are much more meaningful and consistent across agencies.

Exploring Issues Related to Post-Editing MT Compensation