Friday, February 1, 2019

Understanding the Realities of Language Data

This is a guest post by Luigi Muzii that focuses mostly on the various questions that surround Language Data, which by most “big data” definitions and volumes is really not what most in the community would consider big data. As the world hurtles into the brave new world that is being created by a growling volume of machine learning and AI applications, the question of getting the data right is often brushed aside. Most think the data is a solved problem or presume that data is easily available. However, those of us who have been working at MT seriously over the last decade understand this is far from a solved problem. Machines learn from data and smart engineers can find ways to leverage the patterns in data in innumerable ways. Properly used it can make knowledge work easier, or more efficient e.g. machine translation, recommendation, and personalization.

The value and quality of this pattern learning can only be as good as the data used, and however exciting all this technology seems, we need to understand that our comprehension of how the brain (much less the mind) works, is still really is only in its infancy. “Machine learning” is a geeky way of saying “finding complicated patterns in data”. The comparative learning capacity and the neuroplasticity of any 2-year-old child will pretty much put most of these "amazing and astounding" new AI technologies to shame. Computers can process huge amounts of data in seconds, and sometimes they can do this in VERY useful ways, and most of what we call AI today is rarely if ever much more than this. If the data is not good, the patterns will be suspect and often wrong. And given what we know about how easy it is to screw data up this will continue to be a challenge. In fact, DARPA and others are now discussing strategies about detecting and removing “poisoned data” on an increasingly regular basis. Much of the fear about rogue AI is based on this kind of adversarial machine learning which can lead trusted production systems astray to make biased and even dangerous errors.

Despite the hype, we should understand that deep learning algorithms are increasingly going to be viewed as commodities.

The data is your teacher. It's the data where the real value is.

Data is valuable when it is properly collected, understood, organized and categorized. Having rich metadata and taxonomy is especially valuable with linguistic data. The SMT experience has shown that much of the language industry TM was not very useful in building SMT engines without significant efforts in data cleaning. This is even more true with Neural MT. As Luigi points out, most of the data used by the MT engines that process 99% of the language translation being done on the planet today, have used data that the language industry has had very little to do with. In fact, it is also my experience that many large scale MT projects for enterprise use cases involve a data acquisition and data creation phase that produces the right kind of data to drive successful outcomes. While data from localization projects can be very useful at times, it is most often better to create and develop training data that is optimized for the business purpose. Thus a man-machine collaboration

Luigi has already written about metadata and you can find the older articles here and here.

This is also true for the content that drives digital experience in the modern digital marketplace. We are now beginning to understand that content is often the best salesperson in a digital marketplace and good content drives and can enhance a digital buyer and customer journey. And here too data quality and organization matters, in fact, it is a key to success in the dreams of digital transformation. Content needs to be intelligent.

 Ann Rockley said years ago:

Intelligent content is content that’s structurally rich and semantically categorized and therefore automatically discoverable, reusable, reconfigurable, and adaptable.

Here is a simple and succinct description of what intelligent content is. For a more detailed look at what this is and what is needed to make it happen through the translation and globalization process take a look at the SDL ebook on the Global Content Operating Model.


Let’s face it: Translation is prevalently considered as a trivial job. Investigating the rationale of this view is pointless, so let’s take it as a matter of fact and focus on the ensuing behaviors, starting with the constantly increasing unwillingness to pay a fee—however honest and adequate—for a professional performance. Also, the attitude of many industry players towards their customers does not help correct this cheapish view either.

Unfortunately, in conjunction with the prevailing of the Internet, the idea has been progressively established that goods—especially virtual ones—and indeed services, should be ever cheaper and better. The (in)famous zero marginal cost theory has heavily contributed with to the vision of an upcoming era of nearly free goods and services, “precipitating the meteoric rise of a global Collaborative Commons and the eclipse of capitalism.” But who will pay for this? Are governments supposed to subsidize all infrastructural costs to let marginal cost pricing prevail? Well, no. But keep reading anyway.

So where is the Language Data?

Is there anything in all this having to do with data? Data has entered the discussion because of big data. The huge amounts of data manipulated by the large tech corporations have led to the assumption that translation buyers, and possibly industry players too, could do the same with language data. This has also led to an expectation with respect to the power of data, an expectation that may be overly exaggerated or beyond any principle of reality.

Indeed, from the pulverization of the industry, a series of problems comes that have not yet been resolved. The main problem consists in the fact there is no one player really big and vertical having such a large and up-to-date amount of data—especially language data—to be even remotely considered big data or to be used in any comparable way.

Also, more than 99,99 percent of translations today is performed through machine translation and the vast majority of the training data of major online engines comes from sources other than the traditional industry ones. Accordingly, the data that industry players can offer and make available and even use for their business purposes are comparably very little and poor. In fact, the verticality of this data and the width of the relevant scope are totally insufficient to enable any player, including or maybe especially the largest ones, to impact the industry. A certain ‘data effect’ indeed exists only because online machine translation engines are trained with a huge amount of textual data available on the Internet regardless of the translation industry.

For these reasons, a market place of language data might be completely useless if not even pointless. It might be viable but the data available could hardly be the data needed.

For example, TAUS Matching Data is an elegant exercise, but its practicality and usefulness are yet to be proved. It is based on DatAptor, a research project pursued by the Institute for Logic, Language and Computation at the University of Amsterdam under the supervision of Professor by Khalil Sima’an. DatAptor “aims at providing automatic methods for inducing dedicated translation aids from large translation data” by selecting datasets from existing repositories. Beyond the usual integrity, cleanliness, reliability, relevance, and prevalence issues, the traditional and unsolved issue of information asymmetry persists: A deep linguistic competence and subject-field expertise, as well as a fair proficiency in data management, are needed to be sure that the dataset is relevant, reliable and up-to-date. And while the first two might possibly be found in a user querying the database, they are harder to find in the organization collecting and supposedly validating the original data.

Also, several translation data repository platforms are available today generally by harvesting data through web crawling. The data used by the highest-resourced online machine translation engines comes from millions of websites or from the digitalization of book libraries.

Interestingly, open-source or publicly-funded projects like bicleaner, TMop, Zipporah, ParaCrawl or Okapi CheckMate are growing to harvest, classify, systematize, and clean language data.

The initiative of a self-organized group of ‘seasoned globalization professionals’ from some major translation buyers may be seen as part of this trend. This group has produced a list of best practices for translation memory management. Indeed, this effort proves that models and protocols are necessary for standardization, not applications.

TMs are not dead and are not going to die as long as CAT tools and TMSs remain the primary means in the hands of translation professionals and businesses to produce language data.

At this point, two questions arise: What about the chance of having different datasets from the same original repository available on the same marketplace? And what about synthetic data? So, the challenge of selecting and using the right data sources remains unsolved.

Finally, also the coopetition paradox applies to a hypothetical language data marketplace. Although many translation industry players may interact and even cooperate on a regular basis, most of them are unwilling to develop anything that would benefit the entire industry and keep struggling to achieve a competitive advantage.

Is Blockchain really the solution?

For all these reasons, blockchain is not the solution for a weak-willed, overambitious data marketplace.

As McKinsey’s partners Matt Higginson, Marie-Claude Nadeau, and Kausik Rajgopal wrote in a recent article, “Blockchain has yet to become the game-changer some expected. A key to finding the value is to apply the technology only when it is the simplest solution available.” In fact, despite the amount of money and time spent, little of substance has been achieved.

Leaving aside the far-from-trivial problem of the immaturity, instability, expensiveness, complexity—if not obscurity—of the technology and the ensuing uncertainty, maybe blockchain can be successfully used in the future to secure agreements and their execution (possibly through smart contracts), though hardly to anything else in the translation business. Competing technologies are also emerging as less clunky alternatives. Therefore, it does not seem advisable to put your money in a belated and misfocused project based on a bulky, underachieving technology as a platform for exchanging data that will still be exposed to ownership, integrity, and reliability issues.

The importance of Metadata

Metadata is totally different: It can be extremely interesting even for a translation data marketplace.
The fact that big data is essentially metadata has possibly not been discussed enough. The information of interest for data-manipulating companies does not come from the actual content posted, but from the associated data vaguely describing user behaviors, preferences, reactions, trends, etc. Only in a few cases text strings, voice data, and images are mined, analyzed, and re-processed. Even in this case, the outcome of this analysis is stored as descriptive data, i.e. metadata.  The same applies to IoT data. Also, data is as good as the use one is capable of making of it. In Barcelona, for example, within the scope of the Decode project, mayor Ada Colau is trying to use data on the movements of citizens generated by apps like Citymapper to inform and design a better system of public transport.

In translation, metadata might prove useful for quality assessment, process analysis, and re-engineering and market research, but it is much less considered than language data and even more neglected than elsewhere.

Language data is obsessively reclaimed but ill-curated. As usual, the reason is money: Language data is supposed to be immediately profitable, by leveraging it through discount policies or by training machine translation engines. In both cases, they are seen as a means at hand to sustain the pressure on prices and reduce compensations to linguists. Unfortunately, the quality of language data is generally very poor, because curating it is costly.

Italians use the expression “fare le nozze coi fichi secchi” (make a wedding with dry figs) for an attempt to accomplish something without spending what is necessary, while Spanish say “bueno y barato no caben en un zapato” (good and cheap don’t fit in a shoe). Both expressions recall the popular adage “There ain’t no such thing as a free lunch.”

This idea is common to virtually every culture, and yet translation industry players still have to learn it, and possibly not forget it.

We often read and hearsay that there’s a shortage of good talent in the market. On the other hand, many insist that there is plenty and that the only problem of this industry is its ‘bulk market’—whatever this means and regardless of how reliable those who claim this are or boast to be and are wrongly presumed to be. Of course, if you target Translators Café, ProZ, Facebook or even LinkedIn to find matching teams you most possibly have a problem in knowing what talent is and which talents are needed today.

Let’s face it: The main reason for high-profile professionals (including linguists) being unwilling to work in the translation industry is remuneration. And this is also the main reason for the translation industry and the translation profession to be respectively considered as a lesser industry and a trivial job. In an endless downward spiral.

Bad resources have been driving out the good ones for a long time now. And if this applies to linguists that should be ribs, nerves, and muscles of the industry, let alone what may happen with sci-tech specialists.

In 1933, in an interview for the June 18 issue of The Los Angeles Times, Henry Ford offered this advice to fellow business people, “Make the best quality of goods possible at the lowest cost possible, paying the highest wages possible.” Similarly, to summarize the difference between Apple’s long-prided quest for premium prices and Amazon’s low-price-low-margin strategy, on the assumption it would make money elsewhere, Jeff Bezos declared in 2012, “Your margin is my opportunity.”

Can you tell the difference between Ford, Amazon, and any ‘big’ translation industry player? Yes, you can.

Luigi Muzii's profile photo

Luigi Muzii has been in the "translation business" since 1982 and has been a business consultant since 2002, in the translation and localization industry through his firm . He focuses on helping customers choose and implement best-suited technologies and redesign their business processes for the greatest effectiveness of translation and localization related work.

This link provides access to his other blog posts.