I havealways found Luigi Muzii's aka Il barbaro (The Barbarian)
opinions on the "translation industry" interesting and have often felt that his blog never got
the attention it deserved. Possibly because he often wrote with frequent references to Roman historical precedence, in what I have been told is an
Italian scholarly style, and perhaps because English is not his preferred language to communicate complex thoughts.
His willingness to state the obvious (to common sense) sometimes makes him unpopular, especially with the MT naysayers e.g.
"Moreover, translation data - i.e. project data - has a limited lifespan and, at some point in time, it becomes outdated, possibly inaccurate, and
definitely irrelevant."
I think that many in the "translation industry" still fail to realize that the bulk of the material that they translate; i.e. the manuals, documentation and generic business content on websites focused on
the wonder and excellence of corporate goods and services, become less relevant, and less
valuable with each passing day no matter how well it has been translated. They clearly don't like to hear someone suggest that this might be so. And it
is only natural that the buyers of business translation services might ask: "Is there a cheaper way to do this?" since they are acutely aware of
the rapidly declining relevance and low value of a lot of the corporate content they produce.
His willingness to raise fundamental questions I think makes his voice worthy of attention, for me at least. Anyway, I am glad that he sent me this
brief overview which I am told will be expanded in future, and I hope that he will be back in future on this blog with other observations on the
business of translation related to MT or not.
Judge a man by his questions, not his answers … Voltaire
"On his house" is the equivalent of "De domo sua" in English, which we largely use in Italy to label a cause one pleads for his own benefit. In this case,
"domus mea" (my home) is the rationale for hiring a consultant.
Implementing machine translation (MT) effectively is not an easy task for a translation agency, or even for a large translation buyer whose core business
is not translation. Still, with 99,9% of translations performed daily by machines, it is more and more
perfectly reasonable that businesses are eager to have more and more of their content translated, quickly and cost-effectively using the technology that is
available.
Speed and cost-effectiveness are crucial to attain competitive advantage in global market competition, and we should understand that not all content is
equal in terms of how it should and can be properly translated. And not all content deserves the same TEP (Translate>Edit>Proof) consideration and
production process.
Most enterprises, especially SMEs seeking a foothold in international markets, see machine translation as a viable solution, but they do not have the
necessary knowledge and skills needed to deal with the challenging effort of implementing MT properly. Most logically, then, a SME would seek help from
translation agency vendors assuming they would know how to use and implement MT technology.
Unfortunately, most translation agencies often do not have these skills either. Considering that the translation technology market - which includes MT - is
noticeably smaller than the broader linguistic services market, and is somewhat under-served in terms of availability of competent service providers, more
and more language service vendors unfortunately have been adding MT consulting to their service offerings as a means to generate new forms of revenue. This
can only result in a situation somewhat like that shown below.
The following very brief tips are intended as basic practical guidelines for enterprises interested in taking advantage of MT.
Tip #1
Hire an independent consultant. If you are a language service vendor, being supported by an independent advisor will spare you the cost and the hassle of
an internal department while offering your customers the peace of mind of an unbiased opinion and professional help.
Tip #2
Either ask for or run a preliminary analysis. Through interviews with staff and examination of the current modus operandi (MO), an independent consultant
can help you objectively assess your processes and facilities in view of the potential implementation and integration of a machine translation platform, as
well as provide guidance on the possibility of the provision of any ancillary services.
Tip #3
Either ask for or draft a report that provides the following:
Review of your goals,
The outcomes of the preliminary analysis
Appraisal of the suitability of current modus operandi (MO) for machine translation, and
An outline of an exploratory program.
Use flowcharts and/or sequence diagrams to draw your modus operandi (MO), starting from future MO (FMO), then present MO (PMO) and finally transitional MO
(TMO.) By overlapping the three diagrams you should be able to track the differences and be able to make adjustments to streamline the transitional process
and ensure a smooth update of the existing processes.
Tip #4
Write down the requirements that have been collected during the interviews for the preliminary analysis. Prepare a grid with the technical specifications
(TS) and the statement of work (SoW) detailing what is to be done.
Tip #5
Write a clear template for a request for proposals (RFP) including the TS and the SoW and send it out to a group of selected MT vendor candidates, asking
them to fill in the grid.
Tip #6
When hiring the advisor, consider that the advisor should have comprehensive insight and understanding of the general MT market and its players to help
identify the best candidates.
Do not forget
Discuss the report and the proposals with the advisor and the customer, if you are a translation agency. Provide a program, a configuration plan and a
training/recruiting program including all tasks, from benchmarking to data qualification and preparation, from vendor selections to proposal evaluation,
from negotiations to implementation, from testing to training.
Luigi Muzii
has been in the "translation business" since 1982 and has been a business consultant since 2002, in the translation and localization industry
through
his firm
. He focuses on helping customers choose and implement best-suited technologies and redesign their business processes for the greatest effectiveness of
translation and localization related work.
This link provides access to his other blog posts.
This is an important subject and one that needs
ongoing examination and continuing study to get to real insight and find greater
process efficiency. I hope that this post by Silvio Picinini will trigger
further discussion and I also invite others who have information, and opinions
to share on this subject, to step forward. In my observation of MT output over
time, I have seen that MT produces a greater number of
actual errors, but the types of errors most often generated by MT are easy to
spot and clean up. Unlike the incorrect or inconsistent terminology and
sometimes misunderstood source errors, that may often be hidden in the clear
grammar and flow of a human translation. These human errors are much harder to find, and
not as easy to correct without multiple stages of review by independent reviewers.
-----
It is common to see a focus on the errors that
machine translation makes. But maybe there are also strengths in MT that can
benefit translators and customers. So I wrote an article in Multilingual
magazine comparing errors made by MT and human translators. The article is
below, reproduced here with permission.
Please find the original article published on
Multilingual magazine,
July-August 2016, “Errors in MT and human translation”, pg. 46-50.
Post-editing could be defined as the correction of
a translation suggested by a machine. When a translator receives a suggested
translation from Google, Bing or another machine translation (MT) engine, the
translator is working as a post-editor of that suggestion.
If instead the translator receives a suggestion
from a translation memory (TM), the suggestion for that segment was created by a
human translator in the past. If there is no suggestion from a TM, human
translation is the creation of a translation by a human. If we now assume that a
human translator and an MT engine “think” differently about the translation of a
sentence, we can explore how a human translator makes different errors compared
to a statistical machine translation process. If we make translators and
post-editors more aware of the types of errors that they will find, they will be
able to improve their translation quality.
Inconsistency
A human translator will likely go through a text
and consistently translate the word “employee” with the same words in the target
language, keeping his or her favorite word in mind and using it all the time in
the translation. The MT engine, on the other hand, has no commitment to
consistency. Statistical machine translation (SMT) is a popularity contest. So,
in one sentence, the translation for “employee” may be the usual one. But the
corpus used to train the engine may have come from governments, or from the
European Parliament, for example. And governments may like to call employees
“public servants.” If that is popular enough, the MT may choose this translation
for some sentences. The translation will be inconsistent and could look
something like “Company employees may only park on the east parking lot. If that
is full, the public servant may park on the west parking lot.”
Thus, MT is more likely than humans to translate
inconsistently.
However, here are two more thoughts on this. First,
humans create inconsistencies also. If the TM contains translations made by
several translators with different preferences for words, or contains
translations created over a long period of time, chances are that there are
inconsistencies in the TMs. Those inconsistencies will be accepted by the human
translator.
Second, we are not weighing glossaries to one side
or the other. The human translators could follow the glossary, or run glossary
consistency checks and their translations would be consistent. But the same
applies to post-editing.
Mistranslations
Many words have more than one meaning, and are
defined as polysemous. The word “holder,” for example, may mean a person who
owns a credit card (as in credit card holder), but it may also be a stand
designed to hold an object, such as a plate holder. If you are thinking of items
on eBay, you are most likely expecting the translation that means plate holder.
A human translator will easily know the correct translation. The machine,
however, may have a lot of training data related to finance or laws, and the
most popular translation there could be the person. The MT could choose this
most popular meaning and the result could be “fashion sponge holder with suction
cup” translated as “fashion sponge card holder with suction cup.”
Thus, MT is more likely than humans to make
mistranslation substitutions.
Words that are not to be translated
For a human translator, it is easy to know that
Guess, Coach and Old Navy are brands, and therefore should not be translated
into most target languages. As you know, SMT is a popularity contest, and the
very common word “guess” likely appears quite frequently in the corpus that
trained the SMT engine. The consequence — you guessed it — is that the MT is
likely to translate the word instead of leaving it untouched because it is a
brand. This may happen with product names as well. Air
Jordan sneakers could have the word Air translated. It could happen with the
brand Apple versus the fruit apple, but since it is a popularity contest, the
fruit may now be left untranslated instead of the iPhone having a fruit to go
with it.
Thus, MT is more likely than humans to translate
words that are not supposed to be translated.
Untranslated Words
The MT leaves out of vocabulary (oov) words
untranslated, and humans will translate them. This will favor humans depending
on how many oov words are present in the content to be translated. If the
content is specific and different from the corpus used to train the engine, it
is more likely that some words will not be known by the MT engine. But if the MT
engine is well trained with the same kind of subject that is being translated,
then the MT engine will minimize the number of untranslated words. On the other
hand, MT takes the collective opinion into account. It may not translate words
that are now commonly used untranslated, while a translator could be more
traditional or old-fashioned and would translate the word. Would you translate
“player” in “CD player” in your language? The word “player” used to be
translated a few decades ago, but the usage changed and the English “CD player”
is common now in many languages. The MT will learn from the corpus the most
current and frequent usage, and may do better than a human translator. Overall,
this issue still slightly favors the human side. Thus, MT will leave more wrongly untranslated words
than humans.
Gender and Number Agreement
The MT engine may learn from the corpus the correct
translation for “beautiful skirt,” and skirt is a feminine word in many
languages. However, the first time the source contains the combination
“beautiful barometer,” it will pick from what it knows and it may translate
beautiful as a feminine word. If barometer is masculine in the target language,
this creates an error of gender agreement. The MT is more likely to make this
error than a human translator, who intuitively knows the gender of objects. The
same applies to singular and plural. English uses invariant adjectives for both,
as in “beautiful skirt” and “beautiful skirts.” Thus, the MT engine may pick the
singular translation for the adjective next to a plural noun. The MT is more
likely to make a number agreement error than a human translator, who knows when
singular or plural is needed.
Thus, MT will make more grammar errors than humans.
So far we have seen several examples of situations
where humans translate better than MT engines. Now we will look at how a
“self-correcting” property of MT, created from the popularity of a translation,
can often do a better job than humans. A statistical MT engine can be seen as a
“popularity contest” where the translation that is suggested is the most popular
translation for a word or group of words present in the “knowledge” (corpus)
that trained the MT engine.
Spelling
There are two types of spelling errors: the ones
that create words that don’t exist (and can be caught by a spellchecker) and the
errors that turn a word into another existing word. You may have turned “from”
into “form” and “quite” into “quiet.” The first type, a pure spelling error made
by MT would require that you have in the corpus more instances of the error than
of the correction. Can you imagine a corpus that contains “porduct” 33 times and
“product” only 32 times? So MT almost never makes a spelling error of this kind.
For the second type, humans turn words into other
words, and the spellchecker will miss it. The MT engine will not make this error
because it is not likely that the corpus will contain the misspelled word more
frequently than the correct word for that context. This would require having “I
am traveling form San Francisco to Los Angeles” more frequently in your corpus
than you would have “I am traveling from San Francisco to Los Angeles” and which
one is more likely to be popular in a corpus? The correct one. This is why MT
will almost never make this kind of spelling error, while it is easy for a human
translator to do so.
Thus, humans are more likely than MT to make
spelling errors of any kind.
False Friends
False friends are words that look similar to a word
in a different language, but mean something different. One example is the word
“actual,” which means “real” in English. In Portuguese, the word atual
means current or as of this moment. So a presentation mentioning “actual
numbers” could be translated as “current numbers,” seriously changing the
meaning and causing confusion. A human translator may make this error, but the
MT would require the wrong translation for “actual” to be more popular in the
corpus than the correct translation. You would need “actual numbers” to be
translated more frequently as “current numbers” than as “real numbers.” Do you
think this would happen? No, and that is why MT almost never falls for a false
friend, while a human translator falls for it occasionally.
Thus, humans are more likely to make false friend
errors than MT.
Fuzzy Match
There are several errors that result from the use
of TM. These memories offer the human translator suggestions of translation that
are similar to the segment they are translating. Similar does not mean equal, so
if the suggested translation is a fuzzy match, the human translator must make
changes. If they don’t make any change and accept the fuzzy match as it is, they
risk making errors. There are three sub-types of errors to mention here:
Different terms. Think of a medical
procedure where the next step is “Administer the saline solution to the
patient.” If a fuzzy match shows “Administer the plasma to the patient,” this
might risk a person’s life.
Opposite meaning. Think of “The temperature
of the solution administered to the patient must stay below XX degrees.” If a
fuzzy match shows “must stay above XX degrees,” this might risk a person’s life.
For an eCommerce environment, this type of error could be a major issue: “This
item will not be shipped to Brazil” versus “This item will be shipped to
Brazil.”
Numbers that don’t match. Fuzzy matches from
a year before may offer the translator a suggested translation of “iPhone 5”
because that was the model from a year ago. The new model is the iPhone 6. If a
fuzzy match is accepted with the wrong number, the translator is introducing an
old model.
Thus, humans are much more likely to make errors
for accepting fuzzy matches than MT.
Acronyms
MT may leave acronyms as they are, because they may
not be present in the corpus. The MT engine has the advantage of having the
corpus to clarify if an acronym should be translated as the same acronym as in
the original, if it should be translated as a translated acronym or if it should
be translated using the expanded words from the meaning of the acronym. Human
translators may make errors here. If they do research, and the research does not
clarify the meaning, the original acronym may be left in the translation. So
this is an issue that favors the MT over humans, although not heavily.
The best solution for both humans and MT is to try
to find the expanded form of the acronyms. This will help MT and humans produce
a great and clear translation.
Thus, humans are slightly more likely to make
errors translating acronyms than MT.
Terminology
MT may handle terminology remarkably better than a
human translator. If an engine is trained with content that is specific to the
subject being translated, and that has been validated by subject matter experts
and by feedback from the target audience that reads that content, the specific
terminology for that subject will be very accurate and in line with the usage.
Add to this the fact that multiple translators may have created those
translations that are in the corpus, and it becomes easy to see how an MT engine
can do a better job with terminology than a single human translator, who often
translates different subjects all the time and cannot be a subject matter expert
on every subject.
Consider the following example:
English: “In photography and cinematography, a
wide-angle lens refers to a lens whose focal length is substantially smaller
than the focal length of a normal lens for a given film plane. This type of lens
allows more of the scene to be included in the photograph.”
Portuguese machine translation: “Em fotografia e
cinematografia, uma lente grande angular refere-se a uma lente cuja distância
focal é substancialmente menor do que a distância focal de uma lente normal para
um determinado plano do filme. Este tipo de lente permite que mais da cena a ser
incluída na fotografia.”
In this example about photography, the MT already
proposed the translation of wide-angle as grande angular, which is
the term commonly used to refer to this type of lens. This translation means
approximately “large angular.” A human translator knows the translations for the
words wide and angle. The translator could then be tempted to
translate the expression wide-angle literally as “wide angle” lenses
(lente de ângulo amplo), missing the specific terminology used for the
term. The same could happen for focal length. Portuguese usually uses
distância focal, which means “focal distance." A human translator,
knowing the translation for focal and length, would be tempted to
translate this as comprimento focal and would potentially miss the
specific terminology distância focal.
The quality of the terminology is, of course, based
on the breadth and depth of the corpus for the specific subject. A generic
engine such as Google or Bing may not do as well as an MT engine custom-trained
for a subject. But overall, this is an issue that could favor the MT over
humans.
Thus, humans are more likely to make errors for
inappropriate terminology than MT.
Figure 1: MT and human translation
errors
Emerging technologies for post-editing
Now that we are aware of the issues, which are
summarized in Figure 1, we are in a better position to look at emerging
technologies related to post-editing. Post-editing work has one basic
requirement: that the translator is able to receive MT suggestions to correct,
if need be. This technology is now available integrated on several
computer-aided translation (CAT) tools.
Considering the above, the next application of
technology is the use of quality assurance (QA) tools to find MT errors. The
technology itself is not new and has been available in CAT tools and in QA tools
such as Xbench or Okapi CheckMate. What is new is the nature of checks that must
be done with these tools. One example: in human translation you use a glossary
to ensure the consistent translation of a term. In MT, you could create a check
to find a certain polysemous word and the most likely wrong translation for it.
Case frequently has the meaning of an iPhone case, but it is often wrongly
translated with the meaning of a legal case. Your glossary entry for MT may say
something like “find case in the source and legal case in the target.” This
check is very different from a traditional check for human translation that
looks for the use of the correct translation instead of the use of the “probably
wrong” one.
After doing post-editing and finding errors, the
last area of application of this technology is in the measurement of the
post-editing, since what makes post-editing most attractive is the promise of
increasing the efficiency of the translation process. We will briefly mention
some of the main technologies being used or researched:
Post-editing speed tracking. The time spent
post-editing can be tracked at a segment level. These numbers can then be
compiled for a file. Some examples of use of this technology include the MateCat
tool, the iOmegaT tool and the TAUS DQF platform.
MT confidence. Another technology worth
mentioning is MT confidence scores. Based on several factors, an MT engine can
express how confident it is on the translation of a certain word. If this
confidence can be expressed in terms of coloring the words in a segment, this
feature will help the post-editor focus on words with less confidence that are
therefore more likely to require a change. This feature appeared in the CASMACAT
project, as illustrated in Figure 2.
Figure 2: The CASMACAT project shows MT
color-coded according to what is most likely to need post-editing
Edit distance. A concept that is not new but
could be more used more often is the concept of edit distance. It is defined as
the number of changes — additions, deletions or substitutions — made to a
segment of text to create its final translated form. Comparing the final form of
a post-edited segment to the original segment that came out of the MT engine
provides a significant indication of the amount of effort that went into the
post-editing task. The TAUS DQF platform uses edit distance scores.
We use the concept of edit distance in a broader
sense here, indicating the amount of changes. This includes the “raw” number of
changes made, but also includes normalized scores that divide the number of
changes by the length of the text being changed, either in words or characters.
The TER (Translation Edit Rate) score is used to measure the quality of MT
output, and is an example of a normalized score.
The final quality that needs to be achieved through
post-editing defines levels of “light post-editing” and “full post-editing.”
There are discussions to define and measure these levels. The scores based on
edit distance may provide a metric that helps in this definition. It is expected
that the light post-editing should require fewer changes than a full
post-editing, therefore the scores based on edit distance for a light
post-editing should always be lower than the score for a full post-editing.
Figure 3 below shows a hypothetical example with numbers.
Figure 3: Edit distance shows how much editing is
required for a given machine-translated file.
Scores based on edit distance can be an important
number in the overall scenario of measuring post-editing, combined with the
measurements of speed. The TAUS DQF efficiency score proposed a combination of
these measurements.
Silvio Picinini is a Machine Translation Language Specialist at eBay
since 2013. With over 20 years in Localization, he worked on quality-focused
positions for an LSP, and as an in-house English into Brazilian Portuguese
translator for Oracle. A former quality engineer, Silvio holds degrees in
electrical and electronic/software engineering.
The issue of rapidly understanding the MT output quality as precisely as possible BEFORE post-editing begins, is an important one. Juan Rowda from eBay
provides an interesting way to make this assessment based on linguistic criteria and provides a model that can be further refined and defined by
motivated users. Automated metrics like BLEU used by MT system developers provide very little of value for this PEMT effort assessment.
This approach is an example of the kinds of insightful tools that can only be developed by linguists who are engaged with a long-term MT project
and want to solve problems that can add real value to the MT development and PEMT process.
I think it is interesting that somebody outside of the "translation industry" came up with this kind of practical innovation, that can facilitate and
greatly enhance efforts in a project involving translation of several hundred million new words on a regular basis.
-------------------------
This article is based on a quality estimation method I developed and originally formally presented at AMTA in 2015. The premise of the method is a
different approach to machine translation quality estimation (MTQE) created entirely from a linguist’s perspective.
What is MTQE? Quality Estimation is a method used to automatically provide a quality indication for machine translation output without depending on human reference
translations. In more simple terms, it’s a way to find out how good or bad the translations produced by an MT system are, without human
intervention.
A good point to make before we go into more detail on QE is the difference between evaluation and estimation. There are two
main ways in which you can evaluate the quality of MT output: human evaluation (a person will check the translation and provide feedback)
and automatic evaluation (there are different methods that can provide a score on the translation quality without human intervention).
Traditionally, to automatically evaluate the quality of any given MT output, at least one reference translation created by a human translator is
required. The differences and similarities between the MT output and the reference translation can then be turned into a score to determine the quality
of said output. This is the approach followed by certain methods like BLEU or NIST.
The main differentiator of quality estimation is that it does not require a human reference translation.
QE is a prediction of the MT output quality based on certain features and attributes. These features can be, for example, the number of
prepositional or noun phrases in the source and target (and their difference), the number of named entities (names of places, people, companies, etc.),
and many more attributes. With these features, using techniques like machine learning, a QE model can be created to obtain a score that represents the
estimation of the translation quality.
At eBay, we use MT to translate
search queries, item titles and item descriptions
. To train our MT systems, we work with vendors that help us post-edit content. Due to the challenging nature of our content (user-generated, diversity
of categories, millions of listings, etc.), a quick method to estimate the level of effort
post-editing
will require, definitely adds value to our process. QE can help you obtain important information on this level of difficulty in an automated manner.
For example, one can estimate how many segments have a very low-quality translation and could be just discarded instead of post-edited.
What’s the purpose of MTQE?
MTQE can be used for several purposes. Firstly, a primary purpose is to estimate the quality of translations at the segment and file-level.
Segment-level QE scores can help you target post-editing efforts, by focusing only on segments that makes sense to post-edit. You can also estimate
overall post-editing effort/time as it would be safe to assume that segments with a low quality score take more time to post-edit. It is also possible
to compare MT systems based on QE scores and see which engine might perform better. This is especially helpful if you are trying to decide which engine
you should use, or determine if a new version of an engine is actually working better than its predecessor or not. The main purpose of MTQE is to estimate post-editing effort, i.e., how hard it will be to post-edit a text, and how long it might take. QE can
help you obtain this valuable information in an automated manner. For example, identify which segments have a very low-quality translation and thus
should be discarded instead of post-edited. It can also answer a very common question: Can I use MT for this translation project?
With Quality Estimation you can:
estimate the quality of a translation at the segment/file level,
target post-editing (choose sections, segments, or files to post-edit),
discard bad content that makes no sense to post-edit,
estimate post-editing effort/time,
compare MT systems to identify the best performing system for a given content,
monitor a system’s quality progress over time, and more.
Why a Linguist’s Approach?
Standard approaches to QE involve complex formulas and concepts most linguists are not familiar with, like Naive Bayes, Gaussian processes, neural
networks, decision trees, etc... Since so far, QE has been mostly dealt with by computational linguistics scientists. It is also true that traditional
QE models are technically hard to create and implement.
For this reason, I decided to try a different approach, one developed entirely from a linguist’s perspective. This implies that this method
may have certain advantages and disadvantages compared to other approaches, but coming from a linguistic background, my aim was to create a process
and methodology that translators and linguists in the traditional localization industry could actually use.
Inthe research described in Linguistic Indicators for Quality Estimation of Machine Translations , they show how linguistic and shallow features in the source text, the MT output and the target text can help estimate the quality of the
content. Drawing on the research described here I developed a linguistic approach to rapidly determining QE.
In a nutshell, finding potential issues in the following three dimensions of the content can help us get an idea of the MT output quality. These
three dimensions are:
complexity (source text, how complex it is, how difficult will it be for MT to translate),
adequacy (the translation itself, how accurate it is), and
fluency (target text only).
The next step was then trying to identify specific features in these three dimensions, in my content, that would provide an accurate
estimation of the output quality. After some trial and error, I decided to use the following set of features:
Length: is a certain maximum length exceeded? Is there a significant difference between source and target? The idea here is that the
longer a sentence is, the harder it may be for the MT system to get it right.
Polysemy: words that can have multiple meanings (and therefore, multiple translations). With millions of listings across several broad
categories, this is a big issue for eBay content. For example, if you search for lime on eBay.com, you will get results from Clothing categories (lime
color), from Home & Garden (lime seeds), from Health & Beauty (there’s a men’s fragrance called Lime), from Recorded Music
(there’s a band called Lime), etc.. The key here is that, if a polysemous word is in the source, this is an indication of a potential issue. Another key: if a given translation for a source term is near certain words, that is a potential error too. Let me make that
clearer: “clutch” can be translated a) as that pedal in your car or b) as a small handbag; if you have “a” in your target
occurring next to words like bag, leather, purse, or Hermes, that’s most likely a problem.
Terminology: basically checking that some terms are correctly translated. For eBay content, things like brands, typical e-commerce
acronyms, and company terminology are critical. Some brand names may be tricky to deal with, as some have common names, like Coach or Apple, as opposed
to exclusively proper names like Adidas or Nike.
Patterns: any set of words or characters that can be identified as an error. Patterns can be duplicate words, tripled letters, missing
punctuation signs, formal/informal style indicators, words that shouldn’t occur in the same sentence, and more. The use of regular expressions
gives you a great deal of flexibility to look for these error patterns. For example, in Spanish, sentences don’t typically end in prepositions,
so it’s not hard to create a regular expression that finds ES prepositions at the end of a sentence: (prep1|prep2|prep3|etc)\.$
Blacklists: terms that shouldn’t occur in the target language. A typical example of these would be offensive words. In the case of
languages like Spanish, this is useful to detectregionalisms.
Numbers: numbers occurring in the source should also appear in the target.
Spelling: Common misspellings.
Grammar: potential grammar errors, unlikely word combinations, like a preposition followed by a conjugated verb.
After some initial trial and error runs, I discarded ideas like named entity recognition and part-of-speech tagging. I couldn’t get any reliable
information that would help with the estimation, but this doesn’t mean these two can be completely discarded as features. They would, of course,
introduce a higher level of complexity to the method but could yield positive results. This list of determinants is not final and can evolve.
All these features, with all of its checks, make up your QE model.
How do you use the model?
The idea is simple; let me break it down for you:
The goal is to get a score for each segment that can be used as an indication of the quality level.
The presence of any of the above-mentioned features indicates a potential error.
Each error can be assigned a number of points, a certain weight. (During my tests I assigned one point to each type of error, but this can be
customized for different purposes.)
The number of errors is divided by the number of words to obtain a score.
The ideal score, no potential errors detected, would be 0.
Quality estimation must be automatic – it makes no sense to check manually for each of these features. A very easy and inexpensive way to find
potential issues is using Checkmate, which also
integrates LanguageTool, a spelling and grammar checker. Both are open source.
There is a way to account for each of the linguistic features mentioned in Checkmate: terminology and blacklists can be set up in the Terminology tab,
spelling and grammar in the LanguageTool tab, patterns can be created in the Patterns tab, etc. The set of checks you create can be saved as a profile and
be reused. You just need to create a profile once, and you can update it when necessary.
Checkmate will verify one or more files at the same time, and display a report of all potential issues found. By knowing how many errors were detected
in a file, you can get a score at the document level.
Getting scores at the segment level involves an extra step. What we need at this point is to add up all the potential errors found for each segment
(every translation unit is assigned an ID by Checkmate, and that makes the task easier), count the number of words in each segment, and divide those
values to get scores. All the necessary data can be taken from Checkmate’s report, which is available in several formats.
To be able to carry out this step of the process with minimal effort, I created an Excel template and put together a VBA macro that, after copying and
pasting the contents of the Checkmate report gets the job done for you. The results should be similar to this, with highest and lowest scores in red
and green:
This
is the VBA code I used, commented and broken down into smaller bits (VBA experts, please not that I’m not an expert.
Testing
Several tests were run to check the effectiveness of this approach. We took content samples of roughly the same size with different levels of quality,
from perfect (good quality human translation) to very poor (MT output with extra errors injected). Each sample was post-edited by two post-editors,
recording the time required for post-editing each sample. Post-editors didn’t know that the samples had different levels of quality. At the same
time, we obtained the QE score of each sample.
First, we started with short Spanish samples of around 300 words. One of the samples was the golden standard, one of the samples was raw MT output with
errors injected, and the rest of the samples were in-between. Then we repeated the same steps with bigger samples of around 1,000 words. A third test was done using bigger files (around 50,000 words) in the following different stages of our training process:
MT output (raw MT)
review 1, (post-edited, reviewed, and sent back for further post-editing after not meeting quality standards)
Some of these tests were then extended to include Russian, Brazilian Portuguese, and Chinese. Only one post-editor worked on each of these three
languages.
Analyzing Results
Results showed that post-editing time and the QE scores were aligned and strongly correlated. This list shows two sets of samples (~300 words and
~1,000 words), their sizes, the number of potential issues found, and the score. The last two columns show the time taken by each post-editor. Samples in green are the golden standards for each set; in red, the worst quality sample in each set.
As you can see, QE scores and post-editing times are overall aligned. A score of 0 indicates no potential issues were detected (which does not necessarily
means that the file has no errors at all - it just means no errors were found). These initial tests were run at the document level.
This is a different representation of the results for the second set of samples (~1,000 words). Red bars represent the time taken by post-editor #1 to
post-edit each sample; green bars are for post-editor #2. The blue line represents the QE score obtained for each sample.
With the help of colleagues, similar tests were run for 3 additional languages (BPT, RU, and ZH) with similar results. The only language with
inconsistent results was Chinese. We later discovered that Checkmate had some issues with double-byte characters. Also, the set of features we had for
Chinese was rather small compared to other languages.
One thing that becomes obvious from these results is that a strong QE profile (i.e., the number of checks included for each feature, how good they are
at catching issues) has a key role in producing accurate results. As you can see above, the RU profile caught more potential errors than the BPT one.
In a third test, I estimated the quality of the same file after 3 stages of our training process (as described above in the Testing section). After
each step, the score improved. The presence of many potential errors in a file that was post-edited two times helped fine-tune some features in the
model. This also reinforced the idea that a one-size-fits-all model is not realistic. Models can and should be adapted to your type of content.
Let’s take eBay titles as an example: they have no syntactic structure (they are just a collection of words), so perhaps they don’t need
any grammar checks. Titles usually contain brand names, part numbers and model names, so perhaps spelling checks will not provide meaningful
information.
During this test, I also checked changes in edit distance. As the score improved, the edit distance grew closer to the average edit distance for this
type of content at that point in time, which was 72. By looking at the score and the edit distance, I could infer that there’s room to improve
the quality of this particular file. Some analysis at the segment level can help confirm these conclusions. Checking segments with the best and worst
scores helps determine how reliable your results are.
Challenges of using this model
A high number of false positives may occur based on the nature of the content. For example, some English brand names may be considered spelling errors
by certain spellcheckers in the target language. LanguageTool uses an ignore list to avoid incorrectly flagging any terms you add to it. Overall,
it’s virtually impossible to avoid false positives in quality checks in any language. Efforts should be made to minimize them as much as
possible.
Another challenge is trying to match a score with a post-editing effort measurement - it’s not easy to come up with a metric that accurately
predicts the number of words per second that can be post-edited given a certain score. I’m sure that it is not impossible, but a lot of data is
required for precise metrics.
The model is flexible enough to allow you to assign a certain weight to each feature. This can be challenging at first, but it is, in my opinion, a “good problem”. This allows users to adapt the score to their specific needs.
Conclusion
What motivated the development of this method was mainly the idea of providing translators, linguists, post-editors, translation companies, and
people working with MT in general, a means to use quality estimation.
Regarding next steps for this method, I see some clear ones. It would be really interesting to be able to match QE scores with post-editing time. It doesn’t seem impossible and it’s
probably a matter of collecting enough data. Another interesting idea would be integrating QE in a CAT tool or any other post-editing environment,
and have segment-level QE scores displayed to post-editors. Comparing post-editing time and score in one of such tools could also help fine-tune
the QE model and more accurately predict post-editing effort.
I see this just as a starting point. Personally, I like the idea of people taking this model, personalizing it and improving it, and of course
sharing their results. There is definitely room for improvement, and I’m sure that new features can be added to make results even better.
Perhaps in the future new QE models can combine statistical data and language data while keeping the process simple enough.
Juan is a certified localization professional working in the localization industry since 2003. He joined eBay in 2014. Before that, he worked as
translator/editor for several years, managed and trained a team of +10 translators specialized in IT, and also worked as a localization engineer for
some time. He first started working with MT in 2006. Juan helped to localize quite a few major video games, as well.
He was also a professional CAT tool trainer and taught courses on localization.
Juan holds a BA in technical, scientific, legal, and literary translation.
This is a third post in a series on expert MT systems developers. Iconic fits my definition of expert much more closely (as does tayou) as they
do-it-for-you rather than give you a technology platform to do-it-yourself. (DIY does not work well for those who do not really know what they
are doing.) They bring deep expertise to bear on the very specific translation problem that you bring to them, and may create a unique MT solution for
each customer. In the case of Iconic, the components used to build an MT solution, may also vary from customer to customer, as they assemble
sub-components to address unique customer problems, which can vary by content, language and quality expectation. In general, the interaction with an
expert is going to much more consultative and customer specific and less off-the-shelf. As MT is an evolving technology, it is especially useful for an
expert developer to have close ties with an academic institution that is engaged in ongoing research in MT, and related technology like machine
learning and artificial intelligence. This allows them to continue to enhance the technology and most expert developers will have this kind of a
relationship in place. This is especially critical at this juncture, as there is a lot of leverage able work being done in the machine intelligence and
neural network based learning field. This post developed from a conversation with John Tinsley. Most of the emphasis in the post below is mine.
-------------------------------------
From the lab to the market
The concept and technology behind Iconic originated from a large scale research and development project at Dublin City University (DCU). The
state-of-the-art MT technology developed by a number of MT PhDs, who are still with the company today, was adapted to meet the translation requirements
at the European Patent Office. On the back of this, the company spun-out of the university and our first commercial MT engine, for English to
Portuguese patent translation, went into production in July 2010. We haven’t looked back since.
Our original focus was exclusively on patent translation and we worked with a number of LSPs and information providers who specialised in this field.
We’ve since evolved into a turnkey machine translation software and solutions provider that specializes in custom solutions tailored with subject
matter expertise for specific industry sectors and content types. These are areas where our sophisticated MT technology has significant value to add
over off-the-shelf solutions.
MT: a complex technology
There are two established paradigms of MT: rule based, and statistical (though Neural MT is clearly on the horizon, more on that later). Statistical MT
is by far the predominant approach, and extensions to this include the incorporation of rules or linguistic information which are often referred to as
“hybrid” MT. Despite this, there is no single approach or configuration that works best across all language pairs, content types, and writing styles.
Our approach, from a technological perspective, tries to address the fact that there’s no “one-size-fits-all” machine translation solution.
The best approach completely depends on the job at hand. We have developed an Ensemble ArchitectureTM which combines 100’s of processes
across different paradigms that can be combined to produce the best translation for a given task.
Some of these processes are specific to a particular language, to a certain content type (e.g. legal, financial), or to a particular writing style
(e.g. patents, contracts, annual reports). Depending on a given input document or client project, our platform takes an on-the-fly decision as to the
most appropriate combination of processes to use. In fact, in many cases, different sections of a single document can be translated using completely
different processes. The goal is to use the most effective set of tools at our disposal for any given task, so that our end users get the best possible
translation.
The Ensemble Architecture: a sample of just a few of the processes available to enhance translation quality
From a client customisation perspective, this model is super extensible. It allows us to develop and add new processes to the ensemble architecture as
needed on a case by case basis, making it a continuously evolving architecture.
One of the biggest challenges in maintaining such a complex technology is the need for deep expertise in the underlying MT methodologies. This places a
lot of importance on the team, which is something we’ve put a lot of resources into building at Iconic. Having a mix of experts in the science
underpinning MT technology, and language experts, combined with software development skill in the team allows us to transfer the subject matter
expertise in our team, directly into our clients’ applications.
This is crucial when it comes to ongoing engine development and customisation. When adding data to an engine, quality can hit a ceiling quite quickly.
End users will highlight critical issues in the output that cannot be resolved by simply “retraining” and adding more data. Even if we have valuable
post-edited data from translators, it is still often not enough.
Having the expertise in the team, in a) knowing where to look “under the hood” to identify the cause of an issue, and b) having the skill to
implement a fix, is crucial to success.
In fact, we recently ran a video series introducing some of the members of our team which you can watch here.
Applying MT successfully across industry sectors
Iconic’s first product was IPTranslator, a suite of MT engines adapted for patent and intellectual property related content. Patents are challenging
enough to read in your native language, never mind trying to develop MT software to handle them. The complexity of the documents was one of the key
motivators in developing our Ensemble Architecture - we needed the capability for the MT engine to be able to dynamically adapt to the content domain,
be it a chemical, engineering, or electronic patents. Similarly, it needs to be able to treat the different patent sections in different ways, e.g. the
stunted telegraphic style of the title vs. the long-winded “legalese” of the claims.
Since then, this architecture has grown to inherently cover other areas, and verticals that pose similar technological challenges to MT, such aseDiscovery, the financial and life sciences industries, and, in particular, user-generated contentin the hospitality and e-commerce industries.
From the language perspective, like most approaches to MT, our technology is language independent and can be applied to any combination of languages.
However, given the extensibility and the ability to combine domain-adapted approaches with language-specific processes, we’ve been able to tackle
harder projects had have great success with traditionally more challenging languages such as Chinese and German, to name just two.
These are areas we’ve chosen to focus on because we have significant value to add with our technology and expertise. MT has been rolled out generally
with relative success in industries such as IT and for languages like French, which tend to lend themselves better to language technology in general.
But when it comes to more difficult languages and content types, a smarter approach is needed and that’s often where we come into play.
MT for multiple buyers
We are the MT partner of choice for some of the world’s largest translation companies, information providers, and government and enterprise
organisations, each of whom can have very different requirements from an MT technology solution. We have worked very closely with Welocalize over a
number of years, in particular Park IP, to bring post-edited MT into their workflow for patent translation across a number of languages. Similarly, we
have worked extensively with RWS in the same field.
Over the past 18 months, in collaboration with the ADAPT research centre, we have been providing English to Irish MT to the Irish government, also for
post editing. To the best of our knowledge, is the first successful commercial deployment of MT for this language combination.
Outside of traditional language services use cases, Iconic’s MT is used in a number of areas where translation is just one part of a much bigger
picture. For instance, over the course of the last year we have been working on large scale digitization projects, taking millions of documents of
archive content stretching back over the past 150 years (in various challenging formats as you can imagine!), converting them, cleaning them, machine
translating them, and then making the available in multilingual searchable databases. Billions of words of historical analog content have passed
through our engines and are now accessible to a global audience.
In an increasing number of these cases, MT is producing quality output that is fit for purpose as is without the need for post editing and this
trend is only going to continue as the volume of content being produced grows further.
Business models for complex technology
In the same way that there’s no single “best” approach to MT, it is also not an “out of the box” type of software. Yes, there are use cases and
instances where it can work in this way, but they are the arguably the exception rather than the rule. It can work very well for anecdotal or spot
usage, but when it comes to complex real-world problems, a deeper level of expertise and experience is required.
This has probably been distorted by (fantastic, ground-breaking) tools like Moses which have done a great job of lowering the bar to entry to machine
translation from a software development perspective. But doing things in this way will only get you so far and,
ultimately, without understanding the underlying complexities of the software, of the languages being translated, and of how to compliment and
extend upon Moses with additional natural language processing (NLP) techniques, performance will plateau.
It’s possible to provide interfaces for users to upload their own training data, terms, post-editing rules, and so on, which has been done extensively
and often to good effect. Ultimately, however, that’s reallyover-simplifying the technology and basically taking it out of expert hands.
Doing that, as a software provider you have no recourse when someone misuses your technology and it doesn’t meet their needs.
5 steps to adopting machine translation
Software with a Service
To overcome this, we provide MT as a fully managed cloud-service with a different take on the standard business model.
We call this model Software with a Service (as opposed to Software as a Service) and what it does is combine technology automation (the software)
with specialized expert labor (the service) to deliver a complete solution to a business problem. It’s as much about expert people-powered customer
service as it is about code-powered efficiency.
The result is awesome customer experience that can only be delivered with a human touch.
MT engine development and maintenance is costed on a professional services basis as required, which serves the purpose of ruling out unnecessary
“retraining” when it is not going to bring any value. Usage of the engines, cloud-based or otherwise, is costed as normal based on usage either on a
subscription basis or pay-as-you-go. This can change to standard license model when the software needs to be installed on local servers, which is
typically done for security reasons.
Buyers of MT are smart, and they are increasingly aware that assessing MT based on output samples or bake-offs (comparisons) between providers doesn’t
make sense. MT is never going to be as good as it can be straight out-of-the-box. It takes time, and requires expert guidance to develop, incorporate,
and maintain. This is why we’ve had a good response to this model - it just makes sense.
When should MT be used in the first place?
As I mentioned, MT is not a one-size-fits-all solution and, in fact, in some cases it’s not suitable at all. That’s why the first step in all of our
engagements is to assess whether it’s feasible to use MT in the first place. Never mind other business models; the worst business model is accepting
projects that are destined to fail from the outset! The industry has been plagued by false dawns and lack of expectation management for a long time, so
MT providers need to become more transparent about when their solutions should and should not be used.
How do we assess feasibility? At Iconic, we look at 8 factors that can influence things in different ways when it comes to MT. In order for us to have
a clear picture in our mind as to the best way to proceed in each case, we need to understand these factors, interpret them, and then apply our
experiential-driven expertise in order to communicate to the customer how they impact things. We do this even if it means turning down a project
because it wasn’t suitable. We discussed these factors in a webinar recently and we’re
currently running a blog series on our website where
we’re diving even deeper on each factor.
The 8 factors that influence machine translation projects
The Next Frontier? The Neural Frontier?
So, where do we go from here? For all MT providers, there’s no hiding from the fact that neural MT is not just a fad and is most likely here to
stay, and Iconic is no exception. That being said, there has been a lot of talk and hype about it, but relatively little action outside of academia
(yet), in part due to the fact there are a still a few hurdles to overcome before it becomes a practicable commercial solution.
We pride ourselves on our deep MT expertise and our ability to develop complex language solutions, so we are very keen and excited to begin working
with neural MT. To that end, we’ve been working on an exciting project with ADAPT research centre in Dublin, who have been pioneers in MT research
over the last 15 years. We will be in a position to reveal more later this year, so stay tuned!