Pages

Monday, August 1, 2016

Overview of Expert MT Systems - KantanMT

This is second post in a series on Expert MT systems vendors focusing on an interview with Tony O'Dowd of KantanMT. As some of you may notice, Kantan is much further along the marketing curve and some may say it is somewhat slick, but their client base, both Enterprise and LSPs speaks for itself, and Kantan is a serious contender for expert MT services for both enterprise and LSPs in my opinion. Kantan has made many of the SMT support tools look very pretty and easy to use, but I suspect that those who understand what they are actually doing with these tools are likely to be more successful. 
 
----------------------------

Can you provide a brief overview of your recent history so that we can better understand your company and technology? Have you seen growth in the use of MT during your history so far? 
 
KantanMT.com was founded out of an idea I had while preparing for my PhD at Dublin City University. I was very curious as to why so few enterprises and LSPs were using MT within their Localisation workflows. From market research I found that the main causes of this were cost, complexity, challenge and quality. 
  • Cost – Traditional MT systems were sold via Professional Services teams. This required expensive upfront development costs and long lead times to commission an engine. Of the 11 vendors identified in our market research all of them sold their systems via this sales mechanism.
  • Complexity – The complexity of MT systems was identified as a significant barrier to usage and deployment. However, since the traditional MT market was sold via Professional Service type sales-engagements, there was no motivation or reason why complexity should reduce overtime. Put simply it was in the industry’s favour to talk up complexity, however, this was driving down usage and overall market penetration.
  • Challenge – MT systems were challenging to manage, improve and deploy. Since little or no attempt was made to resolve the complexity issue (it wasn’t in the industry’s interest to do this…), the challenge of managing MT was out of the reach of most organisations. Additionally, if an organisation decided to go it alone and build their own solution they would have to recruit PhD staffers which were challenging to find and expensive to hire.
  • Quality – The real and perceived quality of MT systems was both an operational and psychological barrier to entry. Translators felt threatened by MT due to a lack of successful industry implementations and this led to dis-trust and an elevated sense that MT is just not good-enough. Meanwhile, an explosion in web based content and volumes meant that the industry was looking for mechanisms that were good-enough (for purpose.) Aligning these two polar views was a significant barrier to the wide spread usage of MT.

The idea behind Kantan was driven by this analysis and the desire to solve these four fundamental challenges. At KantanMT.com we imagined a platform what would be easy to access, improve development time, measure and predict translation quality and significant address the cost, lead-time and quality challenges. 
The KantanMT.com platform only does three things – It helps our community members develop, improve and deploy SMT solutions within their organisation. It addresses the four challenges and helps enterprises embrace SMT as a productivity enabler for their globalisation strategy. 

Do you build the MT engines for your clients or do you let them do it for themselves?
The KantanMT platform is flexible to accommodate both approaches. The vast majority of LSPs will build their own engines as they view this as a necessary skill they need to embrace and understand within their organisations. For the ISV sector, we generally build the engines using our in-house professional services team and work with linguists to test the translation outputs prior to production release. 
 
If the clients build the engine themselves – do you have a team available to help and guide this?
Yes - within the KantanMT engineering team is a group called Professional Services. This team comprises Solution Architects, Project Managers, Product Trainers and Engine Developers. Their primary role is to develop, improve and deploy engines for large enterprises. This team also has the support of our SRE (Site Reliability Engineering) Team (the main role of this team is to ensure that KantanMT solutions stay running 24x7x365 on our cloud. Remember the KantanMT cloud consists of over 700 servers so this is a vital role in the management of large scale MT solutions). 

We provide the same support for everyone, depending on their need. We also work with the world’s largest LSPs and build and manage KantanMT engines for them too. 

While it has gotten very easy to build a low quality MT engine with Moses, it is my experience that these DIY engines very rarely deliver any business value. What do you do to ensure that you are delivering value for your customers beyond making it easy to build a Moses based engine?
One of the biggest challenges in customizing Statistical Machine Translation systems is rapidly improving the engine after its initial training. While for the most part, you can build a baseline engine using existing Translation Memory assets - the real challenge is how do you go beyond this and achieve higher levels of quality. More importantly, how can you do this rapidly and with minimum cost and effort? 
At KantanMT we tackled this problem in several ways:-
  • KantanBuildAnalytics – This is an interactive development environment, designed for localisation engineers and engine developers, that is used to build and improve KantanMT engines. It uses a range of automated scoring methods ( e.g. BLEU, F-Measure and TER) to assess translation quality, a training normalisation environment that helps improve training candidates, extensive 12-step data cleansers, automatic Gap Analysers and version control. Of course at the core of this environment are the automated scores which are comparative measures that can only meaningfully be used during engine development.

  • KantanAnalytics – This is a technology, jointly developed by the Centre of Next Generation Localisation and KantanLabs, which can predict the quality of translation outputs. Displayed as a percentage value it provides quality guidance to users of KantanMT translations as to the quality of generated outputs – the higher the score the better the fluency and adequacy of the translation. This technology seamlessly integrates with the industry standard Fuzzy match scoring mechanism so that it’s easy for Translation Project Managers to identify the quality of MT outputs. 
  • KantanPEX – PEX stands for Post-Editing Automation. PEX is a series of rules that can be applied to an engine to dynamically modify translations outputs. The KantanMT community use this to address inconsistencies within translations and rapidly ensure engines comply with their quality expectations.
  • KantanTotalRecall – This is a high speed, low latency cloud-based translation memory which is automatically built using the training data uploaded by our clients. The KantanMT is a fusion of both TM (TotalRecall) and MT (KantanMT) technologies which seamlessly blends the best matches from TM with the best translations from MT.
  • KantanLQR – LQR stands for Language Quality Review and KantanLQR is an environment built into the heart of the KantanMT platform which provides a fully interactive workflow for Professional Translators to score the quality of translations. The workflow is fully distributed, highly customisable and Project Managers can determine translation quality in real-time using the industry standard Multidimensional Quality Metrics (MQM). More importantly, the feedback and post-edits from the Professional Translators can be used to fine-tune and improve the KantanMT translation outputs.
  • KantanNER – This is Named Entity Recognition and is built into every KantanMT engine. This is a highly customisable component that is used to ensure numerical data (such as dates, times, currencies, specification data, text entities) are handled outside of the decoding process. For example, we can detect imperial measurements such as feet, inches and miles and convert these measurements to metres, centimetres and kilometres. KantanNER is part of the GENTRY NLP layers developed at KantanLabs and is easy to customise and extend to embrace the precise requirements of the KantanMT community.
  • GENTRY - Gentry is the NLP programming kernel of each KantanMT engine. It’s easy to extend and customise. For example, you can programme additional segmentation and tokenisation rules, extend the 12-step Kantan data cleansers, implement pre-ordering and re-ordering models and even create text pre-processors and post-processors to ensure each KantanMT engines is compliant with the quality expectations of the KantanMT community.
  • KantanFleet – For community members that wish to start translation immediately and avoid the build, test and deploy process, they can use KantanFleet. This is a large collection of pre-built and fully-tested engines in Legal, Financial, Medical, IT and General domains. At present there are over 100 KantanFleet engines. Each KantanFleet engine can easily be extended and customised engines can be built using them as a baseline. This has a significant impact in the time to build, improve and deploy customised engines.
  • KantanLibrary – KantanLibrary is use by community members that may not have sufficient training data to customise their own engines. KantanLibrary is a collection of pre-cleansed, scored and publicly available training data sets. These have all been tested, cleansed and optimised for Legal, Financial, Medical IT, General, Conversational domains.
  • KantanTemplates – This provides an intuitive and powerful way to customise, improve and deploy multiple KantanMT engines that share common training data-sets. Using KantanTemplates™, shared data-sets of bilingual and terminology training files can be used across multiple KantanMT engines, which allows them to be easily modified and updated all at once. KantanTemplates helps you easily customise multiple Machine Translation engines and provides cutting edge analytics and reporting tools to track your progress.
What are some key tools, technologies that you provide on your platform to leverage your customers and maximize their possibility of success?
KantanMT is a complete platform for the development, improvement and deployment of SMT within small, medium and large enterprises. It consists of a large collection of technologies and innovations, all highly integrated with each other to help accelerate the deployment of high quality SMT. Some of these technologies are:- 

  • KantanWidgets – This is a Suite of Productivity Apps that can be used to integrate KantanMT engines into the heart of any localisation workflow. KantanTranslate™ is an App that can be used to provide real-time, on-demand translation of text snippets, KantanDesktopApp™ is a App that can be used to translate one of more documents directly from your desktop, KantanPlugins™ are a collection of application plugins for MS Office and range of browsers that provide real-time translation of content. All KantanWidgets are connected directly to the KantanMT engines developed by the KantanMT community.
  • KantanAPI – this is a RESTful interface into the complete KantanMT platform. It provides both synchronous and asynchronous functionality so that the KantanMT community can build applications exploiting their KantanMT engines.
  • KantanAutoScale – This is a fully distributed, cloud-based deployment technology that helps the KantanMT community release high speed, high capacity engines on the cloud. Using KantanAutoScale technology, KantanMT deployments will scale-up and scale-down based on inbound traffic. This provides the optimal speed and cost balance for clients that wish to translation at scale.
  • KantanSwift – This technology is applied to all engines that require super-fast launch times. The KantanMT community uses this technology in conjunction with KantanAutoScale to manage large deployed KantanMT engines hosted on hundreds of servers.
  • KantanTemplates – This provide an intuitive and powerful way to customise, improve and deploy multiple KantanMT engines that share common training data-sets. Using KantanTemplates™, shared data-sets of bilingual and terminology training files can be used across multiple KantanMT engines, which allows them to be easily modified and updated all at once. KantanTemplates helps you easily customise multiple Machine Translation engines and provides cutting edge analytics and reporting tools to track your progress.
  • KantanLQR – LQR stands for Language Quality Review and KantanLQR is an environment built into the heart of the KantanMT platform which provides a fully interactive workflow for Professional Translators to score the quality of translations. The workflow is fully distributed, highly customisable and Project Managers can determine translation quality in real-time using the industry standard Multidimensional Quality Metrics (MQM). More importantly, the feedback and post-edits from the Professional Translators can be used to fine-tune and improve the KantanMT translation outputs.
  • KantanPEX – PEX stands for Post-Editing Automation. PEX is a series of rules that can be applied to an engine to dynamically modify translations outputs. The KantanMT community use this to address inconsistencies within translations and rapidly ensure engines comply with their quality expectations.
  • KantanTotalRecall – This is a high speed, low latency cloud-based translation memory which is automatically built using the training data uploaded by our clients. The KantanMT is a fusion of both TM (TotalRecall) and MT (KantanMT) technologies which seamlessly blends the best matches from TM with the best translations from MT.
  • KantanBuildAnalytics – This is an interactive development environment, designed for localisation engineers and engine developers, that is used to build and improve KantanMT engines. It uses a range of automated scoring methods ( e.g. BLEU, F-Measure and TER) to assess translation quality, a training normalisation environment that helps improve training candidates, extensive 12-step data cleansers, automatic Gap Analysers and version control. Of course at the core of this environment are the automated scores which are comparative measures that can only meaningfully be used during engine development.
  • KantanAnalytics – This is a technology, jointly developed by the Centre of Next Generation Localisation and KantanLabs, which can predict the quality of translation outputs. Displayed as a percentage value it provides quality guidance to users of KantanMT translations as to the quality of generated outputs – the higher the score the better the fluency and adequacy of the translation. This technology seamlessly integrates with the industry standard Fuzzy match scoring mechanism so that it’s easy for Translation Project Managers identify the quality of MT outputs.
  • KantanNER – This is Named Entity Recognition and is built into every KantanMT engine. This is a highly customisable component that is used to ensure numerical data (such as dates, times, currencies, specification data, text entities) are handled outside of the decoding process. For example, we can detect imperial measurements such as feet, inches and miles and convert these measurements to metres, centimetres and kilometres. KantanNER is part of the GENTRY NLP layers developed at KantanLabs and is easy to customise and extend to embrace the precise requirements of the KantanMT community.
  • GENTRY - Gentry is the NLP programming kernel of each KantanMT engine. It’s easy to extend and customise. For example, you can programme additional segmentation and tokenisation rules, extend the 12-step Kantan data cleansers, implement pre-ordering and re-ordering models and even create text pre-processors and post-processors to ensure each KantanMT engines is compliant with the quality expectations of the KantanMT community.
  • KantanFleet – For community members that wish to start translation immediately and avoid the build, test and deploy process, they can use KantanFleet. This is a large collection of pre-built and fully-tested engines in Legal, Financial, Medical, IT and General domains. At present there are over 100 KantanFleet engines. Each KantanFleet engine can easily be extended and customised engines can be built using them as a baseline. This has a significant impact in the time to build, improve and deploy customised engines.
  • KantanLibrary – KantanLibrary is use by community members that may not have sufficient training data to customise their own engines. KantanLibrary is a collection of pre-cleansed, scored and publicly available training data sets. These data sets have all been tested, cleansed and optimised for Legal, Financial, Medical IT, General, Conversational domains.
Do you have any stock engines ready to run for those clients who need something quick & dirty and cannot use the public Google or Microsoft engines? How do they compare to the generic free engines?
The KantanMT platform comes pre-configured with collections of pre-built engines and pre-cleansed training catalogues. These have previously been described in the question above:-
  • KantanFleet – Pre-built engines described above.
  • KantanLibrary – Cleaned and optimized base training data also described above.
Do you gather translator feedback to better understand their PEMT experience and do you have any plans to improve this feedback cycle and get translators more directly engaged?
Yes we do. In fact we built an environment called KantanLQR to focus on this one aspect of engine development. Put simply, to achieve the highest level of production translation quality, it’s imperative that Professional Translators are involved in the development and improvement of MT engines. More importantly, a structured error typology (similar to MQM, the one used in KantanLQR) is required to capture, organise and then to analyse the feedback from the Professional Translators. This feedback can be harnessed to fine tune vocabulary and terminology selection, improve consistency and impact overall fluency and adequacy of translation outputs. 

What is your approach to pricing? (Please be as vague or as specific as you want to be.)
KantanMT operates a Pay-as-you-go model whereby the KantanMT community simply subscribe to a monthly plan. The monthly plans include access to all the platform features, technologies and applications. Included in each monthly subscription is a generous free-word allowance which ensures that our community can keep their costs low when embracing MT within their localisation workflow. 

What are you doing to ensure your technology stays current and relevant in future? As you may have heard Facebook thinks that SMT is done, and the future is all about Neural MT, do you have any plans in this area?
KantanLabs is the advanced research group within the KantanMT organisation. It is headed up by Dr Dimitar Shterionov. KantanLab’s Chief Scientific Advisor is Professor Andy Way from the ADAPT Centre at DCU, Ireland. KantanLabs primary objective is to explore new ways and novel approaches to Statistical Machine Translation. At present we have three research projects already up and running. These are:-
  • Optimised Training Methods and Adaptive MT – this is a joint project between ADAPT Centre and KantanMT.com which is focusing on ways of accelerating the training process and exploring adaptive MT technologies so that KantanMT engines can be retrained superfast with the latest translation suggestions from Professional Translators. The first deliverable from this project (which accelerates the training time for large engines by as much as 70%) will be launched very shortly.
  • Re-Ordering Models for Challenging and Complex Languages – this is a joint project with EAMT which is exploring interesting ways of re-ordering complex languages for the purposes of improving translation quality.
  • Exploiting Neural Networks in a Commercial Environment – we have just recently announced this project in conjunction with the Marie Curie Foundation. This will be a 2 year research project on how neural networks can be exploited in statistical machine translation systems. 
 
What are the most promising areas for MT in future in your opinion?
In the immediate timeframe we are targeting adaptive MT and interesting re-ordering models for complex languages. However, we cannot ignore the potential impact that Neural Networks may have on statistical methods. An area I’m particularly interested in is using a hybrid combination of neural and phrase-based SMT approaches. Akin to using the best of both worlds, so to speak. Another area of significant importance for us is Named Entity recognition and support. This is especially key in the hospitability and eCommerce industries. 

Have you found the TDA data useful in any of your MT engine development?
We have never used the TAUS data for the purposes of building KantanMT engines, so I’m not in a position to comment on the data. However, I believe that TAUS is incredibly important for the industry as it has fostered a better understanding, higher level of engagement and seeded the industry with successful stories and implementations of MT in commercial contexts. 

Are there any LSPs (other than SDL) that you see as really understanding MT, and know how to develop high quality engines and use MT to solve big translation problems?
Yes, many of our Partners are now experts on developing, improving and deploying large scale MT systems to address very large translation challenges.
  1. For example, MATRIX in Germany has built a system to translate technical information for a market-leading, publicly traded engineering client. This system is built on the KantanMT platform and translates documentation into 12 languages. The source language for these engines is German.
  2. Another one of our partners (which is the largest privately owned LSP) translates the entire photograph catalogue of istock.com last year. This project was in 11 languages (source was English) and the project resulted in over 750M source words being translated into 11 languages.
  3. Another one of our LSP clients, Milengo, recently translated the entire beauty catalogue of the largest Nordic eCommerce platform. They achieved this feat in less than 3 weeks. They are now doing this again into one additional Nordic language.
The KantanMT partner network consists of LSPs all of which are now in a position to implement MT within their localisation workflows. They can do this using the KantanMT platform, generally after completing our MT Orientation and Training Programme. 

Thursday, July 21, 2016

5 Tools to Build Your Basic Machine Translation Toolkit

This is a second post from the MT Language Specialist team at eBay, by . I have often been asked by translators about what kinds of tools are useful when working with MT. There is a lot of corpus level data analysis, preparation and editing going on around any competent MT project. While TM tools have some value, they tend to be segment focused and do not scale, there are much better tools out there to do the corpus pattern analysis, editing and the comparison work that is necessary to build the best systems. We are fortunate to have some high-value tools laid out very clearly for us here by Juan who has extensive direct experience working with large volumes of data and can provide experience-based recommendations.
----------------------------------------------------------------------------------

If you are a linguist working with Machine Translation (MT), your job will be a lot easier if you have the right tools at hand. Having a strong toolkit, and knowing how to use it, will save you loads of time and headaches. It will help you work in an efficient manner, as well. 

As a Machine Translation Language Specialist at eBay, I use these tools on a regular basis at work, and that is why I feel comfortable recommending them. At eBay, we use MT to translate search queries and listing titles and descriptions into several languages. If you want to learn more, I encourage you to read " The Next Big Thing You Missed: Why eBay, Not Google, Could Save Automated Translation ". 

1. Advanced Text Editors
Notepad won't cut it, trust me. You need an advanced text editor that can, at least:
  • deal with different file encoding formats (UTF, ANSI, etc.)
  • open big files and/or with unusual formats/extensions
  • do global search and replace operations with regular expressions support
  • highlight syntax (display different programming, scripting or markup languages -XML, HTML, etc.- with color codes)
  • have multiple files open at the same time (tabs)
This is a list of my personal favorites, but there are a lot of good editors out there.
Notepad ++ : My editor of choice. You can open virtually any file with it, it's really fast, and it will keep your files in the editor even if you close it. You can easily search and replace in a file or in all open files, using regular expressions or just extended characters (control characters like \n or \t). It's really easy to convert from/to different file encodings and save all opened files at once. You can also download different plugins, like spellcheckers, comparators, etc. It's free and you can download it from here


Sublime : This is another amazing editor, and a developers' favorite. Personally, I find it great to write scripts. You can do many cool things with it, like using multiple selections to change several instances of a word at once, split a selection of words into different lines, etc. It supports regular expressions and tabs, as well. It has a distraction-free mode if you really need to focus. It's also free, and you can get it here


EmEditor : Syntax highlighting, document comparison, regular expressions, handles huge files, encoding conversion… Emeditor is extremely complete. My favorite feature, however, are the scriptable macros. This means, you can create, record, and run macros within EmEditor - you can use these macros to automate repetitive tasks, like making changes in several files and/or saving them with different extensions. You can download it from here

2. QA Tools
Quality Assurance Tools assist you in automatically finding different types of errors in translated content. They all basically work in a similar way: 1) you load files with your translated content (source + target); 2) you optionally load reference content, like glossaries, translation memories, previously translated files or blacklists; 3) the tool checks your content and provides a report listing potential errors. Some of the errors you can find using a QA Tool are:
  • terminology: term A in the source is not translated as B in the target
  • blacklisted terms: terms you don't want to see in the target
  • inconsistencies: same source segment with different translations
  • differences in numbers: source and target numbers should match
  • capitalization
  • punctuation: missing or extra periods, duplicate commas, etc.
  • patterns: certain used defined patterns of words, numbers and signs, which may contain regular expressions to make them more flexible, expected to occur in a file.
  • grammar and spelling errors
  • duplicate words, tripled letters, and more.
Some QA Tools you should try are:
Xbench allows you to run the following QA Checks: find untranslated segments, segments with the same source text and different target text, and segments with the same target text and different source text, find segments whose target text matches the source text (potentially untranslated text), tag mismatches, number mismatches, double blanks, repeated words, terminology mismatches against a list of key terms, and spell-check translations. Some linguists like to add all their reference materials in Xbench, like translation memories, glossaries, termbases and other reference files, as the tool allows you to find a term while working on any other running application with just a shortcut.
Xbench also has an Internet Search tab to run searches on Google. The list is pretty limited but there are ways to expand it, maybe I'll share that in the future. You can get Xbench here

Checkmate is the QA Tool part of the Okapi Framework, which is an open source suit of applications to support the localization process. That means, the Framework includes some other tools, but Checkmate is the one you want to perform quality checks on your files. It supports many bilingual file formats, like XLIFF, TTX, and TMX. Some of the checks you can run are: repeated words, corrupted characters, patterns, inline codes differences, significant differences in length between source and target, missing translations, spaces, etc. The patterns section is especially interesting; I will come back to it in the future. Checkmate produces comprehensive error reports in different formats. It can also be integrated with LanguageTool,an open source spelling and grammar checker. You can get Checkmate here

3. Comparison Tools 

Why do you need a comparison tool? Comparing files is a very practical way to see in detail what changes were introduced, e.g. which words were replaced, which segments contain changes, or whether there is any content added or missing. Comparing different versions of a file (for example, before and after post-editing) is essential for processes that involve multiple people or steps. Beyond compare is, by far, the best and most complete comparison tool, in my opinion. 

You can also compare entire folders. If you work with many files, comparing two folders is an effective way to determine if you are missing any files or if a file does not belong in a folder. You can also see if the contents of the files are different or not. 


4. Corpus Analysis Tools

As defined by its website, AntConc is a freeware corpus analysis toolkit for concordancing and text analysis. This is, in my opinion, one of the most helpful tools you can find out there when you want to analyze your corpus or content, regardless of the language. AntConc will let you easily find n-grams and sort them by frequency of occurrence. It is a very practical way to identify the highest frequency n-grams in your corpus. Obviously, you want the most frequently used terms to be translated as accurately as possible. In most texts, words like prepositions or articles are the most common ones, so you can use a stop-word list to filter them out when they don't add any value to the task at hand.

AntConc is extremely helpful when it comes to find patterns in your content. Remember - with MT, you want to fix patterns, not specific occurrences of errors. It may sound obvious, but finding and fixing patterns is a more efficient way to get rid of an issue than trying to fix each particular instance of an error. With AntConc you can select the minimum and maximum sizes of the n-grams you want to see, as well as the frequency. 

AntConc can create a list of each word occurring in your content, preceded by the number of hits. This can help you get a deeper insight on your corpus for terminology work, like which terms you should include in your glossary. These words can also tell you what your content is about - just by looking at the most frequent words, you can tell if the content is technical or not, if it belongs to any specific domain, and even which MT system you can use to translate it, assuming you have more than one customized systems.
There are many things you can use this tool for and it deserves its own article.
Check AntConc out here


5. CAT Tools

CAT Tools make a great post-editing environment. Most modern tools can be connected to different machine translation systems, so you get suggestions both from a TM and from an MT system. And you can use the TM to save your post-edited segments and reuse them in the future. If you have to use glossaries or term bases, CAT tools are ideal, as they can also display terminology suggestions. 
 
When post-editing with a CAT tool, there are usually 2 approaches: you can get MT matches from a TM (of course, they need to be added to it previously) or a connected MT system, or you can work on bilingual, pre-translated files and store in your TM post-edited segments only. 
 
If you have never tried it, I totally recommend Matecat. It's a free, open source, web-based CAT tool, with a nice and simple editor that is easy to use. You don't have to install a single file. They claim you will always get up to 20% more matches than with any other CAT tool. Considering some tools out there cost around 800 dollars, what Matecat has to offer for free can't be ignored. It can process +50 file types; you can get statistics on your files (like word counts or even how much time you spent on each segment), split them, save them on the cloud, and download your work. Even if you never used a CAT tool before, you will feel comfortable post-editing in Matecat in just a few minutes. 
 



Another interesting free, open-source option is OmegaT. Not as user-friendly as Matecat, you will need some time to get used to it, even if you are an experienced TM user. It has pretty much all the same main features commercial CAT tools have, like fuzzy matching, propagation, it supports around 40 different file formats, and it boasts an interface to Google Translate. If you never used it, you should give it a try. 


If you are looking into investing some money and getting a commercial tool, my personal favorite is MemoQ. It has tons of cool features and, overall, is a solid translation environment. It probably deserves a more detailed review, but that is outside of the scope of this post. You can learn more about MemoQ here.



Juan Rowda
Staff MT Language Specialist, eBay

Juan is a certified localization professional working in the localization industry since 2003. He joined eBay in 2014. Before that, he worked as translator/editor for several years, managed and trained a team of +10 translators specialized in IT, and also worked as a localization engineer for some time. He first started working with MT in 2006. Juan helped to localize quite a few major videogames, as well. 
He was also a professional CAT tool trainer and taught courses on localization.
Juan holds a BA in technical, scientific, legal, and literary translation. 

Thursday, July 14, 2016

When MT does not take translators' jobs away - and may create more jobs

This is a guest post by Silvio Picinini who works in a team at eBay that provides linguistic feedback and addresses linguistic issues, specifically to enhance large scale MT projects underway at eBay. To my mind this is an example of best practices in MT, where you have NLP and MT experts working together with linguists to solve large scale translation problems in a collaborative way.  

The eBay linguistic team has actually been producing a number of articles that describe various kinds of linguistic tasks that are increasingly needed to add value and quality to large scale MT efforts. I think these articles are worth greater attention, as they have a high SNR (signal to noise ratio.) They are educating and informing readers of very specific things that IMO together add up to examples of best practice. I am hoping that Silvio and his colleagues become regular contributors to this blog so that more people get access to this valuable information.
------------------------------------------------------------------------------------

I was honored to be invited by Kirti to write for this blog. I hope to deserve it, by sharing my experiences as a translator working with machine translation. Recently I was really impressed by Kirti's post on how a lot of content is being translated outside of the translation services industry. I would like to add a few thoughts to that.
I work with User-Generated Content for eBay. Users all over the world describe what they are selling, creating titles and descriptions for their items. In the millions. We need to translate the information on these items so that users that speak other languages can buy them. So this is the job, translate millions of items quickly, almost instantly. A new initiative at eBay is structuring data in a different way, and making it easier to create product reviews. In a short period, we accumulated millions of reviews. A review written in English about a digital camera (a product sold globally) is probably very useful for a buyer in Germany or in Mexico. So we need these reviews translated for these buyers. Could we do this hiring human translators? No. It is easy to see that given the volume, time and cost involved, human intervention is out of the question. Virtually anything that is open to users, allowing them to create their own content, will generate volumes that are not feasible to be translated by hand. These are real scenarios from eBay, but also Facebook recently announced the translation of posts with their own MT engine, and Amazon is working on MT

In addition to what is already happening, we live in a world where new forms of content created by users appear every day. This is of interest to a lot of people, and that will require translation. So here are some types of User-Generated Content that, in my opinion, seem that will be of interest beyond their original language. I am guessing that their companies may be interested in translating this in the (near) future:
  • Rental Homes reviews on Airbnb
  • TripAdvisor reviews of places to see, eat and stay
  • Netflix movie reviews
  • LinkedIn articles
  • Tweets
  • How-to guides
  • Knowledge bases
  • Even Yelp reviews that seem local can be of interest to visitors from other countries or speakers of a second language in the same country (French in Canada, Spanish in the US)
  • In e-commerce: Product titles and descriptions, product reviews, messaging and user searches.

So this is what I meant with the title of the post: Translators would never be offered User-Generated Content translations, so when these jobs go to machine translation engines, they are not really affected by it in any way. MT is not taking any translator's jobs if there was no job in the first place. But maybe translators would like to affect this enormous translation market. Kirti has been posting guidance on how translators can prepare to participate in this opportunity. 
From my experience at eBay, here are a few thoughts about the role that translators may play.
  • MT engines will need to be trained. The specific content needed for training may not be available to be harvested. Therefore, companies will need to create training data for their engines. This training data will be post-edited from the MT output, and this is a job that requires the human intervention of post-editors and reviewers. The quality of the MT output needs to be measured, and the measurement requires (in the case of BLEU) a human translated reference. So there is also a role for translators, instead of post-editors, in creating references for MT measurements.
  • The importance of the pattern over the individual error: the usual mindset for translators and reviewers is to focus on every error that they see, correct them and then produce perfect quality. For MT, the mindset should focus on patterns of errors. Translators will be trying to make a bigger impact by finding patterns of errors that will improve the quality on a larger scale, on every better translation that the MT engine produces.

Translators have the linguistic ability to see these patterns. In this paper at AMTA 2014, I presented a few patterns found in Brazilian Portuguese:
  • Diminutives are widely used by users in informal language, and are not commonly present in the training data, which is usually in a more formal language.
  • The lack of diacritical marks is common among users, both for accents and for marks that modify letters such as ç, ã and õ. The usual training data is usually written in a more formal language and will contain all the diacritical marks. The MT will have to deal with these differences, such as "relogio" vs. "relógio" and "calca" vs. "calça".
  • Some words are intended for the target language but are also words in the source language, causing issues. "Costumes" is a word in English, but also in Portuguese.
  • Some words are misspelled because certain letters have the same sound, causing issues for MT. For example, "engraçados" spelled as "engrassados" (ç and ss have the same sound).
  • Some words are spelled as people pronounce them, and this is different from the correct written pronunciation. For example, "roupa" spelled as "ropa". MT needs to deal with that.
  • Some English words are spelled as they would be written with Portuguese language rules. So "Michael Jordan" would become "Maico Jordam". 
 

There are MT companies, academic experts and customer engineering teams working with MT. It may be time for the language experts to play a role. 


Silvio Picinini is a Machine Translation Language Specialist at eBay since 2013. With over 20 years in Localization, he worked on quality-focused positions for an LSP, and as an in-house English into Brazilian Portuguese translator for Oracle. A former quality engineer, Silvio holds degrees in electrical and electronic/software engineering. 
 
LinkedIn: https://www.linkedin.com/in/silviopicinini