Pages

Thursday, July 21, 2016

5 Tools to Build Your Basic Machine Translation Toolkit


This is a second post from the MT Language Specialist team at eBay, by. I have often been asked by translators about what kinds of tools are useful when working with MT. There is a lot of corpus level data analysis, preparation and editing going on around any competent MT project. While TM tools have some value, they tend to be segment focused and do not scale, there are much better tools out there to do the corpus pattern analysis, editing and the comparison work that is necessary to build the best systems. We are fortunate to have some high value tools laid out very clearly for us here by Juan who has extensive direct experience working with large volumes of data and can provide experience based recommendations.
----------------------------------------------------------------------------------
If you are a linguist working with Machine Translation (MT), your job will be a lot easier if you have the right tools at hand. Having a strong toolkit, and knowing how to use it, will save you loads of time and headaches. It will help you work in an efficient manner, as well.
As a Machine Translation Language Specialist at eBay, I use these tools on a regular basis at work, and that is why I feel comfortable recommending them. At eBay, we use MT to translate search queries and listing titles and descriptions into several languages. If you want to learn more, I encourage you to read “The Next Big Thing You Missed: Why eBay, Not Google, Could Save Automated Translation”. 

 1. Advanced Text Editors 
Notepad won’t cut it, trust me. You need an advanced text editor that can, at least:
  • deal with different file encoding formats (UTF, ANSI, etc.) 
  • open big files and/or with unusual formats/extensions
  • do global search and replace operations with regular expressions support
  • highlight syntax (display different programming, scripting or markup languages -XML, HTML, etc.- with color codes)
  • have multiple files open at the same time (tabs)

This is a list of my personal favorites, but there are a lot of good editors out there. 

Notepad ++: My editor of choice. You can open virtually any file with it, it’s really fast, and it will keep your files in the editor even if you close it. You can easily search and replace in a file or in all open files, using regular expressions or just extended characters (control characters like \n or \t). It’s really easy to convert from/to different file encodings and save all opened files at once. You can also download different plugins, like spellcheckers, comparators, etc. It’s free and you can download it from here.
 
Sublime: This is another amazing editor, and a developers’ favorite. Personally, I find it great to write scripts. You can do many cool things with it, like using multiple selections to change several instances of a word at once, split a selection of words into different lines, etc. It supports regular expressions and tabs, as well. It has a distraction-free mode if you really need to focus. It’s also free, and you can get it here.

EmEditor: Syntax highlighting, document comparison, regular expressions, handles huge files, encoding conversion… Emeditor is extremely complete. My favorite feature, however, are the scriptable macros. This means, you can create, record, and run macros within EmEditor – you can use these macros to automate repetitive tasks, like making changes in several files and/or saving them with different extensions. You can download it from here.
 2. QA Tools 
Quality Assurance Tools assist you in automatically finding different types of errors in translated content. They all basically work in a similar way: 1) you load files with your translated content (source + target); 2) you optionally load reference content, like glossaries, translation memories, previously translated files or blacklists; 3) the tool checks your content and provides a report listing potential errors. Some of the errors you can find using a QA Tool are:
  • terminology: term A in the source is not translated as B in the target
  • blacklisted terms: terms you don’t want to see in the target
  • inconsistencies: same source segment with different translations
  • differences in numbers: source and target numbers should match
  • capitalization
  • punctuation: missing or extra periods, duplicate commas, etc.
  • patterns: certain used defined patterns of words, numbers and signs, which may contain regular expressions to make them more flexible, expected to occur in a file.
  • grammar and spelling errors
  • duplicate words, tripled letters, and more.
Some QA Tools you should try are: 

Xbench allows you to run the following QA Checks: find untranslated segments, segments with the same source text and different target text, and segments with the same target text and different source text, find segments whose target text matches the source text (potentially untranslated text), tag mismatches, number mismatches, double blanks, repeated words, terminology mismatches against a list of key terms, and spell-check translations. Some linguists like to add all their reference materials in Xbench, like translation memories, glossaries, termbases and other reference files, as the tool allows you to find a term while working on any other running application with just a shortcut.

Xbench also has an Internet Search tab to run searches on Google. The list is pretty limited but there are ways to expand it, maybe I’ll share that in the future. You can get Xbench here.

Checkmate is the QA Tool part of the Okapi Framework, which is an open source suit of applications to support the localization process. That means, the Framework includes some other tools, but Checkmate is the one you want to perform quality checks on your files. It supports many bilingual file formats, like XLIFF, TTX, and TMX. Some of the checks you can run are: repeated words, corrupted characters, patterns, inline codes differences, significant differences in length between source and target, missing translations, spaces, etc. The patterns section is especially interesting; I will come back to it in the future. Checkmate produces comprehensive error reports in different formats. It can also be integrated with LanguageTool, an open source spelling and grammar checker. You can get Checkmate here.


In this following part, we discuss Comparison, Corpus Analysis, and CAT Tools.

3. Comparison Tools


Why do you need a comparison tool? Comparing files is a very practical way to see in detail what changes were introduced, e.g. which words were replaced, which segments contain changes, or whether there is any content added or missing. Comparing different versions of a file (for example, before and after post-editing) is essential for processes that involve multiple people or steps. Beyond compare is, by far, the best and most complete comparison tool, in my opinion. 

You can also compare entire folders. If you work with many files, comparing two folders is an effective way to determine if you are missing any files or if a file does not belong in a folder. You can also see if the contents of the files are different or not. 

4. Corpus Analysis Tools

As defined by its website, AntConc is a freeware corpus analysis toolkit for concordancing and text analysis. This is, in my opinion, one of the most helpful tools you can find out there when you want to analyze your corpus or content, regardless of the language. AntConc will let you easily find n-grams and sort them by frequency of occurrence. It is a very practical way to identify the highest frequency n-grams in your corpus. Obviously, you want the most frequently used terms to be translated as accurately as possible. In most texts, words like prepositions or articles are the most common ones, so you can use a stop-word list to filter them out when they don't add any value to the task at hand. 
AntConc is extremely helpful when it comes to find patterns in your content. Remember - with MT, you want to fix patterns, not specific occurrences of errors. It may sound obvious, but finding and fixing patterns is a more efficient way to get rid of an issue than trying to fix each particular instance of an error. With AntConc you can select the minimum and maximum sizes of the n-grams you want to see, as well as the frequency. 
AntConc can create a list of each word occurring in your content, preceded by the number of hits. This can help you get a deeper insight on your corpus for terminology work, like which terms you should include in your glossary. These words can also tell you what your content is about – just by looking at the most frequent words, you can tell if the content is technical or not, if it belongs to any specific domain, and even which MT system you can use to translate it, assuming you have more than one customized systems. 

There are many things you can use this tool for and it deserves its own article.   
Check AntConc out here.

5. CAT Tools


CAT Tools make a great post-editing environment. Most modern tools can be connected to different machine translation systems, so you get suggestions both from a TM and from an MT system. And you can use the TM to save your post-edited segments and reuse them in the future. If you have to use glossaries or term bases, CAT tools are ideal, as they can also display terminology suggestions.
When post-editing with a CAT tool, there are usually 2 approaches: you can get MT matches from a TM (of course, they need to be added to it previously) or a connected MT system, or you can work on bilingual, pre-translated files and store in your TM post-edited segments only.
If you have never tried it, I totally recommend Matecat. It's a free, open source, web-based CAT tool, with a nice and simple editor that is easy to use. You don’t have to install a single file. They claim you will always get up to 20% more matches than with any other CAT tool. Considering some tools out there cost around 800 dollars, what Matecat has to offer for free can’t be ignored. It can process +50 file types; you can get statistics on your files (like word counts or even how much time you spent on each segment), split them, save them on the cloud, and download your work. Even if you never used a CAT tool before, you will feel comfortable post-editing in Matecat in just a few minutes.


Another interesting free, open-source option is OmegaT. Not as user-friendly as Matecat, you will need some time to get used to it, even if you are an experienced TM user. It has pretty much all the same main features commercial CAT tools have, like fuzzy matching, propagation, it supports around 40 different file formats, and it boasts an interface to Google Translate. If you never used it, you should give it a try.

 If you are looking into investing some money and getting a commercial tool, my personal favorite is Memoq. It has tons of cool features and, overall, is a solid translation environment. It probably deserves a more detailed review, but that is outside of the scope of this post. You can learn more about MemoQ here.
Juan Rowda
Staff MT Language Specialist, eBay

Juan is a certified localization professional working in the localization industry since 2003. He joined eBay in 2014. Before that, he worked as translator/editor for several years, managed and trained a team of +10 translators specialized in IT, and also worked as a localization engineer for some time. He first started working with MT in 2006. Juan helped to localize quite a few major videogames, as well.

He was also a professional CAT tool trainer and taught courses on localization.

Juan holds a BA in technical, scientific, legal, and literary translation.