eMpTy Pages: UTH - Another Chinese Translation Memory Data Utility

This is a guest post by Henry Wang of UTH. I include a brief interview I conducted before Henry wrote this post. I think this focus on developing a data marketplace is interesting as I happen to believe that the data used to train the machine learning systems is often more important than the algorithms themselves. The number of open source toolkits available for building Neural MT system is now almost 10.

I do not have a sense of whether the quality of the UTH data is better than other data utilities that exist and this post is not an endorsement of UTH by me. They, however, appear to be investing much more effort in cleaning the data, but I still feel that the metadata is still sorely lacking for real value to come from this data. And metadata is not just about domain classification. It will be interesting to see the quality of the MT systems that are built using this data, and that evidence will be the best indicator of the quality and value of this data to the MT community.

These data initiatives in China also reflect the building AI momentum in China. If you have the right data you can learn to develop high-value narrow purpose focused machine learning solutions.

What are the primary sources of your data?

Henry: The primary sources of our data include LSPs(language service providers), freelance translators, language service buyers, and several big data organizations.

Can you describe the metadata that you allow users to access to extract the most meaningful subsets for their purposes? Can you provide an overview of your detailed data taxonomy?

Henry: We created a three-tier pyramid structure of the data with 15 top-tier domains, 41 intermediate domains, and 178 bottom-level domains. Users can extract the subsets by choosing domain names (among the three tiers), language combinations, and other items that we provided and are going to provide on our product UIs.

Who are your primary customers?

Henry: MT companies/labs, LSPs, AI companies, e-commerce companies and universities

Do you price differently for LSP who might use less data than for MT developers who need much more data?

Henry: Yes

Do you plan to provide an English interface so that users across the world can also access your data?

Henry: Yes, we have launched several products with English UIs, including Sesame Search (www.zhimasousuo.com).

Do you have your own MT solution? How does it compare with Google for some key languages?

Henry: We are working on that. We also partner with Sogou and several MT labs in China for different language combinations. We believe we will do better than Google in China-related language pairs, and this will come true within 2 years.

Do you see an increasing interest in the use of this kind of language data? What other applications beyond translation?

Henry: Yes, an increasing number of leading AI, e-commerce, MT, and cross-border business companies are reaching out to us for cooperation. Also, we see a big potential in the education/e-learning field. Sesame Lingo is one of our innovative products for language teaching and training with the language data in the core database. Other applications include smart writing and pure data mining that might be applicable to many industries.

What are some of the most interesting research applications of your data from the academic sector?

Henry: Corpus-based studies, and a lot of others.

What are the most urgent data needs that you have by language where there is not enough data?

Henry: Southeast Asian languages, and South Asian languages.

Are you trying to create new combinations of parallel data from existing data? e.g. If there is English to Hindi and English to Chinese in the same domain and subject – could you align the data to create Hindi <> Chinese data?

Henry: Yes, we already mastered that technology years ago, thus an increasing number of language combinations and an increasing amount of data.

What is your feeling about the general usefulness of this kind of data in future?

Henry: With the development of data mining technologies, it will be applied to many more industries for sure. We are currently working very hard on in-context data and comparable data, which will be even more useful.

========

UTH, a Shanghai-based company, is a pioneer in the language service industries. UTH’s mission is to deliver innovative solutions to overcome challenges in the language services with petabyte translation data. Since 2012 when it was founded, UTH has accumulated more than 15 billion translation units across over 220 languages, including Arabic, Bulgarian, Chinese-Simplified, Chinese-Traditional, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Hungarian, Italian, Japanese, Korean, Polish, Portuguese, Romania, Russian, Slovenian, Spanish, Swedish, Thai, Lao, and Khmer, which enables it to secure a strong foot holding in China’s Belt and Road Initiative with a majority coverage of the languages used in participating countries and helps it win the support and cooperation of research institutions, language service buyers and providers, IT giants, e-commerce companies, government agencies as well as the investment support from venture capitals. Last year, it had successfully completed its Series B investment from Sogou, the second largest search engine by mobile queries in China. Sogou has completed its own IPO last year and posted $908.36 million of revenue in FY17.

UTH enhances its translation data business with the diversification in the handy tools in MT, language teaching and learning, and corpus research, which in turn sharpens its insights in the exploitation of big language data and artificial intelligence. Sesame Lingo is one of its products used for language teaching and training with parallel corpora data in the core database, and Sesame Search is an online corpus platform featuring multiple dimensional data classification, search, intuitive data presentation and patented language processing technologies. Recently, UTH has completed several acquisitions to expand its business territory in e-learning, smart writing, data mining and language services. With the strong alliances of Sogou, 5 mid-sized LSPs, 2 AI companies, and much more in 2018, UTH has already established an initial eco-system and become the largest translation database in China. UTH has seen an increasing number of leading AI, e-commerce, MT, and cross-border business companies worldwide reaching out to it for potential collaboration opportunities.

UTH embarks on a pioneering road similar to TAUS, yet UTH possesses a uniquely different advantage. TDA from TAUS is based on a data-sharing mechanism, and the control of data quality is largely determined by data-owners’ integrity and their internal quality control process. When accumulating the language data, TAUS exploits a data-breeding technology, in which it cross-selects the translation units from different languages but with a common translation in the third language to form new pairs. At UTH, more than 50 in-house corpus linguists and engineers, supported by around 400 contracted linguists, are working meticulously in language data sourcing, collection, alignment, and annotation, overseen by the trained testers under rigorous internal quality rules. UTH has formulated a relatively complete set of language quality management practices with reference to LQA models and ISO standards, embedded in the in-house tools for higher efficiency.

UTH’s close cooperation with academia sectors imbues the company with a unique perspective on the potentials of language data. Its data are classified into a unique three-tier pyramid (15 Level I domains, 41 Level II domains and 178 Level III domains) for the purpose of mapping the requirements in LSPs to the academic disciples in Chinese universities and making the data easily accessible to teachers and students on campus, which have won wide acclaims from education experts. In addition, the company launched its education cooperation initiatives in 2017, building several internship bases and joint research programs with prestigious universities in China and overseas, including Southeast University, the University of International Business and Economics, and Nanyang Technological University.

UTH’s focus on in-domain and in-context data is currently its priority and its major differentiator. As the largest repository of parallel articles in China, UTH is cooperating with LSPs (language service providers), freelance translators, language service buyers, and several big data organizations, orchestrating the high-quality data flow among these organizations and turning the immobile language data into flowing values. As a hub in the data exchange, filtering, and processing, UTH becomes an indispensable part and a booster in this trade.

Nowadays, with the increasingly wide application of NMT in Twitter, Facebook, WeChat, QQ and UGC platforms as well as the industrial application of MT in interpretation that help people connect each other across language barriers, translation is growing into a crucial business energizer. However, the technology edges of the forerunners such as Google is diminishing, resulting in a closer gap in the translation quality among NMT vendors, including Bing, SYSTRAN, SDL, DeepL and Baidu, Sogou, NetEase, and iFlytek in China. NMT is a data-hungry application, where data is fed into neural networks to improve its intelligence. Therefore, good quality and fine-tuned translation data will become a crucial part of this fierce competition.

As a trailblazer in China, UTH is now feeding its translation data to several MT companies and MT labs, and together improving the final products, in the hope that it will do better than Google in Chinese-related language pairs in the very near future.

eMpTy Pages

Pages

Pages

Friday, April 6, 2018

UTH - Another Chinese Translation Memory Data Utility

No comments:

Post a Comment