This is a guest post by the principals of Tmxmall, which is a Chinese Translation Memory Marketplace, i.e. a site that sells translation memory to interested parties (mostly translators and LSPs). This is one of two such intiatives coming from China. I will be featuring the other one shortly.
Data is a critical requirement for any of the modern MT technology paradigms. While SMT was able to handle some level of noise in the data used, it appears that NMT is more particular and the "more data is better" principal does not apply as clearly, if it ever did. The quality of the training data matters with AI and machine learning. The pursuit of building up training data resources will always exist in the context of any Machine Learning technology. However, commodity data that you can easily obtain or buy tends to usually be low quality data, and not high-quality data by most people's assessment.
This is what TAUS has to say about these new initiatives:
I find the vision of the cloud based Online CAT much more compelling than desktop solutions, and I would not be surprised if these collaborative, big data based work multi-tool environments do indeed become increasingly more compelling, even to power users of yesterdays technology.
So here are some of my favorite quotes about data.
“By far, the greatest danger of Artificial Intelligence is that people conclude too early that they understand it.” - Eliezer Yudkowsky
Peter Norvig “We don’t have better algorithms. We just have more data.”.
“The sad thing about artificial intelligence is that it lacks artifice and therefore intelligence.” - Jean Baudrillard
“We’re entering a new world in which data may be more important than software.”- Tim O’Reilly, Founder, O’Reilly Media
“Data is a precious thing and will last longer than the systems themselves.”- Tim Berners-Lee, father of the Worldwide Web
For those of you interested in eDiscovery and new applications for MT in Information Governance please check out this webinar that I just did together with Nuance.
Tmxmall is one of the leading providers specializing in language assets management and promoting TM global sharing with its headquarter in Shanghai, China. We are a team of technology and language geeks who have created products around translation memories, helping translators and MT providers better use translation memory data.
Both Jing and Jian are fascinated by information retrieval and search engine. Motivated by this interest, they applied their advantages to retrieval and leverage of translation memory, and are working on mining more valuable information from TM data and promoting data sharing and trading.
Tmxmall Aligner is an online tool used to create translation memories and align parallel texts. The Aligner supports 17 file formats and 19 languages in modes including bilingual documents and one single document. It processes parallel texts by applying Tmxmall’s self-developed leading algorithm based on paragraphs and sentences and could automatically recognize sentences in terms of three pairing situations: one source to multiple targets, multiple targets to one source and multiple targets to multiple targets. It also allows de-duplication, filtering and finding & replacing, ensuring a more convenient and efficient process when creating translation memories.
There are three core modules in TM ROBOT’s working environment: the
client terminals (TM ROBOTs) who are willing to share TMs, Tmxmall TM
marketplace and CAT tools that are integrated with Open API of Tmxmall
TM marketplace. The client terminals (TM ROBOTs) will choose the TMs
that are allowed to be shared and then submit the random sentence pairs
to a P2P platform. If the quality of the submitted random data is
approved, the source TMs in client terminals will be included in the
whole P2P platform and then can be retrieved by all users on this
platform. When users translating in their CAT tools which are integrated
with Open API of Tmxmall TM marketplace, the source sentences will be
sent to TM marketplace. Tmxmall TM marketplace will then distribute the
source sentences to all the client terminals(TM ROBOTs) who are online
and ever shared TMs. When the client terminals received the query
request, client terminals will search the local shared TMs and return
the matched results to Tmxmall. Tmxmall will summarize all the results
and then return the optimized result to the CAT users. When the
optimized results returned, Tmxmall will deduct the relevant fees from
CAT users’ accounts in Tmxmall platform system.
https://www.tmxmall.com/
https://www.tmxmall.com/home/about
Data is a critical requirement for any of the modern MT technology paradigms. While SMT was able to handle some level of noise in the data used, it appears that NMT is more particular and the "more data is better" principal does not apply as clearly, if it ever did. The quality of the training data matters with AI and machine learning. The pursuit of building up training data resources will always exist in the context of any Machine Learning technology. However, commodity data that you can easily obtain or buy tends to usually be low quality data, and not high-quality data by most people's assessment.
This is what TAUS has to say about these new initiatives:
"The internet giants had a competitive edge in translation data, but they spoiled it by polluting their own fishing grounds with machine translations. Now, the hunt is open for new data marketplaces. The European Commission is investing in the Connecting European Facility. But watch out also for the greenfield translation data ventures in China, or perhaps closer to home: the TAUS Data Cloud."To the best of my knowledge data sharing initiatives have not been particularly successful with SMT. There has always been an issue of uneven quality when disparate data is pooled together. I am not sure this changes with these new TM marketplaces. I believe that rich metatags where some meaningful and consistent objective indication of quality, is provided and is likely needed to make such data exchanges viable. I recall in my experiments with TDA data that it is often wise to completely exclude certain lower quality datasets and sources but this understanding came after much trial and error and effort.
I find the vision of the cloud based Online CAT much more compelling than desktop solutions, and I would not be surprised if these collaborative, big data based work multi-tool environments do indeed become increasingly more compelling, even to power users of yesterdays technology.
So here are some of my favorite quotes about data.
"Data-intensive projects have a single point of failure: data quality" George Krasadakis, Data Quality in the era of AI.
“By far, the greatest danger of Artificial Intelligence is that people conclude too early that they understand it.” - Eliezer Yudkowsky
Peter Norvig “We don’t have better algorithms. We just have more data.”.
“The sad thing about artificial intelligence is that it lacks artifice and therefore intelligence.” - Jean Baudrillard
“We’re entering a new world in which data may be more important than software.”- Tim O’Reilly, Founder, O’Reilly Media
“Data is a precious thing and will last longer than the systems themselves.”- Tim Berners-Lee, father of the Worldwide Web
Despite the hype, we should understand that deep learning algorithms are increasingly going to be viewed as commodities. It's the data where the real value is. K.V.😉
For those of you interested in eDiscovery and new applications for MT in Information Governance please check out this webinar that I just did together with Nuance.
====
Tmxmall is one of the leading providers specializing in language assets management and promoting TM global sharing with its headquarter in Shanghai, China. We are a team of technology and language geeks who have created products around translation memories, helping translators and MT providers better use translation memory data.
Our Story
In 2014, Jing Zhang, the founder, and CEO of Tmxmall, who gained his bachelor degree from Northwestern Polytechnical University in Computer Science and Master degree in Information Management from Tianjin University, left Baidu and started his own business with his classmate Jian Chen, who worked for Huawei and Baidu, and now the CTO of Tmxmall.Both Jing and Jian are fascinated by information retrieval and search engine. Motivated by this interest, they applied their advantages to retrieval and leverage of translation memory, and are working on mining more valuable information from TM data and promoting data sharing and trading.
Jing Zhang (left) and Jian Chen(right)
|
The Status
Tmxmall pays high attention to data and takes great efforts to capture value from language data we have. For the present, we have nearly 7 billion thousand sentence pairs which are classified by 34 language pairs including English, Japanese, Russian, German and over 10 domains such as the economy, bioscience, law, and medicine. Those data are mainly in Chinese/English into other languages which are from the offline exchange, purchase from LSPs and freelancers, web crawling, and bilingual documents alignment. Among them, zh-en/en-zh human translated domain data and zh/en - Southeast Asian Languages data are the most popular in China. We now have developed services and products that all revolve around translation memories ecosystem to help users fully leverage our data and also manage their language assets. Recently, Tmxmall has localized its official website into English so that users across the world could get benefits from our service.Our Products
TMXMall Roadmap |
Tmxmall Aligner
Tmxmall Aligner is an online tool used to create translation memories and align parallel texts. The Aligner supports 17 file formats and 19 languages in modes including bilingual documents and one single document. It processes parallel texts by applying Tmxmall’s self-developed leading algorithm based on paragraphs and sentences and could automatically recognize sentences in terms of three pairing situations: one source to multiple targets, multiple targets to one source and multiple targets to multiple targets. It also allows de-duplication, filtering and finding & replacing, ensuring a more convenient and efficient process when creating translation memories.
TM exchange platform
Tmxmall TM exchange platform is where users retrieve translation units for every translation units uploaded. Users can also upload their own translation memories for others to retrieve, download and purchase.TM SaaS management System
Tmxmall TM SaaS management System is designed for users to manage their TMs, allowing users to upload, share, retrieve and delete TMs, and conduct collaborative translation by referring to or updating TMs in real time. Users including freelance translators and LSPs can rent the system according to their use capacity.TM Marketplace
TM marketplace is the place for TM sharing and trading, supporting TM files in 19 languages. Users can upload their own TMs, search for matches, sell or purchase segments matched with the data stored on the platform. Connected with TM SaaS management System, every TM that users bought from the marketplace can be managed in the TM SaaS management System. For now, data sold on TM marketplace costs $1.50 for 1000 words with a 100% match; $1.24 with a 95-99% match; $0.78 with a 85-94% match; $0.45 with a 75-84% match. The money goes to the data owner when a TM transaction is finished.TM ROBOT
TM ROBOT is a client software for managing and sharing local TM data and is developed based on TM Marketplace for users who are hesitant to upload their TMs online. It’s also designed for promoting knowledge sharing economy by connecting global TM data, helping users obtain lasting yields by sharing TMs while respecting their translation achievement, and making language assets reusable to enhance production efficiency in translation industry. When TM ROBOT is installed, users are allowed to manage, share TMs and search for TM matches on TM marketplace on their computer.
TM ROBOT Working Module
|
Tmxmall API
Tmxmall API is a plug-in that integrates the whole data stored on Tmxmall platform(TM exchange platform, TM Saas system, TM marketplace and TM ROBOT) into Desktop CAT tools including SDL Trados and MemoQ, and online CAT tools like Tmxmall online CAT. By using Tmxmall API, language data on Tmxmall platform can be searched when conducting translation in CAT tools.Online CAT
The Online CAT is developed for translators or small teams when handling small translation projects. It seamlessly connects all language data stored on Tmxmall platform and supports Google Translate and pre-translation. We are now working on a new version of an Online CAT which will be released in the coming year. It will support large translation projects and enable translation workflows of freelance translators and LSPs. A variety of input formats, real-time supervision, machine translation, QA check and simultaneous translation & reviewing will be supported by then. Particularly, the online CAT will be integrated with large TM data on Tmxmall’s TM exchange platform, TM SaaS management System, TM marketplace, and local TM ROBOT.
Tmxmall Online CAT
Features
|
Worried about data quality? So are we.
Our primary users are freelance translators, LSPs, teachers in universities’ translation and Interpretation Programs and MT providers. As a responsible enterprise, we put our users first and so definitely care about the data quality. Every TM uploaded on our platform will be verified by staff at Tmxmall. Only the human translated TMs which are aligned orderly can be approved and published. Besides manual verification, we are developing a QA tool where the QA metrics are built-in to spot errors such as punctuations, numbers, and omissions. For users who want to purchase data, they can view the random 30 sentence pairs as a sample to get an overview of the quality.Research on TMs
From the time Tmxmall was established, we never stop our journey on TMs researches. During years’ studying, we have achieved several successes:- Based on automatic alignment algorithm of machine translation, bilingual documents can be aligned automatically with 95% accuracy rate.
- With our leading natural language processing technology and thousands of high-quality TMs, lower-quality sentence pairs will be automatically spotted through TM assessment process algorithm.
- By leveraging CNN classification technology, automatic classification of large-volume TMs with an accuracy up to 97% is now possible.
- The responsive time of billion sentence pairs retrieval is reduced to 200ms after the distributed optimization of distributed search engines.
- Tmxmall Machine Translation Plug-in is now available in SDL Trados. It supports machine translation tools including Google Translate, Baidu Translate, Sougou Translate, Youdao Translate and Newtranx, allowing users to produce more translated materials without increasing costs.
Our Ambition
Recently, we have seen an increasing interest in MT data, especially for MT developers to train their MT engine, which means that the research and implementation of Machine Translation would boom in the near future. Since 2014, we have accumulated a large volume of language data, which allows us to dream big and step towards the AI machine translation industry. By virtue of self-developed leading algorithm, data mining technology, and language data, we are able to train our domain MT engines by using specific language data, so as to produce machine translation with accuracy and quality. Transforming from language data research to MT engines attempt, Tmxmall always treats data as the guidance on every aspect of our business and strongly believe this will be the best long-term way for us to grow and thrive.https://www.tmxmall.com/
https://www.tmxmall.com/home/about
Isn’t this “new” platform very similar to TM-Town?
ReplyDelete