This post highlights a Chinese MT vendor who I suspect is not well known in the US or Europe currently, but who I expect will become better known over the coming years. While the US giants (FAAMG) still dominate the MT landscape around the world today, I think it is increasingly possible that other players from around the world, especially from China may become much more recognized in the future.
One indicator that has been historically reliable to forecast and predict emerging economic power is the volume of patent filings in a country. This has been true for Japan and Germany historically where we saw voluminous patent activity precede the economic rise of these countries, and recently we see that this predictor is also aligned with the rise of S. Korea and China as economic powerhouses. However, the sheer volume of filings is not necessarily a lead indicator of true innovation, and some experts say that the volume of patents filed and granted abroad is a better indicator of innovation and patent quality. But today we see emerging giants from Asia in consumer electronics, automobiles, eCommerce, internet services, and nobody questions the building innovation momentum happening in Asia today.
Artificial Intelligence (AI) is heralded by many as a key driver of wealth creation for the next 50 years. To build momentum with AI requires a combination of access to large volumes of "good" data, computing resources, and deep expertise in machine learning, NLP, and other closely related technologies. Today, the US and China look poised to be the dominant players in the wider application of AI and machine learning-based technologies with a few others close behind. And here too deep knowledge and clout are indicated by the volume of influential papers published and referenced by the global community. A recent analysis, by the Allen Institute for Artificial Intelligence in Seattle,
Washington found that China has steadily increased its share of authorship of the top 10% most-cited papers. The researchers found that America’s share of the most-cited 10 percent
of papers declined from a high of 47 percent in 1982 to a low of 29
percent in 2018. China’s share, meanwhile, has been “rising steeply,”
reaching a high of 26.5 percent last year, Though the US still has significant advantages with the relative supply of expert manpower and dominance in manufacture of AI semiconductor chip technology, this too is slowly changing even though most experts expect the US to maintain leadership for other reasons.
Credit: Allen Institute for Artificial Intelligence
These trends also impact the translation industry and they change the relative benefit and economic value of different languages. The global market is slowly changing from a FIGS-centric view of the world to one where both the most important source language (ZH, KO, HI) and target languages are changing. The fastest-growing economies today are in Africa and Asia and are not likely to be well served by a FIGS-centric view though it appears that English will remain a critical world language for knowledge sharing for at least another 25 years. These changes create an opportunity for agile and skillful Asian technology entrepreneurs like NiuTrans who are much more tuned-in to this rapidly evolving world. I have noted that some of the most capable new MT initiatives I have seen in the last few years were based in China. India has lagged far behind with MT, even though the need there is much greater, because of the myth that English matters more, and possibly because of the lack of governmental support and sponsorship of NLP research.
The Chinese MT Market: A Quick Overview
I
recently sat down with Chungliang Zhang from NiuTrans, an emerging enterprise MT vendor in China, to discuss the Chinese MT market and his company’s own MT offerings. He pointed out that China is the second-largest global economy today, and it is now increasingly commonplace for both Chinese individuals and enterprises to have active global interactions. The economic momentum naturally drives the demand for automated translation services.
Some examples, he pointed out:
In 2019, China’s outbound tourist traffic totaled 155M people, up 3.3% from the previous year. This massive volume of traveler traffic results in a
concomitant demand for language translation. Chungliang pointed out that this travel momentum significantly drives the need for voice translation devices in the consumer market like those produced by Sougou, iFlyTek, and others, which have been very much in demand in the last few years.
There
is also a growing interest by Chinese enterprises, both state-owned or
privately owned, to build and expand their business presence in global
markets. For example, Alibaba, China’s largest eCommerce company, is
listed on the NYSE and has established an international B2B portal
(Alibaba.com) where 20 million enterprises gather and work to “Buy
Global, Sell Global.” Currently, the Alibaba MT team builds the largest eCommerce MT systems
globally, often reaching volumes of 1.79 billion translation calls per
day, which is a larger transaction volume than either Google or Amazon.
“All
in all, as we can see it, there is a clear trend that MT is
increasingly being used in more and more industries, such as language
service industries, intellectual property services, pharmaceutical
industries, and information analysis services.”
While it is clear that consumers and individuals worldwide are
regularly using MT, the primary enterprise users of MT in China are
government agencies and internet-based businesses like eCommerce. This
need for translation is now expanding to more enterprises who seek to
increase their international business presence and realize that MT can
enable and accelerate these initiatives.
The Chinese MT technology
leaders in terms of volume and regular user base are the internet
services giants (such as Baidu, Tencent, Alibaba, Sogou, Netease) or the
AI tech giants (such as iFlyTek). Google Translate and Microsoft Bing
Translator are also popular in China since they are free, but
they don’t have a large share of the total use if the focus is strictly
on MT technology.
When asked to comment on the characteristics and changes in the Chinese MT market, Chungliang said:
“In our understanding, Sogou and iFlytek's primary business focus is the B2C market, and thus both of them develop consumer hardware like personal voice translators. Sogou was recently (July 29, 2020) purchased by Tencent (a major social media player), so we don’t know what will happen next. iFlytek is famous for its Speech-To-Speech technology capabilities. Thus it is natural for them to develop MT, to get the two technologies integrated and grab a larger share of the market.
As
for the other important MT players in China, Alibaba MT mainly serves its own global focused eCommerce business, and Tencent Translate focuses on providing the translation needs of its users in social networking use scenarios. Like Google Translate, Baidu Translate is a portal to attract individual users who might need translation during a search. It also serves to expand Baidu’s influence as a whole. While Netease Youdao
focuses on the education industry, and the Youdao Team integrates the
Youdao online dictionary, direct MT, and human translation.
What are the main languages that people/customers translate? As
far as we know, the most translated language is English, Japanese is
second, followed by Arabic, Korean, Thai, Russian, German, and Spanish.” Of course, this is
all direct to and from Chinese.”
NiuTrans Focus: The Enterprise
The NiuTrans
team learned very early in their operational history and during their
startup phase that their business survival was linked to providing MT
services for the enterprise rather than for individual users and
consumers. The market for individuals is dominated by offerings like
Google Translate and Baidu Translate that offer virtually-free services.
In contrast, NiuTrans is focused on meeting the enterprise demands for
MT, which often means deploying on-premise MT engines and the
development of custom engines. These enterprises tend to be concentrated
around Intellectual Property and Patent services, Pharmaceuticals,
Vehicle Manufacturing, IT, Education, and AI companies. For example,
NiuTrans builds customized patent-domain MT engines for the China Patent
Information Center (CNPAT, a branch of the China National Intellectual
Property Administration, a large-scale patent information service based in Beijing.)
CNPAT
has the largest collections of multilingual parallel data for patents,
and services ongoing and substantial demands for patent-related MT needs
in various use scenarios such as patent application filing and
examination, patent-related transactions, and patent-based lawsuits.
Given the scale of the client’s needs, NiuTrans sends an R&D team on-site to work with CNPAT’s technical team for data processing and data cleaning. This data is then used in the NiuTrans.NMT
training module to develop patent-domain NMT engines on CNPAT’s on-premise servers. The on-site team also develops custom MT APIs on-demand to fit into CNPAT’s current workflow and customer servicing needs.
Besides powering and enabling the specialized translation needs of
services like CNPAT, NiuTrans also provides back-end MT services for
industrial leaders, including iFlyTek (also an early investor in NiuTrans), JD.com (the No. 2 eCommerce business in China), Tencent
(the largest social networking company in China), Xiaomi (a leader of
smart devices OEMs in China), and Kingsoft (a leader of office software
in China).
NiuTrans has an online cloud API that also attracts
100,000+ small and medium enterprises interested in expanding their
international operations and business presence. The pricing for these
smaller users are based on the volume of characters these users translate
and is much lower than Google Translate and Baidu Translate prices.
NiuTran’ Online Cloud User Locations
You can visit the NiuTrans Translate portal at https://niutrans.com
NiuTrans
write and maintain their own NMT code-base rather than use open source
options for NiuTrans.NMT and claim that they achieve comparable, if not
better, quality performance with their competitors. Their comparative performance at the WMT19 evaluations suggests that they actually do better than most of their competitors. They are not
dependent on TensorFlow, PyTorch, or OpenNMT to build their systems.
Today, NiuTrans is a key MT technology provider, especially for
enterprises in China.
NiuTrans.NMT is a lightweight and efficient Transformer-based neural machine translation system. Its main features are:
- Few dependencies. It is implemented with pure C++, and all dependencies are optional.
- Fast decoding. It supports various decoding acceleration strategies, such as batch pruning and dynamic batch size.
- Advanced NMT models, such as Deep Transformer.
- Flexible
running modes. The system can be run on various systems and devices
(Linux vs. Windows, CPUs vs. GPUs, FP32 vs. FP16, etc.).
- Framework agnostic. It supports various models trained with other tools, e.g., Fairseq models.
- The code is simple and friendly to beginners.
When I probed into why NiuTrans had chosen to develop their own NMT
technology rather than use the widely accepted open-source solutions, I
was provided with a history of the company and its evolution through
various approaches to developing MT technology.
The NiuTrans team
originated in the NLP Lab at Northeastern University, China (NEUNLP
Lab), a machine translation research leader in the Chinese academic
world going as far back as 1980. Like many elsewhere in the world, the
team initially studied rule-based MT from 1980 to 2005. In 2006
Professor Jingbo Zhu (the current Chairman of NiuTrans) returned from a
year-long visit to ISI-USC and decided to switch to statistical MT
research working together with Tong Xiao, who was a fresh graduate
student at the time and is now the CEO of NiuTrans. They made rapid
strides in SMT research, releasing the first version of NiuTrans SMT
open source in 2011. At that time, Chinese academia primarily used Moses
to conduct MT-related research and develop MT engines. The development
of the NiuTrans.SMT
open-source proved that Chinese engineers could do the same as, or even
better than Moses, and also helped to showcase the strength and
competence of the NiuTrans team. Thus, in 2012, confident with their MT
technology and armed with a dream to expand the potential of this
technology to connect the world with MT, the NiuTrans team decided to form
an MT company, converting the 30+ years’ of MT research work to
developing MT software for industrial use.
Given their origins in
academia, they kept a close watch on MT research and breakthroughs
worldwide and noticed in 2014 that there was a growing base of research
being done with neural network-based deep learning models. Therefore, the NiuTrans team started studying deep learning technologies in 2015 and
released its first version of NiuTrans.NMT in December 2016, just three
months after Google announced the release of its first NMT engines.
NiuTrans prefers to avoid using open source MT platforms like
TensorFlow, PyTorch, or OpenNMT as they have developed deep competence in MT technology gathered over 40 years of engagement. The leadership believes there are specific advantages to building the whole technology stack for MT and intend to continue with this basic development strategy. As an example, Chunliang pointed me to the release of
NiuTensor, their own deep learning tool: (https://github.com/NiuTrans/NiuTensor) and NiuTrans.NMT Open Source (https://github.com/NiuTrans/NiuTrans.NMT).
They are confident that they can keep pace with continuous improvements in open source with support from the NEUNLP Lab, which has eight permanent staff and 40+ Ph.D./MS students focusing on MT issues of relevance and interest for their overall mission. This group also allows
NiuTrans to stay abreast of the worldwide research being done elsewhere.
NiuTrans understands that a critical requirement for an enterprise user is to adapt and customize the MT system to enterprise-specific terminology or use. Thus, it provides both a user terminology module to introduce user terminology into the MT system and a
user translation memory module to introduce the users’ sentence pairs to tune the MT system. Another more sophisticated solution is incremental training. They incorporate user data to modify the NiuTrans model parameters to get the MT model better adjusted to user data features.
NiuTrans also gathers post-editing feedback on critical
language pairs like ZH <> EN and ZH <> JP on an ongoing basis, then analyze
error patterns to develop continuing engine performance improvements.
Quality Improvement, Data Security, and Deployment
NiuTrans
evaluates MT system performance using BLEU and a human evaluation technique that ranks relative systems. They prefer not to use the widely used 5-point scale to assign an absolute value to a translation. Thus if they were comparing NiuTrans, Google, and DeepL, they would use a
combination of BLEU and have humans rank the same blind test set for the three systems.
NiuTrans also has an ongoing program to improve its MT engines continually. They do this in three different ways:
- Firstly,
as the company has a strong research team that is continually
experimenting and evaluating new research, the impact of this research
is continuously tested to determine if it can be incorporated into the
existing model framework. This kind of significant technical innovation
is added into the model two or three times a year.
- Secondly,
customer feedback, ongoing error analysis, or specialized human
evaluation feedback also trigger regular updates to the most important MT systems (e.g. ZH<>EN) at
least once a month.
- Thirdly, engines will be updated as new data is discovered, gathered, or provided by new clients. High-quality
training data is always sought after and considered valuable to drive ongoing MT system improvements.
NiuTrans has performed well in comparative evaluations of their MT systems against other academic and large online MT solutions. Here is a summary of the results from WMT19. They report that their performance in WMT20 is also excellent,
but final results have not yet been published.
NiuTrans training data comes mainly from two sources: data crawling and data purchase from reliable vendors.
NiuTrans
uses crawlers to collect the parallel texts from the websites that do
not prohibit or prevent this, e.g., some Chinese government agencies’
websites that often provide data in several languages. They also buy
parallel sentences (TM) and dictionaries from specific data provider
companies, who might require signing an agreement, specifying that
the data provider retains the intellectual property rights of the data.
NiuTrans
gets the bulk of its revenue from data-security concerned customers who
deploy their MT systems on On-premise systems. However, NiuTrans is
also working on an Open Cloud https://niutrans.com
offering, allowing customers to access an online API and avoid
installing the infrastructure needed to set up on-premise systems. The
Open Cloud is a more cost-effective option for smaller SME companies,
and NiuTrans has seen rapid adoption of this new deployment in specific
market segments.
International customers, especially the larger ones, much prefer to deploy their NiuTrans MT systems on-premise. For
those international customers who cannot afford on-premise systems, the
NiuTrans Open Cloud solution is an option. This system is deployed on the Alibaba Cloud that is governed by Chinese internet security laws that require that user data be kept for six months before deletion. The company plans to build another cloud service on the Amazon Cloud for international customers who have data security concerns. This new capability will allow users to encrypt their data locally, transfer the data securely to the Amazon Cloud. NiuTrans will then decrypt the source data on their servers, translate it, and finally delete all the user data and the corresponding translation results once the source data has been translated.
NiuTrans currently has 100+ employees, directed by Dr. Jjingbo Zhu and
Dr. Tong Xiao, two leading MT scientists in China. Shenyang is the seat of the company’s headquarters and R&D team as well. Technical support and services are available in Beijing, Shanghai, Hangzhou, Chendu, and Shenzhen currently, but the company is now exploring entering the Japanese market, with the assistance of partners in Tokyo and Osaka. While NiuTrans is not a well-known name in the US/EU translation industry today, I suspect that they will become an increasingly better-known provider of enterprise MT technology in the future.