Monday, October 24, 2016

10 Myths About Computer-Assisted Translation

This is a guest post by "Vova" from SmartCAT. I connected with him on Twitter and learned about the SmartCAT product which I would describe as a TMS + CAT tool with a much more substantial collaboration framework than most translation tools I know of. It is a next-generation translation management tool that enables multi-user collaboration at a large project level, but also allows individual freelancers to use it as a simple CAT tool. It has a new and non-standard approach to pricing which I am still trying to unravel. I have talked to one LSP customer, who was very enthusiastic about his user experience and stressed the QA and productivity benefits especially for large projects. I am still researching the product and company (which has several ex-ABBYY people but they seem eager to develop a separate identity) and will share more as I learn more. But on first glance, this product looks very interesting and even compelling, even though, they, like Lilt, have a hard time describing what they do quickly and clearly.  Surely they should both be looking to hire a marketing consultant - I know one who comes to mind ;-).  The most complete independent review on the SmartCAT product (requires a subscription), is from Jost Zetzsche who likes it, which to my mind is meaningful commendation for them, even though he likes other products too.


Many translators are wary of CAT tools. They feel that computer-aided translation commoditizes and takes the creativity out of the profession. In this article, we will try to clear up some common misconceptions that lead to these fears.


1 — Computer-aided translation is the same as machine translation

The naming of the term “computer-aided translation” often leads to its being confused with “machine translation.” The first thing some of our new users write us is “why don’t I see the automatic translation in the right column”? For some reason, they expect it to be there (and perhaps replace the translation effort at all?).

In reality, machine translation is just a part — in most cases a small part — of what computer-aided translation is about. This part is usually called “PEMT” (post-editing of machine translation) and consists in correcting a translation done by one or another MT engine. We’re nowhere near replacing a human translator with a machine.

PEMT itself is like a red flag to a bull for many translators and deserves a separate article. Here we will just reiterate that equating CAT with machine translation is like equating aviating skills with using the autopilot.

   Here’s where you find it, just in case — but use cautiously.


2 — Computer-aided translation is all about handling repetitions

Another widespread misconception is that CAT tools are only used to handle repetitive translations. What does it mean? Say, you have the same disclaimer printed in the beginning of each book of a given publisher. Someone would then need to translate it only once, and a CAT tool would automatically insert this translation in each new translated book of that publisher.

Here’s what a TM match looks like. (That’s an easy one.)

This “repetition handling” feature is commonly called Translation Memory (TM). Now, TM is a large part of what CAT tools do (and why they were created in the first place). But today it is just a feature, with many others supplementing it.

In SmartCAT, for instance, we have project management, terminology, quality assurance, collaboration, marketplace, and many other features. All these features are carefully integrated with each other to form a single whole that is SmartCAT and that distinguishes us from the competition. If it were all about translation memory, there would be nothing to compete about.

3 — Computer-aided translation doesn’t work for “serious” translations

Some believe that “serious” translators (whatever that means) do not use CAT tools. The truth is that “purists” do exist, just as they do in any other field, from religion to heavy metal. But not using a CAT tool as a translator is close to not using a cellphone as a CEO. According to a 2013 study by Proz, 88% of translators were using CAT tools, and we can only expect the numbers to have gone up since then.

So, why would you use a CAT tool in a “serious” translation? Here’s a very “serious” book on numismatics I translated some time ago. I made it all in SmartCAT. Why? Because I would have never managed to keep this amount of terminology in my head. Even if I used Excel sheets to keep track of all the terms — ancient kings, regions, coin names, weight systems — it would’ve taken me dozens of hours of additional work. In SmartCAT, I had everything within arm’s reach in a glossary that was readily accessible and updateable using simple key combinations.

Inserting a glossary term in SmartCAT

Another reason CAT tools can be useful for “serious” translations is quality assurance. Okay, MS Word has a spellchecker. There is also third-party software that provides more sophisticated QA capabilities. And still, having it right at hand, with downloadable reports and translation-specific QA rules is something only a good CAT tool can boast (more on this later).


4 — CAT tools are for agencies only

Many translators receive orders from agencies dictating the use of one CAT tool or another. So they start thinking that “all these CATs” are an “agency thing” and are meant to make use of them. We’ll leave that latter argument aside for a while and come back to it later.

For now, we’ll just say that there is no reason why a translator should not use a CAT tool for their own projects. If anything, it provides a distraction-free interface where one can concentrate on the work in question and not think about secondary things such as formatting, file handling, word counting, and so on.

Note the tags (orange pentagons): You don’t need to care what formatting there was in the original.


5 — CAT tools are hard to learn

Well, that’s not exactly a myth. I remember my first experience with a prominent CAT tool (it was some ten years ago). I cried for three days, considering myself a worthless piece of junk for not being able to learn something everyone around seemed to be using. When the tears dried out, I went for some googling and realized that I wasn’t the only one to struggle with the mind-boggling interfaces of the software that was en vogue back then.

Luckily, today users have plenty of options to choose from. And although the de rigueur names remain the same (so far), many modern CATs are as easy to learn as a text processor or a juicer (though some of those can be tricky, too). Here’s a video of going all the way from signing up to downloading the final document in SmartCAT in less than one minute. It’s silent and not subtitled, but sometimes looks are more telling than words:


6 — CAT tools are ridiculously expensive

Another myth that is partly true is that CAT tools cost a freaking lot. Some do. The cheapest version of the most popular desktop computer-aided translation software costs around $500. One of the most popular subscription-based solutions costs nearly $30 a month. It’s probably okay if you have a constant inflow of orders and some savings to afford the purchase (and perhaps a personal accountant). But what if you are just starting out? Or if you are an occasional semi-pro translator? Not that okay, then.

In any case, there are still options for you to go (and grow) even if you don’t want to spend on unpredictably profitable assets. SmartCAT is free for both freelancers and companies. The only thing you might opt to pay for is machine translation and image recognition. And, if you decide to market your services via the SmartCAT marketplace, a 10% commission (payable by the customer) will be added on top of your own rate. That’s it — no hidden fees involved.

7 — Computer-aided translation works for large projects only

If you think that CAT tools work best for huge projects, you might be right. If you think they don’t work for small projects at all, you are wrong.

Here’s an example. The last project I made in SmartCAT was a one-page financial document in Excel format. To translate it, I uploaded the file to SmartCAT and already had all the translation memories, terminology, word count, etc. ready. So I just did the translation, downloaded the result and sent it back with an invoice.

If I went the “simple” way, I would have spent some valuable minutes — which are the more valuable the smaller a project is — on organizational “overheads.” Putting the files in the right place in the file system. Looking for previous translations to align the terminology. Finally, doing the translation in Excel, which is a torture in itself.

In CAT tools, whether it is an Excel file, a Powerpoint presentation, a scanned PDF (for CAT tools supporting OCR, e.g. SmartCAT), you still have the same familiar two-column view for any of them. As already said, you concentrate on words, not formats.



— in mere seconds!

8 — Computer-aided translation slows you down

Despite evidence, some translators believe that using CAT tools will actually reduce their translation speed. The logic is that in a CAT tool you have to start a project, configure all its settings, find the TMs and terminology you need to reuse, and so on. In the end, they say, you they spend more time doing this than what they will save as a result.

The reality is quite different. In SmartCAT, for instance, the configuration needed to start a project includes a minimum number of choices. Moreover, all the resources you need are added automatically according to the customer’s name. That saves time in addition to the streamlining of the very translation process.

 8 seconds to create a project with a translation memory and terminology glossary in place

9 — CAT tools worsen the quality of translation

Some believe that by not seeing the whole text, you lose its “flow.” This, they argue, leads to errors in the style and narrative of translation. While this is true in some cases (e.g. for literary translation), the fact is that the “flow” is anyway disturbed by your seeing the original text. It always makes sense to have at least one purely proofreading stage in the end, when you don’t see the original. Then you can judge the text solely on the basis of how good or bad it sounds in the target language.

That’s what I did for a children’s book I translated recently. I made the several first “runs” in SmartCAT. Then I downloaded the result and had it reviewed several more times (and once by a native speaker). When everything was ready, I got the whole thing back to SmartCAT. Why? Because I want to translate the next part of the book. I know I will have forgotten a lot by the time it comes, so having all the previous resources at hand will be very helpful for the quality.

Speaking of quality, modern CAT tools also allow a great degree of quality assurance, with some checking rules fine-tuned for translation tasks. Using those is much more convenient and practical than resorting to spellcheckers available in office software or externally.

 QA rules available in SmartCAT. Some are more paranoid than the others.

10 — Computer-aided translation is bad for translators

That’s the underlying cause for many of the above misconceptions. Some translators fear that computer-aided translation is bad for the profession as a whole. Here’s a very illustrative post by Steve Vitek, a long-time opponent of translation technology. (Interestingly, the post includes many of the views countered above. I’d love to see Steve’s comment on this article of mine. Can my arguments make him change his mind, I wonder?)

The argument is that translation technology deprives translators of their bread. And instead of being there for translators’ growth and profit, it grows and profits at their expense. Customers get  pickier, rates get lower, translations becomes a commodity.

In my humble opinion, CAT tools are as bad for translators as hair-cutting shears are for hairdressers. Perhaps, doing a haircut with a butcher’s knife could be more fun. You could even charge more for providing such an exclusive service. But it has little to do with the profession of cutting hair (or translating). A professional strives to increase the efficiency of their work. Using cutting-edge tools is one way to do this. A very important one, that.

Yes, it can be argued that CAT tools bring down your average per-word rate. But as Gert van Assche aptly puts it, the time you spend on a job is the only thing you need to measure. I can’t say for everyone, but my own hourly rates soar with the use of CAT tools. I know that I can provide the best quality in the shortest time possible. I also know that I don’t charge unnecessarily high rates to my long-time customers, whose attitude I care about a lot.

That’s it — I hope I did manage to clear away your fears about computer-aided translation.
Remember, if you’re not using CAT tools, you are falling behind your colleagues, who might be equally talented but just a bit tech-savvier.

A good CAT tool will aid your growth as a professional and a freelancer. After all, aiding translators is what the whole thing is about.

P.S. If you never tried CAT tools at all, or did but didn’t enjoy the experience, I suggest that you check out SmartCAT now — it’s simple, powerful, and free to use.


About the author


Vladimir “Vova” Zakharov is the Head of Community at SmartCAT.

"Translation is my profession and my passion, and I’m excited to be able to share it with the amazing SmartCAT community!"

Wednesday, October 19, 2016

SYSTRAN Releases Their Pure Neural MT Technology

SYSTRAN announced earlier this week that they are doing a “first release” of their Pure Neural™ MT technology for 30 language pairs. Given how good the Korean samples that I saw were, I am curious why Korean is not one of the languages that they chose to release.

"Let’s be clear, this innovative technology will not replace human translators. Nor does it produce translation which is almost indistinguishable from human translation"  ...  SYSTRAN BLOG

The languages pairs being initially released are 18 in and out of English, specifically EN<>AR, PT-BR, NL, DE, FR, IT, RU, ZH, ES  and 12 in and out of French  FR<>AR, PT-BR, DE, IT, ES, NL. They claim these systems are the culmination of over 50,000 hours of GPU trainings but are very careful to say that they are still experimenting and tuning these systems and that they will adjust them as they find ways to make them better.

They have also enrolled ten major customers in a beta program to validate the technology at the customer level, and I think this is where the rubber will meet the road and we will find how it really works in practice.

The boys at Google (who should still be repeatedly watching that Pulp Fiction clip), should take note of their very pointed statement about this advance in the technology:

Let’s be clear, this innovative technology will not replace human translators. Nor does it produce translation which is almost indistinguishable from human translation – but we are convinced that the results we have seen so far mark the start of a new era in translation technologies, and that it will definitely contribute to facilitating communication between people.
Seriously Mike (Schuster) that’s all that people expect; a statement that is somewhat close to the reality of what is actually true.

They have made a good effort at explaining how NMT works, and why they are excited, which they say repeatedly through their marketing materials. (I have noticed that many who work with Neural net based algorithms are still somewhat mystified by how it works.) They plan to try and explain NMT concepts in a series of forthcoming articles which some of us will find quite useful, and they also provide some output examples which are interesting to understand how the different MT methodologies approach language translation.

 CSA Briefing Overview

In a recent briefing with Common Sense Advisory they shared some interesting information about the company in general:
  • The Korean CSLi Co. ( acquisition has invigorated the technology development initiatives.
  • They have several large account wins including Continental, HP Europe, PwC and Xerox Litigation Services. These kinds of accounts are quite capable of translating millions of words a day as a normal part of their international operational needs.
  • Revenues are up over 20% over 2015, and they have established a significant presence in eDiscovery area which now accounts for 25% of overall revenue.
  • NMT technology improvements will be assessed by an independent third party (CrossLang) with long term experience in MT evaluation, and who are not likely to say misleading things like "55% to 85% improvements in quality" like the boys at Google.
  • SYSTRAN is contributing to an open-source project on NMT with Harvard University and will share detailed information about their technology there. 

Detailed Technical Overview

They have also supplied a more detailed technical paper which I have yet to review carefully, but what struck me immediately on initial perusal was that the data volumes they are building their systems with are minuscule compared to what Google and Microsoft have available. However, the ZH > EN results did not seem substantially different from the amazing-NOT GNMT system. Some initially interesting observations are highlighted below, but you should go to the paper to see the details:

Domain adaptation is a key feature for our customers — it generally encompasses terminology, domain and style adaptation, but can also be seen as an extension of translation memory for human post-editing workflows. SYSTRAN engines integrate multiple techniques for domain adaptation, training full new in-domain engines, automatically post-editing an existing translation model using translation memories, extracting and re-using terminology. With Neural Machine Translation, a new notion of “specialization” comes close to the concept of incremental translation as developed for statistical machine translation like (Ortiz-Martınez et al., 2010 )

What is encouraging is that adaptation or “specialization” is possible with very small volumes of data, and this can be run in a few seconds which suggests this has possibilities to be an Adaptive MT model equivalent.

 Our preliminary results show that incremental adaptation is effective for even limited amounts of in-domain data (nearly 50k additional words). Constrained to use the original “generic” vocabulary, adaptation of the models can be run in a few seconds, showing clear quality improvements on in-domain test sets .

Of course the huge processing requirements of NMT remain a significant challenge and perhaps they are going to have to follow Google and Microsoft who both have new hardware approaches to address this issue with the TPU (Tensor Processing Units) and programmable FPGAs that Microsoft recently announced to deal with this new class of AI based machine learning applications.

For those who are interested,  I ran a paragraph from my favorite Chinese News site and compared the Google “nearly indistinguishable from human translation”  GNMT output with the SYSTRAN PNMT output and I really see no big differences in quality from my rigorous test, and clearly we can safely conclude that humanity is quite far from human range MT quality at this point in time.

 The Google GNMT Sample 


The SYSTRAN Pure NMT Sample

Where do we go from here?

I think the actual customer experience is what will determine the rate of adoption and uptake. Microsoft and a few others are well along the way with NMT too. I think SYSTRAN will provide valuable insights in December from the first beta users who actually try to use it in a commercial application. There is enough evidence now to suggest that if you want to be a long-term player in MT you had better have actual real experience with NMT and not just post how cool NMT is and use SEO words like machine learning and AI on your website.

The competent third party evaluation SYSTRAN has planned is a critical proof statement that hopefully provides valuable insight on what works and what needs to be improved at the MT output level. It will also give us more meaningful comparative data than the garbage that Google has been feeding us. We should note that while BLEU score jumps are not huge the human evaluations show that NMT output is often preferred by many who look at the output.

The ability of serious users to adapt and specialize the NMT engines for their specific in-domain needs I think is really a big deal – if this works as well as I am being told, I think it will quickly push PBSMT-based Adaptive MT (my current favorite) to the sidelines, but it is still too early to really to say this with anything but Google MT Boys certainty.

But after a five-year lull in the MT development world and seemingly little to no progress, we finally have some excitement in the world of machine translation and NMT is still quite nascent. It will only get better and smarter.

Tuesday, October 11, 2016

The Importance & Difficulty of Measuring Translation Quality

This is another, timely post describing the challenges of human quality assessment by Luigi Muzii. As we saw from the recent deceptive Google NMT announcements that while there is a lot of focus on new machine learning approaches we are still using the same quality assessment approach of yesteryear: BLEU. Not much has changed. It is well understood that this metric is flawed but there seems to be no useful replacement coming forward. This necessitates that some kind of human assessment also has to be made and invariably this human review is also problematic. The best practices for these human assessments that I have seen are at Microsoft and eBay. The worst at many LSPs and Google. The key to effective procedures seems to be, the presence of invested and objective linguists on the team, and a culture that has integrity and rigor without the cumbersome and excessively detailed criteria that the "Translation Industry" seems to create (DQF & MDM for example). Luigi offers some insight on this issue that I feel is worth note as we need to make as much more progress on the human assessment of MT output as well. Not only to restrain Google from saying stupid shite like “Nearly Indistinguishable From Human Translation” but also to really understand if we are making progress and understand better what needs to be improved. MT systems can only improve if competent linguistic feedback is provided as the algorithms will always need a "human" reference. The emphasis below is all mine and was not present in the original submission.


Dimensionally speaking, quality is a measurement, i.e. a figure obtained by measuring something.

Because of the intricacies related to the intrinsic nature of languages, objective measurement of translation quality has always been a much researched and debated topic that has borne very little fruit. The notion of understood quality level remains unsolved, together with any kind of generally accepted and clearly understood quality assessment and measurement.

Then along came machine translation and, since it’s inception, we have been facing the central issue of estimating the reliability and quality of MT engines. Quite obviously, this was done by comparing the quality of machine translated outputs to that of human reference data using statistical methods and models, or by having bilingual humans, usually, linguists, evaluate the quality of machine translated output.

Ad hoc algorithms based on specific metrics, like BLEU, were developed to perform automatic evaluation and produce an estimate of the efficiency of the engine for tuning and evaluation purposes. The bias implicit in the selection of the reference model remains a major issue, though, as there is not only one single correct translation. There can be many correct translations.

Human evaluation of machine translation has always been done in the same way as for human translations, with the same inconsistencies, especially when results are examined over time and when these evaluations are done by different people. The typical error-catching approach of human evaluation results is irredeemably biased, as long as errors are not defined uniquely and unambiguously, and if care is not taken to curb giving too much scope to the evaluator’s subjective preferences.

The problem with human evaluation is bias. The red-pen syndrome.

Indeed, human evaluation of machine translation is known for being expensive, time-consuming and often skewed, and yet it is supposed to overcome the drawbacks introduced by the limited accuracy and approximation of automatic evaluation. However, the complications of the many new quality measurement metrics proposed over the years have not yet reduced this rough approximation that we are still faced with. They have instead added to the confusion with these new metrics, which are not well understood and introduce new kinds of bias. In fact, despite the many efforts made over the last few years, the overall approach has remained the same, with a disturbing inclination to move in the direction of too much detail rather than move to more streamlined approaches. For example, the new complexity rising from the integration of DQF and MDM has proven to be expensive and unreliable so far and of limited value. Many know about the inefficiency and ineffectiveness of the SAE metrics once applied to the real world, with many new errors introduced by reviewers, together with many false positives. Indeed, translation quality metrics have become more and more complex and overly detailed, and always seem to be based on the error-catching approach that has proved costly and unreliable thus far. Automatic metrics can be biased too, especially when we assume that the human reference samples are human translation perfection, but at least they are fast, consistent, and convenient. And their shortcomings are widely known and understood. 

People in this industry— and especially academics—seem to forget or ignore that every measurement must be of functional value to business, and that the plainer and simpler the measurement the better it is, enabling it to be easily grasped and easily used in a production mode.

On the other hand, just like human translation, machine translation is always of unknown quality, especially when rendered in a language unknown to the buyer, but it is intrinsically much more predictable and consistent, when compared to human translated projects with large batches of content, where many translators are possibly involved.

Effective upfront measurement helps to provide useful prior knowledge, thus reducing uncertainty, leading to well-informed decisions, and lessening the chance of deployment error. Ultimately, effective measurement helps to save money. Therefore, the availability of clear measures for rapid deployment are vital for any business using machine translation.

Also, any investment in machine translation is likely to be sizeable. Implementing a machine translation platform is a mid- to long-term effort requiring specialized resources and significant resilience to potentially frustrating outcomes in the interim. Effective measurements, including evaluation of outputs, provide a rational basis for selecting what improvements to make first.
In most cases, any measurement is only an estimate, a guess based on available information, made by approximation: it is almost correct and not intended to be exact.

In simple terms, the logic behind the evaluation of machine translation output is to get a few basic facts pinned down:

  1. The efficiency and effectiveness of the MT engines;

  2. The size of the effort required for further tuning the MT engine;

  3. The extent and nature of the PEMT effort.

Each measure is related to one or more strategic decisions. 

Automatic scores give at least some idea of the efficiency and effectiveness of engines. This is crucial to estimate the distance from the required and expected level of performance, and the time for filling the gap.

For example, if using BLEU as the automatic assessment reference, 0.5-0.8 could be considered acceptable for full post-editing, 0.8 or higher for light post-editing.

Full post-editing consists in fixing machine-induced meaning (semantic) distortion, making grammatical and syntactic adjustments, checking terminology for untranslated terms that could possibly be new terms, partially or completely rewriting sentences for target language fluency. It is reserved for publishing and providing high quality input for engine training.

Light post-editing consists in adjusting mechanical errors, mainly for capitalization and punctuation, replacing unknown words, possibly misspelled in the source text, removing redundant words or inserting missing ones, and ignoring all stylistic issues. It is generally used for content to be re-used in different contexts, possibly through further adaptation.

Detailed analytics can also offer an estimate of where improvements, edits, adds, replacements, etc. must be made and this in turn helps in assessing and determining the effort required.

After a comprehensive analysis of automatic evaluation scores has been accomplished, machine translation outputs can then undergo human evaluation.

When coming to human evaluation, a major issue is sampling. In fact, to be affordable, human evaluation must be done on small portions of the output, which must be homogeneous and consistent with the automatic score. 

Once consistent samples have been selected, human evaluation could start with fluency, which is affected by grammar, spelling, choice of words, and style. To prevent bias, evaluators must be given a predefined restricted set of criteria to comply with when voting/rating whether samples are fluent or not.

Fluency refers to the target only, without taking the source into account and its evaluation does not always require evaluators to be bilingual; indeed, it is often better that they are not. However, always consider that monolingual evaluation of target text only generally takes relatively short time, and judgments are generally consistent across different people, but that the more of instructions are provided to evaluators, the longer they take to complete their task, and the less consistent results are. Then the same samples would be passed to bilingual evaluators for adequacy evaluation.

Adequacy is defined as the amount of source meaning preserved in translation. This necessarily requires a comparative analysis of source and target texts, as adequacy can be affected by completeness, accuracy, and cleanup of training data. Consider using a narrow continuous measurement scale.

A typical pitfall of statistical machine translation is terminology. Human evaluation is useful to detect terminology issues. However, that could mean that hard work is required normalizing training data to realign terminology in each segment and analyze and amend translation tables.

Remember that, the number and magnitude of defects (errors) are not the best or the only way to assess quality in a translation service product. Perception can be equally important. When working with MT, in particular, the type and frequency of errors are pivotal, even though all these errors could not be all resolved. Take the Six Sigma model: what could be a reasonably expected level for an MT platform? Now take terminology in SMT, and possibly, in a near future, NMT. Will amending a very large training dataset be convenient to have the correct term(s) always used? Implementing and running an MT platform is basically a cost effectiveness problem. As we know, engines perform differently according to language pairs, amount, and quality of training data, etc.. This means that a one-size-fits-all approach for TQA is unsuitable, and waiving an engine from production use might be better than insisting in trying to use or improve it because the PEMT effort could be excessive. I don’t think that the existing models and metrics, including DQF, can be universally applied.

However, they could be helpful once automatic scores prove the engine could perform acceptably. In this case, defining specific categories for errors emerging from testing and operating engines that could potentially occur repeatedly is the right path to further engine tuning and development. And this can’t be made based on abstract and often abstruse (at least to non-linguists) metrics.

Finally, to get a useful PEMT effort indicator that provides an estimate of the work required for an editor to do to get the content over a predetermined acceptance quality level (AQL,) a weighted combination of correlation and dependence, precision and recall and edit distance scores can be computed. Anyway, the definition of AQLs is crucial for the effective implementation of a PEMT effort indicator, together with a full grasp of analytics, which requires an extensive understanding of the machine translation platform and the training data.

Many of these aspects, from a project management perspective, are covered in more detail in the TAUS PE4PM course.

This course also covers another important element of a post-editing project, the editor’s involvement and remuneration. Especially in the case of full post-editing, post-editors could be asked to contribute to train an engine, and editors could prove extremely valuable on the path to achieve better performances.

Last but not least, the suitability of source text for machine translation and the tools to use in post-editing can make the difference between success and failure in the implementation of a machine translation initiative.

When a post-editing job comes to an LSP or a translator, nothing can at that point be done on the source text or the initial requirements. Any action that can be taken must be taken upstream, earlier in the process. In this respect, while predictive quality analysis at a translated file level has already been implemented, although not fully substantiated yet, predictive quality analysis at source text level is still to come. It would be of great help to translation buyers in general who could base their investment on reasonable measures, possibly in a standard business logic, and possibly improve their content for machine translation and translatability in general. NLP research is already evolving to provide feedback on a user’s writing, reconstruct story lines or classify content, and assess style.

In terms of activities going on in the post-editing side of the world, adaptive machine translation will be a giant leap forward when every user’s edits are made available to an entire community, by permanently incorporating each user’s evolving datasets into the master translation tables. Thus the system is continuously improving with ongoing use in a way that other MT systems do not. At the moment, Adaptive MT is restricted to Lilt and SDL (all talk so far) users. This means that it won’t be available in corporate settings where MT is more likely to be implemented unless SDL software is already in use and/or IP is not an issue. Also, being very clear and objective about the rationale for implementing MT is essential to avoid being misled when interpreting and using analytics. Unfortunately, in most situations, this is not the case. For instance, if my main goal is speed, I should look into analytics for something other than what I should look for if my goal is cutting translation costs or increasing consistency. Anyway, understanding the analytics is no laughing matter. But this is another kettle of fish.

Luigi Muzii's profile photo

Luigi Muzii has been in the "translation business" since 1982 and has been a business consultant since 2002, in the translation and localization industry through his firm . He focuses on helping customers choose and implement best-suited technologies and redesign their business processes for the greatest effectiveness of translation and localization related work.

This link provides access to his other blog posts. 


Friday, October 7, 2016

Real and Honest Quality Evaluation Data on Neural Machine Translation

 I just saw a Facebook discussion on the Google NMT announcements that explores some of the human evaluation issues. And thought I would add one more observation to this charade before I highlight a completely overlooked study that does provide some valuable insight into the possibilities of NMT (which I actually believe are real and substantial) even though it is done in a  "small-scale" University setting.

Does anybody else think that it is strange, that none of the press and the journalists that are gushing about the "indistinguishable from human translation" quality claimed by Google, did not attempt to run even a single Chinese page through the new super duper GNMT Chinese engine? 

Like this post for example where the author seems to have swallowed the Google story, hook, line, and sinker. There are of course 50 more like this. It took me translating just one page to realize that we are really knee deep in bullshit, as I had difficulty getting even a gist understanding with my random sample Chinese web page.

So, is there any honest, unmanipulated data out there, on what NMT can do?  

I have not seen all the details of the SYSTRAN effort but based on the sample output that I asked for, and the general restraint (in spite of their clear enthusiasm) they showed during my conversations, I tend to believe that they have made real progress and can offer better technology to their customers. But this study was just pointed out to me, and I took a look even though the research team has a disclaimer about this being on-going work where it is possible that some results and conclusions might change. I thought it deserved to be more visible and thus I wrote this.

So here we have a reasonable and believable answer to the question:

Is Neural Machine Translation Ready for Deployment? A Case Study on 30 Translation Directions

This was a study conducted at the University of Edinburgh done entirely with UN corpus. They had about 10+ million sentences each per language in common across the six United Nations core languages which include Chinese. Now, this may sound like a lot of data to some, but the Google and Microsoft scale is probably larger by a factor of 10 or even 20. We should also understand that the computing resources available to this little research team are probably 1% to 5% of what Google can easily get access to. (More GPU sometimes can mean you get an extra BLEU point or two). So here we have the David and Goliath scenario in terms of resources, but interestingly I think the inferences they draw are very similar to what SYSTRAN and Microsoft have also reported. The team reports:

"With the exception of fr-es and ru-en the neural system is always comparable or better than the phrase-based system. The differences where NMT is worse are very small. Especially in cases where Chinese is one of the languages in a language pair, the improvement of NMT over PB-SMT is dramatic with between 7 and 9 BLEU points....
We also see large improvements for translations out of and into Arabic. It is interesting to observe that improvements are present also in the case of the highest scoring translation directions, en-es and es-en."
The research team also admits that it is not clear what the implications might be for in-domain systems and I look forward to their hopefully less deceptive human evaluation:
"Although NMT systems are known to generalize better than phrase-based systems for out-of- domain data, it was unclear how they perform in a purely in-domain setting which is of interest for any larger organization with significant resources of their own data, such as the UN or other governmental bodies. This work currently lacks human evaluation which we would like to supply in future versions."
The comparative PBSMT vs NMT  results is presented graphically below.  The blue bar is SMT and the burgundy bar is NMT, I have highlighted the most significant improvements with arrows below.

It is also interesting to note that when more processing power is applied to the problem, they do get some small improvement but it is clearly a case of diminishing returns. 

"Training the NMT system for another eight days always improves the performance of the NMT system, but gains are rather small between 0.4 and 0.7 BLEU. We did not observe any improvements beyond 2M iterations. It seems that stopping training after 8-10 days might be a viable heuristic with little loss in terms of BLEU. "
They additionally share some research on the NMT decoding throughput problem and resolution which some may find useful. Again, to be clear, (for the benefit of Mr. Mike) the scale described here is minuscule compared to the massive resources that Google, Microsoft, and Facebook probably use for deployment. But they show that NMT can be deployed without using GPUs or the Google TPUs for a fraction of the cost.  If this research team sends me a  translation of my test Chinese page I used on the GNMT,  I will share it with you so you can compare to GNMT.

We can all admit that Google is doing this on a bigger scale, but from my vantage point, it seems that they are really not getting that much better results. As University of Edinburgh’s Rico Sennrich said in his Slator interview: “ Given the massive scale of the models, and the resulting computational cost, it is in fact, surprising that they do not outperform recently published work—unfortunately, they only provide a comparison on an older test set, and against relatively old and weak baselines.” He also adds that the Edinburgh system outperformed the Google system in the WMT16 evaluation (which shows how NMT systems and the University of Edinburgh in particular has been doing very well in comparative evaluations.)

So what does this mean?

NMT is definitely here for real and is likely to continue improving albeit incrementally.  If you are an enterprise concerned about large-scale Chinese, Japanese, Korean and Arabic translation you should be looking at NMT technology or talking to MT vendors who have real NMT products. This technology may be especially valuable for those interested in scientific and technical knowledge content like patent and scientific paper related information.

Hopefully, the improvement claims in future are more carefully measured and honest, so that we don't get translators all up in arms again after they see the actual quality that systems like the "near human quality" GNMT ZH-EN  produce. The new NMT systems that are emerging, however,  are already a definite improvement for the casual internet user who just wants better quality gisting. 

SYSTRAN will shortly start providing examples of "adapted" NMT systems which are instantly tuned versions of a generic NMT engine. If the promise I saw in  my investigation is anywhere close to some of the Adaptive MT capabilities NMT is a real game changer for the professional translation industry as well.

Remember, the real goal is not better NMT systems, rather, it is better quality automated translation, that both, supports production business translation work and allows users to really get an accurate sense of the meaning of foreign language content quickly.

For those who think that this study is not an industrial strength experiment, you may be interested to know that one of the researchers quietly published this which shows that their expertise is very much in play at the WIPO even though the training sets were very small. As he says:
"A few days after Google, WIPO (the World Intellectual Property Organization) just deployed its first two in-house neural machine translation (NMT) systems for Chinese-English and Japanese-English. Both systems are already available through the Patentscope Translation Assistant for Patent Texts."
Even at this initial stage, the NMT system BLEU scores show impressive gains, and these scores can only go up from here.

Japanese to English        SMT = 24.41      NMT = 35.99
Chinese to English          SMT = 28.59      NMT = 37.56

This system is live and is something you can try out right now at this link.

The research team whose work triggered this post and essentially wrote it includes Marcin Junczys-Dowmunt, Tomasz Dwojak, and Hieu Hoang.

P.S. I have some wise words coming up on the translation evaluation issue from a guest author next week. I would love to have more translators step forward with their comments on this issue or even volunteer to write a post on human evaluation of MT output.


Wednesday, October 5, 2016

Feedback on the Google Neural MT Deception Post

There was an interesting discussion thread in Reddit about the Google deception post with somebody with the alias oneasasum that I thought was worth highlighting here, since it was the most coherent criticism of my original post.

Google makes MASSIVE progress on Machine Translation -- "We show that our GNMT system approaches the accuracy achieved by average bilingual human translators on some of our test sets."

This is a slightly cleaned up version of just our banter from the whole thread that you can see at the link above which also has other fun comments:

KV: Seriously exaggerated -- take a look at this for more accurate overview The Google Neural Machine Translation Marketing Deception

HE: You should have also posted this article, as you did on another Reddit forum:

That's a much better take, in my opinion.
I saw the blog posting myself the other day. This isn't marketing deception, and most of what this guy covers in his piece, I also covered in mine -- with the exception of pointing out the "60%" and "87%" claims as not being meaningful. (My title may have given you a different impression, however.)

People in NLP are not impressed by the advances in theory or algorithm, as the results amount to a repackaging of methods developed over the past two years by the wider academic community; but are impressed by the scale of the effort, and by the results. See, for example, what Yoav Goldberg said on Twitter -- he said he's impressed by the results:

The GNMT results are cool. the BLEU not so much, only the human evals. But this is very hard to compare to other systems.
Another example is Kyunghyan Cho, known for his work on neural machine translation:

“I am extremely impressed by their effort and success in making the inference of neural machine translation fast enough for their production system by quantized inference and their TPU,” Cho says.
The second thing I would say is that the research article is written by researchers, not Google marketing people. The Google marketing people have no sway over how researchers pitch their results in research articles. 

My read of what these researchers have written (and also what a Google software engineer or two wrote on Twitter, before deleting their comments), is that they are very excited by their work, and feel they have made genuine progress. What you are seeing is not "hype", but "excitement". But there is always a price to pay for showing emotion -- somebody will always try to bring you back down to earth.

The third thing I would say is that this is the first example of a large deployment of neural machine translation, according to Cho again:

That, in and of itself, is praiseworthy.

But he confirmed that Google seems to be the first to publicly announce its use of neural machine translation in a translation product.
The fourth thing I would say is to take with a grain of salt comments by people from either a competing product or school of thought. Perhaps this doesn't apply here; but it's still good to keep it in mind. An example of this might be something like the following: say you have one group working on classical knowledge representation using small data. And then say a machine learning method with large amounts of data makes progress on a problem they care about. What are they going to say? Are they going to say, "That's really great that we are now making progress on this old, stubborn problem!"? No, more likely they'll say, "That's just empty hype. They're nowhere near to solving that problem, and if they really want to make progress they'll drop what they're doing and use some classical knowledge representation."

KV: While the sheer scale of the initiative both in terms of training data volume and ability to provide translations to millions of users at production scale is impressive, the actual translation quality results are really not that impressive and certainly do not warrant a claim such as “Nearly Indistinguishable From Human Translation” and “GNMT reduces translation errors by more than 55%-85% on several major language pairs “.

The translation improvement claims based on the human evaluation is where the problem lies. The validity of the human evaluation is the biggest question mark about the whole report. This is well known to people in the MT research community so to make the claims they did is disingenuous and even deceptive.

I agree they are doing it on a massive scale but actually, it is surprising that they seem to have gotten so little benefit in translation quality improvement as Rico Sennrich at the University of Edinburgh says in this post:

HE: Well, I suppose they will work harder next time to find a better way to measure the quality of their system. Again, I don't think they were trying to deceive.
One thing I would say, however, is that BLEU scores have problems, too. One problem is that even human translators sometimes have low BLEU scores (I had a better reference for this, but lost it, so will give this one):

Recent experiments computed so-called human BLEU scores, where a human reference translation scored against other human reference translations. Such human BLEU scores are barely higher (if at all) than BLEU scores computed for machine translation output, even though the human translations are better.

 KV: Absolutely, BLEU scores are deeply flawed but they are UNDERSTOOD and so they continue to be used as all the other metrics are even worse. I have written about this on my blog.

SO here is an example of the "nearly indistinguishable from human translation" GNMT of a Chinese web page that I just did with the new NMT engine, that just happens to talk about work that Baidu, Alibaba and Microsoft are doing. It is definitely better than looking at a page of Chinese characters (for me anyway) but clearly a very long way from human translation.

HE: Yes, I saw those examples. This guy had posted a link to Twitter, before he deleted it:

He is a software engineer at Google, and was very excited by the results. But, yes, those particular examples weren't great. Not clear whether they were a random sample, or a sample showing the range of quality. 

Also another fun thing that he noticed from the press coverage:

Here's a Technology Review article about it: Google’s New Service Translates Languages Almost as Well as Humans Can   

This quote is priceless:

“It can be unsettling, but we've tested it in a lot of places and it just works,” he [Googler and co-author on the paper Quoc Le] says.


Take a look at that Chinese Newspaper sample above, which I ran today. Seriously what are these guys smoking and are they really so deluded? Yes, clearly they have done something that few can do in terms of using thousands of computers and solving a tough computing challenge. But of very little benefit for the guy who does not speak Chinese as the English they produce is STILL pretty hard to follow.  This is the source page. And this is the translation I got today from the super duper  human-like GNMT!  

Original Chinese Text:

 Google GNMT Translation:

Language service support "along the way" and the line and far