Wednesday, January 25, 2012

A Short Guide to Measuring and Comparing Machine Translation Engines

This is an article from the Asia Online November 2011 Newsletter that provides useful advice for meaningful comparisons of  MT engines and is authored by Dion Wiggins, CEO of Asia Online. So the next time somebody promises you a BLEU of 60, be skeptical, and make sure you get the proper context and assurances that it was properly done. And if they say they have a BLEU score of 90 you know that you are clearly in the bullshit zone.

“What is your BLEU score?” This is the single most irrelevant question relating to translation quality, yet one of the most frequently asked. BLEU scores and other translation quality metrics greatly depend on many factors that must be understood in order for a score to be meaningful. A BLEU score of 20 in some cases can be better than a BLEU score of 50 or vice versa. Without understanding how a test set was measured and other details such as language pair and domain complexity, a BLEU score without context is not much more than a meaningless number. (For a primer on BLEU look here.)

BLEU scores and other translation quality metrics will vary based upon:
  • The test set being measured: Different test sets will give very different scores. A test set that is out of domain will usually score lower than a test set that is in the domain of the translation engine being tested. The quality of the segments in the test set should be gold standard (i.e. validated as correct by humans). Lower quality test set data will give a less meaningful score.
  • How many human reference translations were used: If there is more than one human reference translation, the resulting BLEU score will be higher as there are more opportunities for the machine translation to match part of the reference.
  • The complexity of the language pair: Spanish is a simpler language in terms of grammar and structure than Finnish or Chinese relative to English. Typically if the source or target language is relatively more complex,  the BLEU score will be lower.
  • The complexity of the domain: A patent has far more complex text and structure than a children’s story book. Very different metric scores will be calculated based on the complexity of the domain. It is not practical to compare two different test sets and conclude that one translation engine is better than the other.
  • The capitalization of the segments being measured: When comparing metrics, the most common form of measurement is Case Insensitive. However when publishing, Case Sensitive is also important and may also be measured.
  • The measurement software: There are many measurement tools for translation quality. Each may vary slightly with respect to how a score is calculated, or the settings for the measure tools may not be set the same. The same measurement software should be used for all measurements. Asia Online provides Language Studio™ Pro free of charge and this software measures the scores, for a given test set, for a variety of quality metrics.
It is clear from the above list of variables that a BLEU score number by itself has no real meaning.
How BLEU scores and other translation metrics are measured
With BLEU scores, a higher score indicates higher quality. A BLEU score is not a linear metric. A 2 BLEU point increase from 20 to 22 will be considerably more noticeable than the same increase from 50 to 52. F-Measure and METEOR also work in this manner where a higher score is also better. For Translation Error Rate (TER), a lower score is a better score. Language Studio™ Pro supports all of these metrics and can be downloaded for free.
Basic Test Set Criteria Checklist
The criteria specified by this checklist are absolute. Not complying with any of the checklist items will result in a score that is unreliable and less meaningful.
  • Test Set Data should be very high quality: If the test set data are of low quality, then the measurement delivered will not be reliable.
  • Test set should be in domain: The test set should represent the type of information that you are going to translate. The domain, writing style and vocabulary should be representative of what you intend to translate. Testing on out-of-domain text will not result in a useful metric.
  • Test Set Data must not be included in the training Data: If you are creating an SMT engine, then you must make sure that the data you are testing with or very similar data are not in the data that the engine was trained with. If the test data are in the training data the scores will be artificially high and will not represent the level of quality that will be output when other "blind" data are translated.
  • Test Set Data should be data that can be translated: Test set segments should have a minimal amount of dates, times, numbers and names. While a valid part a segment, they are not parts of the segment that are translated; they are usually transformed or mapped. The focus for a test set should be on words that are to be translated.
  • Test Set Data should have segments that are at between 8 and 15 words in length: Short segments will artificially raise the quality scores as most metrics do not take into account segment length. Short segments are more likely to get a perfect match of the entire phrase, which is not a translation and is more like 100% match with a translation memory. The longer the segment, the more opportunity there is for variations on what is being translated. This will result in artificially lower scores, even if the translation is good. A small number of segments shorter than 8 words or longer than 15 words are acceptable, but these should be limited.
  • Test set should be at least 1,000 segments: While it is possible to get a metric from shorter test sets, a reasonable statistic representation of the metric can only be created when there are sufficient segments to build statistics from. When there are only a low number of segments, small anomalies in one or two segments can raise or reduce the test set score artificially.Be skeptical of scores from test sets that only contain a few hundred sentences.
Comparing Translation Engines - Initial Assessment Checklist
Language Studio™ can be used for calculating BLEU, TER, F-Measure and METEOR scores.
  • All conditions of the Basic Test Set Criteria must be met: If any condition is not met, then the results of the test could be flawed and not meaningful or reliable.
  • Test set must be consistent: The exact same test set must be used for comparison across all translation engines. Do not use different test sets for different engines.
  • Test sets should be “blind”: If the MT engine has seen the test set before or included the test set data in the training data, then the quality of the output will be artificially high and not represent the true quality of the system.
  • Tests must be carried out transparently: Where possible, submit the data yourself to the MT engine and get it back immediately. Do not rely on a third party to submit the data. If there are no tools or APIs for test set submission, the test set should be returned within 10 minutes of being submitted to the vendor via email. This removes any possibility of the MT vendor tampering with the output or fine tuning the engine based on the output.
  • Word Segmentation and Tokenization must be consistent: If Word Segmentation is required (i.e. for languages such as Chinese, Japanese and Thai) then the same word segmentation tool should be used on the reference translations and all the machine translation outputs. The same tokenization should also be used. Language Studio™ Pro provides a simple means to ensure all tokenization is consistent with its embedded tokenization technology.
Ability to Improve is More Important than Initial Translation Engine Quality
The initial scores of a machine translation engine, while indicative of initial quality, should be viewed as a starting point for rapid improvement which is measured by the test set and BLEU scores. Depending on the volume and quality of data provided to the SMT vendor for training, the quality may be lower or higher. Most often, more important than the initial quality is how quickly the translation engine quality improves

Frequently a new translation engine will have gaps in vocabulary and grammatical coverage. Other machine translation vendors’ engines do not improve at all or merely improve very little unless huge volumes of data are added to the initial training data. Most vendors recommend retraining once you have gathered a volume of additional data that is at least 20% of the size of the initial training data that the engine was trained on. Even when this volume of data is added, only a small improvement is achieved. As a result, very few translation engines evolve in quality much further than their initial quality.

In stark contrast, Language Studio™ translation engines are created with millions of sentences of data that Asia Online has prepared in addition to the data that the customer provides. The translation engines improve rapidly with a very small amount of feedback. It is not uncommon to get a 1-2 BLEU score improvement with as little as a few thousand post-edited sentences. Language Studio has a unique 4 step approach that leverages the benefits of Clean Data SMT and manufactures additional learning data by directly analyzing the edits made to the machine translated output.
Consequently, only a small amount of post-edited feedback can improve Language Studio™ translation engine quality quite considerably, and it can do so at speeds much faster and with far less effort than with other machine translation vendors. Asia Online provides complimentary Incremental Improvement Trainings to encourage rapid translation engine quality improvement with every full customization and also offers additional complimentary Incremental Improvement Trainings when word packages are purchased, greatly reducing Total Cost of Ownership (TCO). 
An investment in quality at the development stages of a translation engine impacts and reduces the cost of post editing directly, while increasing post editing productivity. While the development of some rules, normalization, glossary and non-translatable term work will assist in the rate of improvement, the fastest and most efficient way to improve Language Studio™ engines is to post edit the translations and feed them back into Language Studio™ for processing. The edits will be analyzed and new training data will be generated, directly addressing the primary cause of most errors. In other words, just post editing as part of a normal project will result in an immediate improvement. Little or no other extra effort is needed. By leveraging the standard post editing process, the effort and cost of improvement as well as the volume of data required in order to improve is greatly reduced. 

Depending on the initial training data provided by the client, a small number of Incremental Improvement Trainings are usually sufficient for most Language Studio™ translation engines to improve to a quality level approaching near-human quality. 

Other machine translation vendors are now also claiming to build systems based on Clean Data SMT. Closer investigation reveals that their definition of “cleaning” is not the same as Asia Online. Removing formatting tags is not cleaning data. Language Studio™ analyzes translation memories and other training data and ensures that only the highest quality in domain data from trusted sources is included in the creation of your custom engine. The result is that improvements are rapid. Even with just a few thousand segments edited, the improvements are notable. When combined with Language Studio™ hybrid rules and an SMT approach to machine translation the quality of the translation output can increase by as much as 10, 20 or even 30 BLEU points between versions.
Comparing Translation Engines – Translation Quality Improvement Assessment
  • Comparing Versions: When comparing improvements between versions of a translation engine from a single vendor, it is possible to work with just one test set, but the vendor must ensure that the test set remains “blind” and that the scores are not biased towards the test set. Only then can a meaningful representation of quality improvement be achieved.
  • Comparing Machine Translation Vendors: When comparing translation engine output from different vendors, a second “blind” test set is often needed to measure improvement. While you can use the first test set, it is often difficult to ensure that the vendor did not adapt its system to better suit and be biased towards the test set and in doing so delivering an artificially high score. It is also possible for the test set data to be added to engines training data which will also bias the score.
As a general rule, if you cannot be 100% certain that the vendor has not included the first test set data or adapted the engine to suit the test set, then a second “blind” test set is required. When a second test set is used, a measurement should be taken from the original translation engine and compared to the improved translation engine to give a meaningful result that can be trusted and relied upon.
Bringing It All Together
The table below shows a real world example of a version 1 translation engine from Asia Online and an improved version after feedback. Additional rules were added to the translation to meet specific client requirements, which resulted in considerable improvement in translation quality. This is part of Asia Online’s standard customization process. Language Studio™ puts a very high level of control in the customer’s hands where rules, runtime glossaries, non-translatable terms and other customization features ensure the quality of the output is as close to human quality and requires the least amount of editing possible. 

BLEU Score
Case Sensitive
Asia Online  

Google Bing Systran
Reference 1 36.05 45.96 56.59 30.58 29.64 21.01
Reference 2 35.80 39.31 48.85 32.05 29.94 22.56
Reference 3 38.65 52.31 65.03 35.51 33.17 24.68
Combined References 50.45 66.52 80.48 44.58 41.65 30.26
Case Insensitive            
Reference 1 41.30 52.65 59.25 32.18 31.49 22.49
Reference 2 41.01 45.32 51.24 33.67 31.64 23.88
Reference 3 43.99 58.97 67.49 37.15 35.01 25.92
Combined References 56.83 74.35 82.89 46.26 43.68 31.68
*Language Pair: English into French.     Domain: Information Technology.           

It can be seen clearly from the scores above that when all three human reference translations are combined the BLEU score is significantly higher and that the BLEU scores vary considerably between each of the human reference translations. The impact of the improvement and the application of client specific rules can also be seen, raising the case sensitive BLEU score from 50.45 to 80.48 (an increase of 30.03 in just one improvement iteration). One interesting side effect of having multiple human references is that it is often possible to judge the quality of the human reference also. In the example above, the machine translation output is much closer to human reference 3, indicating a higher quality reference. The client later confirmed that the editor who prepared the reference was a senior editor and more skilled than the other 2 editors who prepared human reference 1 and 2. 

A BLEU score, as with other translation metrics, is just a meaningless number unless it is established in a controlled environment. Asking “What is your BLEU score?” could result in any one of the above scores being given. When controls are applied, translation metrics can be used both to measure improvements in a translation engine and compare translation engines from different vendors. However, while automated metrics are useful, the ultimate measurement is still a human assessment. Language Studio™ Pro also provides tools to assist in delivering balanced, repeatable and meaningful metrics for human quality assessment.