Wednesday, January 18, 2017

Objective Assessment of Machine Translation Technologies

Here are some comments by John Tinsley, CEO, Iconic Translation Machines that are a response to the Lilt Evaluation, CSA comments on this evaluation, and my last post, on the variety of problems with quick competitive quality evaluations. 

He makes the point about how the best MT engines are tuned to a specific business purpose very carefully and deliberately e.g. the MT systems at eBay and all the IT domain Knowledge Bases translated by MT. None of them would do well in an instant competitive evaluation like the one Lilt did, but they are all very high-value systems at a business level. Conversely, I think it is likely that Lilt would not do well in translating the type of MT use-case scenario that Iconic specializes in since they are optimized for other kinds of use cases where active and ongoing PEMT is involved (namely typical localization).

These comments describe yet another problem with a competitive evaluation of the kind done by LiltLabs. 

John explains this very clearly below and his statements hold true for others who provide deep expertise based customization like tauyou, SYSTRAN, and SDL.  However, it is possible that the Lilt evaluation approach could be valid for instant Moses systems and for comparisons to raw generic systems. I thought that these statements were interesting enough that they warranted a separate post.

Emphasis below is all mine.



The initiative by Lilt, the post by CSA, and the response from Kirti all serve to shine further light on a challenge we have in the industry that, despite the best efforts of the best minds, is very difficult to overcome. Similar efforts were proposed in the past at a number of TAUS events, and benchmarking continues to be a goal of the DQF (though not just of MT).

The challenge is in making an apples to apples comparison. MT systems put forward for such comparative evaluations are generally trying to cover a very broad type of content (which is what the likes of Google and Microsoft excel at). While most MT providers have such systems, they rarely represent their best offering or full technical capability.

For instance, at Iconic, we have generic engines and domain-specific engines for various language combinations and, on any given test set may or may not outperform another system. I certainly would not want our technology judged on this basis, though! 

From our perspective, these engines are just foundations upon which we build production-quality engines.

We have a very clear picture internally of how our value-add is extracted when we customise engines for a specific client, use case, and/or content type. This is when MT technology in general, is most effective. However, the only way these customisations actually get done are through client engagements and the resulting systems are typically either proprietary or too specific for a particular purpose to be useful for anyone else.

Therefore, the best examples of exceptional technology performance we have are not ones we can put forward in the public domain for the purpose of openness and transparency, however desirable that may be.

I've been saying for a while now that providing MT is a mix of cutting-edge technology, and the expertise and capability to enhance performance. In an ideal world, we will automate the capability to enhance performance as much as possible (which is what Lilt are doing for the post-editing use case) but the reality is that right now, comparative benchmarking is just evaluating the former and not the whole package.

This is why you won't see companies investing in MT technology on the basis of public comparisons just yet.


These comments are also available at:  


No comments:

Post a Comment