Monday, January 23, 2017

Finding the Needle in the Digital Multilingual Haystack

There are some kinds of translation applications where MT just makes sense. Usually, this is because these applications have some combination of the following factors: 
  • Very large volume of source content that could NOT be translated without MT
  • Rapid turnaround requirement (days)
  • Tolerance for lower quality translations at least in early stages of information review
  • To enable triage requirements and help to identify highest priority content from a large mass of undifferentiated content
  • Cost prohibitions (usually related to volume)
This is a guest post by Pete Afrasiabi, of iQwest Information Technologies that goes into some detail into the strategies employed to effectively MT in a business application area, that is sometimes called eDiscovery, (often litigation related), but in a broader sense could be any application where it is useful to sort through a large amount of multilingual content to find high-value content. In today's world, we are seeing a lot more litigation involving large volumes of multilingual documents, especially in cases that involve patent infringement and product liability. MT serves a very valuable purpose in these scenarios, namely, it enables some degree of information triage. When Apple sues Samsung for patent infringement, it is possible that tens of thousands of documents and emails are made available by Samsung (in Korean) for review by Apple attorneys. It is NOT POSSIBLE to translate them all through traditional means, so MT, or some other volume reduction process must be used to identify the documents that matter. Because these use-cases are often present in litigation, it is generally considered risky to use the public MT engines, and most prefer to work within a more controlled environment. I think this is an application area that the MT vendors could service much more effectively by working with expert users like the guest author more closely.


Whether you manage your organization’s eDiscovery needs, are a litigator working with multi-national corporations or are a Compliance officer, you commonly work with multilingual document collections. If you are an executive that needs to know everything about your organization, you would have a triage strategy helping you get the right information ASAP. If the document count is over 50-100k you typically employ native speaking reviewers to perform a linear one by one review of documents or utilize various search mechanisms to help you in this endeavor or both. What you may find is that most documents being reviewed by these expensive reviewers is often irrelevant or requires an expert to review. If the population includes documents from 3 or more languages, then the task becomes even more difficult!

There is a better solution. A solution that if used wisely can benefit your organization, save time/money and a huge amount of head ache. I am proposing that in these document populations the first thing you need to do is eliminate non-relevant documents and if they are in a foreign language you need to see an accurate translation of the document. In this article, you will learn in detail how to improve the quality of these translations using machines at a cost of hundreds of times less than human translation and naturally much faster.

With the advent of new machine translation technologies comes the challenge of proving its efficacy in various industries. Historically MT has been looked at not only inferior but as something to avoid. Unfortunately, the stigma that comes with this technology is not necessarily far from the truth. Adding to that, the incorrect methods utilized in presenting its capabilities by various vendors has led to its demise in active use across most industries. The general feeling is “if we can human translate them, why should we use an inferior method” and that is true for the most part, except that human translation is very expensive, especially when the subject matter is more than a few hundred documents. So is there really a compromise? Is there a point where we can rely on MT to complement existing human translations?

The goal of this article is to look under the hood of these technologies and provide a defensible argument for how MT can be supercharged with human translations. Human being’s innate ability to analyze content provides an opportunity to help and aid some of these machine learning technologies. An attempt to transfer that human based analytical information into a training model for these technologies can provide translation results that are dramatically improved.

Machine Translation technologies are based on dictionaries, translation memories and some rules-based grammar that differs from one software solution to another. Although there are newer technologies that utilize statistical analysis and mathematical algorithms to construct these rules and have been available for the past several years, unfortunately, individuals that have the core competencies to utilize these technologies are few and far between. On top of that, these software solutions are not by themselves the whole solution and just a part of a larger process that entails understanding language translation and how to utilize various aspects of each language and features of each of the software solutions.

I have personally witnessed most if not all the various technologies utilized in MT and about 5 years ago, developed a methodology that has proven itself in real life situations as well. Here is a link to a case study on a regulatory matter that I worked on.

If followed correctly, these instructions can turn machine translated documents into documents with minimal post editing requirements and at a cost of hundreds of times less than human translation. They will also look more closely like their human translated counterparts with proper flow of sentence and grammatical accuracy, far beyond the raw machine translated documents. I have referred to this methodology as “Enhanced Machine Translation”, still not a human translation but much improved from where we have been till now.

Language Characteristics

To understand the nuances of language translation we first must standardize our understanding of the simplest components within most if not all languages. I have provided a summary of what this may look like below.
  • Definition
    • Standard/Expanded Dictionaries
  • Meaning
    • Dimensions of a words definition in Context
  • Attributes
    • Stereotypical description of characteristics
  • Relations
    • Between concepts, attributes and definitions
  • Linguistics
    • Part of Speech / Grammar Rules
  • Context
    • Common understanding based on existing document examples

Simply accepting that this base of understanding is common amongst most, if not all languages is important, since the model we will build on makes assumptions that these building blocks will provide a solid foundation for any solution that we propose.

Furthermore, familiarity with various classes of technologies available is also important, with a clear understanding of each technology solution’s pros and cons. I have included a basic summary below.
  • Basic (Linear) rule based Tools
  • Online Tools (Google, Microsoft, etc.)
  • Statistical Tools
  • Tools combining the best of both worlds of rules-based and statistical analysis

Linear Dictionaries & Translation Memories

  • Ability to understand the form of word (noun, verb, etc.) in a dictionary
  • One to one relationship between words/phrases in translation memories
  • Fully customizable based on language
  • Inability to formulate correct sentence structure
  • Ambiguous results, often not understandable
  • Usually, a waste of resources in most case use examples if relied on exclusively

Statistical Machine Translation

  • Ability to understand co-occurrence of words and building an algorithm to use as reference
  • Capable of comparing sentence structures based on examples given and further building on the algorithm
  • Can be designed to be case-centric
  • Words are not numbers
  • No understanding of form of words
  • Results could be similar to some concept searching tools that often fall off the cliff if relied on too much
 Now that we understand what is available, building a model and process that takes advantage of benefits of various technologies, while minimizing the disadvantages of them would be crucial. In order to enhance any and all of these solution’s capabilities, it is important to understand that machines and machine learning by itself cannot be the only mechanism we build our processes on. This is where human translations come into the picture. If there was some way to utilize the natural ability of human translators to analyze content and build out a foundation for our solutions, would we be able to improve on the resulting translations? The answer is a resounding yes!

BabelQwest : A combination of tools designed to assist in Enhancing Quality of MT

To understand how we would accomplish this, we need to review some of the machine based concept analysis terminologies first. In a nutshell, these definitions and solutions are what we have actually based our solutions on. I have made reference to some of the most important of these definitions below. I have also enhanced these definitions with how as linguists and technologists we will utilize them in building out the “Enhanced Machine Translation” (EMT for short) solutions.
  • Classification: Gather a select representative set of the documents from the existing document corpus that represent the majority of subject matters to be analyzed
  • Clustering: Build out documents selected in the classification stage to find similar documents that match the cluster definitions and algorithms of the representative documents
  • Summarization: Select key sections of these documents as keywords, phrases, and summaries
  • N-Grams: N-Grams are the basic co-occurrence of multiple words that are within any context. We will build these N-Grams from the summarization stage earlier and create a spreadsheet with each depicting each N-Gram and their raw machine translated counterparts. The spreadsheet is built into a voting worksheet that allows human translators to analyze each line and provide feedback as to the correct translations and even whether certain N-Grams captured should be part of the final training seed data or not. This seed data will fine tune the algorithms built out in the next stage down to the context level and with human input. A basic depiction of this spreadsheet is shown below.

Voting Mechanism

iQwest Information Technologies Sample Translation Native Reviewer Suggestion Table

Japanese English

アナログ・デバイス Analog Devices

デバイスの種類によりスティック品 Stick with the type of product devices

トレイ品として包装・ as the product packaging tray


新たに納入仕様書 the new technical specifications
 Common Parameters
共通仕様書 Common Specifications


で新梱包方法を提出してもらうことになった have had to submit to the new packing method

  • Simultaneously human translate the source documents that generated these N-Grams. The human translation stage will build out a number of document pairs with the original content in the original language in one document and the human translated English version in another document. These will be imported into a statistical and analytical model to build the basic algorithms. By incorporating these human-translated documents into the statistical translation engine training, the engine will discover word co-occurrences and their relations to the sentences they appear in as well as discovering variations of terms as they appear in different sentences. They will be further fine-tuned with the results of the N-Gram extraction and translation performed by human translators.
  • Define and/or extract key names and titles of key individuals. This stage is crucial and usually the simplest information to gather since most if not all parties involved already have references in email addresses, company org charts, etc. that can be gathered easily.
  • Start training process of translation engines from the results of the steps above (multilevel and conditioned on volume and type of documents)
  • Once a basic training model has been built we would test machine translate original representative documents and compare with their human translated counterparts. This stage can be accomplished with as little as less than one hundred documents to prove the efficacy of this process. This is why we refer to this stage as the “Pilot” stage.
  • Repeat the same steps with a larger subset of documents to build a larger training model and to prove the overall process is fruitful and can be utilized to machine translate the entire document corpus. We refer to this stage as the “Proof of Concept” stage and it is the final stage. We would then start staging the entirety of the documents subject to this process in a “Batch Process” stage.
In summary, we are building a foundation based on human intellect and analytical abilities to perform the final translations. In using an analogy of a large building, the representative documents and their human translated counterparts (pairs) serve as the concrete foundation and steel beams, the N-Grams serve as the building blocks in between the steel beams and the key names and titles of individuals serve as the fascia of the building.

Naturally, we are not looking to replace human translation completely and in cases where certified human translations are necessary (Regulatory compliance, court submitted documents, etc.) we will still rely heavily on this aspect of the solution. Although the overall time and expense to complete a large-scale translation project is reduced by hundreds of times. The following chart depicts the ROI of a case on a time scale to help understand the impact such a process can have

This process has additional benefits as well. Imagine for a moment a document production with over 2 Million of Korean language documents that were produced over a long-time scale and from various locations across the world. Your organization has a choice of either reviewing every single document and classifying them into various categories utilizing native Korean native reviewers or utilize an Enhanced Machine Translation process to provide a larger contingent of English-speaking reviewers to search and eliminate non-relevant and classify the remainder of the documents.

One industry that this solution offers immense benefits is in the Electronic Discovery & Litigation support industry, where majority of attorneys that are experts in various fields are English-speaking attorneys and by utilizing these resources along with elaborate searching mechanisms (Boolean, Stemming, Concept Search, etc.) in English they can quickly reduce the population of documents. On the other hand, if the law firm relied only on native speaking human reviewers, a crew of 10 expert attorney reviewers, each reviewing 50 documents per hour (4000 documents per day on an 8-hour shift) would take them 500 working days to complete the review, with each charging hourly rates that can add up very quickly. 

We have constructed a chart from data over the past 15 years performing this type of work for some of the largest law firms around the world that shows the impact of a proper document reduction or classification strategy may have at every stage of their litigation. Please note the bars start from the bottom to top, with MT being the brown shaded area.

The difference is stark and if proper care is not given to implementation it often prevents organizations from knowing the content of documents within their control or supervision. This becomes a real issue with Compliance Officers that must rely on knowing every communication that occurs or has occurred within their organization at any given time.


Mr. Pete Afrasiabi the President of iQwest, is a veteran of aggregating technology assisted business processes into organizations for almost 3 decades and in the litigation support industry for 18. He has been involved with projects involving MT (over 100 million documents processed), Manages Services and Ediscovery since the inception of the company as well as deployment of technology solutions (CRM, Email, Infrastructure, etc.) across large enterprises prior to that. He has a deep knowledge of business processes, project management and extensive experience working with C-Level executives.

Pete Afrasiabi
iQwest Information Technologies, Inc.

No comments:

Post a Comment