Thursday, July 21, 2016

5 Tools to Build Your Basic Machine Translation Toolkit

This is a second post from the MT Language Specialist team at eBay, by . I have often been asked by translators about what kinds of tools are useful when working with MT. There is a lot of corpus level data analysis, preparation and editing going on around any competent MT project. While TM tools have some value, they tend to be segment focused and do not scale, there are much better tools out there to do the corpus pattern analysis, editing and the comparison work that is necessary to build the best systems. We are fortunate to have some high-value tools laid out very clearly for us here by Juan who has extensive direct experience working with large volumes of data and can provide experience-based recommendations.

If you are a linguist working with Machine Translation (MT), your job will be a lot easier if you have the right tools at hand. Having a strong toolkit, and knowing how to use it, will save you loads of time and headaches. It will help you work in an efficient manner, as well. 

As a Machine Translation Language Specialist at eBay, I use these tools on a regular basis at work, and that is why I feel comfortable recommending them. At eBay, we use MT to translate search queries and listing titles and descriptions into several languages. If you want to learn more, I encourage you to read " The Next Big Thing You Missed: Why eBay, Not Google, Could Save Automated Translation ". 

1. Advanced Text Editors
Notepad won't cut it, trust me. You need an advanced text editor that can, at least:
  • deal with different file encoding formats (UTF, ANSI, etc.)
  • open big files and/or with unusual formats/extensions
  • do global search and replace operations with regular expressions support
  • highlight syntax (display different programming, scripting or markup languages -XML, HTML, etc.- with color codes)
  • have multiple files open at the same time (tabs)
This is a list of my personal favorites, but there are a lot of good editors out there.
Notepad ++ : My editor of choice. You can open virtually any file with it, it's really fast, and it will keep your files in the editor even if you close it. You can easily search and replace in a file or in all open files, using regular expressions or just extended characters (control characters like \n or \t). It's really easy to convert from/to different file encodings and save all opened files at once. You can also download different plugins, like spellcheckers, comparators, etc. It's free and you can download it from here

Sublime : This is another amazing editor, and a developers' favorite. Personally, I find it great to write scripts. You can do many cool things with it, like using multiple selections to change several instances of a word at once, split a selection of words into different lines, etc. It supports regular expressions and tabs, as well. It has a distraction-free mode if you really need to focus. It's also free, and you can get it here

EmEditor : Syntax highlighting, document comparison, regular expressions, handles huge files, encoding conversion… Emeditor is extremely complete. My favorite feature, however, are the scriptable macros. This means, you can create, record, and run macros within EmEditor - you can use these macros to automate repetitive tasks, like making changes in several files and/or saving them with different extensions. You can download it from here

2. QA Tools
Quality Assurance Tools assist you in automatically finding different types of errors in translated content. They all basically work in a similar way: 1) you load files with your translated content (source + target); 2) you optionally load reference content, like glossaries, translation memories, previously translated files or blacklists; 3) the tool checks your content and provides a report listing potential errors. Some of the errors you can find using a QA Tool are:
  • terminology: term A in the source is not translated as B in the target
  • blacklisted terms: terms you don't want to see in the target
  • inconsistencies: same source segment with different translations
  • differences in numbers: source and target numbers should match
  • capitalization
  • punctuation: missing or extra periods, duplicate commas, etc.
  • patterns: certain used defined patterns of words, numbers and signs, which may contain regular expressions to make them more flexible, expected to occur in a file.
  • grammar and spelling errors
  • duplicate words, tripled letters, and more.
Some QA Tools you should try are:
Xbench allows you to run the following QA Checks: find untranslated segments, segments with the same source text and different target text, and segments with the same target text and different source text, find segments whose target text matches the source text (potentially untranslated text), tag mismatches, number mismatches, double blanks, repeated words, terminology mismatches against a list of key terms, and spell-check translations. Some linguists like to add all their reference materials in Xbench, like translation memories, glossaries, termbases and other reference files, as the tool allows you to find a term while working on any other running application with just a shortcut.
Xbench also has an Internet Search tab to run searches on Google. The list is pretty limited but there are ways to expand it, maybe I'll share that in the future. You can get Xbench here

Checkmate is the QA Tool part of the Okapi Framework, which is an open source suit of applications to support the localization process. That means, the Framework includes some other tools, but Checkmate is the one you want to perform quality checks on your files. It supports many bilingual file formats, like XLIFF, TTX, and TMX. Some of the checks you can run are: repeated words, corrupted characters, patterns, inline codes differences, significant differences in length between source and target, missing translations, spaces, etc. The patterns section is especially interesting; I will come back to it in the future. Checkmate produces comprehensive error reports in different formats. It can also be integrated with LanguageTool,an open source spelling and grammar checker. You can get Checkmate here

3. Comparison Tools 

Why do you need a comparison tool? Comparing files is a very practical way to see in detail what changes were introduced, e.g. which words were replaced, which segments contain changes, or whether there is any content added or missing. Comparing different versions of a file (for example, before and after post-editing) is essential for processes that involve multiple people or steps. Beyond compare is, by far, the best and most complete comparison tool, in my opinion. 

You can also compare entire folders. If you work with many files, comparing two folders is an effective way to determine if you are missing any files or if a file does not belong in a folder. You can also see if the contents of the files are different or not. 

4. Corpus Analysis Tools

As defined by its website, AntConc is a freeware corpus analysis toolkit for concordancing and text analysis. This is, in my opinion, one of the most helpful tools you can find out there when you want to analyze your corpus or content, regardless of the language. AntConc will let you easily find n-grams and sort them by frequency of occurrence. It is a very practical way to identify the highest frequency n-grams in your corpus. Obviously, you want the most frequently used terms to be translated as accurately as possible. In most texts, words like prepositions or articles are the most common ones, so you can use a stop-word list to filter them out when they don't add any value to the task at hand.

AntConc is extremely helpful when it comes to find patterns in your content. Remember - with MT, you want to fix patterns, not specific occurrences of errors. It may sound obvious, but finding and fixing patterns is a more efficient way to get rid of an issue than trying to fix each particular instance of an error. With AntConc you can select the minimum and maximum sizes of the n-grams you want to see, as well as the frequency. 

AntConc can create a list of each word occurring in your content, preceded by the number of hits. This can help you get a deeper insight on your corpus for terminology work, like which terms you should include in your glossary. These words can also tell you what your content is about - just by looking at the most frequent words, you can tell if the content is technical or not, if it belongs to any specific domain, and even which MT system you can use to translate it, assuming you have more than one customized systems.
There are many things you can use this tool for and it deserves its own article.
Check AntConc out here

5. CAT Tools

CAT Tools make a great post-editing environment. Most modern tools can be connected to different machine translation systems, so you get suggestions both from a TM and from an MT system. And you can use the TM to save your post-edited segments and reuse them in the future. If you have to use glossaries or term bases, CAT tools are ideal, as they can also display terminology suggestions. 
When post-editing with a CAT tool, there are usually 2 approaches: you can get MT matches from a TM (of course, they need to be added to it previously) or a connected MT system, or you can work on bilingual, pre-translated files and store in your TM post-edited segments only. 
If you have never tried it, I totally recommend Matecat. It's a free, open source, web-based CAT tool, with a nice and simple editor that is easy to use. You don't have to install a single file. They claim you will always get up to 20% more matches than with any other CAT tool. Considering some tools out there cost around 800 dollars, what Matecat has to offer for free can't be ignored. It can process +50 file types; you can get statistics on your files (like word counts or even how much time you spent on each segment), split them, save them on the cloud, and download your work. Even if you never used a CAT tool before, you will feel comfortable post-editing in Matecat in just a few minutes. 

Another interesting free, open-source option is OmegaT. Not as user-friendly as Matecat, you will need some time to get used to it, even if you are an experienced TM user. It has pretty much all the same main features commercial CAT tools have, like fuzzy matching, propagation, it supports around 40 different file formats, and it boasts an interface to Google Translate. If you never used it, you should give it a try. 

If you are looking into investing some money and getting a commercial tool, my personal favorite is MemoQ. It has tons of cool features and, overall, is a solid translation environment. It probably deserves a more detailed review, but that is outside of the scope of this post. You can learn more about MemoQ here.

Juan Rowda
Staff MT Language Specialist, eBay

Juan is a certified localization professional working in the localization industry since 2003. He joined eBay in 2014. Before that, he worked as translator/editor for several years, managed and trained a team of +10 translators specialized in IT, and also worked as a localization engineer for some time. He first started working with MT in 2006. Juan helped to localize quite a few major videogames, as well. 
He was also a professional CAT tool trainer and taught courses on localization.
Juan holds a BA in technical, scientific, legal, and literary translation. 

Thursday, July 14, 2016

When MT does not take translators' jobs away - and may create more jobs

This is a guest post by Silvio Picinini who works in a team at eBay that provides linguistic feedback and addresses linguistic issues, specifically to enhance large scale MT projects underway at eBay. To my mind this is an example of best practices in MT, where you have NLP and MT experts working together with linguists to solve large scale translation problems in a collaborative way.  

The eBay linguistic team has actually been producing a number of articles that describe various kinds of linguistic tasks that are increasingly needed to add value and quality to large scale MT efforts. I think these articles are worth greater attention, as they have a high SNR (signal to noise ratio.) They are educating and informing readers of very specific things that IMO together add up to examples of best practice. I am hoping that Silvio and his colleagues become regular contributors to this blog so that more people get access to this valuable information.

I was honored to be invited by Kirti to write for this blog. I hope to deserve it, by sharing my experiences as a translator working with machine translation. Recently I was really impressed by Kirti's post on how a lot of content is being translated outside of the translation services industry. I would like to add a few thoughts to that.
I work with User-Generated Content for eBay. Users all over the world describe what they are selling, creating titles and descriptions for their items. In the millions. We need to translate the information on these items so that users that speak other languages can buy them. So this is the job, translate millions of items quickly, almost instantly. A new initiative at eBay is structuring data in a different way, and making it easier to create product reviews. In a short period, we accumulated millions of reviews. A review written in English about a digital camera (a product sold globally) is probably very useful for a buyer in Germany or in Mexico. So we need these reviews translated for these buyers. Could we do this hiring human translators? No. It is easy to see that given the volume, time and cost involved, human intervention is out of the question. Virtually anything that is open to users, allowing them to create their own content, will generate volumes that are not feasible to be translated by hand. These are real scenarios from eBay, but also Facebook recently announced the translation of posts with their own MT engine, and Amazon is working on MT

In addition to what is already happening, we live in a world where new forms of content created by users appear every day. This is of interest to a lot of people, and that will require translation. So here are some types of User-Generated Content that, in my opinion, seem that will be of interest beyond their original language. I am guessing that their companies may be interested in translating this in the (near) future:
  • Rental Homes reviews on Airbnb
  • TripAdvisor reviews of places to see, eat and stay
  • Netflix movie reviews
  • LinkedIn articles
  • Tweets
  • How-to guides
  • Knowledge bases
  • Even Yelp reviews that seem local can be of interest to visitors from other countries or speakers of a second language in the same country (French in Canada, Spanish in the US)
  • In e-commerce: Product titles and descriptions, product reviews, messaging and user searches.

So this is what I meant with the title of the post: Translators would never be offered User-Generated Content translations, so when these jobs go to machine translation engines, they are not really affected by it in any way. MT is not taking any translator's jobs if there was no job in the first place. But maybe translators would like to affect this enormous translation market. Kirti has been posting guidance on how translators can prepare to participate in this opportunity. 
From my experience at eBay, here are a few thoughts about the role that translators may play.
  • MT engines will need to be trained. The specific content needed for training may not be available to be harvested. Therefore, companies will need to create training data for their engines. This training data will be post-edited from the MT output, and this is a job that requires the human intervention of post-editors and reviewers. The quality of the MT output needs to be measured, and the measurement requires (in the case of BLEU) a human translated reference. So there is also a role for translators, instead of post-editors, in creating references for MT measurements.
  • The importance of the pattern over the individual error: the usual mindset for translators and reviewers is to focus on every error that they see, correct them and then produce perfect quality. For MT, the mindset should focus on patterns of errors. Translators will be trying to make a bigger impact by finding patterns of errors that will improve the quality on a larger scale, on every better translation that the MT engine produces.

Translators have the linguistic ability to see these patterns. In this paper at AMTA 2014, I presented a few patterns found in Brazilian Portuguese:
  • Diminutives are widely used by users in informal language, and are not commonly present in the training data, which is usually in a more formal language.
  • The lack of diacritical marks is common among users, both for accents and for marks that modify letters such as ç, ã and õ. The usual training data is usually written in a more formal language and will contain all the diacritical marks. The MT will have to deal with these differences, such as "relogio" vs. "relógio" and "calca" vs. "calça".
  • Some words are intended for the target language but are also words in the source language, causing issues. "Costumes" is a word in English, but also in Portuguese.
  • Some words are misspelled because certain letters have the same sound, causing issues for MT. For example, "engraçados" spelled as "engrassados" (ç and ss have the same sound).
  • Some words are spelled as people pronounce them, and this is different from the correct written pronunciation. For example, "roupa" spelled as "ropa". MT needs to deal with that.
  • Some English words are spelled as they would be written with Portuguese language rules. So "Michael Jordan" would become "Maico Jordam". 

There are MT companies, academic experts and customer engineering teams working with MT. It may be time for the language experts to play a role. 

Silvio Picinini is a Machine Translation Language Specialist at eBay since 2013. With over 20 years in Localization, he worked on quality-focused positions for an LSP, and as an in-house English into Brazilian Portuguese translator for Oracle. A former quality engineer, Silvio holds degrees in electrical and electronic/software engineering. 

Friday, July 8, 2016

Overview of Expert MT Systems -- tauyou

This is a guest post by Diego Bartolome, the CEO of tauyou, who I regard as an expert MT developer with verifiable competence and a track record of success with MT. This post will be one in a series of upcoming posts to inform and introduce readers of this blog about competent MT technology alternatives available in the market today. I am aware of several successful MT implementations they have had, from presentations made by their customers at industry conferences. Also, I have had several conversations with Diego in the past and felt it would be good to highlight his company and it's capabilities, in his own words, as I think he offers solutions and services that are especially well suited for LSPs who realize that MT engine development is best left to experts.

tauyou <language technology> was created 10 years ago and initially had a completely different objective, which was to put machine translation in your pocket, in your mobile, everywhere. We won many prizes, we were in media continuously thanks to our innovation, but the truth is that we lacked the most important thing in a company: recurrent revenue. It took us too much time to realize, it was not until late 2008 when we pivoted to machine translation solutions for the language industry in general, and Language Service Providers in particular. The pivoting period was tough, because we were running out of money even though our burn rate was extremely low, but we managed to survive, and reached a state of sustainability, and then continuous growth thereafter.
In 2009, selling MT to LSPs was a tough sell. During many calls, I was getting from simple NO responses without any further explanation, to the typical excuses, which at that time were somewhat true: MT is the enemy, we will never use MT, it doesn't work for my use case, MT quality is too bad, etc. This has changed over the years, and since the beginning of 2012, we are getting active requests from LSPs that want to integrate MT into their workflow and companies that need a specialized custom MT solution. We are experts in customizing MT and putting it to work in the least possible time. Also, in the past three years, we have seen an explosion of the MT usage for post-editing, and also the raw MT usage, and we currently offer baseline engines companies can use, if little or no is data available. Another key aspect to providing MT technology is the integration into existing production systems. Having APIs that allow clients to connect to our engines for real-time translation in an easy way has been a great asset to succeed in the MT landscape. No matter what CAT tool you use, either the tauyou product is already integrated into it, or it can be integrated in a short period of time.
Our process is extremely customized, and we adapt everything to the unique client use case. We have a deep knowledge of the technology and various support components, so we can customize our technology building blocks according to the very specific application that we are dealing with. The first engines we built took ages to be productive, but currently, some engines are ready to be used in production mode in just a matter of hours because of our continuously improving technology platform! The hardest part for our clients is still recruiting post-editors in some language combinations and verticals. Once the engines are built, they continuously improve thanks to the corrective linguistic feedback of translators that we rapidly incorporate back into the system, the automatic post-editing rules we extract, and frequent incremental retraining. I would say that integrating editor user feedback is key, and we help translation companies engage post-editors, either theirs or the more than 1,500 post-editors we have in our database. 

Engaging translators is key to the success of the MT initiative, and we regularly have calls together with our clients to improve the communication and really make it work for all parties. The key element for many translators might be fair compensation for MT related work, but there are also many who see the possibility of learning a new skill and provide guidance for the MT engine out to become better. If the MT is better, it is also better for the translator! We have evolved and are now often considered as the translators best friend. 

Thanks to our team of NLP engineers, we have developed many support modules to enhance the effectiveness of our MT systems. These include:
  • Named Entity Recognition,
  • Statistical analysis of the source content,
  • Glossaries,
  • Forbidden word lists,
  • Automatic post-editing rules,
  • Extraction of unknown words,
  • Perfect tag positioning,
  • Estimation of the MT Output Quality,
  • Summarization technology,
  • Classification of the source content to use the best MT,
  • Detection of events,
  • Spelling and Grammar checking, etc.
We can integrate any open-source tool that is available from the NLP (Natural Language Processing) and computational linguistics community at large, or develop custom applications for our customers, as we have done in the past. What is more important now, is our thorough knowledge of the process, and our extensive experience, which enriches the customer workflow to some extent. We now also try to make it easy for clients to embrace change, and achieve a significant Return on Investment on their MT technology investments. 

The best engines we have produced are related to the region of the world where we are based, i.e. language pairs including Spanish as source or target, such as bidirectional Spanish - Catalan, French, Portuguese, Italian, etc.. Some of the engines that we are more proud of include generic bidirectional English - Danish, Swedish, and Norwegian that were developed for a major client with large TMs in the Nordic languages, on which we applied many NLP techniques and some innovative algorithms developed just for them to reach an impressive outcome. Other language pairs with English we can be proud of include Japanese, Korean, Hebrew, and Chinese. However, some clients also call us to develop engines not having English as source or target, where competitors do not perform so well, e.g. Danish into French, German into Swedish, or French into German, to name a few. If our clients have good data, the process becomes easier in any language. 

There are several slide decks available from presentations we have made in the past to view here. This may be most interesting to somebody who is not familiar with us:The discreet charm of machine translation. Here is another link with several past presentations. 

Recent applications of these MT technologies include chat translation for companies with internal employees that need to know a topic really well, and also real time social media translation. We can plug our MT into any need! In these cases, we use our stock/baseline engines, and they frequently outperform Google or Microsoft depending on the vertical and language pair. The advantage of our technology is that it can be installed within the client data center, thus securing the confidentiality of the data and with full control. Thanks to our NLP knowledge and expertise, we now have a predictive typing technology to assist translators and content writers, a tool for Project Managers to automatically select the best translator for a given project based on their previous work history, and an automatic content generation tool as well others in development.

Our pricing is fairly simple, it's just a flat rate per month depending on the number of engines, where we include some development and expert consulting work. The translation volume is unlimited, clients can translate as many words as they want! Prices start as low as $1100 per month, and don't involve any set-up fee nor do we have any upfront or hidden costs. Also, the model can be adjusted monthly based on the plan, without minimum commitment if the solution is installed online in a SaaS mode. It's better to try it and learn directly whether it will work for your application, than just think it won't work and you might be losing revenue and profit for your company! Either you succeed or you learn. 

In the recent past, we have started to look into Neural Machine Translation together with our partner Prompsit Language Engineering. Even if results are promising, we think that the technology still needs time to evolve to be practical in a realistic business case for MT post-editing. However, there might be cases and languages such as Japanese or German, where Neural MT will outperform our hybrid machine translation in the near future. In any case, just as rule-based MT has not been replaced by Statistical Machine Translation (SMT), SMT still has some years in her life. Any company interested in being a leader in the MT space has to invest in R&D, and NMT is definitely the technology to research.
The translation future is MT-based :-) 

You can contact tauyou at the address below for more information. Please use the hashtag #emptypages in your communications to receive a special promotion. 


phone: +34 93 711 29 96
address: C/ Les Planes 39, 1o 2a
08201 Sabadell - Spain