Saturday, February 27, 2010

Rule-based MT vs. Statistical MT: Does it Matter?

One of the current debates in the MT community is RbMT vs. SMT. While I do have a clear bias that favors SMT, I have tried to be fair and have written many times on this subject. I agree that it cannot be said that one approach is definitely ALWAYS better than the other. There are many successful uses of both. In fact, at this point in time there may be more examples of RbMT successes since it has been around longer.

However, there is clear evidence that SMT continues to gain momentum and is increasingly the preferred approach. RbMT has been around for 50 years and the MT engines we see around are in many cases the result of decades of investment and research. SMT is barely 5+ years old in terms of being commercially available since Kevin Knight began his research at USC in 2000 and is only just beginning to become available in the market.
The people best suited to answer the question of which approach is better are those who have explored both RbMT & SMT paradigms deeply, to solve the same problem. Unfortunately there are very few of these people around. The only ones I know for sure that have this knowledge are the Google Translate and Microsoft Live Translate teams and they have both voted in favor of SMT.
Today, RbMT still makes sense when you have very little data, or where you have a good foundation rules engine already in place, that has been tested and is a good starting point for customization. Some say they also perform better on languages with very large structural and morphological differences.Combinations like English <> Japanese, Russian, Hungarian, Korean still seem to often do better with RbMT. It is also claimed by some that RbMT systems are more stable and reliable than SMT systems. I think this is probably true with systems built from web-scraped or dirty data but the story with clean data is quite different. SMT systems built with clean data are stable, reliable and much more responsive to small amounts of corrective feedback.

What most people still overlook is that the free online engines are not a good representation of the best output possible with MT today. The best systems come after focused customization efforts, and the best examples for both RbMT and SMT are carefully customized in domain systems that are built for very specific enterprise needs rather than for general web user translation.

It has also become very fashionable to use the word “hybrid” of late. For many this means using both RbMT and SMT at the same time. However, this is more easily said than done. From my viewpoint, characterizing the new Systran system as a hybrid engine is misleading. It is an RbMT engine that applies a statistical post-process on the RbMT output to improve fluency. Fluency has always been a problem for RbMT and this post-process is an attempt to improve the quality of the raw RbMT output.Thus this approach is not a true hybrid from my point of view. In the same way, linguistics are being added to SMT engines in different ways to handle issues like word order and dramatically different morphology which have been a problem for pure data-based SMT approaches. I think most of us agree that statistics, data and linguistics (rules and concepts) are all necessary to get better results, but there are no true hybrids out there today.
RbMTvsSMT Table
I would also like to present my case for the emerging dominance of SMT with some data that I think we can mostly agree, is factual and true and not just a matter of my opinion.
Fact 1: Google used Systran RbMT system as their translation engines for many years before switching to SMT. The Google engines are general purpose baseline systems (i.e. non domain focused). Most people will agree that Google compares favorably with Babelfish which is a RbMT engine. I am told they switched because they saw a better-quality future and continuing evolution with SMT which CONTINUES TO IMPROVE as more data becomes available and corrective feedback is provided. Most people agree that the Google engines have continued to improve since they switched to SMT.
Fact 2: Most of the widely used RbMT systems have been developed over many years (decades in some cases) while none of the SMT systems are much over 5 years old and are still in infancy in 2010. 
Fact 3: Microsoft switched from a Systran RbMT engine to an SMT approach for all their public translation engines in the MSN Live portal as well. I presume for similar reasons as Google. They also use a largely SMT based approach to translate millions of words in their knowledge bases into 9 languages which is perhaps the most widely used corporate MT application in the world today. The Microsoft quality also continues to improve.
Fact 4: Worldlingo switched from a RbMT foundation to SMT to get broader language coverage and attempt to reverse a loss of traffic (mostly to Google)
Fact 5: SMT providers have been able to easily outstrip RbMT providers in terms of language coverage and we are only at the beginning of this trend. Google had a base of 25 languages while they were RbMT based but now have over 45 language pairs that can go into any other language and apparently now have over 1,000 language combinations with their SMT engines.
Fact 6: The Moses Open Source SMT training system has been downloaded over 4,000 times in the last year. TAUS considers it “the most accessed MT system in the world today.” Many new initiatives are coming forth from this exploration of SMT by the open source community and we have not yet really seen the impact of this in the marketplace.

Google and Microsoft have placed their bets. Even IBM, which still has a legacy RbMT offering, has their Arabic and Chinese speech systems linked to an SMT engine that they have developed. So now, we have three of the largest IT companies in the world focused on SMT-based approaches. 

However, this is perhaps just relevant for the public online free engines. Many of us know that customized, in-domain focused systems are different and for enterprise use, the kind of system that matters most. How easy is it to customize an SMT vs RbMT engine?
Fact 7: Callison-Burch, Koehn et al have published a paper (funded by Euromatrix) where they compared 6 European languages engines as baselines and after domain tuning with TM data for SMT, dictionaries for RbMT. They found that Czech, French, Spanish and German to English all had better domain results with SMT. Only the Eng>Ger domain had better results on domain focused systems with RbMT. However, he did find that RbMT had better baselines in many cases than they had since they do not have the data resources that Google or Microsoft have and whose baseline systems are much better.
Fact 8: Asia Online has been involved with patent domain focused systems in Chinese and Japanese. We have produced higher quality translations than RbMT systems which have been carefully developed with almost a decade of dictionary and rules tuning. The SMT systems were built over 3-6 months and will continue to improve. It should be noted that in both cases Asia Online is using linguistic rules in addition to raw data-based SMT engine development.
Fact 9: The intellectual investment from the computational linguistics and NLP community is heavily biased towards SMT maybe by as much as a factor of 10X. This can be verified by looking at the focus of major conferences on MT in the recent past and in 2010. I suspect that this will mean continued advance and progress in the quality of SMT based approaches.

Some of my personal bias and general opinion on this issue:
-- If you have a lot of bilingual matching phrase pairs (100K+) you should try SMT and in most cases you will get better results than a RbMT especially if you spend some time providing corrective feedback in an environment like Asia Online. I think man-machine collaborations are much more easily engineered in SMT frameworks. Corrective feedback can be immediately useful and can leverage the engine quality very quickly.
-- SMT systems will continue to improve as long you have clean data foundations and continue to provide corrective feedback and retrain these systems periodically after “teaching” it what it is getting wrong.
-- SMT will win the English to German quality game in the next 3 years or sooner.
-- SMT will become the preferred approach for most of the new high value markets like Brazilian Portuguese, Chinese, Indic Languages, Indonesian, Thai, Malaysian and major African markets.
-- SMT will continue to improve significantly in future because: Open Source + Academic Research + Growing Data on Web + Crowdsourcing Feedback are all at play with this technology

SMT systems will improve as more data becomes available, bad data is removed and as pre and post processing technologies around these systems improve. I also suspect that the future systems will be some variation of SMT + Linguistics (which includes rules) rather than data-only based approaches. I also see that humans will be essential to driving the technology forward and that some in the professional industry will be at the helm, as they do in fact understand how to manage large scale translation projects better than most.

I have also covered this in some detail in a white paper that can be found in the L10NCafe or on my LinkedIn profile and there is much discussion about this subject in the Automated Language Translation group in LinkedIn where you can also read the views of others with differing opinions. I recommend the entries from Jordi Carrera in particular, as he is an eloquent and articulate voice for RbMT technology. One of the best MT systems I know is an RbMT system that has source analysis and cleanup, integrated and largely automated post-editing at PAHO. The overall process flow is what makes it great, not that it is based on RbMT.
 So does it matter what approach you use? If you have a satisfactory, working RbMT engine then there is probably no reason to change. I would suggest that SMT makes more sense for most long-term initiatives where you want to see the system continually improve. Remember in the end the real objective is to get high volumes of content translated faster in the most accurate way possible and both approaches can work with the right expertise, even though I do prefer SMT and do believe that it will dominate in future.

Tuesday, February 23, 2010

Translation As a Force of Change

One of the things that I have always found interesting about the world of translation is that apart from facilitating global commerce, I see that it also has the potential to be a means to break down walls between cultures and also improve the lives of humans as information starts to flow more freely across languages. Poverty and lack of information are often very closely correlated.This is really powerful, but for the most part the professionals focus on documentation and content that is necessary but not considered especially high value. So who does the world changing stuff?

I tend to think that translation, collaboration and automation are closely related and that great things are possible as these key elements line up.

I wanted to point out some examples of this power already at work. I noticed several articles yesterday on Meedan which is an online meeting place for English and Arabic speakers to share viewpoints. A non-profit service that hopes to foster greater understanding and tolerance, translates content from the Arabic media to English and vice versa. The site uses machine translation, a community to help clean up MT output and already makes 3 million words of translation memory available to enable continuing leverage and encourage new English <> Arabic translation efforts. I met George Weyman last year and I am very happy to see this initiative grow in strength. Apart from being a peacemaker and bridge-builder, George is also a fine tin flute player as this video (starts at 2:28) of an impromptu music jam in an Irish bar shows. I joined them by drumming on the table. In time, I would not be surprised to find that Meedan becomes a model for building dialog elsewhere in the world. 

This meeting happened at the AGIS conference which was focused on building a community and collaboration platform to be able to launch initiatives against information poverty and bring translation assistance to humanitarian causes. Like the Open Translation Tools conference I wrote about earlier, these are fledgling movements that are growing in strength. I would not be surprised to see initiatives like The Rosetta Foundation become a source for more compelling innovation in translation than companies like SDL and other professional industry “leaders”. Collaboration, automation, MT, community management and open source were the focus at AGIS. This is in contrast to the same  localization themes we see repeated endlessly at the larger industry conferences. I would bet that revolution is more likely to come from hungry, motivated “world-changing” mindsets that I saw at AGIS than the professionals reeling under cost cutting from buyers, that we usually see at the major localization conferences. My sense is that people who feel awe can make shit happen.

Recently we also saw the power of collaboration and focused community efforts in Haiti. The following are just a few examples:
Language Lifelines: describes a variety of language industry initiatives to help relief assistance.
GALA setup a site to coordinate language related efforts and Jeff Allen resurrected data that he had worked on at CMU to help Microsoft, Google and others to develop MT solutions that might prove useful to the reconstruction effort.

I was also drawn into a vision that Dion Wiggins, CEO, Asia Online had to translate mostly educational open source content into several South East Asian languages to address the information poverty in the region. Again, the foundation of the effort here is an automated translation platform together with community collaboration and high value content. While this project still has a long way to go, the initial efforts are proof that the concept can work.  There is a growing belief that access to information and knowledge not only raises the lives of those who have access, but  also creates commercial opportunity as more people come online.

We are also seeing that community members (the crowd) can also step up to engage in translation projects, sometimes on a very large scale. While Facebook gets a lot press, I think it is the least interesting of these initiatives as it only focuses on L10N content which probably was best done by professionals anyway. They did prove however, that using crowds is a good way to rapidly expand the language coverage and your global customer base. If they actually extend this to the user content, I think Facebook could become a major force in translation. And again in this case, a management and collaboration infrastructure platform was necessary to enable and manage crowd contribution. I cannot see them extending the translation effort to the real user content without engaging machine translation into the process and flow. Many IT companies have also started to explore crowdsourcing, including Adobe, EMC and Intel and will expand language coverage this way. The professional translation industry should take note that this makes sense for companies to do because “long-tail” languages are not easily done cost effectively through standard channels.
While many in the professional industry comment disparagingly about quality in crowdsourcing translation, there is evidence that it can work quite well. The three best examples I know of are the TED Open Translation project which now has translated almost 5000 speeches into 70 languages using a pool of over 2000 volunteer translators, and the Yeeyan project in China and Global Voices.

The Yeeyan project takes interesting content in English and translates it into Chinese just to share interesting, compelling material. The community involves 8,000 volunteer translators, who’ve created 40,000 translations and collaborate with the Guardian, Time, NY Times and others. This effort got them into some trouble with Chinese censorship regulations but it has already evolved into a platform that employs “translators” and is self funding. 

Global Voices is translated into more than 15 languages by volunteer translators, who have formed the LinguaAdvocacy website and network to help people speak out online in places where their voices are censored. This a truly virtual organization that allows us to hear real voices from around the world. Check out the recently translated articles. There are many more initiatives that I give a shout out to in my Twitter stream.

I believe the professional industry is at a point where they need to understand collaboration, crowdsourcing, automation, MT, and open source. This is both an opportunity and a threat as those who resist these new forces, will likely be marginalized. Microsoft changed the world when they introduced PCs and a much more open IT model while IBM defended mainframes and became much less relevant. At the time, the management at IBM were not able to take a nerdy college dropout named Bill seriously. Maybe because he delivered his software on a single floppy or maybe because he did not wear a tie.  Microsoft in turn was caught completely off guard when Google introduced their much more open, free and cloud based model and became less relevant. This cycle will likely continue as innovation drives change and I predict that Google too will become less dominant in the not so distant future because they have lost the original spirit.

The Economist is also regularly translated into Chinese by a group that calls themselves the Eco Team. The founder had this to say:
"Like the forum name says, producing a Chinese version of The Economist is our goal. But we're still young and immature; very amateur, not professional. So what? Because we are young, we have the fervor, the enthusiasm, the passion. Because we are amateurs, we'll double our efforts to do our best. As long as we wish, we can be successful and do a good job!"

Ethan Zuckerman summarizes the implications of this very nicely. Change is coming to the world of translation, with or without the support and guidance of the professionals.

We are in an age where information is a primary driver of wealth creation. While the initial wave has been focused around English and European languages, this will increasingly shift to languages like Chinese. Social Networks in China are already proving that they can be innovators and leaders in the new digital economy. The value of information and thus of translation will continue to increase, and the understanding that knowledge can bring prosperity will hopefully gain momentum all around the world. I hope that some of us will help make this happen.

Wednesday, February 17, 2010

The Global Customer Support Translation Opportunity

Recently I have written about why MT is important for LSPs. MT is a key enabling technology to make large volumes of dynamic high-value business content multilingual.  I have also pointed out the significant business value of making customer support content in particular more multilingual.I would like to go into more detail on the specific challenges one is likely to face in translating knowledge base and community content and how this could be addressed.

In most cases support content is likely to be 20X to 1000X the volume, of even a large documentation project so using MT technology will be a core requirement. It is also important that stakeholders understand that human quality is not achievable across the whole body of content and that it is important to define a “good enough” quality level early in the process.

Understanding the Corpus

The first step to developing a translation strategy for “massive” content is to profile the source corpus and understand volatility, language style, terminology, high frequency linguistic patterns, content creation process,  and assess existing linguistic resources available to build an MT engine. It is usually wise to do the following:
-- Gather existing translation memory (TM) and glossaries for training corpus
-- Identify sections that must be human translated (e.g. security, payment processing terms and conditions, legal content)
-- Analyze the source corpus and identify high frequency phrase patterns and ensure that they they are translated and validated by human translators
-- Identify the most frequently used knowledge base and community content and ensure that these are translated by humans and used as training corpus.
Once this is done, an MT engine can be built and evaluated. While it is important to do linguistic evaluation, it is perhaps even more important to show samples of MT output to real customers and determine whether the output is useful.
KB Development Process
It is generally recommended that new knowledge base content is run through the initial engine and the MT translation is analyzed and corrected by human post-editors and linguists until a target quality level is achieved. This process may involve several iterations to continually improve the quality. The whole knowledge base can be periodically retranslated as big improvements in MT engine quality are accomplished. It is important to understand that this is an ongoing and continuously evolving process and that overall quality will be strongly related to the amount of human corrective feedback that is provided.
Self-service KB
It is worth restating that there are significant benefits to doing this as the customer support environment evolves with the general momentum behind collaboration and community networks.The ROI in terms of call deflection savings and improved customer satisfaction is well documented and is significant. But perhaps the greatest benefit is the expanded visibility for the global customer who cannot really use the English content in it’s original form.

Microsoft has clearly demonstrated the value of making their huge knowledge base multilingual. At a recent TAUS conference they reported that hundreds of millions of support queries are handled by raw MT and interestingly, surveys indicate that the successful resolution and customer satisfaction in many of these languages is actually higher than it is for English! Others are starting to follow suit and Intel and Cisco have also done similar things on a smaller scale. The CSI presentation by Greg Oxton at a recent TAUS meeting states it very simply:

Content is King -- Language is Critical

I saw recently that analysts in the content management community have identified the growing demand for multilingual content as one of the strongest trends of 2009 and see it growing further in 2010. The Gilbane Group has a big emphasis on content globalization in their upcoming conference this summer. I was involved with a webinar yesterday with Moravia that focused on the customer support content globalization issue. A replay of the webinar is available here.

The time is now, to focus on and learn how to undertake content globalization projects that start at ten million words and and can run into hundreds of millions of words. This is the future of professional translation and I think that effective man-machine collaborations will be a key to success.

Monday, February 15, 2010

Learning to Share in Professional Forums: Collaboration

I read an article that I thought was striking and worth sharing. When I look back at how I started this blog, I recall the first few entries were about censorship and a discussion about how we as an industry produce too many similar conferences which dilutes the consolidated effort and marginalizes localization professionals in the general corporate landscape.

I think that perhaps, one of the reasons the professional translation industry is so fragmented, is that there are very low barriers to entry and competition is reduced to price in most cases. This creates an environment where the level of distrust is high among the various players in the supply chain and the level of collaboration is minimal and guarded if it exists at all. This is quite visible in the translation technology (too many trying to do exactly the same thing), the relationship between freelancers and LSPs and even in the general status of localization managers in global enterprises. 

While I don’t really have any definitive answers, I do think it is worth asking some fundamental questions to see if there is a way to get the disparate elements working together. Why is the industry unable to build greater mass, visibility and momentum?

It is interesting to see that many of the most exciting things happening in the world of translation are happening outside the realm of control of the professional industry. Facebook, TED, dotSUB, Ushahidi, Global Voices, Meedan are all initiatives that have learned to harness motivated and willing crowds. These are exciting initiatives that are changing the world. Google Translate, a vibrant open source SMT movement (Moses) and upstarts like Asia Online and others are making the most waves in the translation automation sector. In contrast, most of the news about our industry trade associations (GALA, ELIA, TAUS, ProZ, ATA, LW, LISA etc..) has to do with continuing communication problems, fragmentation and difficulties in developing meaningful collaboration models.

We have seen massive change in the music, newspaper, customer support that is driven by collaboration, open technology platforms and open knowledge sharing by motivated communities in social networks. It would not be surprising to see that these same Web 2.0 dynamics and forces could bring big changes to the world of professional translation. A study by Deloitte describes this "Big Shift" where IT infrastructure development, free knowledge flows and public policy support together are fundamentally reshaping the economic playing field.
One of the keys to connecting to the energy that these new collaborative movements foster is learning to share openly which brings me back to the article that triggered this entry.  I think it starts with just how we as individuals share what we know about our business and expertise. The industry needs to develop stronger collaboration models. From my vantage point, I see that there is some sharing going on between translators but very little between all the key players and levels in the professional translation supply chain. The first step in building strong peer-to-peer networks and collaboration culture is learning how to share. The chart below shows how the Ogilvy PR group has mapped some of the drivers of influence and persuasion to a social media context, where sharing is a primary action and modus operandi.6a00d8341cb26653ef0120a8989ca3970b

The study referenced in the article talks about awe as a key ingredient driving sharing behavior. Apparently human beings like to share awe and humans who share awe can bring about change. They say:
“Awe-inducing experiences encourage people to look beyond themselves and deepen connections to the broader social world (Shiota, Keltner, and Mossman 2007). All of these factors suggest that awe should lead people to want to share."

I also saw another article that again made me think about the unrealized potential we as an industry have, if we learned to walk together. We need to evolve from standard command-and-control views to developing strong collaborative cultures.

I found a few more tidbits from the CSI site that attach to this thread and suggest a new model that we can adopt:
If information is to function as a source of organizational vitality, we must abandon our dark cloaks of control and trust in its need for free movement, even in our own organizations. Information is necessary for new order, an order we do not impose, but order nonetheless. All of life uses information in this way” – Margaret Wheatley, Leadership and the New Science

“The open society, the unrestricted access to knowledge, the unplanned and uninhibited association of men for its furtherance—these are what may make a vast, complex, ever-growing, ever-changing, evermore specialized and expert technological world, nevertheless a world of human community.” – J. Robert Oppenheimer, 1954

There are some signs of collaboration culture beginning to take root: The L10NCafe is one attempt I know of, I also believe that the Open Translation Tools summit will lead to something of substance even though they have humble beginnings. I am sure there must be others that I do not know or have not mentioned.

I sense that the translation industry is poised for dramatic change, not all of it comfortable and welcome. However, I think that translation will increasingly be a  force driving change in the world, not just for developing new commercial markets, but also to raise the quality of life for millions. And for me that is a truly awesome and wonderful idea. I hope that you too can find things that are worth sharing.

Thursday, February 11, 2010

Translation Humor & Mocking Machine Translation

I often run into blogs by translators and LSPs or just regular people who suggest that machine translation is not quite ready. In fact some people actually, believe it or not, mock MT. So while I do believe that MT is going be very much a part of the translation landscape in the very near future, I thought it would be fun to pick some of my favorite examples of MT gone awry. 

While MT mishaps can be funny I still think that humans, especially silly humans can do better, and my first example is by Ben who translated this Bollywood song and just put down what he thought he heard. I speak the language (Hindi) she is singing in and I laughed till I cried. In fact I can’t stop smiling as I type this. For those who want to know, “meheboob mereh” actually means my beloved.  

These are Ben’s own words on what he was trying to do:
My translation of an Indian music video. This is what I think the words sound like.

Translation Party is a popular site that uses a familiar technique used to make MT look bad. You keep translating the same phrase back and forth and perhaps even across various languages to make sure that you make MT output that is really bad. Interestingly there are some “MT consultants” who also use this technique to test MT technology. A pointless exercise if you are serious, but can be great fun if you are just playing. So in my test,  <practice makes perfect and there is no substitute for hard work> was translated as <after working really hard Substitute>. Interestingly it was 100% accurate on <please do not poop on my knee> and gave me the same phrase back. I think that shows that when it really matters it can get things right.

Another personal favorite of mine is from Jill Sommer who had this little gem on her blog. Here is a tiny movie with a dialog developed completely from MT round tripping. As she describes it:
This fine little film by Matt Sloan capitalizes on Babelfish for its dialog. It translates to and from English, French and German. It was filmed on location in Trouville, France. Enjoy!

Mark Liberman in his Language Log blog shows this little furry iPod docking station gadget and with the following description which is suspected to be machine translation:
iMini is built in the rhythm decoding chip MJ1191 of the programming embedded system, and to integrate the HIPS skeleton; No matter you play any kind of music, MJ1191 always make your pet in dancing for you at once.

Another site that is always good for a laugh is These are examples of mostly Chinese and Japanese attempts at translation into English. And this restaurant sign is one I often use in my presentations to show what MT is without human translator involvement. If you have not ever looked at the site, it is quite funny . Here is one that is fun. I am told that there is a site in Japan with funny Japanese phrases from foreigners and I am sure the Chinese are laughing at us too. Just take a look at some of the strange Chinese character tattoos.

Here is a blog that specializes in finding strange translation examples from across the world. Here are some examples we collected from around the world and put on our website (in the left column).  

Anyway while I do laugh at these examples, I do believe the technology is improving all the time and as they say, he who laughs last, laughs the loudest. 

Let me know if you find other fun stuff and if I like it I will add it to this entry or create another entry with the best examples that people find.  Let's focus on really funny and not just wrong, since that would be like laughing at Sarah Palin.

P.S. The Huffington Post found some funny subtitles: Lost In Translation: When Subtitles Go Wrong

I also found another site of mostly human translation gaffes but I thought I would continue to add the best links I find over time to this entry.

Thanks to The Full Blog

And a few more from the Globalization Group and here is an explanation on why the translation industry is "hella lame".

And of course Monty Python with their Dirty Hungarian Phrasebook.

And for those of you who don't speak hip-hop, here is an excellent translation of the song My Hump.

This is a late insertion and shows you how human beings are always  SOOOOO much funnier than anything that MT could dream up. 60 Unintentionally Offensive Business and Product Names - Anybody want to buy some Asshoe shoes, or try some of that tasty Fart juice that goes really well with JussiPussi rolls and Shitto sauce?

Wednesday, February 10, 2010

Making Customer Support Content Multilingual

One of the largest new opportunities for the professional translation industry is in the Customer Support departments of high technology or industrial engineering global companies. I have briefly described the reasons why, but I thought that it would be worth elaborating on this further. 

Many global companies, especially those that are members of the Consortium for Service Innovation realize that a major new trend that they face is the growing power of the community and self-service in the Web 2.0 age. We see that already 98% of customer support interactions of an average global high tech company happen in self-service and the community forums. This is a major shift. However, the focus and much of the resource allocation in companies is still on the call support center and relatively static documentation and content which the professional translation industry is involved with. This makes less sense every day as evidence suggests that the customer experience is often formed by how support problems are handled. As the CSI points out, this is done mostly by self-service content and the “community” outside of corporate control.

GCSE Non-Anglo

If English speaking customers choose to solve their problems in this way, it follows that most global customers will also want to do the same. However, the content available to the global customer is often a fraction of what an Anglophile can get to and so non-English speakers are often left frustrated. There is now clear evidence of the following:

-- The customer support experience is increasingly formed outside the call center and customers strongly prefer self-service and community support.
-- The support experience is often critical in forming customer perceptions and developing brand loyalty.
-- Global customers do not have as much local language information access and thus probably have a less satisfying support experience.
-- Making much more knowledge and product support content available is a key to generating a better support experience.
-- Good self-service knowledge base content and greater visibility to high quality community content can greatly enhance the customer experience.
-- There is a clear relationship between customer loyalty and increased revenue and probably repeat purchase.

This situation presents a significant opportunity for the professional translation industry. However, given the huge volumes of content that need to be made multilingual it is important and necessary that automation be a key component of the multilingual content development strategy.  Microsoft was a pioneer in doing this, and they showed that hundreds of millions of customers were willing to use machine translated knowledge base content. Until recently they had all the knowledge base content available in at least 9 languages and are expected to expand this to more languages in future.

The benefits of global enterprise making large amounts of support content multilingual are significant, both financially and in terms of positive customer as the following graphic shows. Not only is self-service content a HUGE cost saver it can also create real positive brand perceptions. 

Call Deflection Benefit

The highest quality content production process will always need a high degree of human steering and expert linguistic guidance. Machine translation without humans may not provide the translation quality necessary to provide a positive support experience. The professional translation industry has a major opportunity ahead as major global corporations begin to act on this trend.

The role of customer support is shifting from answering questions and solving customer problems to facilitating a network of people and content. The Consortium’s research shows that the majority of the customer support experience is with content; not with people. The benefit of offering that content in the language of the customer is huge.

I will continue on this theme for at least another blog entry. I strongly recommend that you take a look at the CSI website as it is filled with great information and research on what is going on in the world of Customer Support.

Quote from the CSI web site:
Rather than continuing to invest in doing what we do faster, better, and cheaper, maybe we need to look at doing something altogether different... maybe there is a lesson for the localization industry in this.

Monday, February 1, 2010

The Impact of “Clean Data” on SMT

This is a summary of a study on translation memory consolidation that I was involved with and a continued examination of the issue of “clean data” which I believe is essential to long-term success with data-driven MT initiatives.The Asia Online rating system rates data that is deemed to be best suited for SMT training and is not a judgment on TM quality for TM purposes. The study conducted by Asia Online took a relatively small set of TM data with the kind facilitation of TAUS and data provided by three members and attempted to answer three questions:
  1. Is there a benefit to sharing TM data for the purpose of building SMT engines?
  2. What are some practical guidelines to help enhance serious data sharing attempts ?
  3. What do best practices look like?
As many people continue to believe that sheer data volume alone is enough to solve many problems with SMT I thought it would be useful to provide an overview of the Asia Online data consolidation study in this blog. Apart from the simple common sense of the "Garbage In Garbage Out" principal which is important in any data processing applications and perhaps even more so in SMT, does it not make sense that if SMT engines learn from parallel corpus, it would be wise and efficient to clean this corpus first?

The Google paper that has also been referenced by TAUS/TDA as foundational justification for “more data is always better” has been criticized by many.  Jaap van Der Meer suggests that Norvig said “forget trying to come up with elegant theories and embrace the unreasonable effectiveness of data.” A more careful examination of the paper reveals that many of the examples in the paper are related to graphical image examples where erroneous pixels are much more tolerable.In fact, Norvig himself, has stated in his own blog that his comments were misinterpreted:

To set the record straight: That's a silly statement, I didn't say it, and I disagree with it. … Peter Norvig

So I maintain that both data quality and algorithms matter, and unless we are talking about huge magnitudes of order differences, clean data will produce better SMT engines and respond more easily to corrective feedback. I have seen this proven over and over again. We are all aware that TM tends to get messy over time and that it is wise to scrub and clean it periodically for best results in any translation automation endeavor.

Basically the study found that some TM is better suited for SMT and that it is important to understand this BEFORE you consolidate data from multiple sources. The graphic below shows the details of the data in question. Additionally we also found that Datasets A and C were more consistent in their use of terminology.

The key findings from the study are as follows:
  • -- Data quality matters and all translation memory data is not equally good for SMT engine development.
  • -- Data quality assessment should be an important first step in any data consolidation exercise for SMT engine development purposes.
  • -- MORE DATA IS NOT ALWAYS BETTER and smaller amounts of high quality data can produce better results than large amounts of dirty data.
  • -- Terminological consistency is an important driver for better quality and success with SMT. Efforts made to standardize terminology across multiple TM datasets will likely yield significant improvements in SMT engine quality.
  • -- Introducing “known dirty data” into the system decreases the quality of the system and increases the unpredictability of the results 
  • -- Systems built with clean data and consistent terminology tend to perform better and improve faster


  • -- Data cleaning and normalization and terminology analysis and standardization is a critical first step to having success with any project that combines TM for developing SMT engines
In the noise about soft censorship we have gotten distracted from two additional questions that also are worth our attention.

What is the best way to store data so that it is useful for both TM and SMT leverage purposes?
What are the best practices for consolidating TM? What tools are necessary to maximize benefits?

Common Source of Data Problems in TM
Encoding problems and inconsistencies in the data.
Large volumes of formatting tags and other metadata that have no linguistic value but which can impact system learning.
Punctuation differences and diacritics are inconsistent across the TM
English words appear in the French translation. This may be valid in translation memory, but will result in English being embedded in the French training data and make it possible for the SMT engine to think that English is French!
Excessive formatting tags and HTML (often representing images) embedded in segments
Frequently the French translations had bracketed terms that were not present in the English source.
Frequently the capitalization does not match.
Terminology was sometimes inconsistent, with different terms being used through the data for the same concept or meaning
Large number of variables in many different forms embedded in the text. Variable forms and formats are inconsistent.
Multiple sentences in one segment. While this is valid, the job of word alignment becomes more complex. Higher quality SMT word alignment can be achieved when these are broken out into their individual segments.
In French text, there are frequently abbreviations when there should be a complete word.
Words missing on either side.

These are early days and we are all still learning, but the tools are getting better and the dialogue is getting more useful and pragmatic as we move away from naive views, that any random pile of data is better than one that has been carefully considered and prepared.

Without understanding the relative cleanliness and quality of the data, data sharing is not necessarily beneficial.

While TM data may often be problematic for SMT in its raw state, some of what is considered “dirt” to SMT can be cleaned through automated tools used by Asia Online and others. However, these tools cannot correct situations when the translations themselves are of a lower quality. This issue has also been highlighted by Don DePalma in a recent article referring to this study where he said: “Our recent MT research contended that many organizations will find that their TMs are not up to snuff — these manually created memories often carve into stone the aggregated work of lots of people of random capabilities, passed back and forth among LSPs over the years with little oversight or management.”

Lets hope that the TDA too will let this taboo subject (data quality) out into the open. I am sure the community will come together and help develop strategies to cope and overcome the initial problems that we identify when we try and share TM data resources