谷歌利用大数据提高通用翻译

真友书屋 2014-06-28

展开全文

谷歌翻译(Google Translate)是目前翻译网页或简短的文字片段使用最多的一个快捷工具。据德国媒体Der Spiegel报道，支持该服务的后台核心技术，会在不久的将来被改进为类似“星际迷航（Star Trek)”那样的通用翻译器。

当然，谷歌并不是唯一一家致力于此事的公司。从Facebook到微软的每个人都有这样一个野心，那就是创建一个能最终彻底解决语言障碍的服务。而这个野心实际吗？如果想要实现又需要付出多大的努力？

机器翻译的存在由来已久，但一直远远落后于人工翻译。很多机器翻译软件的开发问题，是如何对不同语言的语法以及词汇进行定义，而这些都不容易解决。

在工程师Franz Och的指导下，谷歌的做法颠覆上面的一切，它家采用纯粹的统计方法。举个例子，通过并行处理大量的可利用的翻译资料，英法语之间的翻译就比旧的通过算法驱动的翻译方法好很多。平行处理的可利用的文本资料库越大，翻译效果就越好。（当然这也离不开过去几十年数据存储及计算处理能力的巨大发展。）

如果说谷歌的方法是自力更生从零基础开始构建，那么Facebook则是选择了借助他山之石。早在2012年8月，Facebook收购了一家语言翻译软件公司Mobile Technologies，这项收购被 Facebook产品管理总监描述为“长期的产品路线图的投资” 。在MobileTechnologies产品中就有一款叫做Jibbigo的能执行语音翻译的应用。

从以上两种不同的方法中，我们就可以看到一个共同点, 那就是他们都拥有海量的实体语言数据库后盾。谷歌和微软都有能捕获网络实时数据的搜索引擎; Facebook还有高达十几亿的实时聊天用户。所有这一切都给翻译资料提供了海量数据宝库。

而如今最大的尚未解决的问题是：谷歌，Facebook，微软，以及其他公司能否在使用实时会话生成语言翻译资料库的同时做到匿名化？设立自愿选择加入(opt-in)程序使人们在会话时同意被采集建库似乎是最好的解决办法。可是根据以往经验，这些公司往往更有可能只是简单粗暴地把这种数据采集条款加入服务协议中。

原文：

Google taps bigdata for universal translator

By InfoWorld Tech Watch

Google Translate is currently best known for being aquick and dirty way to render Web pages or short text snippets in anotherlanguage. But accordingto Der Spiegel[1],the next step for the core technology behind that service is a device thatamounts to the universal translator from "Star Trek."

Google isn't alone, either. Apparently everyone fromFacebook to Microsoft is ramping up similar ambitions: to create services thateradicate language barriers as we currently know them. A realistic goal orstill science fiction? And at what cost?

Machine translation has been around in one form oranother for decades, but has always lagged far behind translations produced byhuman hands. Much of the software written to perform machine translationinvolved defining different languages' grammars and dictionaries, a difficultand inflexible process.

Google's approach, under the guidance of engineerFranz Och, was to replace all that with a purely statistical approach. Lookingat masses of data in parallel -- for instance, the English and Frenchtranslations of various public-domain texts -- produced far better translationsthan the old algorithm-driven method. The bigger the corpus, or body ofparallel texts, the better the results. (The imploding costs of storage andprocessing power over the last couple of decades have also helped.)

If Google's plan is to create its own technology fromscratch, Facebook's strategy appears to be to import it. Back in August,Facebook pickedup language translation software company Mobile Technologies[2], which Facebook product managementdirector described[3] as "an investment in our long-termproduct roadmap." Among Mobile Technologies' products is an app namedJibbigo, which translates speech.

From these two projects alone, it's easy to see acommon element: the backing of a company that has tons of real-world linguisticdata at its disposal. Google and Microsoft both have search engines thatharvest the Web in real time; Facebook has literally a billion users chattingaway. All of this constitutes a massive data trove that can be harvested forthe sake of a translation corpus.

The big unanswered question so far: If Google,Facebook, Microsoft, and the rest plan on using real-time conversations togenerate a corpus for translations, will any of that data be anonymized? Is iteven possible? An opt-in program that allows people to let their talk be usedas part of the corpus seems like the best approach. But based on their previousbehavior, isn't it more likely they'll simply roll such harvesting into aterms-of-service agreement?

Thisarticle, "Google taps big data for universaltranslator[4]," was originally published atInfoWorld.com[5].Follow the latest developments in businesstechnology news[6] and get a digest of the keystories each day in the InfoWorld Daily newsletter[7].For the latest business technology news, followInfoWorld on Twitter[8].