Colloquium

 

"Syntactic Parsing and Machine Translation"

Chris Callison-Burch, Johns Hopkins University

Thursday, September 9, 2010 at 4:00 P.M.

Room 368 (CIT 3rd Floor)

Modern approaches to machine translation, like those used in Google's online translation system, are data-driven. Statistical translation models are trained using parallel text, which consist of sentences in one language paired with their translation into another language. Probabilistic word-for-word and phrase-for-phrase translation tables are extracted from the human-created parallel texts, and are then used as the basic building blocks in the automatic translation systems.

Most current statistical translation models (like Google's) are completely devoid of linguistic information. They essentially *memorize* the translations of words and phrases from the training data, but are unable to *generalize*. As such they fail to learn a considerable amount from the training data: they fail to learn the translations of unseen words; they fail to learn simple linguistic facts like that a language's word order is subject-object-verb or that adjective-noun alternation occurs between languages; they are often unable to generate grammatical output.

Although the translation-by-memorization strategy works reasonably well for languages that have large volumes of training data, it does poorly for languages that have only small amounts of training data. Large volumes of data exist for language pairs like Arabic-English and Chinese-English, where DARPA's investment has created parallel corpora with 200 million words for each language. However, the majority of the world's languages have far less data.

In this talk, I will describe how to improve translation when only small amounts of training data is available. In particular, I will focus on translation models that use syntactic parsing to learn better generalizations from the training data. I'll focus on the Urdu-English language pair, which is an interesting case because it has less than 1% of the amount of training data as Arabic-English or Chinese-English, and it requires a lot of reordering to produce grammatical English output, because verbs occur at the end of Urdu sentences.

I'll also briefly detail my strategies for dealing with unknown words through transliteration (learning letter to sound mappings), using active learning strategies to increase the amount of training data, and employing non-expert translators through Amazon's Mechanical Turk to create crowdsourced training data.

Host: Eugene Charniak