📄 report.tex.svn-base
字号:
\documentclass[11pt]{report} \usepackage{epsf}\usepackage{graphicx}\usepackage{index}\usepackage{varioref}\usepackage{amsmath}\usepackage{multirow}\usepackage{theorem} % for examples\usepackage{alltt}\usepackage{ulem}\usepackage{epic,eepic}\usepackage{boxedminipage}\usepackage{fancybox}\usepackage[square]{natbib}\usepackage{ps4pdf}\usepackage{picins} % pictures next to paragraphs, Ondrej's part\usepackage{avm}\usepackage{epsfig}\PSforPDF{\usepackage{dtree} % a dependency tree\usepackage{pstricks}\usepackage{pst-node}\usepackage{pst-plot}}\usepackage{subfig}\oddsidemargin 0mm\evensidemargin 5mm\topmargin -20mm\textheight 240mm\textwidth 160mm\newcommand{\bold}{\it}\renewcommand{\emph}{\it}\makeindex\theoremstyle{plain}\begin{document}\title{\vspace{-15mm}\LARGE {\bf Final Report}\\[2mm]of the\\[2mm]2006 Language Engineering Workshop\\[15mm]{\huge \bf Open Source Toolkit\\[2mm]\bf for Statistical Machine Translation:\\[5mm]Factored Translation Models\\[2mm]and Confusion Network Decoding}\\[10mm]{\tt \Large http://www.clsp.jhu.edu/ws2006/groups/ossmt/}\\[2mm]{\tt \Large http://www.statmt.org/moses/}\\[15mm]Johns Hopkins University\\[2mm]Center for Speech and Language Processing}\author{\large Philipp Koehn,Marcello Federico,Wade Shen,Nicola Bertoldi,\\\large Ond\v{r}ej Bojar,Chris Callison-Burch,Brooke Cowan, Chris Dyer,Hieu Hoang,Richard Zens,\\\large Alexandra Constantin,Christine Corbett Moran,Evan Herbst}\normalsize\maketitle \section*{Abstract}The 2006 Language Engineering Workshop {\bold Open Source Toolkit for Statistical Machine Translation} had the objective to advance the current state-of-the-art in statistical machine translation through richer input and richer annotation of the training data. The workshop focused on three topics: factored translation models, confusion network decoding, and the development of an open source toolkit that incorporates this advancements.This report describes the scientific goals, the novel methods, and experimental results of the workshop. It also documents details of the implementation of the open source toolkit.\phantom{.}\newpage\section*{Acknowledgments}The participants at the workshop would like to thank everybody at Johns Hopkins University who made the summer workshop such a memorable --- and in our view very successful --- event. The JHU Summer Workshop is a great venue to bring together researchers from various backgrounds and focus their minds on a problem, leading to intense collaboration that would not have been possible otherwise.We especially would like to thank Fred Jelinek for heading the Summer School effort and Laura Graham and Sue Porterfield for keeping us sane during the hot summer weeks in Baltimore.Besides the funding acquired from JHU for this workshop from DARPA and NSF, the participation at the workshop was also financially supported by the funding by the GALE program of the Defense Advanced Research Projects Agency, Contract No. HR0011-06-C-0022 and funding by the University of Maryland, the University of Edinburgh and MIT Lincoln Labs\footnote{This work was sponsored by the Department of Defense under Air Force Contract FA8721-05-C-0002. Opinions, interpretations, conclusions, and recommendations are those of the authors and are not necessarily endorsed by the United States Government.}.\phantom{.}\newpage\section*{Team Members}\begin{itemize}\item Philipp Koehn, Team Leader, University of Edinburgh\item Marcello Federico, Senior Researcher, ITC-IRST\item Wade Shen, Senior Researcher, Lincoln Labs\item Nicola Bertoldi, Senior Researcher, ITC-IRST\item Ond\v{r}ej Bojar, Graduate Student, Charles University\item Chris Callison-Burch, \sout{Graduate Student, University of Edinburgh}\\ \phantom{.} \hspace{1.35in} Assistant Research Professor, JHU \item Brooke Cowan, Graduate Student, MIT\item Chris Dyer, Graduate Student, University of Maryland\item Hieu Hoang, Graduate Student, University of Edinburgh\item Richard Zens, Graduate Student, RWTH Aachen University\item Alexandra Constantin, Undergraduate Student, Williams College\item Evan Herbst, Undergraduate Student, Cornell\item Christine Corbett Moran, Undergraduate Student, MIT\end{itemize}\tableofcontents\chapter{Introduction}Statistical machine translation has emerged as the dominant paradigm in machine translation research. Statistical machine translation is built on the insight that many translation choices have to be weighed against each other --- whether it is different ways of translating an ambiguous word, or alternative ways of reordering the words in an input sentence to reflect the target language word order. In statistical machine translation these choices are guided using probabilities which are estimated using collections of translated texts, called parallel corpora. While statistical machine translation research has gained much by building on the insight that probabilities may be used to make informed choices, current models are deficient because they lack crucial information. Much of the translation process is best explained with morphological, syntactic, semantic, or other information that is not typically contained in parallel corpora. We show that when such information is incorporated to the training data, we can build richer models of translation, which we call {\bf factored translation models}. Since we automatically tag our data there often many ways of marking up input sentences. This further increases multitude of choices that our machine translation must deal with, and requires an efficient method for dealing with potentially ambiguous input. We investigate {\bf confusion network decoding} as a way of addressing this challenge. In addition to these scientific goals, we also address another pervasive problem for our field. Given that the methods and systems we develop are increasingly complex, simply catching up with the state-of-the-art has become a major part of the work done by research groups. To reduce this tremendous duplication of efforts, we have made our work available in an {\bf open source toolkit}. To this end, we merged the efforts of a number of research labs (University of Edinburgh, ITC-irst, MIT, University of Maryland, RWTH Aachen) into a common set of tools, which includes the core of a machine translation system: the decoder. This report documents this effort, which we have continued to pursue beyond the summer workshop.\section{Factored Translation Models}We are proposing a new approach that we call factored translation models, which extends traditional phrase-based statistical machine translation models to take advantage from additional annotation, especially linguistic markup. Phrase-based statistical machine translation is a very strong baseline to improve upon. Phrase-based systems have consistently outperformed other methods in recent competitions. Any improvements over this approach therefore implies an improvement in the state-of-the-art.The basic idea behind factored translation models is to represent phrases not simply as sequences of fully inflected words, but instead as sequences containing multiple levels of information. A word in our model is not a single token, but vector of factors. This enables straight-forward integration of part-of-speech tags, morphological information, and even shallow syntax. Instead of dealing with linguistic markup in preprocessing or postprocessing steps (e.g., the re-ranking approaches of the 2003 JHU workshop), we build a system that intergrate this information into the decoding process to better guide the search.Our approach to factored translation models is described in detail in Chapter~\ref{chap:factored-models}. The results of the experiments that we conducted on factored translation between English and German, Spanish and Czech are given in Chapter~\ref{chap:factored-experiments}.\section{Confusion Network Decoding}With the move to factored translation models, there are now several reasons why we may have to deal with ambiguous input. One is that the tools that we use to annotated our data may not make deterministic decisions. Instead of only relying on the 1-best output of our tools, we accept ambiguous input in the form of confusion networks. This preserves ambiguity and defers firm decisions until later stages, which has been shown to be advantageous in previous research. While confusion networks are a useful way of dealing with ambiguous factors, they are more commonly used to represent the output of automatic speech recognition when combining machine translation and speech recognition in speech translation systems. Our approach to confusion network decoding, and its application to speech translation, is described in detail in Chapter~\ref{chap:confusion-networks}. Chapter \ref{confusion-net-experiments} presents experimental results using confusion networks. \section{Open Source Toolkit}There are several reasons to create an open research environment by opening up resources (tools and corpora) freely to the wider community. Since our research is largely publicly funded, it seems appropriate to return the products of this work to the public. Access to free resources enables other research group to advance work that was started here, and provides them with baseline performance results for their own novel efforts.While these are honorable goals, our motivation for creating this toolkit is also somewhat self-interested: Building statistical machine translation systems has become a very complex task, and rapid progress in the field forces us to spend much time reimplementing other researchers' advances in our system. By bringing several research groups together to work on the same system, this duplication of effort is reduced and we can spend more time on what we would really like to do: Come up with new ideas and test them.The starting point of the Moses system was the Pharaoh system of the University of Edinburgh \citep{koe:04}. It was re-engineered during the workshop, and several major new components were added. Moses is a full-fledged statistical machine translation system, including the training, tuning and decoding components. The system provides state-of-the-art performance out of the box, as has been shown at recent ACL-WMT \citep{callisonburch-EtAl:2007:WMT}, TC-STAR, and IWSLT \citep{ShenIWSLT} evaluation campaigns.The implementation and usage of the toolkit is described in more detail in Chapter~\ref{toolkit}.\chapter{Factored Translation Models}\label{chap:factored-models}The current state-of-the-art approach to statistical machine translation, so-called phrase-based models, represent phrases as sequences of words without any explicit use of linguistic information, be it morphological, syntactic, or semantic. Such information has been shown to be valuable when it is integrated into pre-processing or post-processing steps. For instance, improvements in translation quality have been achieved by preprocessing Arabic morphology through stemming or splitting off of affixes that typically translate into individual words in English \citep{Habash2005}. Other research shows the benefits of reordering words in German sentences prior to translation so that their word order is more similar to English word order \citep{Collins2005}.However, a tighter integration of linguistic information into the translation model is desirable for two reasons:\begin{itemize}\item Translation models that operate on more general representations, such as lemmas instead of surface forms of words, can draw on richer statistics and overcome data sparseness problems caused by limited training data.\item Many aspects of translation can be best explained on a morphological, syntactic, or semantic level. Having such information available to the translation model allows the direct modeling of these aspects. For instance: reordering at the sentence level is mostly driven by general syntactic principles, local agreement constraints show up in morphology, etc.\end{itemize}Therefore, we developed a framework for statistical translation models that tightly integrates additional information. Our framework is an extension of phrase-based machine translation \citep{OchThesis}. \section{Current Phrase-Based Models}\begin{figure}\begin{center}\includegraphics[width=\linewidth]{images/word-aligned-parallel-corpus}\end{center}\caption{Word-level alignments are generated for sentence pairs using the IBM Models.}\label{word-aligned-parallel-corpus}\end{figure}Current phrase-based models of statistical machine translation \citep{OchThesis, koe:03} are based on on earlier word-based models \citep{Brown1988,Brown1993} that define the translation model probability $P(\mathbf{f} | \mathbf{e})$ in terms of word-level alignments $\mathbf{a}$. \begin{equation}P(\mathbf{f} | \mathbf{e}) = \sum_a{P(\mathbf{a},\mathbf{f} | \mathbf{e})}\label{conditional-probability-as-sum-over-alignments}\end{equation}\citet{Brown1993} introduced a series of models, referred to as the IBM Models, which defined the alignment probability $P(\mathbf{a},\mathbf{f} | \mathbf{e})$ so that its parameters could be estimated from a parallel corpus using expectation maximization. Phrase-based statistical machine translation uses the IBM Models to create high probability word-alignments, such as those shown in Figure \ref{word-aligned-parallel-corpus}, for each sentence pair in a parallel corpus. All phrase-level alignments that are consistent with the word-level alignments are then enumerated using phrase-extraction techniques \citep{Marcu2002,koe:03,Tillmann2003,Venugopal2003}. This is illustrated in Figure \ref{phrase-extraction}. The highlighted regions show how two French translations of the English phrase {\it Spain declined} can be extracted using the word alignment. Once they have been enumerated these phrase-level alignments are used to estimate a {\it phrase translation probability}, $p(\bar{f} | \bar{e})$, between a foreign phrase $\bar{f}$ and English phrase $\bar{e}$. This probability is generally estimated using maximum likelihood as%\begin{equation}p(\bar{f} | \bar{e}) = \frac{count(\bar{f}, \bar{e})}{count(\bar{e})}\label{phrase-translation-probability}\end{equation}%\begin{figure}\begin{center}\includegraphics[scale=.55]{images/phrase-extraction}\end{center}\caption{Phrase-to-phrase correspondences are enumerated from word-level alignments.}\label{phrase-extraction}\end{figure}%The phrase translation probability is integrated into a log linear formulation of translation \citep{Och2002}. The log linear formulation of translation is given by\begin{equation}
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -