📄 report.tex.svn-base

📁 moses开源的机器翻译系统
💻 SVN-BASE
📖 第 1 页 / 共 5 页
字号:
P(\mathbf{e} | \mathbf{f}) = \exp \sum_{i=1}^{n}{\lambda_i h_i (\mathbf{e}, \mathbf{f})} \label{log-linear-formulation}\end{equation}%Where  $h_i$ can be an arbitrary {\it feature function} that assigns a score to a  translation.  Commonly used feature functions include the phrase translation probability, and also trigram language model probabilities, word translation probabilities, phrase length penalty, and reordering costs.  \subsection{Problems with phrase-based models}The limitations of current approaches to statistical machine translation stem from their formulation of phrases.  Because they treat phrases as sequences of fully-inflected words and do not incorporate any additional linguistic information, they are limited in the following ways:\begin{itemize}\item They are unable to learn translations of words that do not occur in the data, because they are unable to generalize.  Current approaches know nothing of  morphology, and fail to connect different word forms.  When a form of a word does not occur in the training data, current systems are unable to translate it.  This problem is severe for languages which are highly inflective, and in cases where only small amounts of training data are available. \item They are unable to distinguish between different linguistic contexts.  When current models have learned multiple possible translations for a particular word or phrase, the choice of which translation to use is guided by frequency information rather than by linguistic information.   Often times linguistic factors like case, tense, or agreement are important determinants  for what translation ought to be used in a particular context.  Because current phrase-based approaches lack linguistic information they do not have an  appropriate means of choosing between alternative translations.  \item They have limited capacities for learning linguistic facts.   Because current models do not use any level of abstraction above words, it is impossible to model simple linguistic facts.  Under current approaches it is impossible to learn or to explicitly specify that adjective-noun alternation occurs between two languages, or that a language's word order is subject-object-verb, or similar linguistic facts.  \end{itemize}\section{Factored Translation Models}\begin{figure}\begin{center}\includegraphics[width=\linewidth]{images/word-aligned-parallel-corpus-plus-factors}\end{center}\caption{Factored Translation Models integrate multiple levels of information in the training data.}\label{word-aligned-parallel-corpus-plus-factors}\end{figure}We propose Factored Translation Models to advance statistical machine translation through the incorporation multiple levels of information.  These layers of information, or {\it factors}, are integrated into both the training data and the models.  The parallel corpora used to train Factored Translation Models are tagged with factors such as parts of speech and lemmas, as shown in Figure \ref{word-aligned-parallel-corpus-plus-factors}.   Instead of modeling translation between full inflected words in the source and targets, our models can incorporate more general mappings between factors in the source and target (and between factors within the target, as well shall shortly discuss).  We can represent different models graphically by showing the mappings between the different factors, by adding connecting lines in Figure \ref{graphical-model}.\begin{figure}\begin{center}\includegraphics[scale=0.75]{factors.pdf}\end{center}\caption{The models specify a mapping between factors in the source and target languages.  In this report we represent different model configurations by showing which factors are connected using arrows.}\label{graphical-model}\end{figure}The use of factors introduces several advantages over current phrase-based approaches:\begin{itemize}\item Morphology can be better handled by translating in multiple steps.\item Linguistic context can facilitate better decisions when selecting among translations.   \item Linguistic mark up of the training data allows for many new modeling possibilities. \end{itemize}\subsection{Better handling of morphology}One example of the short-comings of the traditional surface word approach in statistical machine translation is the poor handling of morphology. Each word form is treated as a token in itself. This means that the translation model treats, say, the word {\bold house} as  completely independent of the word {\bold houses}. Any instance of {\bold house} in the training data does not add any knowledge to the translation of {\bold houses}. In the extreme case, while the translation of {\bold house} may be known to the model, the word {\bold houses} may be unknown and the system will not be able to translate it. While this problem does not show up as strongly in English --- due to the very limited morphological production in English --- it does constitute a significant problem for morphologically rich languages such as Arabic, German, Czech, etc.Thus, it may be preferable to model translation between morphologically rich languages on the level of lemmas, and thus pooling the evidence for different word forms that derive from a common lemma. In such a model, we would want to translate lemma and morphological information separately,\footnote{Note that while we illustrate the use of factored translation models on such a linguistically motivated example, our framework can be equally well applied to models that incorporate automatically defined word classes.} and combine this information on the target side to generate the ultimate output surface words.  Such a model, which makes more efficient use of the translation lexicon, can be defined as a factored translation model as illustrated in Figure \ref{mophology-example}.\begin{figure}\begin{center}\includegraphics[scale=0.75]{factored-morphgen-symmetric.pdf}\end{center}\caption{A particular configuration of a factored translation model which employs {\it translation steps} between lemmas and POS+morphology, and a {\it generation step} from the POS+morphology and lemma to the fully inflected word}\label{mophology-example}\end{figure}\subsubsection*{Translation and Generation Steps}\label{sec:factored-decomposition}The translation of the factored representation of source words into the factored representation of target words is broken up into a sequence of {\bf mapping steps} that either {\bf translate} input factors into output factors, or {\bf generate} additional target factors from existing target factors.The previous of a factored model which uses morphological analysis and generation breaks up the translation process into the following steps: \vspace{-3pt}{\begin{itemize}\itemsep=-3pt\item Translating morphological and syntactic factors\item Generating surface forms given the lemma and linguistic factors\end{itemize}}Factored translation models build on the phrase-based approach, which divides a sentence into small text chunks (so-called phrases) and translates those chunks. This model implicitly defines a segmentation of the input and output sentences into such phrases, such as:\begin{center}\includegraphics[scale=0.75]{phrase-model-houses.pdf}\end{center}Our current implementation of factored translation models strictly follows the phrase-based approach, with the additional decomposition of phrase translation into a sequence of mapping steps. Since all mapping steps operate on the same phrase segmentation of the input and output sentence into phrase pairs, we call these {\bf synchronous factored models}. Let us now take a closer look at one example, the translation of the one-word phrase {\bold h{\"a}user} into English. The representation of {\bold h{\"a}user} in German is: surface-form {\bold h{\"a}user}, lemma {\bold haus}, part-of-speech {\bold NN}, count {\bold plural}, case {\bold nominative}, gender {\bold neutral}. Given the three mapping steps in our morphological analysis and generation model may provide the following applicable mappings:\begin{itemize}\item {\bf Translation:} Mapping lemmas\begin{itemize}\item {\bold haus $\rightarrow$ house, home, building, shell}\end{itemize}\item {\bf Translation:} Mapping morphology\begin{itemize}\item {\bold NN$|$plural-nominative-neutral $\rightarrow$ NN$|$plural, NN$|$singular} \end{itemize}\item {\bf Generation:} Generating surface forms\begin{itemize}\item {\bold house$|$NN$|$plural $\rightarrow$ houses} \item {\bold house$|$NN$|$singular $\rightarrow$ house} \item {\bold home$|$NN$|$plural $\rightarrow$ homes} \item {\bold ...}\end{itemize}\end{itemize}The German {\bold haus$|$NN$|$plural$|$nominative$|$neutral} is expanded as follows:\begin{itemize}\item {\bf Translation:} Mapping lemmas\\{\bold \{ ?$|$house$|$?$|$?,$\;\;$ ?$|$home$|$?$|$?,$\;\;$ ?$|$building$|$?$|$?,$\;\;$ ?$|$shell$|$?$|$? \}}\item {\bf Translation:} Mapping morphology\\{\bold \{ ?$|$house$|$NN$|$plural,$\;\;$ ?$|$home$|$NN$|$plural,$\;\;$ ?$|$building$|$NN$|$plural,$\;\;$ ?$|$shell$|$NN$|$plural,$\;\;$ ?$|$house$|$NN$|$singular,$\;\;$ ...~\}}\item {\bf Generation:} Generating surface forms\\{\bold \{ houses$|$house$|$NN$|$plural,$\;\;$ homes$|$home$|$NN$|$plural,$\;\;$ buildings$|$building$|$NN$|$plural,$\\ $shells$|$shell$|$NN$|$plural,$\;\;$ house$|$house$|$NN$|$singular,$\;\;$ ...~\}}\end{itemize}These steps are not limited to single words, but instead can be applied to sequences of factors.  Moreover, each of these steps has a probabilistic definition.  Just as phrase-based models calculate phrase translation probabilities  $p(\bar{e}_{words} | \bar{f}_{words})$ over fully inflected words, factored translation models use probabilities over more abstract features, such as $p(\bar{e}_{lemma} | \bar{f}_{lemma})$ and  $p(\bar{e}_{morph+pos} | \bar{f}_{morph+pos})$.  The generation steps can also be defined probabilistically as $p(\bar{e}_{words} | \bar{e}_{lemma}, \bar{e}_{morph+pos})$.As in phrase-based models, the different components of the model are combined in a log-linear model. In addition to traditional components --- language model, reordering model, word and phrase count, etc. --- each translation and generation probability is represented by a feature in the log linear model. \subsection{Adding context to facilitate better decisions}\label{additional-context}\begin{figure}\begin{center}\includegraphics[scale=.55]{images/phrase-extraction-plus-factors}\end{center}\caption{Different factors can be combined.  This has the effect of giving different conditioning variables.}\label{phrase-extraction-of-words-and-post-tags}\end{figure}If the only occurrences of {\it Spain declined} occurred in the sentence pair given in Figure \ref{word-aligned-parallel-corpus}, the phrase translation probability for the two French phrases under current phrase-based models would be%\begin{eqnarray*}p(\textnormal{{\it l' Espagne a refus\'{e} de}} | \textnormal{{\it Spain declined}}) &=& 0.5 \\p(\textnormal{{\it l' Espagne avait refus\'{e} d'}} | \textnormal{{\it Spain declined}}) &=& 0.5 \end{eqnarray*}%Under these circumstances the two forms of {\it avoir} would be equiprobable and the model would have no mechanism for choosing between them.  %In Factored Translation Models translation probabilities can be conditioned on more information than just words.  For instance, using the combination of factors given in Figure \ref{phrase-extraction-of-words-and-post-tags} we can calculate translation probabilities that are conditioned on both words and parts of speech\begin{equation}p(\bar{f}_{words} | \bar{e}_{words}, \bar{e}_{pos})  = \frac{count(\bar{f}_{words},  \bar{e}_{words}, \bar{e}_{pos})}{count( \bar{e}_{words}, \bar{e}_{pos})}\label{multiple-conditioning-factors}\end{equation}Whereas in the conventional phrase-based models the two French translations of {\it Spain declined} were equiprobable, we now have a way of distinguishing between them.  We can now correctly choose which form of {\it avoir} to use if we know that the English verb {\it decline} is past tense (VBD) or that it is a past participle (VBN):%\begin{eqnarray*}p(\textnormal{{\it l' Espagne a refus\'{e} de}} | \textnormal{{\it Spain declined, NNP VBN}}) &=& 0 \\p(\textnormal{{\it l' Espagne avait refus\'{e} d'}} | \textnormal{{\it Spain declined, NNP VBN}}) &=& 1 \\ 	    & & \\p(\textnormal{{\it l' Espagne a refus\'{e} de}} | \textnormal{{\it Spain declined, NNP VBD}}) &=& 1 \\p(\textnormal{{\it l' Espagne avait refus\'{e} d'}} | \textnormal{{\it Spain declined, NNP VBD}}) &=& 0\end{eqnarray*}%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%\subsection{New modeling possibilities}\label{better-modeling}The introduction of factors also allows us to model things we were unable to model in the standard phrase-based approaches to translation.  For instance, we can now incorporate a translation model probability which operates over sequences of parts of speech, $p(\bar{f}_{pos} | \bar{e}_{pos})$.  We can estimate these probabilities straightforwardly using techniques similar to the ones used for phrase extraction in current approaches to statistical machine translation.  In addition to enumerating phrase-to-phrase correspondences using word alignments, we can also enumerate POS-to-POS correspondences, as illustrated in Figure \ref{phrase-extraction-of-pos-tags}.  After enumerating all POS-to-POS correspondences for every sentence pair in the corpus, we can calculate $p(\bar{f}_{pos} | \bar{e}_{pos})$ using maximum likelihood estimation%\begin{equation}p(\bar{f}_{pos} | \bar{e}_{pos})  = \frac{count(\bar{f}_{pos}, \bar{e}_{pos})}{count(\bar{e}_{pos})} \end{equation}%This allows us to capture linguistic facts within our probabilistic framework.  For instance, the adjective-noun alternation that occurs between French and English would be captured because the model would assign probabilities such that%\[ p(\textnormal{NN ADJ} | \textnormal{JJ NN}) > p(\textnormal{ADJ NN} | \textnormal{JJ NN}) \]%Thus a simple linguistic generalization that current approaches cannot learn can be straightforwardly encoded in Factored Translation Models.\begin{figure}\begin{center}\includegraphics[scale=.55]{images/phrase-extraction-pos-tags-2}\end{center}\caption{In factored models correspondences between part of speech tag sequences are enumerated in a similar fashion to phrase-to-phrase correspondences in standard models.}\label{phrase-extraction-of-pos-tags}\end{figure}Moreover, part of speech tag sequences are not only useful for calculating translation probabilities such as $p(\bar{f}_{pos} | \bar{e}_{pos})$.  They may also be used for calculating ``language model'' probabilities such as $p(\bar{f}_{pos})$.  The probability $p(\bar{f}_{pos})$ can be calculated similarly to the $n$-gram language model probability $p(\bar{f}_{words})$, which is used in current statistical machine translation systems.   Sequences of parts of speech have much richer counts than sequences of words since the number of unique part of speech tags is much smaller than the number of unique words.  This allows higher order $n$-gram models to be estimated from data.  Practical constraints generally limit us to tri-gram language models over words, but we can accurately estimate 6- or 7-gram language models over parts of speech.\section{Statistical Modeling}Factored translation models closely follow the statistical modeling methods used in phrase-based models. Each of the mapping steps is modeled by a feature function. This function is learned from the training data, resulting in translation tables and generation tables.Phrase-based statistical translation models are acquired from word-aligned parallel corpora by extracting all phrase-pairs that are consistent with the word alignment. Given the set of extracted phrase pairs with counts, various scoring functions are estimated, such as conditional phrase translation probabilities based on relative frequency estimation.Factored models are also acquired from word-aligned parallel corpora. The tables for translation steps are extracted in the same way as phrase translation tables. The tables for generation steps are estimated on the target side only (the word alignment plays no role here, and additional monolingual data may be used). Multiple scoring functions may be used for generation and translation steps, in our experiments we used
💿 文件大小 8836 K
👤 上传用户 myhpgnl
📂 所属分类 Linux/Unix编程
🏷️ 相关标签

#moses #开源 #机器翻译系统
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -