📄 report.tex.svn-base
字号:
\begin{itemize}\item five scores for translation steps: conditional phrase translation probabilities in both direction (foreign to English and vice versa), lexical translation probabilities (foreign to English and vice versa), and phrase count;\item two scores for generation steps: conditional generation probabilities in both directions (new target factors given existing target factors and vice versa).\end{itemize}The different components of the model are combined in the log-linear model. In addition to traditional components --- language model, reordering model, word and phrase count, etc. --- each mapping step forms a component with five (translation) or two (generation) features. The feature weights in the log-linear model are determined using a minimum error rate training method \citep{Och2003c}.\section{Efficient Decoding}Compared to phrase-based models, the decomposition of the phrase translation into several mapping steps creates additional computational complexity. Instead of a simple table lookup to obtain the possible translation for an input phrase, now a sequence of such tables have to be consulted and their content combined.Since all translation steps operate on the same segmentation, the {\bf expansion} of these mapping steps can be efficiently pre-computed prior to the heuristic beam search, and stored as translation options (recall the example in Section~\ref{sec:factored-decomposition}, where we carried out the expansion for one input phrase). This means that the fundamental search algorithm does not change. Only the scoring of hypothesis becomes slightly more complex.However, we need to be careful about the combinatorial explosion of the number of translation options given a sequence of mapping steps. If one or many mapping steps result in a vast increase of (intermediate) expansions, this may be become unmanageable. We currently address this problem by early pruning of expansions, and limiting the number of translation options per input phrase to a maximum number, by default 50.\section{Current Shortcomings}One significant limiting factor in the performance of multi-factoredtranslation models is the due to the present requirement thatsuccessive translation steps all translate identical source andtarget spans. If a compatible translation is not found for asecondary translation step (either because hypotheses withcompatible factors were discarded earlier or because there is nopossible translation in the phrase table for the secondarytranslation step), the hypothesis is abandoned. This hasconsiderable benefit from a computational perspective since itconstrains the search space for potential targets when translatingsecondary factors. However, it causes a number of significantproblems:\begin{enumerate} \item In models where a secondary factor is both generated from anothertarget factor and translated from a source factor, any pruningbefore both steps have completed runs the risk of producing not justdegraded output, but failing to find any adequate translation. \item Because a compatible translation must be found in secondary stepsfor a translation hypothesis to survive, it is difficult to filtersecondary translation tables. This results in very large tableswhich are inefficient to load and have considerable memory overhead. \item When secondary translation steps fail and hypotheses areabandoned, the model is forced to rely on shorter translation unitsfor the primary translation step. This is in direct conflict to thepotential benefits that can be gained by richer statistics.\end{enumerate}There are several possible ways that the exact-span matchrequirement might be addressed. One solution that is computationallytractable is to back off to shorter spans only in the event of afailure to find any possible translation candidates duringsubsequent translation steps. The problem that arises is how thespans established should be translated once multiple translationunits can be used. Reordering within phrases is certainly quitecommon. These can be further constrained to either match alignmentsthat are suggested by the initial span.\chapter[Experiments with Factored Translation Models]{Experiments with \\ Factored Translation Models}\label{chap:factored-experiments}This chapter reviews the factored translation model experiments conducted at the summer workshop. After developing the Moses software during the workshop, we used it to create different configurations of factored translation models to address particular problematic cases when translating into different languages. The structure of this chapter is as follows:\begin{itemize}\item Section \ref{english-german-experiments} presents our experiments for translation from English into German. We configured factored models to address German morphology through lemmas, and to integrate part of speech and agreement information to improve grammatical coherence. \item Section \ref{english-spanish-experiments} describes factored models for translation from English into Spanish, where Spanish subject-verb and adjective-noun-determiner agreement is explicitly modeled. These experiments further examine how factored models can be used to improve translation quality in small data scenarios. \item Section \ref{english-czech-experiments} compares the performance of three different English to Czech translation models with include lemma and morphological information as factors, and shows that these models result in better translation quality than the baseline phrase-based model. \end{itemize}\section{English-German}\label{english-german-experiments}German is an example for a language with a relatively rich morphology. Historically, most research in statistical machine translation has been carried out on language pairs with the target language English.This leads to the question: Does rich morphology pose problems that have not been addressed so far, if it occurs on the target side? Previous research has shown, that stemming morphologically rich input languages leads to better performance. However, this trick does not work when we have to {\it generate} rich morphology.\subsection{Impact of morphological complexity}To assess the impact of rich morphology, we carried out a study to see what performance gains could be achieved, if we could generate German morphology perfectlty.For this, we used a translation model trained on 700,000 sentences of the English--German Europarl corpus (a training corpus we will work throughout this section), and the test sets taken from the 2006 ACL Workshop of Statistical Machine Translation. We trained a system with the standard settings of the Moses system (described in \ref{toolkit}).English--German is a difficult language pair, which is also reflected in the BLEU scores for this task. For our setup, we achieved a score of 17.80 on the 2006 test set, whereas for other language pairs scores of over 30 BLEU can be achieved. How much of this is due to the morphological complexity of German? If we measure BLEU not on words (as it typically done), but on stems, we can get some idea how to answer this question. As shown in Table~\ref{tab:german:stem-bleu} the stem-BLEU score is 21.47, almost 4 points higher.\begin{table}\begin{center}\begin{tabular}{|c|c|c|} \hline\bf Method & \bf devtest & \bf test\\ \hlineBLEU measured on words & 17.76 & 17.80 \\ \hlineBLEU measured on stems & 21.70 & 21.47 \\ \hline\end{tabular}\end{center}\caption{Assessment of what could be gained with perfect morphology: BLEU scores measured on the word leveled on on stemmed system output and reference sets. The BLEU score decreases by 4 points due to errors in the morphology.}\label{tab:german:stem-bleu}\end{table}One of the motivations for the introduction of factored translation models is the problem of rich morphology. Morphology increases the vocabulary size and leads to sparse data problems. We expect that backing off to word representations with richer statistics such as stems or word classes will allow us to deal with this problem. Also, morphology carries information about grammatical information such as case, gender, and number, and by explicitly expressing this information in form of factors will allow us to develop models that take grammatical constraints into account.\subsection{Addressing data sparesness with lemmas}The German language model may not be as effectives in machine translation as language models are for English, since its rich morphology fragments the data. This raises the question whether this problem of data sparseness may be overcome by building a language model on lemmas instead of the surface form of words.\begin{figure}\begin{center}\begin{tabular}{cc}\includegraphics[scale=1]{factored-lemma2.pdf}&\includegraphics[scale=1]{factored-lemma1.pdf}\\Lemma Model 1 & Lemma Model 2\end{tabular}\end{center}\caption{Two models for including lemmas in factored translation models: Both models map words from input to output in a translation step and generate the lemma on the output side. Model 2 includes an additional step that maps input words to output lemmas.}\label{fig:german:lemma-model}\end{figure}To test this hypothesis, we build two factored translation models, as illustrated in Figure~\ref{fig:german:lemma-model}. The models are based on traditional phrase-based statistical machine translation systems, but add additional information in form of lemmas on the output side which allows the integration of a language model trained on lemmas. Note that this goes beyond previous work in reranking, since the second language model trained on lemmas is integrated into the search.In our experiments, we obtained higher translation performance when using the factored translation models that integrate a lemma language models (all language models are trigram models trained with the SRILM toolkit). See Table~\ref{tab:german:lemma-model} for details. On the two different set sets we used, we gained 0.60 and 0.65 BLEU with Model 1 and 0.19 BLEU and 0.48 BLEU with Model 2 for the two test sets, respectively. The additional translation step does not seem to be useful.\begin{table}\begin{center}\begin{tabular}{|c|c|c|} \hline\bf Method & \bf devtest & \bf test\\ \hlinebaseline & 18.22 & 18.04 \\ \hlinehidden lemma (gen only) & \bf 18.82 & \bf 18.69 \\ \hlinehidden lemma (gen and trans) & 18.41 & 18.52 \\ \hlinebest published results & - & 18.15 \\ \hline\end{tabular}\end{center}\caption{Results with the factored translation models integrating lemmas from Figure~\ref{fig:german:lemma-model}: language models over lemmas lead to better performance, beating the best published results. Note: the baseline presented here is higher than the one used in Table~\ref{tab:german:stem-bleu}, since we used a more mature version of our translation system.}\label{tab:german:lemma-model}\end{table}\subsection{Overall grammatical coherence}The previous experiment tried to take advantage of models trained with richer statistics over more general representation of words by focussing the the lexical level. Another aspect of words is their grammatical role in the sentence. A straightforward aspect to focus on are part-of-speech tags. The hope is that constraints on part-of-speech tags might ensure more grammatical output.\begin{figure}\begin{center}\includegraphics[scale=1]{factored-simple-pos-lm.pdf}\end{center}\caption{Adding part-of-speech information to a statistical machine translation model: By generating POS tags on the target side, it is possible to use high-order language models over these tags that help ensure more grammatical output. In our experiment, we only obtained a minor gain (BLEU 18.25 vs. 18.22).}\label{fig:german:pos-model}\end{figure}The factored translation model that integrates part-of-speech information is very similar to the lemma models from the previous section. See Figure~\ref{fig:german:pos-model} for an illustration. Again the additional information on the target side is generated by a generation step, and a language model over this factor is employed.Since there are only very few part-of-speech tags compared to surface forms of words, it is possible to build very high-order language models for them. In our experiments with used 5-gram and 7-gram models. However, the gains with obtained by adding such a model were only minor: for instance, on the devtest set we imrpoved BLEU to 18.25 from 18.22, while on the test set, no difference in BLEU could be measured.A closer look at the output of the systems suggests that local grammatical coherence is already fairly good, so that the POS sequence models are not necessary. On the other hand, for large-scale grammatical concerns, the added sequence models are not strong enough to support major restructuring.\subsection{Local agreement (esp. within noun phrases)}The expectation with adding POS tags is to have a handle on relatively local grammatical coherence, i.e. word order, maybe even insertion of the proper function words. Another aspect is morphological coherence. In languages as German not only nouns, but also adjectives and determiners are inflected for count (singular versus plural), case and grammatical gender. When translating from English, there is not sufficient indication from the translation model which inflectional form to chose and the language model is the only means to ensure agreement.By introducing morphological information as a factor to our model, we expect to be able to detect word sequences with agreement violation. Thus our model should be able to decide that\begin{itemize}\item {\bold DET-sgl NOUN-sgl} is a good sequence, but\item {\bold DET-sgl NOUN-plural} is a bad sequence\end{itemize}The model for integrating morphological factors is similar to the previous models, see Figure\ref{fig:german:morphology} for an illustration. We generate a morphological tag in addition to the word and part-of-speech tag. This allows us to use a language model over the tags. Tags are generated with the LoPar parser.\begin{figure}\begin{center}\includegraphics[scale=1]{factored-posmorph-lm.pdf}\end{center} \caption{Adding morphological information: This enables the incorporation of language models over morphological factors and ensure agreement, especially in local contexts such as noun phrases.}\label{fig:german:morphology}\end{figure}When using a 7-gram POS model in addition to the language model, we see minor improvements in BLEU (+0.03 and +0.18 for the devtest and test set, respectively). But an analysis on agreement within noun phrases shows that we dramatically reduced the agreement error rate from 15\% to 4\%. See Table~\ref{tab:german:morphology} for the summary of the results.\begin{table}\begin{center}\begin{tabular}{|c|c|c|c|} \hline\bf Method & \bf Agreement errors in NP & \bf devtest & \bf test\\ \hlinebaseline & 15\% in NP $\ge$ 3 words & 18.22 BLEU & 18.04 BLEU \\ \hlinefactored model & 4\% in NP $\ge$ 3 words & 18.25 BLEU & 18.22 BLEU \\ \hline\end{tabular}\end{center}\caption{Results with the factored translation model integrating morphology from Figure~\ref{fig:german:morphology}. Besides minor improvement in BLEU, we drastically reduced the number of agreement errors within noun phrases.}\label{tab:german:morphology}\end{table}Here two examples, where the factored model outperformed the phrase-based baselines:\begin{itemize}\item Example 1: rare adjective in-between preposition and noun\begin{itemize}\item baseline: {\bold ... \underline{zur} zwischenstaatlichen methoden ...}\item factored model: {\bold ... zu zwischenstaatlichen methoden ... }\end{itemize}\item Example 2: too many words between determiner and noun\begin{itemize}\item baseline: {\bold ... \underline{das} zweite wichtige {\"a}nderung ...}\item factored model: {\bold ... die zweite wichtige {\"a}nderung ... }\end{itemize}
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -