⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 report.tex.svn-base

📁 moses开源的机器翻译系统
💻 SVN-BASE
📖 第 1 页 / 共 5 页
字号:
\end{itemize}In both cases, the language model over surface forms of words is not strong enough. Locally, on a bigram level the word sequences are correct, due to the ambiguity in the morphology of German adjectives. For instance, {\bold zwischenstaatlichen} could be both singular female dative, as the preposition {\bold zur}, or plural, as the noun {\bold methoden}. The agreement error is between preposition and noun, but the language model has to overcome the context of the unusual adjective {\bold zwischenstaatlichen} which is not very frequent in the training corpus. For the morphological tags, however, we have very rich statistics that rule out the erroneous word sequence.\subsection{Subject-verb agreement}\label{english-subject-verb-agreement-example}Besides agreement errors within noun phrases, another source for disfluent German output are agreement errors between subject in verb. In German, subject and verb are often next to each other (for instance, {\bold \underline{hans} \underline{schwimmt}.}), but may also be several words apart, which almost always the case in relative clauses ({\bold ... damit \underline{hans} im see ... \underline{schwimmt}.}).We could address this problems with factors and skip language models. Consider the following example of a English sentence which may be generated incorrectly by a machine translation system: {\bold \begin{center} \begin{tabular}{cccccccc}\bf the & \bf paintings & \bf of & \bf the & \bf old & \bf man & \underline{\bf is} & \bf beautiful \\\end{tabular}\end{center}}In this sentence, {\bold old man is} is a better trigram than {\bold old man are} so the language model will more likely prefer the wrong translation. The subject-verb agreement is between the words {\bold paintings} and {\bold are}, which are several words apart. Since this out of the reach of traditional language models, we would want to introduce tags for subject and verb to check for this agreement. For all the other wirdsm, the tag is empty. See the extended example below:{\bold \begin{center}\begin{tabular}{cccccccc}\bf the & \bf paintings & \bf of & \bf the & \bf old & \bf man & \bf are & \bf beautiful \\- & SBJ-plural & - & - & - & - & V-plural & - \\\end{tabular}\end{center}} Given these tags, we should prefer the correct morphological forms:\begin{center}{\bold p(-,SBJ-plural,-,-,-,-,V-plural,-) $>$ p(-,SBJ-plural,-,-,-,-,V-singular,-)}\end{center}We implemented a skip language model, so that the empty tags are ignored, and the language model decision is simply made on the base of the subject and verb tags:\begin{center}{\bold p(SBJ-plural,V-plural) $>$ p(SBJ-plural,V-singular)}\end{center}We explored this idea when translating into Spanish, as described in the next section. \section{English-Spanish}\label{english-spanish-experiments}In this section we describe a series of experiments we conducted usingfactored models for translation European parliament proceedings fromEnglish to Spanish.  The motivation for these experiments was toassess the utility of factored translation models for limited resourcetranslation tasks.  In the standard phrase-based translation paradigmthe translation process is modeled as a $p(\textbf{e}|\textbf{f})$where the target language sentence $\textbf{e}$ is generated thesource sentence $\textbf{f}$ and both source and target are decomposedinto fully inflected substrings or phrases.  It is becausephrase-based systems directly model the translation strings that theyrequire large amounts of parallel data to train.  Intuitively, thesedata are helpful for modeling both general coocurrence phenomena (suchas local agreement) within a language and phrases that translatednon-compositionally across languages.In these experiments, we explore the use of factored models toreduce the data requirements for statistical machine translation.In these models we attempt to improve on the performance of a standardphrase-based model either by explicitly address local agreement or bymodeling the translation process through decomposition into differentmorphological factors.  Specifically, we explored models that modellemmatized forms of the parallel training data rather than fullyinflected forms.  We also experimented with models that attempt toexplicitly model agreement through language models of agreement-likefeatures derived from morphological analysis.To compare the performance of these models we created a small subsetof Europarl that we call {\tt EuroMini}.  This subset consisted of40,000 sentence pairs (approximately 5\% of the total EuroparlCorpus).  Table~\ref{tab:euromini-corpus-stats} shows the statisticsfor this corpus and BLEU scores from a baseline phrase-based MT system. Forthese tasks we report results translating from English into Spanish.We choose this task to test whether our factored models could bettermodel the explicit agreement phenomena in Spanish.  We compared theresults of systems trained on this reduced form of Europarl with:\begin{enumerate}  \item standard models trained on {\tt EuroMini}  \item standard models trained with {\tt EuroMini} with Spanish-side of full Europarl.  \item standard models trained on full Europarl\end{enumerate}\begin{table}  \begin{center}    \begin{tabular}{|l|c|l|l|}      \hline      \bf Data Set & \bf Translation Direction & \bf Size & \bf Baseline (with different LMs) \\ \hline \hline      Full Europarl & English $\rightarrow$ Spanish  & 950k LM Train       &  3-gram LM $\rightarrow$ 29.35 \\                                                  &  & 700k Bitext         &  4-gram LM $\rightarrow$ 29.57 \\                                                  &  &                     &  5-gram LM $\rightarrow$ 29.54 \\ \hline      \tt EuroMini & English $\rightarrow$ Spanish   & 60k LM Train        &  3-gram LM $\rightarrow$ 23.41 \\                                                   & & 40k Bitext          &  3-gram LM $\rightarrow$ 25.10 (950k train) \\      \hline    \end{tabular}  \end{center}  \caption{{\tt EuroMini} and Europarl English-Spanish Corpus Description}  \label{tab:euromini-corpus-stats}\end{table}\subsection{Sparse Data and Statistical MT for English-Spanish}Like many languages, Spanish exhibits both subject-verb agreement andnoun-phrase internal agreement.  Spanish verbs must agree with theirsubjects in both number and person.  Spanish noun phrases forcedeterminers, adjectives and nouns to agree in both number and gender.In both cases, the agreement is explicitly marked in Spanishmorphology. Examples of these agreement phenomena are shown inTable~\ref{tab:spanish-agr-examples}.\begin{table}  \begin{center}    \begin{tabular}{|l|llll|}      \hline      \multicolumn{5}{|c|}{\it Subject Verb Agreement} \\ \hline \hline      \bf Spanish & \bf T\'{u} & \bf quieres & un & billete \\      \bf Gloss & you [2p, sing] & want [2p, sing] & a & ticket \\       \hline    \end{tabular}    \begin{tabular}{|l|lllllllll|}      \hline      \multicolumn{10}{|c|}{\it Long Distance Subject Verb Agreement} \\ \hline \hline      \bf Spanish & La & \bf creaci\'{o}n & de & un & grupo & de & alto & nivel & \bf es \\      \bf Gloss & The & creation [3p, sing] & of & a & group & of & high & level/standing & is [3p, sing]\\      \hline    \end{tabular}    \begin{tabular}{|l|lll|}      \hline      \multicolumn{4}{|c|}{\it Noun Phrase Agreement} \\ \hline \hline      \bf Spanish & \bf esta & \bf cooperaci\'{o}n & \bf reforzada \\      \bf Gloss & this [sing, f] & cooperation [sing, f] & reinforced/enhanced [sing, f] \\      \hline    \end{tabular}  \end{center}  \caption{Examples of Spanish Agreement Phenomena}  \label{tab:spanish-agr-examples}\end{table}The inflectional morphology that marks these phenomena in Spanishpresents a unique set of problems for statistical language learning ingeneral and MT in specific.  First, methods based on counts of surfaceform words suffer from data fragmentation.  Compare, for instance, theEnglish phrase ``saw the'' (as in ``I/he/she/they [ saw the ]car/bag'') with it's possible Spanish translation (shown in Table~\ref{tab:example-frag}).\begin{table}[h]    \begin{center}      \begin{tabular}{|l|l||l|l|}        \hline        vi al & vi a la & viste al & viste a la \\ \hline        vio al & vio a la & vimos al & vimos a la \\ \hline        vieron al & vieron a la & visteis al & visteis a la \\         \hline      \end{tabular}    \end{center}    \caption{Examples of Data Fragmentation Due to Poor Generalization}    \label{tab:example-frag}\end{table}Each surface shown here differs by only person and number features onthe verb ``ver'' or the gender on the determiner ``el.''  Instead ofof learning a relation between underlying lemmatized form ``ver a el''and the corresponding English phrase, a standard MT system must learneach of the variants shown above.  In situations where training data isabundant, this may not cause great difficulty, but when paralleltraining resources are lacking, the fragmentation shown here couldcause poor estimation of translation probabilities.  Furthermore, inthese situations observation of each phrase variant may not bepossible, and a statistical model based on surface forms alone wouldnot be able to produce unseen variants.  For statistical MT systems,the lack of ability to generalize could affect all stages of trainingincluding word alignment, phrase extraction and language modeltraining.A second set of problems in Spanish has to do with long distanceagreement and is similar to the English example given in Section \ref{english-subject-verb-agreement-example}.  In this case, inflectional morphologyenforces an agreement relation between two surface forms across a longspan.  Phrase-based MT systems have difficulty modeling thisagreement: typically neither language models or phrase translationmodels can be reliably estimated for dependencies longer than 3-4words.  This problem is exacerbated in sparse data conditions, and wehypothesize that in these conditions, long term coherence ofphrase-based MT output could suffer.In the sections below we detail two factored translation models thatattempt to address these problems specifically.  Both models extendthe standard surface form model by using morphological analysis andpart of speech information.  To address the agreement and long-spancoherence problems, we construct a model that {\it generates}agreement features and {\it checks} these features using a language

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -