📄 netdict.tex
字号:
back to their original form.As a final example, a typical output transformation applied viathe edit script \texttt{global.ded} will convert all phones tocontext-dependent form and append a short pause model \texttt{sp}at the end of each pronunciation. The following two commands willdo this\begin{verbatim} TC AS sp\end{verbatim}For example, these commands would convert the dictionary entry\begin{verbatim} BAT b ah t\end{verbatim}into\begin{verbatim} BAT b+ah b-ah+t ah-t sp\end{verbatim}Finally, if the \texttt{-l} option is set, \htool{HDMan} will generate a log file containinga summary of the pronunciations used from each source andhow many words, if any are missing. It is also possible togive \htool{HDMan} a phone list using the \texttt{-n} option.In this case, \htool{HDMan} will record how many times each phonewas used and also, any phones that appeared in pronunciations butare not in the phone list. This is useful for detecting errors and unexpected phone symbols in the source dictionary.\mysect{Word Network Expansion}{netexpand}\index{word network@expansion rules}Now that word networks and dictionaries have been explained, the conversion of word level networksto model-based recognition networks will be described. Referringagain to Fig~\href{f:recsys}, this expansionis performed automatically by the module \htool{HNet}. By default,\htool{HNet} attempts to infer the required expansion from thecontents of the dictionary and the associated list of HMMs.However, 5 configurations parameters are supplied to applymore precise control where required:\texttt{ALLOWCXTEXP}\index{allowcxtexp@\texttt{ALLOWCXTEXP}}, \texttt{ALLOWXWRDEXP}\index{allowxwrdexp@\texttt{ALLOWXWRDEXP}}, \texttt{FORCECXTEXP}\index{forcecxtexp@\texttt{FORCECXTEXP}}, \texttt{FORCELEFTBI}\index{forceleftbi@\texttt{FORCELEFTBI}} and\texttt{FORCERIGHTBI}\index{forcerightbi@\texttt{FORCERIGHTBI}}.The expansion proceeds in four stages.\begin{enumerate}\item \textit{Context definition} \\The first step is to determine how modelnames are constructed from the dictionary entries and whethercross-word context expansion should be performed.The dictionary is scanned and each distinct phone is classified as either\begin{enumerate}\item \textit{Context Free} \\ In this case, the phone is skipped when determining context. An example is a model (\texttt{sp}) for short pauses. This will typically be inserted at the end of every word pronunciation but since it tends to cover a very short segment of speech it should not block context-dependent effects in a cross-word triphone system.\item \textit{Context Independent} \\ The phone only exists in context-independent form. A typical example would be a silence model (\texttt{sil}). Note that the distinction that would be made by \htool{HNet} between \texttt{sil} and \texttt{sp} is that whilst both would only appear in the HMM set in context-independent form, \texttt{sil} would appear in the contexts of other phones whereas \texttt{sp} would not.\item \textit{Context Dependent} \\ This classification depends on whether a phone appears in the context part of the name and whether any context dependent versions of the phone exist in the HMMSet. Context Dependent phones will be subject to model name expansion.\end{enumerate}\item \textit{Determination of network type} \\The default behaviour is to produce the simplest networkpossible. If the dictionary is closed (every phone name appearsin the HMM list), then no expansion of phone names is performed.The resulting network is generated by straightforwardsubstitution of each dictionary pronunciation for eachword in the word network. If the dictionary is not closed, then if word internal context expansionwould find each model in the HMM set then word internal context expansion is used.Otherwise, full cross-wordcontext expansion is applied.The determination of the network type\index{network type} can be modified byusing the configuration parameters mentioned earlier. By default\texttt{ALLOWCXTEXP} is set true. If \texttt{ALLOWCXTEXP} is set false, then no expansion of phone names is performed and each phone corresponds to themodel of the same name. The default value of \texttt{ALLOWXWRDEXP} is false thuspreventing context expansion across word boundaries. This also limits theexpansion of the phone labels in the dictionary to word internal contextsonly. If \texttt{FORCECXTEXP} is set true, then context expansion will beperformed. For example, if the HMM set contained all monophones, all biphonesand all triphones, then given a monophone dictionary, the default behaviour of\htool{HNet} would be to generate a monophone recognition network since thedictionary would be closed. However, if \texttt{FORCECXTEXP} is set true and\texttt{ALLOWXWRDEXP} is set false then word internal context expansion will be performed. If \texttt{FORCECXTEXP} is set true and \texttt{ALLOWXWRDEXP} isset true then full cross-word context expansion will be performed.\item \textit{Network expansion} \\Each word in the word network is transformed into a \textit{word-end} node preceded by the sequence of model nodes corresponding tothe word's pronunciation.For cross word context expansion, the initial and final context dependent phones (and any preceding/following context independentones) are duplicated as many times as is necessaryto cater for each different crossword context. Each duplicated word-final phone is followed bya similarly duplicated word-end node.Null words are simply transformed into word-end nodes withno preceding model nodes. \item \textit{Linking of models to network nodes} \\Each model node is linked to the corresponding HMM definition.In each case, the required HMM model name is determined from the phone name and the surroundingcontext names. The algorithm used for this is\begin{enumerate}\item Construct the context-dependent name and see if the corresponding model exists.\item Construct the context-independent name and see if the corresponding model exists.\end{enumerate}If the configuration variable \texttt{ALLOWCXTEXP} is false (a) is skipped and if the configuration variable \texttt{FORCECXTEXP} is true(b) is skipped. If no matching model is found, an error isgenerated. When the right contextis a boundary or \texttt{FORCELEFTBI} is true, then thecontext-dependent name takes the form of a left biphone, that is,the phone \texttt{p} with left context \texttt{l} becomes \texttt{l-p}. When the left contextis a boundary or \texttt{FORCERIGHTBI} is true, then thecontext-dependent name takes the form of a right biphone, that is,the phone \texttt{p} with right context \texttt{r} becomes \texttt{p+r}.Otherwise, the context-dependent name is a full triphone, that is,\texttt{l-p+r}.Context-free phones are skipped in this process so\begin{verbatim} sil aa r sp y uw sp sil\end{verbatim}would be expanded as\begin{verbatim} sil sil-aa+r aa-r+y sp r-y+uw y-uw+sil sp sil\end{verbatim}assuming that \texttt{sil} is context-independent and \texttt{sp} iscontext-free. \index{cfwordboundary@\texttt{CFWORDBOUNDARY}} For word-internal systems, the context expansion can be further controlled via the configuration variable\texttt{CFWORDBOUNDARY}. When set true (default setting) context-free phoneswill be treated as word boundaries so\begin{verbatim} aa r sp y uw sp\end{verbatim}would be expanded to\begin{verbatim} aa+r aa-r sp y+uw y-uw sp\end{verbatim}Setting \texttt{CFWORDBOUNDARY} false would produce\begin{verbatim} aa+r aa-r+y sp r-y+uw y-uw sp\end{verbatim}\end{enumerate}Note that in practice, stages (3) and (4) above actually proceed concurrentlyso that for the first and last phone of context-dependent models, logicalmodels which have the same underlying physical model can be merged.\centrefig{mononet}{100}{Monophone Expansion of Bit-But Network}Having described the expansion process in some detail, some simpleexamples will help clarify the process. All of these are basedon the Bit-But word network illustrated in Fig.~\href{f:wdnet}.Firstly, assume that the dictionary contains simple monophonepronunciations, that is\begin{verbatim} bit b i t but b u t start sil end sil\end{verbatim}and the HMM set consists of just monophones\begin{verbatim} b i t u sil\end{verbatim}In this case, \htool{HNet} will find a closed dictionary. There willbe no expansion and it will directly generate the network shown in Fig~\href{f:mononet}. In this figure, the rounded boxesrepresent model nodes and the square boxes represent word-end nodes.Similarly, if the dictionarycontained word-internal triphone pronunciations such as\begin{verbatim} bit b+i b-i+t i-t but b+u b-u+t u-t start sil end sil\end{verbatim}and the HMM set contains all the required models\begin{verbatim} b+i b-i+t i-t b+u b-u+t u-t sil\end{verbatim}then again \htool{HNet} will find a closed dictionaryand the network shown in Fig.~\href{f:wintnet} would be generated.\centrefig{wintnet}{100}{Word Internal Triphone Expansion of Bit-But Network}If however the dictionary contained just the simple monophone pronunciationsas in the first case above, but the HMM set contained just triphones,that is\begin{verbatim} sil-b+i t-b+i b-i+t i-t+sil i-t+b sil-b+u t-b+u b-u+t u-t+sil u-t+b sil\end{verbatim}then \htool{HNet} would perform full cross-word expansion andgenerate the network shown in Fig.~\href{f:xwrdnet}.\centrefig{xwrdnet}{100}{Cross-Word Triphone Expansion of Bit-But Network}Now suppose that still using the simple monophone pronunciations,the HMM set contained all monophones, biphones and triphones. In thiscase, the default would be to generate the monophone network ofFig~\href{f:mononet}. If \texttt{FORCECXTEXP} is true but \texttt{ALLOWXWRDEXP} is set false then the word-internal network\index{word-internal network expansion}of Fig.~\href{f:wintnet} would be generated. Finally, if both\texttt{FORCECXTEXP} and \texttt{ALLOWXWRDEXP} are set true then the cross-word network\index{cross-word network expansion}of Fig.~\href{f:xwrdnet} would be generated. \mysect{Other Kinds of Recognition System}{othernets}Although the recognition facilities of \HTK\ are aimed primarilyat sub-word based connected word recognition, it can neverthelesssupport a variety of other types of recognition system.To build a phoneme recogniser, a word-level network is defined usingan SLF file in the usualway except that each ``word'' in the network represents a single phone.The structure of the network will typically be a loop in which allphones loop back to each other.\index{phone recognition}The dictionary then contains an entry for each ``word'' such that the word andthe pronunciation are the same, for example, the dictionary might contain\begin{verbatim} ih ih eh eh ah ah ... etc\end{verbatim}Phoneme recognisers often use biphones to provide some measure ofcontext-dependency. Provided that the HMM set contains all the necessarybiphones, then \htool{HNet} will expand a simple phone loop into a context-sensitivebiphone loop simply by setting the configuration variable \texttt{FORCELEFTBI} or \texttt{FORCERIGHTBI} to true, as appropriate.Whole word recognisers can be set-up in a similar way. The word networkis designed using the same considerations as for a sub-word based systembut the dictionary gives the name of the whole-word HMM in place of eachword pronunciation.\index{whole word recognition}Finally, word spotting\index{word spotting} systems can be defined by placing each keywordin a word network in parallel with the appropriate filler models.The keywords can be whole-word models or subword based. Note in thiscase that word transition penalties placed on the transitions can beused to gain fine control over the false alarm rate.%%% Local Variables: %%% mode: plain-tex%%% TeX-master: "htkbook"%%% End:
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -