📄 exampsys.tex

📁 该压缩包为最新版htk的源代码,htk是现在比较流行的语音处理软件,请有兴趣的朋友下载使用
💻 TEX
📖 第 1 页 / 共 4 页
字号:
\centrefig{step6}{85}{Step 6}The \texttt{-t} option sets the pruning\index{pruning} thresholds to be used duringtraining.  Pruning limits the range of state alignments that theforward-backward algorithm includes in its summation and itcan reduce the amount of computation required by anorder of magnitude.  For most training files, a very tight pruning thresholdcan be set, however, some training files will provide poorer acousticmatching and in consequence a wider pruning beam is needed.  \htool{HERest}deals with this by having an auto-incrementing pruning threshold.  In theabove example, pruning is normally 250.0.  If re-estimation fails on anyparticular file, the threshold is increased by 150.0 and the file isreprocessed.  This is repeated until either the file is successfullyprocessed or the pruning limit of 1000.0 is exceeded.  At this point it is safe to assume that thereis a serious problem with the training file and hence the fault should be fixed(typically it will be an incorrect transcription) or the training file should be discarded.The process leading to the initial set of monophones in the directory\texttt{hmm0} is illustrated in Fig.~\href{f:step6}.Each time \htool{HERest} is run it performs a single re-estimation.  Each newHMM set is stored in a new directory.  Execution of \htool{HERest} should berepeated twice more, changing the name of the input and output directories (setwith the options \texttt{-H} and \texttt{-M}) each time, until the directory\texttt{hmm3} contains the final set of initialised monophone HMMs.\subsection{Step 7 - Fixing the Silence Models}\sidefig{egsils}{55}{Silence Models}{-4}{The previous step has generated a 3 state left-to-right HMM for eachphone and also a HMM for the silence model\index{silence model} \texttt{sil}.  The next step is to add extra transitions from states 2 to 4 and fromstates 4 to 2\index{transitions!adding them}in the silence model.  The idea here is to make the model more robustby allowing individual states to absorb the variousimpulsive noises in the training data.  The backward skip allows this to happenwithout committing the model to transit to the following word.Also, at this point, a 1 stateshort pause\index{short pause} \texttt{sp} model should be created.  This should be a so-called \textit{tee-model}\index{tee-models}which has a direct transition from entry to exit node.This \texttt{sp} has its emitting state tied to the centre state of the silence model.The required topology of the two silence models is shown in Fig.~\href{f:egsils}.}These silence models can be created in two stages\begin{itemize}\item Use a text editor on the file \texttt{hmm3/hmmdefs} to copy the centre state ofthe \texttt{sil} model tomake a new \texttt{sp} model and store the resulting MMF \texttt{hmmdefs}, which includes the new \texttt{sp} model, in the new directory \texttt{hmm4}. \item Run the HMM editor \htool{HHEd}\index{hhed@\htool{HHEd}} to add the extra transitions requiredand tie the \texttt{sp} state to the centre \texttt{sil} state\end{itemize}\htool{HHEd} works in a similar way to \htool{HLEd}.  It applies a set of commands ina script to modify a set of HMMs.  In this case, it is executed as follows\begin{verbatim}    HHEd -H hmm4/macros -H hmm4/hmmdefs -M hmm5 sil.hed monophones1\end{verbatim}where \texttt{sil.hed} contains the following commands\begin{verbatim}    AT 2 4 0.2 {sil.transP}    AT 4 2 0.2 {sil.transP}    AT 1 3 0.3 {sp.transP}    TI silst {sil.state[3],sp.state[2]}\end{verbatim}The \texttt{AT}\index{at@\texttt{AT} command} commands add transitions to thegiven transition matrices and the final \texttt{TI}\index{ti@\texttt{TI}command} command creates a tied-state called \texttt{silst}.  The parameters ofthis tied-state are stored in the \texttt{hmmdefs} file and within each silencemodel, the original state parameters are replaced by the name of thismacro\index{macros}.  Macros are described in more detail below. For now it issufficient to regard them simply as the mechanism by which\HTK\ implements parameter sharing. Note that the phone list used here has been changed, because the original list\texttt{monophones0} has been extended by the new \texttt{sp} model. The new file is called \texttt{monophones1} and has been used in the above \htool{HHEd}command.\centrefig{step7}{110}{Step 7}Finally, another two passes of \htool{HERest} are applied using the phonetranscriptions with \texttt{sp} models between words.  This leaves theset of monophone HMMs created so far in the directory \texttt{hmm7}.This step is illustrated in Fig.~\href{f:step7}\subsection{Step 8 - Realigning the Training Data}As noted earlier, the dictionary contains multiple pronunciations for some words, particularly function words.  The phone models created sofar can be used to \textit{realign} the training data and create newtranscriptions.  This can be done with a single invocation of the\index{realignment}\HTK\ recognition tool \htool{HVite}\index{hvite@\htool{HVite}}, viz\begin{verbatim}    HVite -l '*' -o SWT -b silence -C config -a -H hmm7/macros \          -H hmm7/hmmdefs -i aligned.mlf -m -t 250.0 -y lab \          -I words.mlf -S train.scp  dict monophones1 \end{verbatim}This command uses the HMMs stored in \texttt{hmm7} to transform the inputword level transcription \texttt{words.mlf} to the new phone level transcription\texttt{aligned.mlf} using the pronunciations stored in the dictionary\texttt{dict} (see Fig~\href{f:step8}).   The key difference between thisoperation and the original word-to-phone mapping performed by \htool{HLEd}in step 4 is that the recogniser considers all pronunciations for eachword and outputs the pronunciation that best matches the acoustic data.\index{phone alignment}\index{phone mapping}In the above, the \texttt{-b} option is used to insert a silence model\index{silence model}at the start and end of each utterance.  The name \texttt{silence} is usedon the assumption that the dictionary contains an entry\begin{verbatim}    silence sil\end{verbatim}Note that the dictionary should be sorted firstly by case (upper case first) and secondly alphabetically.  The \texttt{-t} option sets a pruning level of 250.0 and the \texttt{-o} option is used to suppress the printing of scores, word names and timeboundaries in the output MLF.\centrefig{step8}{85}{Step 8}Once the new phone alignments have been created, another  2 passesof \htool{HERest} can be applied to reestimate the HMM set parametersagain.  Assuming that this is done, the final monophone HMM set willbe stored in directory \texttt{hmm9}.\mysect{Creating Tied-State Triphones}{egcreattri}Given a set of monophone HMMs, the final stage of model building is to createcontext-dependent triphone\index{HMM!triphones} HMMs.  This is done in two steps.  Firstly, themonophone transcriptions are converted to triphone transcriptions and a setof triphone models are created by copying the monophones and re-estimating.Secondly, similar acoustic states of these triphones are tied to ensure thatall state distributions can be robustly estimated.\subsection{Step 9 - Making Triphones from Monophones}Context-dependent triphones can be made by simply cloning\index{HMM!cloning}\index{cloning} monophones and thenre-estimating using triphone transcriptions.  The latter should be createdfirst using \htool{HLEd}\index{hled@\htool{HLEd}} because a side-effect is to generate a list of allthe triphones for which there is at least one example in the training data.That is, executing\begin{verbatim}    HLEd -n triphones1 -l '*' -i wintri.mlf mktri.led aligned.mlf\end{verbatim}will convert the monophone transcriptions in \texttt{aligned.mlf} toan equivalent set of triphone transcriptions in \texttt{wintri.mlf}.At the same time, a list of triphones is written to the file \texttt{triphones1}.The edit script \texttt{mktri.led}  contains the commands\begin{verbatim}    WB sp    WB sil    TC \end{verbatim}The two \texttt{WB}\index{wb@\texttt{WB} command} commands define \texttt{sp} and \texttt{sil}as \textit{word boundary symbols}.  These then block the addition ofcontext in the \texttt{TI} command, seen in the following script, which converts all phones(except word boundary symbols) to triphones\index{triphones!word internal}\index{triphones!from monophones}\index{triphones!by cloning}.  For example,\begin{verbatim}    sil th ih s sp m ae n sp ...\end{verbatim}becomes\begin{verbatim}    sil th+ih th-ih+s ih-s sp m+ae m-ae+n ae-n sp ...\end{verbatim}This style of triphone transcription is referred to as \textit{word internal}.\index{word internal}Note that some biphones will also be generated as contexts at word boundarieswill sometimes only include two phones.The cloning of models can be done efficiently using the HMM editor \htool{HHEd}:\begin{verbatim}    HHEd -B -H hmm9/macros -H hmm9/hmmdefs -M hmm10          mktri.hed monophones1\end{verbatim}where the edit script \texttt{mktri.hed}contains a clone command \texttt{CL} followed by \texttt{TI} commands to tie all ofthe transition matrices in each triphone\index{triphones!notation} set, that is:\begin{verbatim}    CL triphones1    TI T_ah {(*-ah+*,ah+*,*-ah).transP}    TI T_ax {(*-ax+*,ax+*,*-ax).transP}    TI T_ey {(*-ey+*,ey+*,*-ey).transP}    TI T_b {(*-b+*,b+*,*-b).transP}    TI T_ay {(*-ay+*,ay+*,*-ay).transP}    ...\end{verbatim}  The file \texttt{mktri.hed} can be generated using the {\em Perl} script\texttt{maketrihed} included in the \texttt{HTKTutorial} directory.When running the \htool{HHEd}\index{hled@\htool{HHEd}} command youwill get warnings about trying to tie transition matrices for the siland sp models. Since neither model is context-dependent there aren'tactually any matrices to tie.The clone command \texttt{CL}\index{cl@\texttt{CL} command} takes as itsargument the name of the file containing the list of triphones (andbiphones)\index{cloning}\index{parameter tying}\index{item lists} generatedabove.  For each model of the form \texttt{a-b+c} in this list, it looks forthe monophone \texttt{b} and makes a copy of it.\index{tying!transitionmatrices} Each \texttt{TI} command takes as its argument the name of a macroand a list of HMM components.  The latter uses a notation which attempts tomimic the hierarchical structure of the HMM parameter set in which thetransition matrix \texttt{transP} can be regarded as a sub-component of eachHMM.  The list of items within brackets are patterns designed to match the setof triphones, right biphones and left biphones for each phone.\centrefig{egtranstie}{80}{Tying Transition Matrices}Up to now macros and tying have only been mentioned in passing.  Although afull explanation must wait until chapter~\ref{c:HMMDefs}, a brief explanationis warranted here.  Tying means that one or more HMMs share the same set ofparameters.  On the left side of Fig.~\href{f:egtranstie}, two HMM definitionsare shown.  Each HMM has its own individual transition matrix.  On the rightside, the effect of the first \texttt{TI} command in the edit script\texttt{mktri.hed} is shown.  The individual transition matrices have beenreplaced by a reference to a \textit{macro} called \texttt{T\_ah} whichcontains a matrix shared by both models.  When reestimating tied parameters,the data which would have been used for each of the original untied parametersis pooled so that a much more reliable estimate can be obtained.Of course, tying could affect performance if performed indiscriminately.Hence, it is important to only tie parameters which have little effect ondiscrimination.  This is the case here where the transition parameters do notvary significantly with acoustic context but nevertheless need to be estimatedaccurately.  Some triphones will occur only once or twice and so very poorestimates would be obtained if tying was not done.  These problems of datainsufficiency will affect the output distributions too, but this will be dealtwith in the next step.Hitherto, all HMMs have been stored in text format and could be inspected likeany text file.  Now however, the model files will be getting larger and spaceand load/store times become an issue.  For increased efficiency,\HTK\ can store and load MMFs in binary\index{HMM!binary storage}format.  Setting the standard \texttt{-B} option causes this to happen.\sidefig{step9}{55}{Step 9}{-4}{Once the context-dependent models have been cloned, the new triphone set can bere-estimated using \htool{HERest}.  This is done as previously except that themonophone model list is replaced by a triphone list and the triphonetranscriptions are used in place of the monophone transcriptions.  For the final pass of \htool{HERest}, the \texttt{-s} option should be used togenerate a file of state occupation statistics called \texttt{stats}.  Incombination with the means and variances, these enable likelihoods to becalculated for clusters of states and are needed during the state-clusteringprocess \index{statistics!state occupation} described below.Fig.~\href{f:step9} illustrates this step of the HMM constructionprocedure. Re-estimation should be again done twice, so that the resultantmodel sets will ultimately be saved in \texttt{hmm12}.  }\begin{verbatim}   HERest -B -C config -I wintri.mlf -t 250.0 150.0 1000.0 -s stats \    -S train.scp -H hmm11/macros -H hmm11/hmmdefs -M hmm12 triphones1\end{verbatim}\subsection{Step 10 - Making Tied-State Triphones}The outcome of the previous stage is a set of triphone HMMs with all triphonesin a phone set sharing the same transition matrix.  When estimating thesemodels, many of the variances in the output distributionswill have been floored since there will be\index{variance!flooring problems}\index{state tying}\index{tying!states}\index{data insufficiency}insufficient data associated with many of the states.  The last step inthe model building process is to tie states within triphone setsin order to share data and thus be able to make robust parameter estimates.In the previous step, the \texttt{TI} command was used toexplicitly tie all members of a set of transition matrices together. However,the choice of which states to tie requires a bit more  subtlety sincethe performance of the recogniser depends crucially on how accuratethe state output distributions capture the statistics of the speech data.\htool{HHEd} provides two mechanisms which allow states to be clustered and\index{state clustering}then each cluster tied.  The first is data-driven and uses a similaritymeasure between states.  The second uses decision trees\index{decision trees}and is based on asking questions about the left and right contexts of eachtriphone.  The decision tree attempts to find those contexts which make the largestdifference to the acoustics and which should therefore distinguish clusters.Decision tree state tying is performed by running \htool{HHEd} in the normal way, i.e.\begin{verbatim}   HHEd -B -H hmm12/macros -H hmm12/hmmdefs -M hmm13 \        tree.hed triphones1 > log\end{verbatim}Notice that the output is saved in a log file.  This is important sincesome tuning of thresholds is usually needed.The edit script \texttt{tree.hed}, which contains the instructions regardingwhich contexts to examine for possible clustering, can be rather long andcomplex. A script for automatically generating this file, \texttt{mkclscript},is found in the RM Demo. A version of the \texttt{tree.hed} script, which canbe used with this tutorial, is included in the \texttt{HTKTutorial} directory.Note that this script is only capable of creating the TB commands (decision tree clustering of states).  The questions (QS) still need defining bythe user.  There is, however, an example list of questions which may be suitable to some tasks (or at least useful as an example) supplied with the RM demo (lib/quests.hed).  The entire script appropriate for clustering English phone models is too long to show here in the text, however, its main components are given by the following fragments:
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -