📄 exampsys.tex
字号:
To train a set of HMMs, every file of training data must have an associatedphone level transcription. Since there is no hand labelled data to bootstrap aset of models, a flat-start scheme will be used instead. To do this, two setsof phone transcriptions will be needed. The set used initially will have noshort-pause (\texttt{sp}) models between words. Then once reasonable phonemodels have been generated, an \texttt{sp} model will be inserted between wordsto take care of any pauses introduced by the speaker.\index{flat start}The starting point for both sets of phone transcription is anorthographic\index{transcription!orthographic} transcription in \HTK\ labelformat. This can be created fairly easily using a text editor or a scriptinglanguage.An example of this is found in the RM Demo at point 0.4. Alternatively, thescript \texttt{prompts2mlf} has been provided in the \texttt{HTKTutorial}directory.The effect should be to convert the prompt utterances exampled above into thefollowing form:\begin{verbatim} #!MLF!# "*/S0001.lab" ONE VALIDATED ACTS OF SCHOOL DISTRICTS . "*/S0002.lab" TWO OTHER CASES ALSO WERE UNDER ADVISEMENT . "*/S0003.lab" BOTH FIGURES (etc.)\end{verbatim}As can be seen, the prompt labels need to be converted into path names, eachword should be written on a single line and each utterance should be terminatedby a single period on its own. The first line of the file just identifies thefile as a \textit{Master Label File} (MLF). This is a single file containing acomplete set of transcriptions. \HTK\ allows each individual transcription tobe stored in its own file but it is more efficient to use an MLF.\index{master label files}\index{MLF}The form of the path name used in the MLF deserves some explanation since it isreally a \textit{pattern} and not a name.\index{master label files!patterns}When \HTK\ processes speech files, it expects to find a transcription (or {\it label file}) with the same name but a different extension. Thus, if the file\texttt{/root/sjy/data/S0001.wav} was being processed, \HTK\ would look for alabel file called \texttt{/root/sjy/data/S0001.lab}. When MLF files are used,\HTK\ scans the file for a pattern which matches the required label file name.However, an asterix will match any character string and hence the pattern usedin the example is in effect path independent. It therefore allows the sametranscriptions to be used with different versions of the speech data to bestored in different locations.Once the word level MLF has been created, phone level MLFs can be generatedusing the label editor \htool{HLEd}\index{hled@\htool{HLEd}}. For example,assuming that the above word level MLF is stored in the file\texttt{words.mlf}, the command\begin{verbatim} HLEd -l '*' -d dict -i phones0.mlf mkphones0.led words.mlf\end{verbatim}will generate a phone level transcription of the following formwhere the \texttt{-l} option is needed to generate the path '\verb+*+' in the output patterns.\begin{verbatim} #!MLF!# "*/S0001.lab" sil w ah n v ae l ih d .. etc\end{verbatim}This process is illustrated in Fig.~\href{f:step4}.The \htool{HLEd} edit script \texttt{mkphones0.led} contains the following commands\begin{verbatim} EX IS sil sil DE sp\end{verbatim}The expand \texttt{EX} command replaces each word in \texttt{words.mlf} by the corresponding pronunciation in the dictionary file \texttt{dict}. The \texttt{IS}command inserts a silence model \texttt{sil} at the start and end ofevery utterance. Finally, the delete \texttt{DE} command deletes allshort-pause \texttt{sp} labels, which are not wanted in the transcriptionlabels at this point. \centrefig{step4}{60}{Step 4}\subsection{Step 5 - Coding the Data}The final stage of data preparation is to parameterise the raw speechwaveforms into sequences of feature vectors. \HTK\ support both FFT-based\index{analysis!FFT-based}and LPC-based\index{analysis!LPC-based} analysis. Here Mel Frequency Cepstral Coefficients (MFCCs)\index{MFCC coefficients},which are derived from FFT-based log spectra, will be used.Coding can be performed using the tool \htool{HCopy}\index{hcopy@\htool{HCopy}} configured to\index{coding}automatically convert its input into MFCC vectors. To do this, a configurationfile (\texttt{config}) is needed which specifies all of the conversion parameters\index{parameterisation}. Reasonable settings for these are as follows\begin{verbatim} # Coding parameters TARGETKIND = MFCC_0 TARGETRATE = 100000.0 SAVECOMPRESSED = T SAVEWITHCRC = T WINDOWSIZE = 250000.0 USEHAMMING = T PREEMCOEF = 0.97 NUMCHANS = 26 CEPLIFTER = 22 NUMCEPS = 12 ENORMALISE = F\end{verbatim}Some of these settings are in fact the default setting, but theyare given explicitly here for completeness. In brief, they specifythat the target parameters are to be MFCC using $C_0$ as the energycomponent, the frame period is 10msec (\HTK\ uses units of 100ns),the output should be saved in compressed format, and a crc checksum shouldbe added. The FFT should use a Hamming window and the signal shouldhave first order preemphasis applied using a coefficient of 0.97.The filterbank should have 26 channels and 12 MFCC coefficients shouldbe output. The variable \texttt{ENORMALISE} is by default true and performs energynormalisation on recorded audio files. It cannot be used with live audio andsince the target system is for live audio, this variable should be set tofalse.Note that explicitly creating coded data files is not necessary, as coding canbe done "on-the-fly" from the original waveform files by specifying theappropriate configuration file (as above) with the relevant HTK tools. However,creating these files reduces the amount of preprocessing required duringtraining, which itself can be a time-consuming process.To run \htool{HCopy}, a list ofeach source file and its corresponding output file is needed. For example,the first few lines might look like\index{extensions!mfc@\texttt{mfc}}\begin{verbatim} /root/sjy/waves/S0001.wav /root/sjy/train/S0001.mfc /root/sjy/waves/S0002.wav /root/sjy/train/S0002.mfc /root/sjy/waves/S0003.wav /root/sjy/train/S0003.mfc /root/sjy/waves/S0004.wav /root/sjy/train/S0004.mfc (etc.)\end{verbatim}Files containing lists of files are referred to as script files\footnote{Not to be confused with files containing \textit{edit} scripts}and\index{extensions!scp@\texttt{scp}}by convention are given the extension \texttt{scp} (although \HTK\ does not demand this). Script files are specified using the standard\texttt{-S} option and their contents are read simply as extensionsto the command line. Thus, they avoid the need for command lines withseveral thousand arguments\footnote{Most UNIX shells, especially the C shell, only allow a limited andquite small number of arguments.}.\index{command line!arguments}\index{command line!script files}\centrefig{step5}{100}{Step 5}\noindentAssuming that the above script is stored in the file \texttt{codetr.scp},the training data would be coded by executing\begin{verbatim} HCopy -T 1 -C config -S codetr.scp\end{verbatim}This is illustrated in Fig.~\href{f:step5}. A similar procedure isused to code the test data (using \verb|TARGETKIND = MFCC_0_D_A| inconfig) after which all of the pieces are in place to start trainingthe HMMs. \mysect{Creating Monophone HMMs}{egcreatmono}In this section, the creation of a well-trained set of single-Gaussianmonophone HMMs will be described. The starting point will bea set of identical monophone HMMs in which every mean and variance isidentical. These are then retrained, short-pause models areadded and the silence model is extended slightly. The monophonesare then retrained.Some of the dictionary entries have multiple pronunciations. However,when \htool{HLEd} was used to expand the word level MLF to create thephone level MLFs, it arbitrarily selected the first pronunciation it found.Once reasonable monophone HMMs have been created, the recogniser tool\htool{HVite} can be used to perform a \textit{forced alignment} of\index{forced alignment}the training data. By this means, a new phone level MLF is created in whichthe choice of pronunciations depends on the acoustic evidence. This newMLF can be used to perform a final re-estimation of the monophone HMMs.\index{monophone HMM!construction of}\subsection{Step 6 - Creating Flat Start Monophones}The first step in HMM training is to define a prototype model. Theparameters of this model are not important, its purpose is todefine the model topology. For phone-based systems, a goodtopology to use is 3-state left-right with no skips such as the following\begin{verbatim} ~o <VecSize> 39 <MFCC_0_D_A> ~h "proto" <BeginHMM> <NumStates> 5 <State> 2 <Mean> 39 0.0 0.0 0.0 ... <Variance> 39 1.0 1.0 1.0 ... <State> 3 <Mean> 39 0.0 0.0 0.0 ... <Variance> 39 1.0 1.0 1.0 ... <State> 4 <Mean> 39 0.0 0.0 0.0 ... <Variance> 39 1.0 1.0 1.0 ... <TransP> 5 0.0 1.0 0.0 0.0 0.0 0.0 0.6 0.4 0.0 0.0 0.0 0.0 0.6 0.4 0.0 0.0 0.0 0.0 0.7 0.3 0.0 0.0 0.0 0.0 0.0 <EndHMM>\end{verbatim}where each ellipsed vector is of length 39. This number, 39, is computed fromthe length of the parameterised static vector (\texttt{MFCC\_0} = 13) plusthe delta coefficients (+13) plus the acceleration coefficients (+13).The \HTK\ tool \htool{HCompV}\index{hcompv@\htool{HCompV}} will scan a set of data files, computethe global mean and variance and set all of the Gaussians in a given HMMto have the same mean and variance.\index{flat start}Hence, assuming that a list of all the training files is stored in\texttt{train.scp}, the command\begin{verbatim} HCompV -C config -f 0.01 -m -S train.scp -M hmm0 proto\end{verbatim}will create a new version of \texttt{proto} in the directory \texttt{hmm0}in which the zero means and unit variances above have been replacedby the global speech means and variances.Note that the prototype HMM defines the parameter kind as \texttt{MFCC\_0\_D\_A} (Note: 'zero' not 'oh').This means that delta and acceleration coefficients are to be computed andappended to the static MFCC coefficients computed and stored during thecoding process described above. To ensure that these are computed during loading,the configuration file \texttt{config} should be modifiedto change the target kind, i.e.\ the configuration file entry for\texttt{TARGETKIND} should be changed to\begin{verbatim} TARGETKIND = MFCC_0_D_A\end{verbatim}\htool{HCompV} has a number of options specified for it. The \texttt{-f} option causes a variance floor macro\index{variance floor macros} (called \texttt{vFloors}) to be generated whichis equal to 0.01 times the global variance. This is a vectorof values which will be used to set a floor on the variances estimatedin the subsequent steps. The \texttt{-m} option asks for means to be computedas well as variances. Given thisnew prototype model stored in the directory\texttt{hmm0}, a \textit{Master Macro File}\index{master macro files} (MMF) called \texttt{hmmdefs} \index{MMF}containing a copy for each of the required monophone HMMs is constructed by manually copying the prototype and relabeling it for each required monophone (including ``sil''). The format of an MMF is similar to thatof an MLF and it serves a similar purpose in that it avoids havinga large number of individual HMM definition files\index{HMM!definition files} (see Fig.~\href{f:MMFeg}).\centrefig{MMFeg}{85}{Form of Master Macro Files}The flat start monophones stored in the directory \texttt{hmm0} arere-estimated using the embedded re-estimation\index{embedded re-estimation} tool \htool{HERest}\index{herest@\htool{HERest}}invoked as follows\begin{verbatim} HERest -C config -I phones0.mlf -t 250.0 150.0 1000.0 \ -S train.scp -H hmm0/macros -H hmm0/hmmdefs -M hmm1 monophones0\end{verbatim}The effect of this is to load all the models in \texttt{hmm0} which arelisted inthe model list \texttt{monophones0} (\texttt{monophones1} less the short pause (\texttt{sp}) model). These are then re-estimated them using the datalisted in \texttt{train.scp} and the new model set is stored in thedirectory \texttt{hmm1}.Most of the files used in this invocation of \htool{HERest} have already been described. The exception is the file \texttt{macros}.This should contain a so-called \textit{global options} macro andthe variance floor macro \texttt{vFloors} generated earlier. The global options macrosimply defines the HMM parameter kind and the vector size i.e.\begin{verbatim} ~o <MFCC_0_D_A> <VecSize> 39\end{verbatim}See Fig.~\href{f:MMFeg}. This can be combined with \texttt{vFloors} into a text filecalled \texttt{macros}.
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -