📄 train.tex
字号:
HInit -S trainlist -H globals -M dir1 -l ih -L labs proto mv dir1/proto dir1/ih\end{verbatim}where the option \texttt{-l} defines the name of the sub-word model, and the file \texttt{trainlist} is assumed to hold\begin{verbatim} data/tr1.mfc data/tr2.mfc data/tr3.mfc data/tr4.mfc data/tr5.mfc data/tr6.mfc\end{verbatim}In this case, \htool{HInit} will first try to find label\index{model training!sub-word initialisation}files corresponding to each data file. In the example here, the standard \texttt{-L}\index{standard options!aaal@\texttt{-L}} option indicates that they arestored in a directory called \texttt{labs}. As an alternative, theycould be stored in a Master Label File\index{master label files} (MLF) and loaded via the standard option \texttt{-I}.Once the label files have been loaded, each data file is scanned and all segmentscorresponding the label \texttt{ih} are loaded. Figure~\href{f:hinitdp}illustrates this process.All \HTK\ tools support the \texttt{-T}\index{standard options!aaat@\texttt{-T}} trace option and although the details of tracing varies from tool to tool, setting the least signicant bit (e.g.\ by \texttt{-T 1}), causes all tools to output top level progress information. In the caseof \htool{HInit}, this information includes the log likelihood at each iteration and henceit is very useful for monitoring convergence\index{monitoring convergence}. For example, enabling top level tracingin the previous example might result in the following being output\begin{verbatim} Initialising HMM proto . . . States : 2 3 4 (width) Mixes s1: 1 1 1 ( 26 ) Num Using: 0 0 0 Parm Kind: MFCC_E_D Number of owners = 1 SegLab : ih maxIter : 20 epsilon : 0.000100 minSeg : 3 Updating : Means Variances MixWeights/DProbs TransProbs 16 Observation Sequences Loaded Starting Estimation Process Iteration 1: Average LogP = -898.24976 Iteration 2: Average LogP = -884.05402 Change = 14.19574 Iteration 3: Average LogP = -883.22119 Change = 0.83282 Iteration 4: Average LogP = -882.84381 Change = 0.37738 Iteration 5: Average LogP = -882.76526 Change = 0.07855 Iteration 6: Average LogP = -882.76526 Change = 0.00000 Estimation converged at iteration 7 Output written to directory :dir1:\end{verbatim}The first part summarises the structure of the HMM, in this case, the data issingle stream MFCC coefficients with energy and deltas appended. The HMM has3 emitting states, each single Gaussian and the stream width is 26. The currentoption settings are then given followed by the convergence information. In thisexample, convergence was reached after 6 iterations, however if the \texttt{maxIter}limit was reached, then the process would terminate regardless.\htool{HInit} provides a variety of command line options for controlling its detailed behaviour.\index{model training!update control}The types of parameter estimated by \htool{HInit} can be controlledusing the \texttt{-u} option, for example, \texttt{-u mtw} would update the means, transition matrices andmixture component weights but would leave the variances untouched. A variance floor\index{variance floors}can be applied using the \texttt{-v} to prevent any variance getting too small. Thisoption applies the same variance floor to all speech vector elements. More precisecontrol can be obtained by specifying a variance macro (i.e.\ a \texttt{~v} macro)called \texttt{varFloor1}\index{varfloorn@\texttt{varFloorN}} for stream 1, \texttt{varFloor2} for stream 2, etc. Eachelement of these variance vectors then defines a floor for the corresponding HMM variancecomponents.The full list of options supported by \htool{HInit} is described in the \refpart.\mysect{Flat Starting with \htool{HCompV}}{flatstart}One limitation of using \htool{HInit} for the initialisationof sub-word models is that it requires labelled training data.For cases where this is not readily available, an alternativeinitialisation strategy is to make all models equal initially andmove straight to embedded training using \htool{HERest}. Theidea behind this so-called \textit{flat start} training is similar to theuniform segmentation strategy adopted by \htool{HInit} since by makingall states of all models equal, the first iteration of embedded trainingwill effectively rely on a uniform segmentation of the data.\centrefig{flatst}{90}{Flat Start Initialisation}Flat start\index{flat start} initialisation is provided by the \HTK\ tool \htool{HCompV} whose operationis illustrated by Fig~\href{f:flatst}. The input/output of HMM definition filesand training files in \htool{HCompV}\index{hcompv@\htool{HCompV}} works in exactly the same way as described above for\htool{HInit}. It reads in a prototype HMM definition and some training dataand outputs a new definition in which every mean and covariance is equal to the global speech mean and covariance. Thus, for example, the followingcommand would read a prototype definition called \texttt{proto}, read in all speechvectors from \texttt{data1}, \texttt{data2}, \texttt{data3}, etc, compute the global mean and covarianceand write out a new version of \texttt{proto} in \texttt{dir1} with this mean andcovariance.\begin{verbatim} HCompV -m -H globals -M dir1 proto data1 data2 data3 ...\end{verbatim}The default operation of \htool{HCompV} is only to update the covariances of the HMMand leave the means unchanged. The use of the \texttt{-m} option above causes themeans to be updated too. This apparently curious default behaviour arises because\htool{HCompV} is also used to initialise the variances in so-called\textit{Fixed-Variance} HMMs. These are HMMs initialised in the normal way exceptthat all covariances are set equal to the global speech covariance and neversubsequently changed.\index{fixed-variance}Finally, it should be noted that \htool{HCompV} can also be used to generate variance floor macros by using the \texttt{-f} option.\index{variance floor macros!generating}\mysect{Isolated Unit Re-Estimation using \htool{HRest}}{resthmm}\sidefig{restloop}{60}{\htool{HRest} Operation}{2}{ \htool{HRest} is the final tool in the setdesigned to manipulate isolated unit HMMs. Its operation is very similar to\htool{HInit} except that, as shown in Fig~\href{f:restloop}, it expects the inputHMM definition to have been initialised and it uses Baum-Welch re-estimation\index{Baum-Welch re-estimation!isolated unit} in placeof Viterbi training. This involves finding the probability of being in eachstate at each time frame using the \textit{Forward-Backward} algorithm.This probability is then used to form weighted averagesfor the HMM parameters. Thus, whereas Viterbi training makes a hard decisionas to which state each training vector was ``generated'' by, Baum-Welch takes a soft decision. This can be helpful when estimating phone-based HMMssince there are no hard boundaries between phones in real speech and usinga soft decision may give better results.\index{forward-backward!isolated unit}The mathematical details of the Baum-Welch re-estimationprocess are given below in section~\ref{s:bwformulae}.\htool{HRest} is usually applied directly to the models generated by \htool{HInit}. Hence forexample, the generation of a sub-word model for the phone \texttt{ih} begunin section~\ref{s:inithmm}would be continued by executing the following command}\begin{verbatim} HRest -S trainlist -H dir1/globals -M dir2 -l ih -L labs dir1/ih\end{verbatim}This will load the HMM definition for \texttt{ih} from \texttt{dir1},re-estimate the parameters using the speech segments labelled with \texttt{ih}and write the new definition to directory \texttt{dir2}.If \htool{HRest} is used to build models with a large number of mixture components per state,a strategy must be chosen for dealing with \textit{defunct mixture components}.These are mixture components which have very little associated training data andas a consequence either the variances or the corresponding mixture weight becomesvery small. If either of these events happen, the mixture component is effectivelydeleted and provided that at least one component in that state is left, a warningis issued. If this behaviour is not desired then the variance can be floored asdescribed previously using the \texttt{-v} option (or a variance floor macro)and/or the mixture weight can be floored using the \texttt{-w} option. \index{defunct mixture components}Finally, a problem which can arise whenusing \htool{HRest} to initialise sub-word models is that of over-short training segments\index{over-short training segments}. By default, \htool{HRest} ignores all training examples which have fewer frames thanthe model has emitting states. For example, suppose that a particular phone with 3 emitting states had only a few trainingexamples with more than 2 frames of data. In this case, there would be twosolutions. Firstly, the number of emitting states could be reduced. Since\HTK\ does not require all models to have the same number of states,this is perfectly feasible.Alternatively, some skip transitions could be added and the defaultreject mechanism disabled by setting the \texttt{-t} option.Note here that \htool{HInit} has the same reject mechanism and suffersfrom the same problems. \htool{HInit}, however, does not allowthe reject mechanism to be suppressed since the uniform segmentationprocess would otherwise fail.\mysect{Embedded Training using \htool{HERest}}{eresthmm}\index{model training!embedded}Whereas isolated unit training is sufficient for building whole wordmodels and initialisation of models using hand-labelled \textit{bootstrap}data, the main HMM training procedures for building sub-word systemsrevolve around the concept of \textit{embedded training}. Unlike theprocesses described so far, embedded training\index{embedded training} simultaneously updates allof the HMMs in a system using all of the training data. It is performed by\htool{HERest}\index{herest@\htool{HERest}} which, unlike \htool{HRest}, performs just a single iteration. \index{Baum-Welch re-estimation!embedded unit}In outline, \htool{HERest} works as follows. On startup, \htool{HERest} loads in a completeset of HMM definitions. Every training file must have an associatedlabel file which gives a transcription for that file. Only thesequence of labels is used by \htool{HERest}, however, and any boundary locationinformation is ignored. Thus, these transcriptions can be generatedautomatically from the known orthography of what was said and a pronunciation dictionary.\htool{HERest} processes each training file in turn. After loading it into memory,it uses the associated transcription to construct a composite HMM which spans the whole utterance.This composite HMM is made by concatenating instances of the phone HMMs corresponding to each label in the transcription. The Forward-Backwardalgorithm is then applied and the sums needed to form the weightedaverages accumulated in the normal way. When all of the trainingfiles have been processed, the new parameter estimates are formedfrom the weighted sums and the updated HMM set is output.\index{forward-backward!embedded}The mathematical details of embedded Baum-Welch re-estimationare given below in section~\ref{s:bwformulae}.In order to use \htool{HERest}, it is first necessary to construct a file containing a listof all HMMs in the model set with each model name being written on a separate line.The names of the models in this\index{HMM lists}list must correspond to the labels used in the transcriptions and theremust be a corresponding model for every distinct transcription label.\htool{HERest} is typically invoked by a command line of the form\begin{verbatim} HERest -S trainlist -I labs -H dir1/hmacs -M dir2 hmmlist\end{verbatim}where \texttt{hmmlist} contains the list of HMMs. On startup, \htool{HERest} will load the HMM master macro file (MMF) \texttt{hmacs} (there may beseveral of these). It then searches for a definition for eachHMM listed in the \texttt{hmmlist}, if any HMM name is not found, it attempts to open a file of the same name in the current directory(or a directory designated by the \texttt{-d} option).Usually in large subword systems, however, all of the HMM definitionswill be stored in MMFs. Similarly, all of the required transcriptionswill be stored in one or more Master Label Files\index{master label files} (MLFs), and in theexample, they are stored in the single MLF called \texttt{labs}.\centrefig{herestdp}{90}{File Processing in \htool{HERest}}Once all MMFs and MLFs have been loaded, \htool{HERest} processes each file in the\texttt{trainlist}, and accumulates the required statistics as describedabove. On completion, an updated MMF is output to the directory\texttt{dir2}. If a second iteration is required, then \htool{HERest} is reinvokedreading in the MMF from \texttt{dir2} and outputing a new one to \texttt{dir3}, and so on.This process is illustrated by Fig~\href{f:herestdp}.When performing embedded training, it is good practice tomonitor the performance of the models on unseen test dataand stop training when no further improvement is obtained. Enablingtop level tracing by setting \texttt{-T 1} will cause \htool{HERest} tooutput the overall log likelihood per frame of the training data.This measure could be used as a termination condition forrepeated application of \htool{HERest}. However, repeated re-estimation to convergence\index{monitoring convergence} may take an impossibly long time.Worse still, it can lead to over-training since the models can become tooclosely matched to the training data and fail to generalise well on unseentest data. Hence in practice around 2 to 5 cycles of embedded re-estimation are normally sufficient when training phonemodels.In order to get accurate acoustic models, a large amount of trainingdata is needed. Several hundredutterances are needed for speaker dependent recognition andseveral thousand are needed forspeaker independent recognition. In the latter case, a singleiteration of embedded trainingmight take several hours to compute. There are two mechanisms for speeding up this computation. Firstly, \htool{HERest} has a pruning\index{model training!pruning} mechanismincorporated into its forward-backward computation. \htool{HERest} calculatesthe backward probabilities $\beta_j(t)$ first and then the forward probabilities$\alpha_j(t)$.The full computation of these probabilities for all values of state $j$and time $t$ is unnecessary since many of these combinations will be highlyimprobable. On the forward pass, \htool{HERest} restricts the computation ofthe $\alpha$ values to just those for which the total log likelihood as determined by the product $\alpha_j(t)\beta_j(t)$ iswithin a fixed distance from the total likelihood $P(\bm{O}|M)$. Thispruning is always enabled since it is completely safe and causes no lossof modelling accuracy. Pruning on the backward pass is also possible.However, in this case, the likelihood product $\alpha_j(t)\beta_j(t)$is unavailable since $\alpha_j(t)$ has yet to be computed, and hence a much broader {\it beam} must be set toavoid pruning errors. Pruning on the backward path is therefore underuser control. It is set using the \texttt{-t} option which has two forms. In the simplest case, a fixed pruning beam is set. For example,using \texttt{-t 250.0} would set a fixed beam of 250.0. This methodis adequate when there is sufficient compute time available to use a generously wide beam. When a narrower beam is used, \htool{HERest} will reject any utterance for which the beam proves to be too narrow.
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -