📄 train.tex
字号:
%/* ----------------------------------------------------------- */%/* */%/* ___ */%/* |_| | |_/ SPEECH */%/* | | | | \ RECOGNITION */%/* ========= SOFTWARE */ %/* */%/* */%/* ----------------------------------------------------------- */%/* Copyright: Microsoft Corporation */%/* 1995-2000 Redmond, Washington USA */%/* http://www.microsoft.com */%/* */%/* Use of this software is governed by a License Agreement */%/* ** See the file License for the Conditions of Use ** */%/* ** This banner notice must not be removed ** */%/* */%/* ----------------------------------------------------------- */%% HTKBook - Steve Young 1/12/97%\mychap{HMM Parameter Estimation}{Training}\sidepic{Tool.train}{80}{ In chapter~\ref{c:HMMDefs} the various types of HMM were describedand the way in which they are represented within \HTK\ was explained.Defining the structure and overall form of a set of HMMs is the firststep towards building a recogniser. The second step is to estimatethe parameters of the HMMs from examples of the data sequences thatthey are intended to model. This process of parameter estimation\index{parameter estimation} isusually called \textit{training}. \HTK\ supplies four basic tools for parameter estimation: \htool{HCompV},\htool{HInit}, \htool{HRest} and \htool{HERest}.\htool{HCompV} and \htool{HInit} are used for initialisation.\htool{HCompV} will set the mean and variance of every Gaussian componentin a HMM definition to be equal to the global mean and variance of thespeech training data. This is typically used as an initialisation stagefor \textit{flat-start} training.Alternatively, a more detailed initialisation ispossible using \htool{HInit} which will compute the parameters of a newHMM using a Viterbi style of estimation. }\htool{HRest} and \htool{HERest} are used to refine the parameters of existing HMMs using Baum-Welch Re-estimation. Like \htool{HInit}, \htool{HRest} performs \textit{isolated-unit} training whereas\htool{HERest} operates on complete model sets and performs \textit{embedded-unit}training. In general, whole word HMMs are built using \htool{HInit}and \htool{HRest}, and continuous speech sub-word based systemsare built using \htool{HERest} initialised by either \htool{HCompV} or\htool{HInit} and \htool{HRest}.This chapter describes these training tools and their use for estimating the parameters of plain (i.e. untied)continuous density HMMs. The use of tying and special cases such astied-mixture HMM sets and discrete probality HMMs are dealt with in later chapters. The first section of this chapter gives an overview of thevarious training strategies possible with \HTK. This is then followedby sections covering initialisation, isolated-unit training, andembedded training. The chapter concludes with a section detailingthe various formulae used by the training tools.\mysect{Training Strategies}{tstrats}As indicated in the introduction above, the basic operation of the \HTK\ training toolsinvolves reading in a set of one or more HMM definitions, and then usingspeech data to estimate the parameters of these definitions. The speechdata files are normally stored in parameterised form such as \texttt{LPC} or \texttt{MFCC}parameters. However, additional parameters such as delta coefficients arenormally computed \textit{on-the-fly} whilst loading each file. \sidefig{isoword}{62}{Isolated Word Training}{-4}{In fact,it is also possible to use waveform data directly by performing the full parameterconversion \textit{on-the-fly}. Which approach is preferred depends on theavailable computing resources. The advantages of storing the data alreadyencoded are that the data is more compact in parameterised form and pre-encodingavoids wasting compute time converting the data each time that it is readin. However, if the training data is derived from CD-ROMS and they can beaccessed automatically on-line, then the extra compute may be worth thesaving in magnetic disk storage.\index{isolated word training}The methods for configuring speech datainput to \HTK\ tools were described in detail in chapter~\ref{c:speechio}.All of the various input mechanisms are supported by the \HTK\ trainingtools except direct audio input.The precise way in which the training tools are used depends on thetype of HMM system to be built and the form of the availabletraining data. Furthermore,\HTK\ tools are designed to interface cleanly to each other, so alarge number of configurations are possible. In practice, however,HMM-based speech recognisers are either whole-word or sub-word.}As the name suggests, whole word modelling\index{whole word modelling} refers to a techniquewhereby each individual word in the system vocabulary is modelled bya single HMM. As shown in Fig.~\href{f:isoword}, whole word HMMsare most commonly trained on examples of each word spoken inisolation. If these training examples, which are often called\textit{tokens}, have had leading and trailing silence removed, thenthey can be input directly into the training tools without the needfor any label information. The most common method of building wholeword HMMs is to firstly use\htool{HInit}\index{hinit@\htool{HInit}} to calculate initial parameters for the model and thenuse\htool{HRest}\index{hrest@\htool{HRest}} to refine the parameters using Baum-Welchre-estimation. Where there is limited training data and recognitionin adverse noise environments is needed, so-called {\it fixedvariance} models can offer improved robustness. These are models inwhich all the variances are set equal to the global speech variance\index{global speech variance}and never subsequently re-estimated. The tool\htool{HCompV}\index{hcompv@\htool{HCompV}} can be used to compute this global variance.\index{training!whole-word}\centrefig{subword}{90}{Training Subword HMMs}Although \HTK\ gives full support for building whole-wordHMM systems, the bulk of its facilities are focussed on building sub-word systems in which the basic units are theindividual sounds of the language called \textit{phones}.One HMM is constructed for each such phone\index{phones} and continuous speech\index{continuous speech recognition} is recognised by joining the phones together to make any required vocabulary using a pronunciation dictionary.\index{training!sub-word}The basic procedures involved in training a set of subword modelsare shown in Fig.~\href{f:subword}. The core process involves theembedded training\index{embedded training} tool \htool{HERest}\index{herest@\htool{HERest}}. \htool{HERest} uses continuously spoken utterances as its source of training dataand simultaneously re-estimates the complete set of subword HMMs.For each input utterance, \htool{HERest} needs a transcription i.e.\ a list ofthe phones in that utterance. \htool{HERest} then joins together all of the subword HMMs corresponding to this phone list to make a singlecomposite HMM. This composite HMM is used to collectthe necessary statistics for the re-estimation. When all of thetraining utterances have been processed, the total set of accumulatedstatistics are used to re-estimate the parameters of all of the phoneHMMs. It is important to emphasise that in the above process, the transcriptionsare only needed to identify the sequence of phones in each utterance.No phone boundary information is needed. The initialisation\index{phone model initialisation} of a set of phone HMMs prior to embedded re-estimationusing \htool{HERest} can be achieved in two different ways. As shown on theleft of Fig.~\href{f:subword}, a small set of hand-labelled \textit{bootstrap} training data can be used along with\index{bootstrapping}the isolated training tools \htool{HInit} and \htool{HRest} toinitialise each phone HMM individually. When used in this way,both \htool{HInit} and \htool{HRest} use the label informationto extract all the segments of speech corresponding to the currentphone HMM in order to perform isolated word training. A simpler initialisation procedure uses \htool{HCompV} to assign the globalspeech mean and variance to every Gaussian distribution in every phoneHMM. This so-called \textit{flat start} procedure implies that during thefirst cycle of embedded re-estimation, each training utterance will beuniformly segmented. The hope then is that enough of the phone modelsalign with actual realisations of that phone so that on the second andsubsequent iterations, the models align as intended.\index{flat start}One of the major problems to be faced in building any HMM-basedsystem is that the amount of training data for each model will bevariable and is rarely sufficient. To overcome this, \HTK\ allowsa variety of sharing mechanisms to be implemented whereby HMM parametersare tied together so that the training data is pooled and more robustestimates result. These tyings, along with a variety of othermanipulations, are performed using the \HTK\ HMM editor \htool{HHEd}.The use of \htool{HHEd}\index{hhed@\htool{HHEd}} is described in a later chapter. Here it issufficient to note that a phone-based HMM set typically goes throughseveral refinement cycles of editing using \htool{HHEd} followedby parameter re-estimation using \htool{HERest} before the final model set isobtained.Having described in outline the main training strategies, eachof the above procedures will be described in more detail.\mysect{Initialisation using \htool{HInit}}{inithmm}In order to create a HMM definition, it is first necessary to producea prototype definition. As explained in Chapter~\ref{c:HMMDefs}, HMM definitionscan be stored as a text file and hence the simplest way of creatinga prototype is by using a text editor to manually produce a definition of the formshown in Fig~\ref{f:hmm1def}, Fig~\ref{f:hmm2def} etc. The function of a prototypedefinition is to describe the form and topology of the HMM, the actual numbers usedin the definition are not important. Hence, the vector size and parameter kind shouldbe specified and the number of states chosen. The allowable transitions between statesshould be indicated by putting non-zero values in the corresponding elements of thetransition matrix and zeros elsewhere. The rows of the transition matrix must sum to oneexcept for the final row which should be all zero. Each state definition should show therequired number of streams and mixture components in each stream. All mean valuescan be zero but diagonal variances should be positive and covariance matricesshould have positive diagonal elements. All state definitions can be identical.\index{model training!initialisation}Having set up an appropriate prototype, a HMM can be initialised using the \HTK tool\htool{HInit}. The basic principle of \htool{HInit} depends on the concept of a HMM asa generator of speech vectors. Every training example can be viewed as the outputof the HMM whose parameters are to be estimated. Thus, if the state that generated each vector in the trainingdata was known, then the unknown means and variances could be estimated by averaging all thevectors associated with each state. Similarly, the transition matrix could be estimatedby simply counting the number of time slots that each state was occupied. This processis described more formally in section~\ref{s:bwformulae} below.\sidefig{vitloop}{60}{\htool{HInit} Operation}{2}{The above idea can be implemented by an iterative scheme as shown in Fig~\href{f:vitloop}. Firstly, the Viterbi\index{Viterbi training}algorithm is used to find the most likely state sequence corresponding toeach training example, then the HMM parameters are estimated. As a side-effectof finding the Viterbi state alignment, the log likelihood of the training datacan be computed. Hence, the whole estimation process can be repeated untilno further increase in likelihood is obtained.This process requires some initial HMM parameters to get started. To circumventthis problem, \htool{HInit} starts by uniformly segmenting the data and associatingeach successive segment with successive states. Of course, this only makes senseif the HMM is left-right. If the HMM is ergodic, then the uniform segmentation\index{uniform segmentation} can be disabled and some other approach taken. For example,\htool{HCompV} can be used as described below.If any HMM state has multiple mixture components, then the training vectors areassociated with the mixture component with the highest likelihood. The number ofvectors associated with each component within a state can then be used to estimate the mixtureweights. In the uniform segmentation stage, a K-means clustering\index{K-means clustering} algorithm is usedto cluster the vectors within each state.\index{model training!mixture components}Turning now to the practical use of \htool{HInit}, whole word models can be initialised bytyping a command of the form}\begin{verbatim} HInit hmm data1 data2 data3\end{verbatim}where \texttt{hmm} is the name of the file holding the prototypeHMM and \texttt{data1}, \texttt{data2}, etc.\ are thenames of the speech files holding the training examples, each file holding a single examplewith no leading or trailing silence. The HMM definition can be distributed across a numberof macro files loaded using the standard \texttt{-H} option. For example, in\begin{verbatim} HInit -H mac1 -H mac2 hmm data1 data2 data3 ...\end{verbatim}then the macro files \texttt{mac1} and \texttt{mac2} would be loaded first. If these contained adefinition for \texttt{hmm}, then no further HMM definition input would be attempted. If however,they did not contain a definition for \texttt{hmm}, then \htool{HInit} would attempt to open a file called\texttt{hmm} and would expect to find a definition for \texttt{hmm} within it. \htool{HInit} can in principleload a large set of HMM definitions, but it will only update the parameters of the single namedHMM. On completion, \htool{HInit} will write out new versions of all HMM definitions loaded on start-up.The default behaviour is to write these to the current directory which has the usuallyundesirable effect of overwriting the prototype definition. This can be prevented byspecifying a new directory for the output definitions using the \texttt{-M} option.Thus, typical usage of \htool{HInit} takes the form \index{model training!whole word}\begin{verbatim} HInit -H globals -M dir1 proto data1 data2 data3 ... mv dir1/proto dir1/wordX\end{verbatim}Here \texttt{globals} is assumed to hold a global options macro\index{global options macro} (and possibly others). The actual HMM definition is loaded from the file \texttt{proto} in the current directory andthe newly initialised definition along with a copy of \texttt{globals} will be written to\texttt{dir1}. Since the newly created HMM will still be called \texttt{proto}, it is renamedas appropriate.For most real tasks, the number of data files required will exceed the command line argumentlimit and a script file\index{script files} is used instead. Hence, if the names of the data files are stored in the file\texttt{trainlist} then typing\begin{verbatim} HInit -S trainlist -H globals -M dir1 proto\end{verbatim}would have the same effect as previously.\centrefig{hinitdp}{90}{File Processing in \htool{HInit}}When building sub-word models, \htool{HInit} can be used in the same manner as above to initialiseeach individual sub-word HMM. However, in this case, the training data is typically continuousspeech with associated label files identifying the speech segments corresponding toeach sub-word. To illustrate this, the following command could be used to initialisea sub-word HMM for the phone \texttt{ih}\begin{verbatim}
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -