📄 fundaments.tex

📁 该压缩包为最新版htk的源代码,htk是现在比较流行的语音处理软件,请有兴趣的朋友下载使用
💻 TEX
📖 第 1 页 / 共 3 页
字号:
12 3 下一页
%/* ----------------------------------------------------------- */%/*                                                             */%/*                          ___                                */%/*                       |_| | |_/   SPEECH                    */%/*                       | | | | \   RECOGNITION               */%/*                       =========   SOFTWARE                  */ %/*                                                             */%/*                                                             */%/* ----------------------------------------------------------- */%/*         Copyright: Microsoft Corporation                    */%/*          1995-2000 Redmond, Washington USA                  */%/*                    http://www.microsoft.com                */%/*                                                             */%/*   Use of this software is governed by a License Agreement   */%/*    ** See the file License for the Conditions of Use  **    */%/*    **     This banner notice must not be removed      **    */%/*                                                             */%/* ----------------------------------------------------------- */%% HTKBook - Steve Young 1/12/97%\mychap{The Fundamentals of \HTK}{fundaments}\sidepic{overview}{40}{}\HTK\ is a toolkit for building Hidden Markov Models (HMMs). HMMs can be used to model any time series and the core of \HTK\ is similarly general-purpose. However, \HTK\ is primarilydesigned for building HMM-based speech processing tools,  inparticular recognisers.  Thus, much of the infrastructuresupport in \HTK\ is dedicated to this task.  As shown in the pictureabove, there are two major processing stages involved.  Firstly,the \HTK\ training tools are used to estimate the parameters of a setof HMMs using training utterances and their associated transcriptions. Secondly,unknown utterances are transcribed using the \HTK\ recognition tools.The main body of this book is mostly concerned with the mechanics of thesetwo processes. However, before launching into detail it is necessary tounderstand some of the basic principles of HMMs.  It is also helpful to havean overview of the toolkit and to have some appreciation of how training and recognition in \HTK\ is organised. This first part of the book attempts to provide this information. In thischapter, the basic ideas of HMMs and their use in speech recognition areintroduced.  The following chapter then presents a brief overviewof \HTK\ and, for users of older versions, it highlights the maindifferences in version 2.0 and later.  Finally in this tutorial part of the book, chapter 3describes how a HMM-based speechrecogniser can be built using\HTK.  It does this by describing the construction of a simple small vocabularycontinuous speech recogniser.The second part of the book then revisits the topics skimmed over hereand discusses each in detail. This can be read in conjunction with the third and final partof the book which provides a reference manual for \HTK.  This includesa description of each tool,  summaries of the variousparameters used to configure \HTK\ and a list of the error messages thatit generates when things go wrong.Finally, note that this book is concerned only with \HTK\ as a tool-kit.It does not provide  information for using the \HTK\ libraries as a programmingenvironment.\mysect{General Principles of HMMs}{genpHMM}\sidefig{messencode}{50}{Message Encoding/Decoding}{-4}{}Speech recognition systems generally assume that the speech signal isa  realisation of some message encoded as a sequence of one or moresymbols (see Fig.~\href{f:messencode}). To effect the reverseoperation of recognising the underlying symbol sequence given a spokenutterance, the continuous speech waveform is first converted to asequence of equally spaced discrete parameter vectors. This sequenceof parameter vectors is assumed to form an exact representation ofthe speech waveform on the basis that for the duration covered by asingle vector (typically 10ms or so), the speech waveform can beregarded as being stationary.  Although this is not strictly true, itis  a reasonable approximation.  Typical parametric representations incommon use are smoothed spectra or linear prediction coefficients plusvarious other representations derived from these.The r\^{o}le of the recogniser is to effect a mapping between  sequencesof speech vectors and the wanted underlying symbol sequences.  Twoproblems make this very difficult.  Firstly, the mapping from symbolsto speech is not one-to-one since different underlying symbols cangive rise to similar speech sounds.  Furthermore, there are largevariations in the realised speech waveform due to speaker variability,mood, environment, etc.  Secondly, the boundaries between symbolscannot be identified explicitly from the speech waveform.  Hence, itis not possible to treat the speech waveform as a sequence ofconcatenated static patterns. The second problem of not knowing the word boundary locationscan be avoided by restricting the task toisolated word recognition.  As shown in Fig.~\href{f:isoprob},this implies that the speech waveform corresponds to a singleunderlying symbol (e.g. word) chosen from a fixed vocabulary.Despite the fact that thissimpler problem is somewhat artificial, it nevertheless has a wide range of practical applications.Furthermore, it serves as a good basis for introducing the basic ideas of HMM-basedrecognition before dealing with the more complex continuous speechcase.  Hence, isolated word recognition using HMMs will be dealtwith first.\mysect{Isolated Word Recognition}{isowrdrec}Let each spoken word be represented by a sequence of speech vectors or {\itobservations} $\bm{O}$,  defined as\begin{equation}\bm{O} = \bm{o}_1, \bm{o}_2, \ldots, \bm{o}_T\end{equation}where $\bm{o}_t$ is the speech vector observed at time $t$.  Theisolated word recognition problem can then be regarded as that ofcomputing\begin{equation}  \label{e:2} \arg\max_i \left\{ P(w_i | \bm{O}) \right\}\end{equation}where $w_i$ is the $i$'th vocabulary word.  This probabilityis not computable directly but using Bayes' Rule\index{Bayes' Rule} gives\begin{equation} \label{e:3}  P(w_i | \bm{O}) = \frac{P(\bm{O}|w_i) P(w_i)}{P(\bm{O})}\end{equation}Thus, for a given set of prior probabilities $P(w_i)$, the mostprobable spoken word depends only on the likelihood $P(\bm{O}|w_i) $.Given the dimensionality of the observation sequence $\bm{O}$, thedirect estimation of the joint conditional probability $P(\bm{o}_1,\bm{o}_2,\ldots | w_i)$ from examples of spoken wordsis not practicable. However, if a parametric model of word productionsuch as a Markov modelis assumed, then estimation from data is possible since the problemof estimating the class conditional observation densities $P(\bm{O}|w_i)$is replaced by the much simpler problem of estimating the Markovmodel parameters.\sidefig{isoprob}{50}{Isolated Word Problem}{-4}In HMM based speech recognition, it is assumed that the sequence ofobserved speech vectors corresponding to each word is generatedby a Markov model\index{HMM!definitions} as shown in Fig.~\href{f:markovgen}.A Markov model is a finite state machine which changes stateonce every time unit and each time $t$ that a state $j$ is entered, aspeech vector $\bm{o}_t$ is generated from the probability density$b_j(\bm{o}_t)$.  Furthermore, the transition from state $i$ to state $j$is also probabilistic and is governed by the discrete probability $a_{ij}$.Fig.~\href{f:markovgen} shows an example of this process where the six statemodel moves through the state sequence $X=1,2,2,3,4,4,5,6$ inorder to generate the sequence $\bm{o}_1$ to $\bm{o}_6$. Notice thatin \HTK, the entry and exit states of a HMM are non-emitting.  Thisis to facilitate the construction of composite models as explained inmore detail later.The joint probability that $\bm{O}$ is generated by the model $M$ movingthrough the state sequence$X$ is calculated simply as the product of the transitionprobabilities and the output probabilities.  So for the state sequence $X$ inFig.~\href{f:markovgen}\begin{equation} \label{e:4}P(\bm{O},X|M) = a_{12} b_2(\bm{o}_1) a_{22} b_2(\bm{o}_2) a_{23}            b_3(\bm{o}_3) \ldots\end{equation}However, in practice, only the observation sequence $\bm{O}$ is known and theunderlying state sequence $X$ is hidden.  This is why it iscalled a {\it Hidden Markov Model}.  \centrefig{markovgen}{85}{The Markov Generation Model}Given that $X$ is unknown, therequired likelihood\index{likelihood computation} is computed by summing over all possible statesequences $X = x(1), x(2), x(3), \ldots, x(T)$, that is\begin{equation}  \label{e:5}P(\bm{O}|M) = \sum_X a_{x(0)x(1)} \prod_{t=1}^T b_{x(t)}(\bm{o}_t)a_{x(t)x(t+1)} \end{equation}where $x(0)$ is constrained to be the model entry state and $x(T+1)$is constrained to be the model exit state.As an alternative to equation~\ref{e:5}, the likelihood can beapproximated by only considering the most likely statesequence, that is\begin{equation} \label{e:6}\hat{P}(\bm{O}|M) = \max_X \left\{         a_{x(0)x(1)} \prod_{t=1}^T b_{x(t)}(\bm{o}_t) a_{x(t)x(t+1)}        \right\}\end{equation}Although the direct computation of equations \ref{e:5} and \ref{e:6}is not tractable, simple recursive procedures exist which allowboth quantities to be calculated very efficiently.Before going any further, however, notice that if equation~\ref{e:2} iscomputable then the recognition problem is solved.  Given a set of models$M_i$ corresponding to words $w_i$, equation~\ref{e:2} issolved by using \ref{e:3} and assuming that\begin{equation} \label{e:7}P(\bm{O}|w_i) = P(\bm{O}|M_i).\end{equation}All this, of course, assumes that the parameters $\{a_{ij}\}$ and$\{b_{j}(\bm{o}_t)\}$ are known for each model $M_i$.  Herein lies theelegance and power of the HMM framework.  Given a set of training examplescorresponding to a particular model, the parameters of that model can bedetermined automatically by a robust and efficient re-estimation procedure.  Thus, provided that a sufficient number of representativeexamples of each word can be collected then a HMM can be constructedwhich implicitly models all of the many sources of variability inherentin real speech.  Fig.~\href{f:useforiso} summarises the use of HMMsfor isolated word recognition.  Firstly, aHMM is trained for each vocabulary word using a number of examplesof that word.  In this case, the vocabulary consists ofjust three words: ``one'', ``two'' and ``three''.Secondly, to recognise some unknown word, the likelihood of each model generating that word is calculated and the most likelymodel identifies the word.\centrefig{useforiso}{84}{Using HMMs for Isolated Word Recognition}\mysect{Output Probability Specification}{outprobspec}Before\index{output probability!continuous case} the problem of parameter estimation can be discussed in moredetail, the form of the output distributions $\{b_{j}(\bm{o}_t)\}$needs to be made explicit. \HTK\ is designed primarily for modellingcontinuous parameters using continuous density multivariate outputdistributions.  It can also handle observation sequencesconsisting of discrete symbols in which case, the output distributionsare discrete probabilities.  For simplicity, however, the presentation in this chapter will assume that continuous density distributions are being used. The minor differences that the useof discrete probabilities entail are noted in chapter~\ref{c:HMMDefs}and discussed in more detail in chapter~\ref{c:discmods}.In common with most other\index{Gaussian mixture}\index{streams}\index{codebooks}continuous density HMM systems, \HTK\ represents output distributionsby Gaussian Mixture Densities.  In \HTK, however, a furthergeneralisation is made.  \HTK\ allows each observation vector at time $t$to be split into a number of $S$ independent data streams $o_{st}$.  Theformula for computing $b_{j}(\bm{o}_t)$ is then\begin{equation} \label{e:8}  b_{j}(\bm{o}_t) = \prod_{s=1}^S \left[     \sum_{m=1}^{M_s} c_{jsm} {\cal N}(\bm{o}_{st};                    \bm{\mu}_{jsm}, \bm{\Sigma}_{jsm})  \right]^{\gamma_s}\end{equation}where $M_s$ is the number of mixture components in stream $s$, $c_{jsm}$is the weight of the $m$'th component and ${\cal N}(\cdot; \bm{\mu}, \bm{\Sigma})$ is a multivariate Gaussian withmean vector $\bm{\mu}$ and covariance matrix $\bm{\Sigma}$,that is\index{stream weight}\index{codebook exponent}\begin{equation}{\cal N}(\bm{o}; \bm{\mu}, \bm{\Sigma}) =       \frac{1}{\sqrt{(2 \pi)^n | \bm{\Sigma} |}}        e^{- \frac{1}{2}(\bm{o}-\bm{\mu})' \bm{\Sigma}^{-1}(\bm{o}-\bm{\mu})}\end{equation}where $n$ is the dimensionality of $\bm{o}$.The exponent $\gamma_s$ is a stream weight\footnote{often referred to as a codebook exponent.}.  It can be used to give a particular stream more emphasis, however,it can only be set manually.  No current \HTK\ training tools
12 3 下一页
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -