📄 speechio.tex

📁 隐马尔科夫模型工具箱
💻 TEX
📖 第 1 页 / 共 5 页
字号:
equally on the mel-scale across the resulting pass-band such that the lower cut-offof the first filter is at \texttt{LOFREQ} and the upper cut-off of the lastfilter is at \texttt{HIFREQ}.If mel-scale filterbank parameters are required directly, then the target kindshould be set to \texttt{MELSPEC}\index{melspec@\texttt{MELSPEC}}.Alternatively, log filterbank parameters can be generated by setting the targetkind to \texttt{FBANK}.  \mysect{Vocal Tract Length Normalisation}{vtln}A simple speaker normalisation technique can be implemented bymodifying the filterbank analysis described in the previous section.Vocal tract length normalisation (VTLN) aims to compensate for thefact that speakers have vocal tracts of different sizes. VTLN can beimplemented by warping the frequency axis in the filterbank analysis.In HTK simple linear frequency warping is supported. The warpingfactor~$\alpha$ is controlled by the configuration variable\texttt{WARPFREQ}\index{melspec@\texttt{WARPFREQ}}. Here values of$\alpha < 1.0$ correspond to a compression of the frequency axis. Asthe warping would lead to some filters being placed outside theanalysis frequency range, the simple linear warping function ismodified at the upper and lower boundaries. The result is that thelower boundary frequency of the analysis(\texttt{LOFREQ}\index{melspec@\texttt{LOFREQ}}) and the upperboundary frequency (\texttt{HIFREQ}\index{melspec@\texttt{HIFREQ}})are always mapped to themselves. The regions in which the warpingfunction deviates from the linear warping with factor~$\alpha$ arecontrolled with the two configuration variables(\texttt{WARPLCUTOFF}\index{melspec@\texttt{WARPLCUTOFF}}) and(\texttt{WARPUCUTOFF}\index{melspec@\texttt{WARPUCUTOFF}}).Figure~\href{f:vtlnpiecewise} shows the overall shape of the resultingpiece-wise linear warping functions.\centrefig{vtlnpiecewise}{60}{Frequency Warping}The warping factor~$\alpha$ can for example be found using a searchprocedure that compares likelihoods at different warping factors. Atypical procedure would involve recognising an utterance with$\alpha=1.0$ and then performing forced alignment of the hypothesisfor all warping factors in the range $0.8 - 1.2$. The factor thatgives the highest likelihood is selected as the final warping factor.Instead of estimating a separate warping factor for each utterance,large units can be used by for example estimating only one~$\alpha$per speaker. Vocal tract length normalisation can be applied in testing as well asin training the acoustic models.\mysect{Cepstral Features}{cepstrum}Most often, however, cepstral parameters are requiredand these are indicated by setting the target kind to \texttt{MFCC} standingfor Mel-Frequency Cepstral Coefficients (MFCCs).  These are calculated from thelog filterbank amplitudes $\{m_j\}$ using the Discrete Cosine Transform\hequation{   c_i = \sqrt{\frac{2}{N}} \sum_{j=1}^N m_j \cos \left( \frac{\pi i}{N}(j-0.5) \right)}{dct}where $N$ is the number of filterbank channels set by the configurationparameter \texttt{NUMCHANS}\index{numchans@\texttt{NUMCHANS}}.  The requirednumber of cepstral coefficients is set by\texttt{NUMCEPS}\index{numceps@\texttt{NUMCEPS}} as in the linear predictioncase.  Liftering can also be applied to MFCCs using the\texttt{CEPLIFTER}\index{ceplifter@\texttt{CEPLIFTER}} configuration parameter(see equation~\ref{e:ceplifter}).MFCCs are the parameterisation of choice for many speech recognition applications.They give good discrimination and lend themselves to a number of manipulations.In particular, the effect of inserting a transmission channel on the inputspeech is to multiply the speech spectrum by the channel transfer function.In the log cepstral domain, this multiplication becomes a simple addition whichcan be removed by subtracting the cepstral mean from all input vectors.In practice, of course, the mean has to be estimated over a limited amountof speech data so the subtraction will not be perfect.  Nevertheless, thissimple technique is very effective in practice where itcompensates for long-term spectral effects such as those caused by differentmicrophones and audio channels.  To perform thisso-called \textit{Cepstral Mean Normalisation} (CMN) in \HTK\, it is only necessaryto add the \texttt{\_Z}\index{qualifiers!aaaz@\texttt{\_Z}} qualifier to the target parameter kind.  The mean is estimated by computing the average ofeach cepstral parameter across each input speech file.  Since this cannot be donewith live audio, cepstral mean compensation is not supported for this case.\index{cepstral mean normalisation}In addition to the mean normalisation the variance of the data can benormalised. For improved robustness both mean and variance of the datashould be calculated on a larger units (e.g.\ on all the data from aspeaker instead of just on a single utterance). To usespeaker-/cluster-based normalisation the mean and variance estimatesare computed offline before the actual recognition and stored inseparate files (two files per cluster). The configuration variables\texttt{CMEANDIR}\index{numchans@\texttt{CMEANDIR}} and\texttt{VARSCALEDIR}\index{numchans@\texttt{VARSCALEDIR}} point to thedirectories where these files are stored. To find the actual filenamea second set of variables(\texttt{CMEANMASK}\index{numchans@\texttt{CMEANMASK}} and\texttt{VARSCALEMASK}\index{numchans@\texttt{VARSCALEMASK}}) has to bespecified. These masks are regular expressions in which you can usethe special characters \texttt{?}, \texttt{*} and \texttt{\%}. Theappropriate mask is matched against the filename of the file to berecognised and the substring that was matched against the \texttt{\%}characters is used as the filename of the normalisation file. Anexample config setting is:\begin{verbatim}CMEANDIR     = /data/eval01/plp/cmnCMEANMASK    = %%%%%%%%%%_*VARSCALEDIR  = /data/eval01/plp/cvnVARSCALEMASK = %%%%%%%%%%_*VARSCALEFN   = /data/eval01/plp/globvar\end{verbatim}So, if the file \verb|sw1-4930-B_4930Bx-sw1_000126_000439.plp| is to berecognised then the normalisation estimates would be loaded from thefollowing files:\begin{verbatim}/data/eval01/plp/cmn/sw1-4930-B/data/eval01/plp/cvn/sw1-4930-B\end{verbatim}The file specified by\texttt{VARSCALEFN}\index{numchans@\texttt{VARSCALEFN}} contains theglobal target variance vector, i.e. the variance of the data is firstnormalised to 1.0 based on the estimate in the appropriate file in\texttt{VARSCALEDIR}\index{numchans@\texttt{VARSCALEDIR}} and thenscaled to the target variance given in\texttt{VARSCALEFN}\index{numchans@\texttt{VARSCALEFN}}.The format of the files is very simple and each of them just containsone vector. Note that in the case of the cepstral mean only the staticcoefficients will be normalised. A cmn file could for example look like:\begin{verbatim}<CEPSNORM> <PLP_0><MEAN> 13-10.285290 -9.484871 -6.454639 ...\end{verbatim}The cepstral variance normalised always applies to the fullobservation vector after all qualifiers like delta and accelerationcoefficients have been added, e.g.:\begin{verbatim}<CEPSNORM> <PLP_D_A_Z_0><VARIANCE> 3933.543018 31.241779 36.076199 ...\end{verbatim}The global variance vector will always have the same number ofdimensions as the cvn vector, e.g.:\begin{verbatim}<VARSCALE> 39 2.974308e+01 4.143743e+01 3.819999e+01 ...\end{verbatim}These estimates can be generated using \htool{HCompV}. See thereference section for details. \mysect{Perceptual Linear Prediction}{plp}An alternative to the Mel-Frequency Cepstral Coefficients is the useof Perceptual Linear Prediction (PLP) coefficients.As implemented in HTK the PLP feature extraction is based on thestandard mel-frequency filterbank (possibly warped). The melfilterbank coefficients are weighted by an equal-loudness curve andthen compressed by taking the cubic root.\footnote{the degree of  compression can be controlled by setting the configuration parameter  \texttt{COMPRESSFACT}\index{enormalise@\texttt{COMPRESSFACT}} which  is the power to which the amplitudes are raised and defaults to  0.33)} From the resulting auditory spectrum LP coefficents areestimated which are then converted to cepstral coefficents in thenormal way (see above).\mysect{Energy Measures}{energy}\index{speech input!energy measures} To augment the spectral parameters derived from linear prediction ormel-filterbank analysis, an energy term can be appended by including thequalifier \texttt{\_E}\index{qualifiers!aaae@\texttt{\_E}} in the target kind.The energy is computed as the log of the signal energy, that is, for speechsamples $\{s_n, n=1,N \}$\hequation{   E = log \sum_{n=1}^N s_n^2}{logenergy}This log energy measure can be normalised to the range $-E_{min}..1.0$ bysetting the Boolean configuration parameter\texttt{ENORMALISE}\index{enormalise@\texttt{ENORMALISE}} to true (defaultsetting).  Thisnormalisation is implemented by subtracting the maximum value of $E$ in theutterance and adding $1.0$. Note that energy normalisation is incompatible with live audio input and in such circumstances the configuration variable \texttt{ENORMALISE} should be explicitly set false.The lowest energy in the utterance can be clamped using the configurationparameter\texttt{SILFLOOR}\index{silfloor@\texttt{SILFLOOR}} which gives the ratiobetween the maximum and minimum energies in the utterance in dB. Its defaultvalue is 50dB. Finally, the overall log energy can be arbitrarily scaled by the value of the configuration parameter \texttt{ESCALE}\index{escale@\texttt{ESCALE}} whose default is $0.1$. \index{silence floor}When calculating energy for LPC-derived parameterisations, the default is touse the zero-th delay autocorrelation coefficient ($r_0$).  However, this meansthat the energy is calculated after windowing and pre-emphasis.  If theconfiguration parameter \texttt{RAWENERGY}\index{rawenergy@\texttt{RAWENERGY}}is set true, however, then energy is calculated separately before any windowingor pre-emphasis regardless of the requested parameterisation\footnote{ In anyevent, setting the compatibility variable \texttt{V1COMPAT} to true in\htool{HPARM} will ensure that the calculation of energy is compatible withthat computed by the Version 1 tool \htool{HCode}.  }.In addition to, or in place of, the log energy, the qualifier\texttt{\_O}\index{qualifiers!aaao@\texttt{\_O}} can be added to a target kindto indicate that the 0'th cepstral parameter $C_0$ is to be appended.  Thisqualifier is only valid if the target kind is \texttt{MFCC}.  Unlike earlierversions of \HTK\, scaling factors set by the configuration variable\texttt{ESCALE} are not applied to $C_0$\footnote{ Unless \texttt{V1COMPAT} isset to true.  }.\mysect{Delta, Acceleration and Third Differential Coefficients}{delta}\index{speech input!dynamic coefficents}The performance of a speech recognition system can be greatly enhanced byadding time derivatives to the basic static parameters.  In \HTK, these areindicated by attaching qualifiers to the basic parameter kind.  The qualifier\texttt{\_D} indicates that first order regression coefficients (referred to asdelta coefficients) are appended, the qualifier\texttt{\_A}\index{qualifiers!aaaa@\texttt{\_A}} indicates that second orderregression coefficients (referred to as acceleration coefficients) and  the qualifier\texttt{\_T}\index{qualifiers!aaaa@\texttt{\_T}} indicates that third orderregression coefficients (referred to as third differential coefficients) areappended. The \texttt{\_A} qualifier cannot be used without also using the\texttt{\_D}\index{qualifiers!aaad@\texttt{\_D}} qualifier. Similarlythe \texttt{\_T} qualifier cannot be used without also using the\texttt{\_D} and \texttt{\_A} qualifiers.The delta coefficients\index{delta coefficients} are computed using thefollowing regression formula\index{regression formula}\hequation{   d_t = \frac{ \sum_{\theta =1}^\Theta \theta(c_{t+\theta} - c_{t-\theta}) }{                2 \sum_{\theta = 1}^\Theta \theta^2 }}{deltas}where $d_t$ is a delta coefficient at time $t$ computed in terms of thecorresponding static coefficients $c_{t-\Theta}$ to $c_{t+\Theta}$.  The valueof $\Theta$ is set using the configuration parameter\texttt{DELTAWINDOW}\index{deltawindow@\texttt{DELTAWINDOW}}.  The same formulais applied to the delta coefficients to obtain acceleration coefficients exceptthat in this case the window size is set by\texttt{ACCWINDOW}\index{accwindow@\texttt{ACCWINDOW}}. Similarlythe third differentials use \texttt{THIRDWINDOW}. Sinceequation~\ref{e:deltas} relies on past and future speech parameter values, some modification is needed at the beginning and end of the speech.  Thedefault behaviour is to replicate the first or last vector as needed to fill
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -