📄 speechio.tex
字号:
Independent of what parameter kind is required, there are some simplepre-processing operations that can be applied prior to performing the actualsignal analysis.\index{speech input!pre-processing} Firstly, the DC mean can be removed from the source waveform by setting theBoolean configuration parameter\texttt{ZMEANSOURCE}\index{zmeansource@\texttt{ZMEANSOURCE}} to true (i.e.\ \texttt{T}). This is useful when\index{speech input!DC offset}the original analogue-digital conversion has added a DC offset to thesignal. It is applied to each window individually so that it can beused both when reading from a file and when using direct audioinput\footnote{ This method of applying a zero mean is different toHTK Version 1.5 where the mean was calculated and subtracted from thewhole speech file in one operation. The configuration variable\texttt{V1COMPAT} can be set to revert to this older behaviour.}.Secondly, it is common practice to pre-emphasisethe signal by applying the first order difference equation\hequation{ {s^{\prime}}_n = s_n - k\,s_{n-1}}{preemp}to the samples\index{speech input!pre-emphasis}$\{s_n, n=1,N \}$ in each window. Here $k$ is thepre-emphasis\index{pre-emphasis} coefficient which should be in the range$0 \leq k < 1$. It is specified using the configurationparameter \texttt{PREEMCOEF}\index{preemcoef@\texttt{PREEMCOEF}}.Finally,it is usually beneficial to taper thesamples in each window so that discontinuities at the windowedges are attenuated. This is done by setting theBoolean configuration parameter \texttt{USEHAMMING}\index{usehamming@\texttt{USEHAMMING}}to true.This applies the following transformation to the samples$\{s_n, n=1,N\}$ in the window\hequation{ {s^{\prime}}_n = \left\{ 0.54 - 0.46 \cos \left( \frac{2 \pi (n-1)}{N-1} \right) \right\} s_n}{ham}When both pre-emphasis and Hamming windowing are enabled,pre-emphasis is performed first.\index{speech input!Hamming window function} \index{Hamming Window}In practice, all three of the above are usually applied.Hence, a configuration file will typically contain the following\begin{verbatim} ZMEANSOURCE = T USEHAMMING = T PREEMCOEF = 0.97\end{verbatim}Certain types of artificially generated waveform data can cause numericaloverflows with some coding schemes. In such cases adding a small amount ofrandom noise to the waveform data solves the problem. The noise is addedto the samples using\hequation{ {s^{\prime}}_n = s_n + q RND()}{dither}where $RND()$ is a uniformly distributed random value over the interval$[-1.0, +1.0)$ and $q$ is the scaling factor. The amount of noise added to the data ($q$) is set with the configuration parameter\index{adddither@\texttt{ADDDITHER}}\texttt{ADDDITHER} (default value $0.0$). A positive value causes the noise signal added to be the same every time(ensuring that the same file always gives exactly the same results). With anegative value the noise is random and the same file may produce slightlydifferent results in different trials.One problem that can arise when processing speech waveform files obtained fromexternal sources, such as databases on CD-ROM, is that thebyte-order\index{byte-order} may be different to that used by the machine onwhich \HTK\ is running. To deal with this problem, \htool{HWave} can performautomatic byte-swapping in order to preserve proper byte order. \HTK\ assumesby default that speech waveform data is encoded as a sequence of 2-byteintegers as is the case for most current speech databases\footnote{Many of themore recent speech databases use compression. In these cases, the data may beregarded as being logically encoded as a sequence of 2-byte integers even ifthe actual storage uses a variable length encoding scheme.}. If the source format is known, then \htool{HWave} will also make an assumptionabout the byte order used to create speech files in that format. It then checksthe byte order of the machine that it is running on and automatically performsbyte-swapping if the order is different. For unknown formats, proper byte ordercan be ensured by setting the configuration parameter\texttt{BYTEORDER}\index{byteorder@\texttt{BYTEORDER}} to \texttt{VAX} if thespeech data was created on a little-endian machine such as a VAX or an IBM PC,and to anything else (e.g. \texttt{NONVAX}) if the speech data was created on abig-endian machine such as a SUN, HP or Macintosh machine. \index{speechinput!byte order} The reading/writing of \HTK\ format waveform files can be further controlledvia the configuration parameters \texttt{NATURALREADORDER} and \texttt{NATURALWRITEORDER}. The effect and default settings of these parametersare described in section~\href{s:byteswap}.\index{byte swapping}Note that \texttt{BYTEORDER} should not be used when \texttt{NATURALREADORDER}is set to true. Finally, note that \HTK\ can also byte-swap parameterisedfiles in a similar way provided that only the byte-order of each 4 byte floatrequires inversion. \mysect{Linear Prediction Analysis}{lpcanal}In linear prediction (LP) \index{linear prediction} analysis, the vocal tract transfer functionis modelled by an all-pole filter\index{all-pole filter} with transfer function\footnote{Note that some textbooks define the denominator of equation~\ref{e:allpole}as $1 - \sum_{i=1}^p a_i z^{-i}$ so that the filter coefficients are thenegatives of those computed by \HTK.}\hequation{H(z) = \frac{1}{\sum_{i=0}^p a_i z^{-i}}}{allpole}where $p$ is the number of poles and $a_0 \equiv 1$.The filter coefficients $\{a_i \}$ are chosen to minimisethe mean square filter prediction error summed over the analysiswindow. The \HTK\ module \htool{HSigP} uses the \textit{autocorrelationmethod} to perform this optimisation as follows.Given a window of speech samples $\{s_n, n=1,N \}$,the first $p+1$ terms of the autocorrelation sequence arecalculated from\hequation{r_i = \sum_{j=1}^{N-i} s_j s_{j+i}}{autoco}where $i = 0,p$.The filter coefficients are then computed recursivelyusing a set of auxiliary coefficients $\{k_i\}$ which can beinterpreted as the reflection coefficients of an equivalentacoustic tube and the prediction error $E$ which is initiallyequal to $r_0$. Let $\{k_j^{(i-1)} \}$ and $\{a_j^{(i-1)} \}$be the reflection and filter coefficients for a filter of order$i-1$, then a filter of order $i$ can be calculated in three steps.Firstly, a new set of reflection coefficients\index{reflection coefficients} are calculated.\hequation{ k_j^{(i)} = k_j^{(i-1)}}{kupdate1}for $j = 1,i-1$ and\hequation{ k_i^{(i)} = \left\{ r_i + \sum_{j=1}^{i-1} a_j^{(i-1)} r_{i-j} \right\} / E^{(i-1)}}{kupdate2}Secondly, the prediction energy is updated.\hequation{E^{(i)} = (1 - k_i^{(i)} k_i^{(i)} ) E^{(i-1)}}{Eupdate}Finally, new filter coefficients are computed\hequation{a_j^{(i)} = a_j^{(i-1)} - k_i^{(i)} a_{i-j}^{(i-1)}}{aupdate1}for $j = 1,i-1$ and\hequation{a_i^{(i)} = - k_i^{(i)} }{aupdate2}This process is repeated from $i=1$ through to the required filter order$i=p$.To effect the above transformation, the target parameter kind mustbe set to either \texttt{LPC}\index{lpc@\texttt{LPC}} to obtain the LP filter parameters $\{a_i\}$ or\texttt{LPREFC}\index{lprefc@\texttt{LPREFC}} to obtain the reflection coefficients $\{k_i \}$. Therequired filter order must also be set using the configurationparameter \texttt{LPCORDER}\index{lpcorder@\texttt{LPCORDER}}.Thus, for example, the following configurationsettings would produce a target parameterisationconsisting of 12 reflection coefficients per vector.\begin{verbatim} TARGETKIND = LPREFC LPCORDER = 12\end{verbatim}An alternative LPC-based parameterisation is obtained by setting thetarget kind to \texttt{LPCEPSTRA}\index{lpcepstra@\texttt{LPCEPSTRA}} to generate linear prediction cepstra. The cepstrum of a signal is computed by taking a Fourier (or similar)transform of the log spectrum. In the case of linear prediction cepstra\index{linear prediction!cepstra}, therequired spectrum is the linear prediction spectrum which can be obtainedfrom the Fourier transform of the filter coefficients. However, it can be shownthat the required cepstra can be more efficiently computed using a simple recursion\hequation{ c_n = -a_n - \frac{1}{n} \sum_{i=1}^{n-1} (n-i) a_i c_{n-i}}{lpcepstra}The number of cepstra generated need not be the same as the number offilter coefficients, hence it is set by a separate configuration parameter called \texttt{NUMCEPS}\index{numceps@\texttt{NUMCEPS}}.The principal advantage of cepstral coefficients is that they are generally decorrelated and this allows diagonal covariancesto be used in the HMMs. However, one minor problem with them is that the higher order cepstra are numerically quite small and this results ina very wide range of variances when going from the low to high cepstral coefficients\index{cepstral coefficients!liftering}.\HTK\ does not have a problem with this but for pragmatic reasons such asdisplaying model parameters, flooring variances, etc., it is convenient to re-scalethe cepstral coefficients to have similar magnitudes. This is done bysetting the configuration parameter \texttt{CEPLIFTER}\index{ceplifter@\texttt{CEPLIFTER}} to some value $L$ to\textit{lifter} the cepstra according to the following formula\hequation{ {c^{\prime}}_n = \left( 1 + \frac{L}{2} sin \frac{\pi n}{L} \right) c_n}{ceplifter}As an example, the following configuration parameters woulduse a 14'th order linear prediction analysis togenerate 12 liftered LP cepstra per target vector\begin{verbatim} TARGETKIND = LPCEPSTRA LPCORDER = 14 NUMCEPS = 12 CEPLIFTER = 22\end{verbatim}These are typical of the values needed to generate a good front-endparameterisation for a speech recogniser based on linear prediction.\index{cepstral analysis!LPC based}\index{cepstral analysis!liftering coefficient}Finally, note that the conversions supported by \HTK\ are not limited tothe case where the source is a waveform. \HTK\ can convert anyLP-based parameter into any other LP-based parameter.\mysect{Filterbank Analysis}{fbankanal}The human ear resolves frequencies non-linearly across the audio spectrum andempirical evidence suggests that designing a front-end to operate in a similarnon-linear manner improves recognition performance. A popular alternative tolinear prediction based analysis is therefore filterbank analysis since thisprovides a much more straightforward route to obtaining the desired non-linearfrequency resolution. However, filterbank amplitudes are highly correlated andhence, the use of a cepstral transformation in this case is virtually mandatoryif the data is to be used in a HMM based recogniser with diagonal covariances.\index{cepstral analysis!filter bank} \index{speech input!filter bank}\HTK\ provides a simple Fourier transform based filterbank designed togive approximately equal resolution on a mel-scale. Fig.~\href{f:melfbank}illustrates the general form of this filterbank. As can be seen,the filters used are triangular and they are equally spaced along the mel-scalewhich is defined by\hequation{ \mbox{Mel}(f) = 2595 \log_{10}(1 + \frac{f}{700})}{melscale}To implement this filterbank, the window of speech data istransformed\index{mel scale} using a Fourier transform and the magnitude istaken. The magnitude coefficients are then \textit{binned} by correlating themwith each triangular filter. Here binning means that each FFT magnitudecoefficient is multiplied by the corresponding filter gain and the resultsaccumulated. Thus, each bin holds a weighted sum representing the spectralmagnitude in that filterbank channel.\index{binning} As an alternative, theBoolean configuration parameter\texttt{USEPOWER}\index{usepower@\texttt{USEPOWER}} can be set true to use thepower rather than the magnitude of the Fourier transform in the binningprocess. \index{cepstral analysis!power vs magnitude}\centrefig{melfbank}{110}{Mel-Scale Filter Bank}\index{speech input!bandpass filtering}Normally the triangular filters are spread over the whole frequency range fromzero upto the Nyquist frequency. However, band-limiting is often useful toreject unwanted frequencies or avoid allocating filters to frequency regions inwhich there is no useful signal energy. For filterbank analysis only, lowerand upper frequency cut-offs can be set using the configuration parameters\texttt{LOFREQ}\index{lofreq@\texttt{LOFREQ}} and\texttt{HIFREQ}\index{hifreq@\texttt{HIFREQ}}. For example,\begin{verbatim} LOFREQ = 300 HIFREQ = 3400\end{verbatim}might be used for processing telephone speech. When low and high pass cut-offsare set in this way, the specified number of filterbank channels are distributed
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -