📄 speechio.tex

📁 隐马尔科夫模型工具箱
💻 TEX
📖 第 1 页 / 共 5 页
字号:
the regression window.In older version 1.5 of \HTK\ and earlier, this end-effect problem was solvedby using simple first order differences at the start and end of the speech, that is\begin{equation}   d_t = c_{t+1} - c_t,\;\;\; t<\Theta\end{equation}and\begin{equation}   d_t = c_t - c_{t-1}, \;\;\; t \geq T-\Theta\end{equation}where $T$ is the length of the data file.  If required, this older behaviourcan be restored by setting the configuration variable \texttt{V1COMPAT}\index{v1compat@\texttt{V1COMPAT}}to true in \htool{HParm}.For some purposes, it is useful to use simple differences throughout.  Thiscan be achieved by setting the configuration variable \texttt{SIMPLEDIFFS}\index{simplediffs@\texttt{SIMPLEDIFFS}}to true in \htool{HParm}.  In this case, just the end-points of the delta windoware used, i.e.\hequation{   d_t = \frac{ (c_{t+\Theta} - c_{t-\Theta}) }{                2 \Theta}}{simdiffs}\index{simple differences}When delta and acceleration coefficients are requested, they are computed forall static parameters including energy if present.  In some applications, theabsolute energy is not useful but time derivatives of the energy may be.  Byincluding the \texttt{\_E} qualifier together with the\texttt{\_N}\index{qualifiers!aaan@\texttt{\_N}} qualifier, the absolute energyis suppressed leaving just the delta and acceleration coefficients of theenergy.\mysect{Storage of Parameter Files}{parmstore}Whereas \HTK\ can handle waveform data in a variety of file formats,all parameterised speech data is stored externally in either native\HTK\ format data files or Entropic Esignal format files.Entropic ESPS format is no longer supported directly, but input and outputfilters can be used to convert ESPS to Esignal format on input andEsignal to ESPS on output.\subsection{\HTK\ Format Parameter Files}\HTK\ format files consist of a contiguous sequence of \textit{samples}preceded by a header.  Each sample is a vector of either 2-byte integers or4-byte floats.  2-byte integers are used for compressed forms as describedbelow and for vector quantised data as described later insection~\ref{s:vquant}.  \HTK\ format data files can also be used to storespeech waveforms as described in section~\ref{s:waveform}.  \index{fileformats!HTK}The \HTK\ file format header is 12 bytes long and contains the following data \begin{tabbing}++ \= +++++++++ \= \kill\>\texttt{nSamples}\>-- number of samples in file (4-byte integer) \\\>\texttt{sampPeriod}\>-- sample period in 100ns units (4-byte integer) \\\>\texttt{sampSize}\>-- number of bytes per sample (2-byte integer) \\\>\texttt{parmKind}\>-- a code indicating the sample kind (2-byte integer)\end{tabbing}The parameter kind\index{parameter kind} consists of a 6 bitcode representing the basic parameter kind plus additional bits foreach of the possible qualifiers\index{qualifiers}.  The basic parameter kind codes are\begin{tabbing}++++\= +++ \= ++++++++ \=   \kill\>0 \> \texttt{WAVEFORM} \> sampled waveform \\\>1 \> \texttt{LPC} \> linear prediction filter coefficients \\\>2 \> \texttt{LPREFC} \> linear prediction reflection coefficients \\\>3 \> \texttt{LPCEPSTRA} \> LPC cepstral coefficients \\\>4 \> \texttt{LPDELCEP}  \> LPC cepstra plus delta coefficients \\\>5 \> \texttt{IREFC}       \> LPC reflection coef in 16 bit integer format  \\\>6 \> \texttt{MFCC}      \> mel-frequency cepstral coefficients \\\>7 \> \texttt{FBANK}     \> log mel-filter bank channel outputs \\\>8 \> \texttt{MELSPEC}     \> linear mel-filter bank channel outputs \\\>9 \> \texttt{USER}      \> user defined sample kind \\\>10 \> \texttt{DISCRETE} \> vector quantised data \\\end{tabbing}and the bit-encoding for the qualifiers (in octal) is \begin{tabbing}++++\= +++ \= ++++++++ \=   \kill\>\texttt{\_E} \> 000100 \> has energy \\\>\texttt{\_N} \> 000200 \> absolute energy suppressed \\\>\texttt{\_D} \> 000400 \> has delta coefficients \\\>\texttt{\_A} \> 001000 \> has acceleration coefficients\\\>\texttt{\_C} \> 002000 \> is compressed \\\>\texttt{\_Z} \> 004000 \> has zero mean static coef. \\\>\texttt{\_K} \> 010000 \> has CRC checksum \\\>\texttt{\_O} \> 020000 \> has 0'th cepstral coef. \\\end{tabbing}\index{qualifiers!codes}The \texttt{\_A} qualifier can only be specified when \texttt{\_D}is also specified.The \texttt{\_N} qualifier is only valid when both energy and deltacoefficients are present.  The sample kind \texttt{LPDELCEP} is identical to \texttt{LPCEPSTRA\_D}and is retained for compatibility with older versions of \HTK.The \texttt{\_C}\index{qualifiers!aaac@\texttt{\_C}} and \texttt{\_K}\index{qualifiers!aaak@\texttt{\_K}} only exist in external files.  Compressedfiles are always decompressed on loading and any attached CRC is checked and removed.  An external file can contain both an energyterm and a 0'th order cepstral coefficient.  These may be retainedon loading but normally one or the other is discarded\footnote{Some applications may require the 0'th order cepstral coefficientin order to recover the filterbank coefficients from the cepstralcoefficients.}.\putfig{HTKFormat}{130}{Parameter Vector Layout in \HTK\ Format Files}All parameterised forms of \HTK\ data files consist of a sequence of vectors.Each vector is organised as shown by the examples in Fig~\href{f:HTKFormat}where various different qualified forms are listed.  As can be seen, an energyvalue if present immediately follows the base coefficients.  If deltacoefficients are added, these follow the base coefficients and energy value.Note that the base form \texttt{LPC} is used in this figure only as an example,the same layout applies to all base sample kinds.  If the 0'th order cepstralcoefficient is included as well as energy then it is inserted immediatelybefore the energy coefficient, otherwise it replaces it.For external storage of speech parameter files, two compression methods areprovided.  For LP coding only, the \texttt{IREFC} parameter kind exploits thefact that the reflection coefficients are bounded by $\pm 1$ and hence they canbe stored as scaled integers such that $+1.0$ is stored as $32767$ and $-1.0$is stored as $-32767$.  For other types of parameterisation, a more generalcompression facility indicated by the\texttt{\_C}\index{qualifiers!aaac@\texttt{\_C}} qualifier is used.  \HTK\ compressed parameter files consist of a set of compressed parameter vectors stored as shorts such that for parameter $x$\begin{eqnarray}x_{short} & = & A*x_{float}-B  \nonumber   \end{eqnarray}The coefficients $A$ and $B$ are defined as\begin{eqnarray}A & = & 2*I/(x_{max}-x_{min}) \nonumber\\B & = & (x_{max}+x_{min})*I/(x_{max}-x_{min}) \nonumber\end{eqnarray}where $x_{max}$ is the maximum value of parameter $x$ in the whole file and$x_{min}$ is the corresponding minimum. $I$ is the maximum range of a 2-byteinteger i.e.\ 32767.  The values of $A$ and $B$ are stored as two floatingpoint vectors prepended to the start of the file immediately after the header.When a \HTK\ tool writes out a speech file to external storage, no furthersignal conversions are performed.  Thus, for most purposes, the targetparameter kind specifies both the required internal representation and the formof the written output, if any.  However, there is a distinction in the way thatthe external data is actually stored.  Firstly, it can be compressed asdescribed above by setting the configuration parameter \texttt{SAVECOMPRESSED}to true.  If the target kind is \texttt{LPREFC} then this compression isimplemented by converting to \texttt{IREFC} otherwise the general compressionalgorithm described above is used.  Secondly, in order to avoid data corruptionproblems, externally stored \HTK\ parameter files can have a cyclic redundancychecksum appended.  This is indicated by the qualifier\texttt{\_K}\index{qualifiers!aaak@\texttt{\_K}} and it is generated by settingthe configuration parameter \texttt{SAVEWITHCRC} to true.  The principle toolwhich uses these output conversions is \htool{HCopy} (seesection~\ref{s:UseHCopy}).\subsection{Esignal Format Parameter Files}\index{file formats!Esignal}The default for parameter files is native \HTK\ format.  However, \HTK\ toolsalso support the Entropic Esignal format for both input and output. Esignalreplaces the Entropic ESPS file format. To ensure compatibility Entropicprovides conversion programs from ESPS to ESIG and vice versa.To indicate that a source file is in Esignal format the configurationvariable  \texttt{SOURCEFORMAT}\index{sourceformat@\texttt{SOURCEFORMAT}}should be set to \texttt{ESIG}.  Alternatively, \texttt{-F ESIG}\index{standard options!aaaf@\texttt{-F}} can be specifiedas a command-line option.  To generate Esignal format output files, the configuration variable\texttt{TARGETFORMAT} should be set to \texttt{ESIG} or the command line option\texttt{-O ESIG} should be set.ESIG files consist of three parts: a preamble, a sequence of field specifications called the field list and a sequence of records. The preambleand the field list together constitute the header. The preamble is purely ASCII. Currently it consists of 6 information items that are all terminatedby a new line.  The information in the preamble is the following:\begin{tabbing}++ \= +++++++++ \= \kill\>\texttt{line 1}\>-- identification of the file format \\\>\texttt{line 2}\>-- version of the file format\\\>\texttt{line 3}\>-- architecture (ASCII, EDR1, EDR2, machine name)\\\>\texttt{line 4}\>-- preamble size (48 bytes)\\\>\texttt{line 5}\>-- total header size\\\>\texttt{line 6}\>-- record size\\\end{tabbing}All ESIG files that are output by \HTK\ programs contain the following global fields:\begin{description}  \item[commandLine] the command-line used to generate the file;  \item[recordFreq] a double value that indicates the sample frequency        in Herz;  \item[startTime] a double value that indicates a time at which the first         sample is presumed to be starting;  \item[parmKind] a character string that indicates the full         type of parameters in the file, e.g: \texttt{MFCC\_E\_D}.  \item[source\_1] if the input file was an ESIG file this field includes the        header items in the input file.\end{description}After that there are field specifiers for the records. The first specifier is for the  basekind of the parameters, e.g: \texttt{MFCC}. Then for each available qualifier there are additional specifiers. Possible specifiers are:\begin{tabbing}++++\=   \kill\>\texttt{zeroc}  \\\>\texttt{energy}\\\>\texttt{delta}\\\>\texttt{delta\_zeroc} \\\>\texttt{delta\_energy}\\\>\texttt{accs}\\\>\texttt{accs\_zeroc} \\\>\texttt{accs\_energy}\\\end{tabbing}\index{qualifiers!ESIG field specifiers}The data segments of the ESIG files have exactly the same format as thethe corresponding \HTK\ files. This format was described in the previous section.\HTK\ can only input parameter files that have a valid parameter kind as valueof the header field \texttt{parmKind}. If this field does not exist or if thevalue of this field does not contain a valid parameter kind, the file is rejected. After the header has been read the file is treated as an \HTK\ file.\mysect{Waveform File Formats}{waveform}For reading waveform data files, \HTK\ can support a variety of differentformats and these are all briefly described in this section.  The defaultspeech file format is \HTK. If a different format is to be used, it can bespecified by setting the configuration parameter\texttt{SOURCEFORMAT}\index{sourceformat@\texttt{SOURCEFORMAT}}.  However,since file formats need to be changed often, they can also be set individuallyvia the \texttt{-F}\index{standard options!aaaf@\texttt{-F}} command-lineoption.  This over-rides any setting of the \texttt{SOURCEFORMAT} configurationparameter.Similarly for the output of waveforms, the format can be set using either theconfiguration parameter \texttt{TARGETFORMAT} or the \texttt{-O} command-lineoption.  However, for output only native \HTK\ format (\texttt{HTK}), Esignalformat (\texttt{ESIG}) and headerless (\texttt{NOHEAD}) waveform files aresupported.The following sub-sections give a brief description of each of the waveformfile formats supported by \HTK.\subsection{HTK File Format}\index{file formats!HTK}The \HTK\ file format for waveforms is identical to that described insection~\ref{s:parmstore} above.  It consists of a 12 byte header followedby a sequence of 2 byte integer speech samples.  For waveforms, the\texttt{sampSize} field will be 2 and the \texttt{parmKind} field will be 0.The \texttt{sampPeriod} field gives the sample period in 100ns units, hence forexample, it will have the value 1000 for speech files sampled at 10kHz and 625for speech files sampled at 16kHz.\subsection{Esignal File Format}\index{file formats!Esignal}The Esignal file format for waveforms is similar to that described insection~\ref{s:parmstore} above with the following exceptions. When reading an
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -