📄 speechio.tex
字号:
ESIG waveform file the \HTK\ programs only check whether the record lengthequals 2 and whether the datatype of the only field in the data records is\texttt{SHORT}. The data field that is created on output of a waveform iscalled \texttt{WAVEFORM}.\subsection{TIMIT File Format}\index{file formats!TIMIT}The TIMIT format has the same structure as the HTK format except that the12-byte header contains the following\begin{tabbing}++ \= +++++++++ \= \kill\>\texttt{hdrSize}\>-- number of bytes in header ie 12 (2-byte integer) \\\>\texttt{version}\>-- version number (2-byte integer) \\\>\texttt{numChannels}\>-- number of channels (2-byte integer) \\\>\texttt{sampRate}\>-- sample rate (2-byte integer) \\\>\texttt{nSamples}\>-- number of samples in file (4-byte integer) \end{tabbing}TIMIT format data is used only on the prototype TIMIT CD ROM.\subsection{NIST File Format}\index{file formats!NIST}The NIST file format is also referred to as the Sphere file format.A NIST header consists of ASCII text. It begins with a label of theform \texttt{NISTxx} where xx is a version code followed by the numberof bytes in the header. The remainder of the header consists ofname value pairs of which \HTK\ decodes the following\begin{tabbing}++ \= +++++++++++++ \= \kill\>\texttt{sample\_rate} \>-- sample rate in Hz \\\>\texttt{sample\_n\_bytes} \>-- number of bytes in each sample \\\>\texttt{sample\_count} \>-- number of samples in file \\\>\texttt{sample\_byte\_format} \>-- byte order \\\>\texttt{sample\_coding} \>-- speech coding eg pcm, $\mu$law, shortpack \\\>\texttt{channels\_interleaved} \>-- for 2 channel data only\end{tabbing}The current NIST Sphere data format\index{NIST Sphere data format} subsumes avariety of internal data organisations. HTK currently supports interleaved$\mu$law used in Switchboard, Shortpack compression used in the originalversion of WSJ0 and standard 16bit linear PCM as used in Resource Management,TIMIT, etc. It does not currently support the Shorten compression format asused in WSJ1 due to licensing restrictions. Hence, to read WSJ1, the filesmust be converted using the NIST supplied decompression routines into standard16 bit linear PCM. This is most conveniently done under UNIX by using thedecompression program as an input filter set via the environment variable\texttt{HWAVEFILTER}\index{hwavefilter@\texttt{HWAVEFILTER}} (seesection~\ref{s:iopipes}). For interleaved $\mu$law as used in Switchboard, the default is to add the twochannels together. The left channel only can be obtained by setting theenvironment variable \texttt{STEREOMODE} to \texttt{LEFT} and the right channelonly can be obtained by setting the environment variable \texttt{STEREOMODE} to\texttt{RIGHT}. \index{mu law encoded files } \subsection{SCRIBE File Format}\index{file formats!SCRIBE}The SCRIBE format is a subset of the standard laid down by the European EspritProgramme SAM Project. SCRIBE data files are headerless and therefore consistof just a sequence of 16 bit sample values. \HTK\ assumes by default that thesample rate is 20kHz. The configuration parameter \texttt{SOURCERATE} shouldbe set to over-ride this. The byte ordering assumed for SCRIBE data files is\texttt{VAX} (little-endian).\subsection{SDES1 File Format}\index{file formats!Sound Designer(SDES1)}The SDES1 format refers to the ``Sound Designer I'' format defined byDigidesign Inc in 1985 for multimedia and general audo applications. It isused for storing short monoaural sound samples. The SDES1 header is complex(1336 bytes) since it allows for associated display window information to bestored in it as well as providing facilities for specifying repeat loops. TheHTK input routine for this format just picks out the following information\begin{tabbing}++ \= +++++++++ \= \kill\>\texttt{headerSize} \>-- size of header ie 1336 (2 byte integer) \\\>(182 byte filler) \\\>\texttt{fileSize} \>-- number of bytes of sampled data (4 byte integer)\\\>(832 byte filler) \\\>\texttt{sampRate} \>-- sample rate in Hz (4 byte integer) \\\>\texttt{sampPeriod} \>-- sample period in microseconds (4 byte integer) \\ \>\texttt{sampSize} \>-- number of bits per sample ie 16 (2 byte integer)\end{tabbing}\subsection{AIFF File Format}\index{file formats!Audio Interchange (AIFF)}The AIFF format was defined by Apple Computer for storing monoaural andmultichannel sampled sounds. An AIFF file consists of a number of {\itchunks}. A {\it Common} chunk contains the fundamental parameters of the sound(sample rate, number of channels, etc) and a {\it Sound Data} chunk containssampled audio data. \HTK\ only partially supports AIFF since some of theinformation in it is stored as floating point numbers. In particular, thesample rate is stored in this form and to avoid portability problems, \HTK\ ignores the given sample rate and assumes that it is 16kHz. If this default rate is incorrect, then the true sample period should bespecified by setting the \texttt{SOURCERATE} configuration parameter.Full details of the AIFF format are available from Apple DeveloperTechnical Support.\subsection{SUNAU8 File Format}\index{file formats!Sun audio (SUNAU8)}The SUNAU8 format defines a subset of the ``.au'' and ``.snd'' audio fileformat used by Sun and NeXT. An SUNAU8 speech data file consists of a headerfollowed by 8 bit $\mu$law encoded speech samples. The header is 28 bytes andcontains the following fields, each of which is 4 bytes\begin{tabbing}++ \= +++++++++ \= \kill\>\texttt{magicNumber} \>-- magic number 0x2e736e64 \\\>\texttt{dataLocation} \>-- offset to start of data \\\>\texttt{dataSize} \>-- number of bytes of data \\\>\texttt{dataFormat} \>-- data format code which is 1 for 8 bit $\mu$law \\ \>\texttt{sampRate} \>-- a sample rate code which is always 8012.821 Hz \\ \>\texttt{numChan} \>-- the number of channels \\ \>\texttt{info} \>-- arbitrary character string min length 4 bytes\end{tabbing}No default byte ordering is assumed for this format. If the data source isknown to be different to the machine being used, then the environment variable\texttt{BYTEORDER} must be set appropriately. Note that when used on Sun Sparcmachines with 16 bit audio device the sampling rate of 8012.821Hz is notsupported and playback will be peformed at 8KHz.\subsection{OGI File Format}\index{file formats!OGI}The OGI format is similar to TIMIT. The header contains the following\begin{tabbing}++ \= +++++++++ \= \kill\>\texttt{hdrSize}\>-- number of bytes in header \\\>\texttt{version}\>-- version number (2-byte integer) \\\>\texttt{numChannels}\>-- number of channels (2-byte integer) \\\>\texttt{sampRate}\>-- sample rate (2-byte integer) \\\>\texttt{nSamples}\>-- number of samples in file (4-byte integer) \\\>\texttt{lendian}\>-- used to test for byte swapping (4-byte integer) \end{tabbing}\subsection{WAV File Format}{}\index{file formats!WAV} The WAV file format is a subset of Microsoft's RIFF specification for thestorage of multimedia files. A RIFF file starts out with a file header followedby a sequence of data ``chunks''. A WAV file is often just a RIFF file with asingle ``WAVE'' chunk which consists of two sub-chunks - a ``fmt'' chunkspecifying the data format and a ``data'' chunk containing the actual sampledata. The WAV file header contains the following\begin{tabbing}++ \= +++++++++ \= \kill\>\texttt{'RIFF'}\>-- RIFF file identification (4 bytes) \\\>\texttt{<length>}\>-- length field (4 bytes)\\\>\texttt{'WAVE'}\>-- WAVE chunk identification (4 bytes) \\\>\texttt{'fmt '}\>-- format sub-chunk identification (4 bytes) \\\>\texttt{flength}\>-- length of format sub-chunk (4 byte integer) \\\>\texttt{format}\>-- format specifier (2 byte integer) \\\>\texttt{chans}\>-- number of channels (2 byte integer) \\\>\texttt{sampsRate}\>-- sample rate in Hz (4 byte integer) \\\>\texttt{bpsec}\>-- bytes per second (4 byte integer) \\\>\texttt{bpsample}\>-- bytes per sample (2 byte integer) \\\>\texttt{bpchan}\>-- bits per channel (2 byte integer) \\\>\texttt{'data'}\>-- data sub-chunk identification (4 bytes) \\\>\texttt{dlength}\>-- length of data sub-chunk (4 byte integer)\end{tabbing}Support is provided for 8-bit CCITT mu-law, 8-bit CCITT a-law, 8-bit PCM linear and 16-bit PCM linear - all in stereo or mono (use of \texttt{STEREOMODE} parameter as per NIST). The default byte ordering assumed for \texttt{WAV} data files is \texttt{VAX} (little-endian).\subsection{ALIEN and NOHEAD File Formats}\index{file formats!ALIEN}\index{file formats!NOHEAD}\HTK\ tools can read speech waveform files with alien formats provided thattheir overall structure is that of a header followed by data. This is done bysetting the format to \texttt{ALIEN} and setting the environment variable\texttt{HEADERSIZE} to the number of bytes in the header. \HTK\ will thenattempt to infer the rest of the information it needs. However, if input isfrom a pipe, then the number of samples expected must be set using theenvironment variable \texttt{NSAMPLES}\index{nsamples@\texttt{NSAMPLES}}. Thesample rate of the source file is defined by the configuration parameter\texttt{SOURCERATE} as described in section~\ref{s:sigproc}. If the file hasno header then the format \texttt{NOHEAD} may be specified instead of\texttt{ALIEN}\index{alien@\texttt{ALIEN}} in which case\texttt{HEADERSIZE}\index{headersize@\texttt{HEADERSIZE}} is assumed to bezero.\mysect{Direct Audio Input/Output}{audioio}\index{speech input!direct audio}Many \HTK\ tools, particularly recognition tools, can input speech waveformdata directly from an audio device. The basic mechanism for doing this is tosimply specify the \texttt{SOURCEKIND} as being\texttt{HAUDIO}\index{haudio@\texttt{HAUDIO}} following which speech sampleswill be read directly from the host computer's audio input device.Note that for live audio input, the configuration variable \texttt{ENORMALISE} should be set to false both during training and recognition. Energy normalisation cannotbe used with live audio input, and the default setting for this variableis \texttt{TRUE}. When training models for live audio input, be sure toset \texttt{ENORMALISE} to false. If you have existing models trained with \texttt{ENORMALISE} set to true, you can retrain them using {\it single-passretraining} (see section~\ref{s:singlepass}).When using direct audio input\index{direct audio input}, the input samplingrate may be set explicitly using the configuration parameter\texttt{SOURCERATE}, \index{sourcerate@\texttt{SOURCERATE}} otherwise \HTK\ will assume that it has been set by some external means such as anaudio control panel. In the latter case, it must be possible for\htool{HAudio} to obtain the sample rate from the audio driverotherwise an error message will be generated.Although the detailed control of audio hardware is typically machine dependent,\HTK\ provides a number of Boolean configuration variables to request specificinput and output sources. These are indicated by the following table\begin{center}\index{audio source}\index{audio output}\begin{tabular}{|c|l|} \hlineVariable & Source/Sink \\ \hline\texttt{LINEIN} & line input \\ \texttt{MICIN} & microphone input \\ \texttt{LINEOUT} & line output \\\texttt{PHONESOUT} & headphones output \\ \texttt{SPEAKEROUT} & speaker output \\ \hline\end{tabular}\end{center}\index{linein@\texttt{LINEIN}}\index{micin@\texttt{MICIN}}\index{lineout@\texttt{LINEOUT}}\index{phonesout@\texttt{PHONESOUT}}\index{speakerout@\texttt{SPEAKEROUT}}The major complication in using direct audio is in starting and stopping theinput device. The simplest approach to this is for \HTK\ tools to take directcontrol and, for example, enable the audio input for a fixed period determinedvia a command line option. However, the \htool{HAudio}/\htool{HParm} modulesprovides two more powerful built-in facilities for audio input control.\index{direct audio input!silence detector!speech detector} The first method of audio input control involves the use of an automaticenergy-based speech/silence detector which is enabled by setting theconfiguration parameter\texttt{USESILDET}\index{usesildet@\texttt{USESILDET}} to true. Note thatthe speech/silence detector can also operate on waveform input files.The automatic speech/silence detector uses a two level algorithm which firstclassifies each frame of data as either speech or silence and then applies aheuristic to determine the start and end of each utterance.\index{HParm!SILENERGY} \index{HParm!SPEECHTHRESH}The detector classifies eachframe as speech or silence based solely on the log energy of the signal. Whenthe energy value exceeds a threshold the frame is marked as speech otherwise assilence. The threshold is made up of two components both of which can be set byconfiguration variables. The first component represents the mean energy levelof silence and can be set explicitly via the configurationparameter \texttt{SILENERGY}. However, it is more usual to take a measurementfrom the environment directly. Setting the configuration parameter\texttt{MEASURESIL} to true will cause the detector to calibrate its parametersfrom the current acoustic environment just prior to sampling. The secondthreshold component is the level above which frames are classified as speech (\texttt{SPEECHTHRESH}
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -