📄 speechio.tex
字号:
%/* ----------------------------------------------------------- */%/* */%/* ___ */%/* |_| | |_/ SPEECH */%/* | | | | \ RECOGNITION */%/* ========= SOFTWARE */ %/* */%/* */%/* ----------------------------------------------------------- */%/* developed at: */%/* */%/* Speech Vision and Robotics group */%/* Cambridge University Engineering Department */%/* http://svr-www.eng.cam.ac.uk/ */%/* */%/* Entropic Cambridge Research Laboratory */%/* (now part of Microsoft) */%/* */%/* ----------------------------------------------------------- */%/* Copyright: Microsoft Corporation */%/* 1995-2000 Redmond, Washington USA */%/* http://www.microsoft.com */%/* */%/* 2001 Cambridge University */%/* Engineering Department */%/* */%/* Use of this software is governed by a License Agreement */%/* ** See the file License for the Conditions of Use ** */%/* ** This banner notice must not be removed ** */%/* */%/* ----------------------------------------------------------- */%% HTKBook - Steve Young 1/12/97%\mychap{Speech Input/Output}{speechio}Many tools need to input parameterised speech data and \HTK\ provides a number of different methods for doing this:\begin{itemize}\item input from a previously encoded speech parameter file\item input from a waveform file which is encoded as part of the input processing\item input from an audio device which is encoded as part of the input processing.\end{itemize}For input from a waveform file, a large number of different file formatsare supported, including all of the commonly used CD-ROM formats.Input/output for parameter files is limited to the standard \HTK\ file formatand the new Entropic Esignal format.\sidepic{Tool.spio}{60}{}All \HTK\ speech input\index{speech input} is controlled by configurationparameters which give details of what processing operations to apply to eachinput speech file or audio source. This chapter describes speech input/outputin \HTK. The general mechanisms are explained and the various configurationparameters are defined. The facilities for signal pre-processing, linearprediction-based processing, Fourier-based processing and vector quantisationare presented and the supported file formats are given. Also described are thefacilities for augmenting the basic speech parameters with energy measures,delta coefficients and acceleration (delta-delta) coefficients and forsplitting each parameter vector into multiple data streams to form\textit{observations}. The chapter concludes with a brief description of thetools \htool{HList} and \htool{HCopy} which are provided for viewing,manipulating and encoding speech files.\mysect{General Mechanism}{genio}The facilities for speech input and output in \HTK\ are providedby five distinct modules: \htool{HAudio}, \htool{HWave},\htool{HParm}, \htool{HVQ} and \htool{HSigP}. The interconnectionsbetween these modules are shown in Fig.~\href{f:Spmods}. \index{speech input!general mechanism}\sidefig{Spmods}{62}{Speech Input Subsystem}{2}{Waveformsare read from files using \htool{HWave}, or are input direct froman audio device using \htool{HAudio}. In a few rare cases, such asin the display tool \htool{HSLab}, only the speech waveform is needed.However, in most cases the waveform is wanted in parameterised form andthe required encoding is performed by \htool{HParm}using the signal processing operations defined in \htool{HSigP}. The parameter vectors are output by \htool{HParm}in the form of observations which are the basic units of data processedby the \HTK\ recognition and training tools. An observation contains allcomponents of a raw parameter vector but it may be possibly split intoa number of independent parts. Each such part is regarded by a \HTK\ toolas a statistically independent data stream. Also, an observationmay include VQ indices attached to each data stream. Alternatively,VQ indices can be read directly from a parameter file in which case theobservation will contain only VQ indices.}Usually a \HTK\ tool will require a number of speech data files to bespecified on the command line. In the majority of cases, thesefiles will be required in parameterised form. Thus, the following exampleinvokes the \HTK\ embedded training tool \htool{HERest}to re-estimate a set of models using the speech datafiles \texttt{s1}, \texttt{s2}, \texttt{s3}, \ldots . These areinput via the library module \htool{HParm} and theymust be in exactly the form needed by the models.\begin{verbatim} HERest ... s1 s2 s3 s4 ...\end{verbatim}However, if the external form of the speech data files is not in therequired form, it will often be possible to convert them automatically duringthe input process.To do this, configuration parameter values are specified whose function is to define exactlyhow the conversion should be done. The key idea is that there is a \textit{source parameter kind} and \textit{target parameter kind}.The source refers to the natural form of the data inthe external medium and the target refers to the form of thedata that is required internally by the \HTK\ tool.The principle function of the speechinput subsystem is to convert the source parameter kind into the required target parameter kind. \index{speech input!automatic conversion}Parameter kinds consist of a base form to which one or morequalifiers may be attached where each qualifier consists ofa single letter preceded by an underscore character.\index{qualifiers}Some examples of parameter kinds are\begin{varlist} \fwitem{2cm}{WAVEFORM} simple waveform \fwitem{2cm}{LPC} linear prediction coefficients \fwitem{2cm}{LPC\_D\_E} LPC with energy and delta coefficients \fwitem{2cm}{MFCC\_C} compressed mel-cepstral coefficients\end{varlist}\index{speech input!target kind}The required source and target parameter kinds are specifiedusing the configuration parameters \texttt{SOURCEKIND}\index{sourcekind@\texttt{SOURCEKIND}} and \texttt{TARGETKIND}\index{targetkind@\texttt{TARGETKIND}}.Thus, if the following configuration parameters were defined\begin{verbatim} SOURCEKIND = WAVEFORM TARGETKIND = MFCC_E\end{verbatim}then the speech input subsystem would expect each input file to containa speech waveform and it would convert it to mel-frequency cepstralcoefficients with log energy appended.The source need not be a waveform. For example, the configurationparameters\begin{verbatim} SOURCEKIND = LPC TARGETKIND = LPREFC\end{verbatim}would be used to read in files containing linear prediction coefficientsand convert them to reflection coefficients.For convenience, a special parameter kind called\texttt{ANON}\index{anon@\texttt{ANON}} is provided. When the source isspecified as \texttt{ANON} then the actual kind of the source is determinedfrom the input file. When \texttt{ANON} is used in the target kind, then it isassumed to be identical to the source. For example, the effect of thefollowing configuration parameters\begin{verbatim} SOURCEKIND = ANON TARGETKIND = ANON_D\end{verbatim}would simply be to add delta coefficients to whatever the source formhappened to be. The source and target parameter kinds default to \texttt{ANON}to indicate that by defaultno input conversions are performed. Note, however, that where two or morefiles are listed on the command line, the meaning of\texttt{ANON} will not be re-interpreted from one file to the next. Thus, itis a general rule, that any tool reading multiple source speech files requiresthat all the files have the same parameter kind.The conversions applied by \HTK's input subsystem can be complex and maynot always behave exactly as expected. There are two facilities that can be used to help check and debug the set-up of the speech i/oconfiguration parameters.Firstly, the tool \htool{HList} simply displays speech data by listing iton the terminal. However, since \htool{HList} uses the speech input subsystem likeall \HTK\ tools, if a value for \texttt{TARGETKIND} is set, then it will display the targetform rather than the source form. This is the simplest way to check the form ofthe speech data that will actually be delivered to a \HTK\ tool. \htool{HList} is describedin more detail in section~\ref{s:UseHList} below.Secondly, trace output can be generated from the \htool{HParm} moduleby setting the \texttt{TRACE} configuration file parameter. This is abit-string in which individual bits cover different parts of theconversion processing. The details are given in the reference section.To summarise, speech input in \HTK\ is controlled by configurationparameters. The key parameters are \texttt{SOURCEKIND} and {\ttTARGETKIND} which specify the source and target parameter kinds. These determine the end-points of the required input conversion.However, to properly specify the detailed steps in between, moreconfiguration parameters must be defined.These are described in subsequent sections.\mysect{Speech Signal Processing}{sigproc}In this section, the basic mechanisms involved in transforming aspeech waveform into a sequence of parameter vectors will bedescribed. Throughout this section, it is assumed that the\texttt{SOURCEKIND} is \texttt{WAVEFORM} and that data is being read froma HTK format file via \htool{HWave}. Reading from different formatfiles is described below in section~\ref{s:waveform}.Much of thematerial in this section also applies to data read direct from an audio device, theadditional features needed to deal with this latter case aredescribed later in section~\ref{s:audioio}.\vspace{0.2cm}\index{speech input!blocking}The overall process is illustrated in Fig.~\href{f:Blocking}which shows the sampled waveform being converted into a sequence of parameter blocks. In general, \HTK\ regardsboth waveform files and parameter files as being justsample sequences, the only difference being that in the formercase the samples are 2-byte integers and in the latter theyare multi-component vectors. The sample rate of the inputwaveform will normally be determined from the input fileitself. However, it can be set explicitly using theconfiguration parameter \texttt{SOURCERATE}. The periodbetween each parameter vector determines the output samplerate and it is set using the configuration parameter \texttt{TARGETRATE}. The segment of waveform used to determineeach parameter vector is usually referred to as a windowand its size is set by theconfiguration parameter \texttt{WINDOWSIZE}. Notice that thewindow size and frame rate are independent. Normally,the window size will be larger than the frame rate so thatsuccessive windows overlap as illustrated in Fig.~\href{f:Blocking}.\index{sourcerate@\texttt{SOURCERATE}}\index{targetrate@\texttt{TARGETRATE}}\index{windowsize@\texttt{WINDOWSIZE}}For example, a waveform sampledat 16kHz would be converted into 100 parameter vectors persecond using a 25 msec window by setting the followingconfiguration parameters.\begin{verbatim} SOURCERATE = 625 TARGETRATE = 100000 WINDOWSIZE = 250000\end{verbatim}Remember that all durations are specified in 100 nsec units\footnote{The somewhat bizarre choice of 100nsec units originated in Version 1 of\HTK\ when times were represented by integers and this unit was the bestcompromise between precision and range. Times are now represented bydoubles and hence the constraints no longer apply. However, the need for backwardscompatibility means that 100nsec units have been retained. The names\texttt{SOURCERATE} and \texttt{TARGETRATE} are also non-ideal, \texttt{SOURCEPERIOD} and \texttt{TARGETPERIOD} would be better. }.\sidefig{Blocking}{50}{Speech Encoding Process}{2}{}
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -