📄 decode.tex

📁 隐马尔科夫模型工具箱
💻 TEX
📖 第 1 页 / 共 3 页
字号:
上一页 1 23
at the start and end of the network.  The decoder then finds the bestmatching path through the network and constructs a lattice whichincludes model alignment information.  Finally, the lattice is convertedto a transcription and output to the label file \texttt{file.rec}.As for testing on a database, alignments will normally be computed ona large number of input files so in practice the input files would be listedin a \texttt{.scp} file and the output transcriptions would be writtento an MLF using the \texttt{-i} option.When the \texttt{-m} option is used, the transcriptions output by \htool{HVite} would by default contain both the model level and word level transcriptions\index{transcriptions!word level}.\index{transcriptions!model level}\index{transcriptions!phone level}For example, a typical fragment of the output might be\begin{verbatim}    7500000  8700000 f  -1081.604736 FOUR 30.000000    8700000  9800000 ao  -903.821350    9800000 10400000 r   -665.931641   10400000 10400000 sp    -0.103585   10400000 11700000 s  -1266.470093 SEVEN 22.860001   11700000 12500000 eh  -765.568237   12500000 13000000 v   -476.323334   13000000 14400000 n  -1285.369629   14400000 14400000 sp    -0.103585\end{verbatim}Here the score alongside each model name is the acoustic score for that segment.The score alongside the word is just the language model score.Although the above information can be useful for some purposes, for examplein bootstrap training, only the model names are required.The formatting option \texttt{-o SWT} in the above suppresses all outputexcept the model names.\index{decoder!output formatting}\mysect{Decoding and Adaptation}{dec_adapt}Speaker adaptation techniques allow speaker independent model sets to be adapted to better fit the characteristics of individual speakers using a small amount of adaptation data. Chapter~\ref{c:Adapt} described how the \htool{HEAdapt} tool can be used to perform offline supervised adaptation (using the true transcription of the data).  This section describes how adapted model sets are used in the recognition process and also how \htool{HVite} can be used to perform unsupervised adaptation on a model set (when no transcription is available).\mysubsect{Recognition with Adapted HMMs}{rec_adapt}\index{decoder!using adapted HMMs}As described in section~\ref{s:tmfs}, \htool{HEAdapt}\index{headapt@\htool{HEAdapt}} can produce either aMMF containing the newly adapted model set or a TMF containing justthe adaptation transform.If a transformed MMF has been constructed, then \htool{HVite}\index{hvite@\htool{HVite}} can be used in the usual way. If a TMF has been produced however, this needs to be passed to \htool{HVite} (using the \texttt{-J} option) along with the model set from which the transform was estimated.\htool{HVite} then transforms the model set using the TMF and recognises theinput speech using the transformed model set.Thus, a common form of invocation would be\begin{verbatim}    HVite -S test.scp -H hmmset -J trans.tmf -i results \           -w wdnet dict hmmlist\end{verbatim} \mysubsect{Unsupervised Adaptation}{unsup_adapt}\index{adaptation!unsupervised adaptation}Unsupervised adaptation occurs when no transcription of the adaptation data exists and one must be generated. In this case \htool{HVite} can be usedto create a transcription of the adaptation data and use this to estimate a transformation using MLLR. The transformation can then be saved to a TMF using the \texttt{-K} option.Unsupervised adaptation is signalled by the use of the \texttt{-j} option and this also controls the mode of adaptation by specifying the number of utterances to be processed before a transform is estimated. Thus, the adaptation can be varied between static (adaptation only performed after recognition of all utterances) and incremental adaptation.As soon as a transform has been estimated during incremental adaptation, it is used to adapt the model set to improve performance for any subsequent utterances. Note however that only the final transformation is saved.To use \htool{HVite} for this purpose it is invoked with a command line of the form \begin{verbatim}    HVite -S adapt.scp -H hmmset -K trans.tmf -j 10 -i results \           -w wdnet dict hmmlist\end{verbatim}where \texttt{adapt.scp} contains a list of coded adaptation sentences, adaptation is being performed incrementally every 10 utterances and the final transform is stored in \texttt{trans.tmf} \mysect{Recognition using Direct Audio Input}{recaudio}\index{decoder!live input}In all of the preceding discussion, it has been assumed that input wasfrom speech files stored on disk.  These files would normally havebeen stored in parameterised form so that little or no conversionof the source speech data was required.   When \htool{HVite}is invoked with no files listed on the command line, it assumes thatinput is to be taken directly from the audio input.  In this case,configuration variables must be used to specify firstly how thespeech waveform is to be captured and secondly, how the capturedwaveform is to be converted to parameterised form. Dealing with waveform capture\index{waveform capture} first, as described insection~\ref{s:audioio}, \HTK\ provides two main forms of control over speechcapture: signals/keypress and an automatic speech/silencedetector\index{speech/silence detector}. To use the speech/silence detectoralone, the configuration file would contain the following\begin{verbatim}    # Waveform capture    SOURCERATE=625.0    SOURCEKIND=HAUDIO    SOURCEFORMAT=HTK    USESILDET=T    MEASURESIL=F    OUTSILWARN=T    ENORMALISE=F\end{verbatim}where the source sampling rate is being set to 16kHz.  Notice that the\texttt{SOURCEKIND}\index{sourcekind@\texttt{SOURCEKIND}} must be set to\texttt{HAUDIO} and the \texttt{SOURCEFORMAT} must be set to \texttt{HTK}. Setting the Boolean variable \texttt{USESILDET}\index{usesildet@\texttt{USESILDET}} causes thespeech/silence detector to be used, and the\texttt{MEASURESIL}\index{measuresil@\texttt{MEASURESIL}}\texttt{OUTSILWARN}\index{outsilwarn@\texttt{OUTSILWARN}} variables result in a measurement being taken of the background silence levelprior to capturing the first utterance.  To make sure that each input utteranceis being captured properly, the \htool{HVite} option \texttt{-g} can be set tocause the captured wave to be output after each recognition attempt. Note thatfor a live audio input system, the configuration variable\texttt{ENORMALISE} should be explicitly set to \texttt{FALSE} both when training models and when performing recognition. Energy normalisation cannotbe used with live audio input, and the default setting for this variableis \texttt{TRUE}.As an alternative to using the speech/silence detector, asignal\index{signals!for recording control} can be used to start and stoprecording.  For example,\begin{verbatim}    # Waveform capture    SOURCERATE=625.0    SOURCEKIND=HAUDIO    SOURCEFORMAT=HTK    AUDIOSIG=2\end{verbatim}would result in the Unix interrupt signal (usually the Control-C key) beingused as a start and stop control\footnote{ The underlying signal number must begiven, \HTK\ cannot interpret the standard Unix signal names such as\texttt{SIGINT} }. Key-press control of the audio input can be obtained bysetting \texttt{AUDIOSIG} to a negative number.Both of the above can be used together, in this case, audio capture is disableduntil the specified signal is received.  From then on control is in the handsof the speech/silence detector.The captured waveform must be converted to the required target parameter kind.  Thus, the configuration file must defineall of the parameters needed to control theconversion of the waveform to the required target kind.This process is described in detail in Chapter~\ref{c:speechio}.As an example, the following parameters would allow conversionto Mel-frequency cepstral coefficients with delta and accelerationparameters.\begin{verbatim}    # Waveform to MFCC parameters    TARGETKIND=MFCC_0_D_A    TARGETRATE=100000.0    WINDOWSIZE=250000.0    ZMEANSOURCE=T    USEHAMMING = T    PREEMCOEF = 0.97    USEPOWER = T    NUMCHANS = 26    CEPLIFTER = 22    NUMCEPS = 12\end{verbatim}Many of these variable settings are the default settingsand could be omitted, they are included explicitly here as a reminderof the main configuration options available.When \htool{HVite} is executed in direct audio input mode,it issues a prompt prior to each input and it is normal to enablebasic tracing so that the recognition results can be seen.A typical terminal output might be\begin{verbatim}    READY[1]>    Please speak sentence - measuring levels    Level measurement completed    DIAL ONE FOUR SEVEN           ==  [258 frames] -97.8668 [Ac=-25031.3 LM=-218.4] (Act=22.3)    READY[2]>    CALL NINE TWO EIGHT           ==  [233 frames] -97.0850 [Ac=-22402.5 LM=-218.4] (Act=21.8)    etc\end{verbatim}If required, a transcription of each spoken input can be output to a label file or an MLF in the usual way by setting the \texttt{-e} option.  However, to do thisa file name must be synthesised.  This is done by using a counterprefixed by the value of the\htool{HVite} configuration variable\texttt{RECOUTPREFIX}\index{recoutprefix@\texttt{RECOUTPREFIX}} and suffixed by the value of \texttt{RECOUTSUFFIX}\index{recoutsuffix@\texttt{RECOUTSUFFIX}}.For example, with the settings\begin{verbatim}    RECOUTPREFIX = sjy    RECOUTSUFFIX = .rec\end{verbatim}then the output transcriptions would be stored as \texttt{sjy0001.rec},  \texttt{sjy0002.rec} etc.\mysect{N-Best Lists and Lattices}{nbest}\index{decoder!N-best}As noted in section~\ref{s:decop}, \htool{HVite} can generate lattices\index{lattice generation}and N-best\index{N-best} outputs.  To generate an N-best list, the \texttt{-n} optionis used to specify the number of N-best tokens to store per state andthe number of N-best hypotheses to generate.  The result is thatfor each input utterance, a multiple alternative transcription\index{multiple alternative transcriptions} is generated.For example, setting \texttt{-n 4 20} with a digit recogniser would generate an output of the form\begin{verbatim}    "testf1.rec"    FOUR    SEVEN    NINE    OH    ///     FOUR    SEVEN    NINE    OH    OH    ///     etc\end{verbatim}The lattices from which the N-best lists are generated can be output by settingthe option \texttt{-z ext}.  In this case, a lattice called \texttt{testf.ext} willbe generated for each input test file \texttt{testf.xxx}.  By default, these latticeswill  be stored in the same directory as the test files, but they can be redirectedto another directory using the \texttt{-l} option.\index{output lattice format}The lattices generated by \htool{HVite} have the following general form\begin{verbatim}    VERSION=1.0    UTTERANCE=testf1.mfc    lmname=wdnet    lmscale=20.00  wdpenalty=-30.00    vocab=dict    N=31   L=56       I=0    t=0.00      I=1    t=0.36      I=2    t=0.75      I=3    t=0.81    ... etc    I=30   t=2.48      J=0     S=0    E=1    W=SILENCE   v=0  a=-3239.01  l=0.00        J=1     S=1    E=2    W=FOUR      v=0  a=-3820.77  l=0.00        ... etc    J=55    S=29   E=30   W=SILENCE   v=0  a=-246.99   l=-1.20   \end{verbatim}The first 5 lines comprise a header which records names of the files used togenerate the lattice along with the settings of the language model scale andpenalty factors. Each node in the lattice represents a point in time measured inseconds and each arc represents a word spanning the segment of the inputstarting at the time of its start node and ending at the time of its end node.  For each such span, \texttt{v} gives the number of the pronunciation used, \texttt{a} gives the acoustic score and \texttt{l} gives the language modelscore.The language model scores in output lattices do not include the scale factorsand penalties.  These are removed so that the lattice can be used as aconstraint network for subsequent recogniser testing.  When using \htool{HVite}normally, the word level network file is specified using the \texttt{-w}option.  When the \texttt{-w} option is included but no file name is included,\htool{HVite} constructs the name of a lattice file from the name of the testfile and inputs that.  Hence, a new recognition network is created for eachinput file and recognition is very fast.  For example, this is an efficient wayof experimentally determining optimum values for the language model scale\index{lattice!language model scale factor} andpenalty factors.%%% Local Variables: %%% mode: latex%%% TeX-master: "htkbook"%%% End:
上一页 1 23
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -