📄 decode.tex
字号:
\htool{HVite} can be made to compute forced alignments by not specifying a network with the \texttt{-w} option but by specifyingthe \texttt{-a} option instead. In this mode, \htool{HVite} computes a new network for each input utterance using the wordlevel transcriptions and a dictionary. By default, the outputtranscription will just contain the words and their boundaries.One of the main uses of forced alignment\index{forced alignment}, however, is to determine the actual pronunciations used in the utterancesused to train the HMM system in this case, the \texttt{-m}option can be used to generate model level output transcriptions.} This type of forced alignment is usually part of a \textit{bootstrap}process, initially models are trained on the basis of one fixedpronunciation per \index{hled@\htool{HLEd}}\index{ex@\texttt{EX} command}word\footnote{The \htool{HLEd} \texttt{EX} command can be used to compute phonelevel transcriptions when there is only one possible phone transcriptionper word}. Then \htool{HVite} is used in forced alignment modeto select the best matching pronunciations. The new phone leveltranscriptions can then be used to retrain the HMMs. Since trainingdata may have leading and trailing silence, it is usuallynecessary to insert a silence model at the start and end of therecognition network. The \texttt{-b} option can be used to do this.As an illustration, executing\begin{verbatim} HVite -a -b sil -m -o SWT -I words.mlf \ -H hmmset dict hmmlist file.mfc\end{verbatim}would result in the following sequence of events (see Fig.~\href{f:hvalign}).The input file name \texttt{file.mfc} would have its extension replaced by\texttt{lab} and then a label file of this name would be searched for.In this case, the MLF file \texttt{words.mlf} has been loaded. Assuming that this file contains a word level transcription called\texttt{file.lab}, this transcription along with the dictionary \texttt{dict}will be used to construct a network equivalent to \texttt{file.lab}but with alternative pronunciations included in parallel. Since \texttt{-b}option has been set, the specified \texttt{sil} model will be insertedat the start and end of the network. The decoder then finds the bestmatching path through the network and constructs a lattice whichincludes model alignment information. Finally, the lattice is convertedto a transcription and output to the label file \texttt{file.rec}.As for testing on a database, alignments will normally be computed ona large number of input files so in practice the input files would be listedin a \texttt{.scp} file and the output transcriptions would be writtento an MLF using the \texttt{-i} option.When the \texttt{-m} option is used, the transcriptions output by \htool{HVite} would by default contain both the model level and word level transcriptions\index{transcriptions!word level}.\index{transcriptions!model level}\index{transcriptions!phone level}For example, a typical fragment of the output might be\begin{verbatim} 7500000 8700000 f -1081.604736 FOUR 30.000000 8700000 9800000 ao -903.821350 9800000 10400000 r -665.931641 10400000 10400000 sp -0.103585 10400000 11700000 s -1266.470093 SEVEN 22.860001 11700000 12500000 eh -765.568237 12500000 13000000 v -476.323334 13000000 14400000 n -1285.369629 14400000 14400000 sp -0.103585\end{verbatim}Here the score alongside each model name is the acoustic score for that segment.The score alongside the word is just the language model score.Although the above information can be useful for some purposes, for examplein bootstrap training, only the model names are required.The formatting option \texttt{-o SWT} in the above suppresses all outputexcept the model names.\index{decoder!output formatting}\mysect{Recognition using Direct Audio Input}{recaudio}\index{decoder!live input}In all of the preceding discussion, it has been assumed that input wasfrom speech files stored on disk. These files would normally havebeen stored in parameterised form so that little or no conversionof the source speech data was required. When \htool{HVite}is invoked with no files listed on the command line, it assumes thatinput is to be taken directly from the audio input. In this case,configuration variables must be used to specify firstly how thespeech waveform is to be captured and secondly, how the capturedwaveform is to be converted to parameterised form. Dealing with waveform capture\index{waveform capture} first, as described insection~\ref{s:audioio}, \HTK\ provides two main forms of control over speechcapture: signals/keypress and an automatic speech/silencedetector\index{speech/silence detector}. To use the speech/silence detectoralone, the configuration file would contain the following\begin{verbatim} # Waveform capture SOURCERATE=625.0 SOURCEKIND=HAUDIO SOURCEFORMAT=HTK USESILDET=T MEASURESIL=F OUTSILWARN=T ENORMALISE=F\end{verbatim}where the source sampling rate is being set to 16kHz. Notice that the\texttt{SOURCEKIND}\index{sourcekind@\texttt{SOURCEKIND}} must be set to\texttt{HAUDIO} and the \texttt{SOURCEFORMAT} must be set to \texttt{HTK}. Setting the Boolean variable \texttt{USESILDET}\index{usesildet@\texttt{USESILDET}} causes thespeech/silence detector to be used, and the\texttt{MEASURESIL}\index{measuresil@\texttt{MEASURESIL}}\texttt{OUTSILWARN}\index{outsilwarn@\texttt{OUTSILWARN}} variables result in a measurement being taken of the background silence levelprior to capturing the first utterance. To make sure that each input utteranceis being captured properly, the \htool{HVite} option \texttt{-g} can be set tocause the captured wave to be output after each recognition attempt. Note thatfor a live audio input system, the configuration variable\texttt{ENORMALISE} should be explicitly set to \texttt{FALSE} both when training models and when performing recognition. Energy normalisation cannotbe used with live audio input, and the default setting for this variableis \texttt{TRUE}.As an alternative to using the speech/silence detector, asignal\index{signals!for recording control} can be used to start and stoprecording. For example,\begin{verbatim} # Waveform capture SOURCERATE=625.0 SOURCEKIND=HAUDIO SOURCEFORMAT=HTK AUDIOSIG=2\end{verbatim}would result in the Unix interrupt signal (usually the Control-C key) beingused as a start and stop control\footnote{ The underlying signal number must begiven, \HTK\ cannot interpret the standard Unix signal names such as\texttt{SIGINT} }. Key-press control of the audio input can be obtained bysetting \texttt{AUDIOSIG} to a negative number.Both of the above can be used together, in this case, audio capture is disableduntil the specified signal is received. From then on control is in the handsof the speech/silence detector.The captured waveform must be converted to the required target parameter kind. Thus, the configuration file must defineall of the parameters needed to control theconversion of the waveform to the required target kind.This process is described in detail in Chapter~\ref{c:speechio}.As an example, the following parameters would allow conversionto Mel-frequency cepstral coefficients with delta and accelerationparameters.\begin{verbatim} # Waveform to MFCC parameters TARGETKIND=MFCC_0_D_A TARGETRATE=100000.0 WINDOWSIZE=250000.0 ZMEANSOURCE=T USEHAMMING = T PREEMCOEF = 0.97 USEPOWER = T NUMCHANS = 26 CEPLIFTER = 22 NUMCEPS = 12\end{verbatim}Many of these variable settings are the default settingsand could be omitted, they are included explicitly here as a reminderof the main configuration options available.When \htool{HVite} is executed in direct audio input mode,it issues a prompt prior to each input and it is normal to enablebasic tracing so that the recognition results can be seen.A typical terminal output might be\begin{verbatim} READY[1]> Please speak sentence - measuring levels Level measurement completed DIAL ONE FOUR SEVEN == [258 frames] -97.8668 [Ac=-25031.3 LM=-218.4] (Act=22.3) READY[2]> CALL NINE TWO EIGHT == [233 frames] -97.0850 [Ac=-22402.5 LM=-218.4] (Act=21.8) etc\end{verbatim}If required, a transcription of each spoken input can be output to a label file or an MLF in the usual way by setting the \texttt{-e} option. However, to do thisa file name must be synthesised. This is done by using a counterprefixed by the value of the\htool{HVite} configuration variable\texttt{RECOUTPREFIX}\index{recoutprefix@\texttt{RECOUTPREFIX}} and suffixed by the value of \texttt{RECOUTSUFFIX}\index{recoutsuffix@\texttt{RECOUTSUFFIX}}.For example, with the settings\begin{verbatim} RECOUTPREFIX = sjy RECOUTSUFFIX = .rec\end{verbatim}then the output transcriptions would be stored as \texttt{sjy0001.rec}, \texttt{sjy0002.rec} etc.\mysect{N-Best Lists and Lattices}{nbest}\index{decoder!N-best}As noted in section~\ref{s:decop}, \htool{HVite} can generate lattices\index{lattice generation}and N-best\index{N-best} outputs. To generate an N-best list, the \texttt{-n} optionis used to specify the number of N-best tokens to store per state andthe number of N-best hypotheses to generate. The result is thatfor each input utterance, a multiple alternative transcription\index{multiple alternative transcriptions} is generated.For example, setting \texttt{-n 4 20} with a digit recogniser would generate an output of the form\begin{verbatim} "testf1.rec" FOUR SEVEN NINE OH /// FOUR SEVEN NINE OH OH /// etc\end{verbatim}The lattices from which the N-best lists are generated can be output by settingthe option \texttt{-z ext}. In this case, a lattice called \texttt{testf.ext} willbe generated for each input test file \texttt{testf.xxx}. By default, these latticeswill be stored in the same directory as the test files, but they can be redirectedto another directory using the \texttt{-l} option.\index{output lattice format}The lattices generated by \htool{HVite} have the following general form\begin{verbatim} VERSION=1.0 UTTERANCE=testf1.mfc lmname=wdnet lmscale=20.00 wdpenalty=-30.00 vocab=dict N=31 L=56 I=0 t=0.00 I=1 t=0.36 I=2 t=0.75 I=3 t=0.81 ... etc I=30 t=2.48 J=0 S=0 E=1 W=SILENCE v=0 a=-3239.01 l=0.00 J=1 S=1 E=2 W=FOUR v=0 a=-3820.77 l=0.00 ... etc J=55 S=29 E=30 W=SILENCE v=0 a=-246.99 l=-1.20 \end{verbatim}The first 5 lines comprise a header which records names of the files used togenerate the lattice along with the settings of the language model scale andpenalty factors. Each node in the lattice represents a point in time measured inseconds and each arc represents a word spanning the segment of the inputstarting at the time of its start node and ending at the time of its end node. For each such span, \texttt{v} gives the number of the pronunciation used, \texttt{a} gives the acoustic score and \texttt{l} gives the language modelscore.The language model scores in output lattices do not include the scale factorsand penalties. These are removed so that the lattice can be used as aconstraint network for subsequent recogniser testing. When using \htool{HVite}normally, the word level network file is specified using the \texttt{-w}option. When the \texttt{-w} option is included but no file name is included,\htool{HVite} constructs the name of a lattice file from the name of the testfile and inputs that. Hence, a new recognition network is created for eachinput file and recognition is very fast. For example, this is an efficient wayof experimentally determining optimum values for the language model scale\index{lattice!language model scale factor} andpenalty factors.%%% Local Variables: %%% mode: latex%%% TeX-master: "htkbook"%%% End:
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -