📄 exampsys.tex
字号:
\begin{verbatim} RO 100.0 stats TR 0 QS "L_Class-Stop" {p-*,b-*,t-*,d-*,k-*,g-*} QS "R_Class-Stop" {*+p,*+b,*+t,*+d,*+k,*+g} QS "L_Nasal" {m-*,n-*,ng-*} QS "R_Nasal" {*+m,*+n,*+ng} QS "L_Glide" {y-*,w-*} QS "R_Glide" {*+y,*+w} .... QS "L_w" {w-*} QS "R_w" {*+w} QS "L_y" {y-*} QS "R_y" {*+y} QS "L_z" {z-*} QS "R_z" {*+z} TR 2 TB 350.0 "aa_s2" {(aa, *-aa, *-aa+*, aa+*).state[2]} TB 350.0 "ae_s2" {(ae, *-ae, *-ae+*, ae+*).state[2]} TB 350.0 "ah_s2" {(ah, *-ah, *-ah+*, ah+*).state[2]} TB 350.0 "uh_s2" {(uh, *-uh, *-uh+*, uh+*).state[2]} .... TB 350.0 "y_s4" {(y, *-y, *-y+*, y+*).state[4]} TB 350.0 "z_s4" {(z, *-z, *-z+*, z+*).state[4]} TB 350.0 "zh_s4" {(zh, *-zh, *-zh+*, zh+*).state[4]} TR 1 AU "fulllist" CO "tiedlist" ST "trees"\end{verbatim}Firstly, the \texttt{RO}\index{ro@\texttt{RO} command} command is used to setthe outlier threshold\index{outlier threshold} to 100.0 and load the statisticsfile\index{statistics file} generated at the end of the previous step. Theoutlier threshold determines the minimum occupancy\index{minimum occupancy} ofany cluster and prevents a single outlier state forming a singleton clusterjust because it is acoustically very different to all the other states. The\texttt{TR}\index{tr@\texttt{TR} command} command sets the trace level to zeroin preparation for loading in the questions. Each\texttt{QS}\index{qs@\texttt{QS} command} command loads a single question andeach question is defined by a set of contexts. For example, the first\texttt{QS} command defines a question called \texttt{L\_Class-Stop} which istrue if the left context is either of the stops \texttt{p},\texttt{b}, \texttt{t}, \texttt{d}, \texttt{k} or \texttt{g}.\sidefig{step10}{50}{Step 10}{-4}{}Notice that for a triphone system, it is necessary to include questionsreferring to both the right and left contexts of a phone. The questions shouldprogress from wide, general classifications (such as consonant, vowel, nasal,diphthong, etc.) to specific instances of each phone.Ideally, the full set of questions loaded using the \texttt{QS} command wouldinclude every possible context which can influence the acoustic realisation ofa phone, and can include any linguistic or phonetic classification which may berelevant. There is no harm in creating extra unnecessary questions, becausethose which are determined to be irrelevant to the data will be ignored.The second \texttt{TR} command enables intermediate level progress reporting sothat each of the following \texttt{TB} commands\index{tb@\texttt{TB} command}can\index{tree building} be monitored. Each of these \texttt{TB} commandsclusters one specific set of states. For example, the first \texttt{TB}command applies to the first emitting state of all context-dependent models forthe phone \texttt{aa}.Each \texttt{TB} command works as follows. Firstly, each set of states definedby the final argument is pooled to form a single cluster. Each question in thequestion set loaded by the \texttt{QS} commands is used to split the pool intotwo sets. The use of two sets rather than one, allows the log likelihood ofthe training data to be increased and the question which maximises thisincrease is selected for the first branch of the tree. The process is thenrepeated until the increase in log likelihood achievable by any question at anynode is less than the threshold specified by the first argument (350.0 in thiscase).Note that the values given in the \texttt{RO} and \texttt{TB} commands affectthe degree of tying and therefore the number of states output in the clusteredsystem. The values should be varied according to the amount of training dataavailable.As a final step to the clustering, any pair of clusters which can be merged\index{cluster merging} such that the decrease in log likelihood is belowthe threshold is merged. On completion, the states in each cluster $i$ aretied to form a single shared state with macro name \texttt{xxx\_i} where\texttt{xxx} is the name given by the second argument of the \texttt{TB}command.The set of triphones used so far only includes those needed to cover thetraining data. The \texttt{AU} command takes as its argument a new list oftriphones expanded to include all those needed for recognition. This list canbe generated, for example, by using \htool{HDMan} on the entire dictionary (notjust the training dictionary), converting it to triphones using the command\texttt{TC} and outputting a list of the distinct triphones to a file using theoption \texttt{-n} \begin{verbatim} HDMan -b sp -n fulllist -g global.ded -l flog beep-tri beep\end{verbatim}\noindentThe -b sp option specifies that the sp phone is used as a word boundary, and so is excluded from triphones. The effect of the \texttt{AU} command is to use the decision trees to synthesise all of the new previously unseen triphones in the new list.\index{au@\texttt{AU} command}Once all state-tying has been completed and new models synthesised, some models may share exactlythe same 3 states and transition matrices and are thus identical.The \texttt{CO} command\index{co@\texttt{CO} command}\index{model compaction} is usedto compact the model set by finding all identical models and tying themtogether\footnote{Note that if the transition matrices had not been tied, the \texttt{CO}command would be ineffective since all models would be different byvirtue of their unique transition matrices.}, producing a new list of modelscalled \texttt{tiedlist}.One of the advantages of using decision tree clustering is that it allowspreviously\index{unseen triphones}unseen triphones to be synthesised. To do this, the trees mustbe saved and this is done by the \texttt{ST} command\index{st@\texttt{ST} command}.Later if new previously unseen triphones are required, for example in thepronunciation of a new vocabulary item, the existing model set can bereloaded into \htool{HHEd}, the trees reloaded using the \texttt{LT} command\index{lt@\texttt{LT} command}and then a new extended list of triphones created using the \texttt{AU} command.\index{au@\texttt{AU} command}After \htool{HHEd} has completed, the effect of tying can be studied andthe thresholds adjusted if necessary. The log file willinclude summary statistics which give the total number of physicalstates remaining and the number of models after compacting.Finally, and for the last time, the models are re-estimated twice using\htool{HERest}. Fig.~\href{f:step10} illustrates this last step in the HMMbuild process. The trained models are then contained in the file\texttt{hmm15/hmmdefs}.\mysect{Recogniser Evaluation}{egrectest}The recogniser is now complete and its performance can be evaluated. The recognition network and dictionary have already been constructed, and test data has been recorded. Thus, all that is necessary is to run the recogniser and then evaluate the results using the \HTK\ analysis tool \htool{HResults}\index{recogniser evaluation}\subsection{Step 11 - Recognising the Test Data}Assuming that \texttt{test.scp} holds a list of the coded test files,then each test file will be recognised and its transcription output toan MLF called \texttt{recout.mlf} by executing the following\begin{verbatim} HVite -H hmm15/macros -H hmm15/hmmdefs -S test.scp \ -l '*' -i recout.mlf -w wdnet \ -p 0.0 -s 5.0 dict tiedlist\end{verbatim}The options \texttt{-p} and \texttt{-s} set the \textit{word insertion penalty}\index{word insertion penalty}and the \textit{grammar scale factor}, \index{grammar scale factor}respectively. The word insertion penaltyis a fixed value added to each token when it transits from the end of one wordto the start of the next. The grammar scale factor is the amount by whichthe language model probability is scaled before being added to each token as it transits from the end of one wordto the start of the next. These parameters can have a significant effecton recognition performance and hence, some tuning on development test datais well worthwhile.The dictionary contains monophone transcriptions whereas the supplied HMM listcontains word internal triphones. \htool{HVite}\index{hvite@\htool{HVite}} will make the necessary conversions when loading the word network \texttt{wdnet}. However, if the HMM list contained both monophones and context-dependent phonesthen \htool{HVite} would become confused. The required form of word-internal network\index{networks!word-internal} expansion can be forced by setting the configuration variable\texttt{FORCECXTEXP}\index{forcecxtexp@\texttt{FORCECXTEXP}} to true and \texttt{ALLOWXWRDEXP}\index{allowxwrdexp@\texttt{ALLOWXWRDEXP}} to false (see chapter~\ref{c:netdict} for details).\index{accuracy figure}Assuming that the MLF \texttt{testref.mlf} contains word level transcriptionsfor each test file\footnote{The \htool{HLEd} tool may have to be used to insert silences at the start and end of each transcription or alternatively\htool{HResults} can be used to ignore silences (or any other symbols) usingthe \texttt{-e} option}, the actualperformance can be determined by running \htool{HResults} as follows\begin{verbatim} HResults -I testref.mlf tiedlist recout.mlf\end{verbatim}the result would be a print-out of the form\begin{verbatim} ====================== HTK Results Analysis ============== Date: Sun Oct 22 16:14:45 1995 Ref : testrefs.mlf Rec : recout.mlf ------------------------ Overall Results ----------------- SENT: %Correct=98.50 [H=197, S=3, N=200] WORD: %Corr=99.77, Acc=99.65 [H=853, D=1, S=1, I=1, N=855] ==========================================================\end{verbatim}The line starting with \texttt{SENT:} indicates that of the 200 test utterances,197 (98.50\%) were correctly recognised. The following line starting with \texttt{WORD:} gives the word level statistics and indicates that of the 855 words in total,853 (99.77\%) were recognised correctly. There was 1 deletion error (\texttt{D}), 1 substitution\index{recognition!results analysis}error (\texttt{S}) and 1 insertion error (\texttt{I}). The accuracy figure (\texttt{Acc})of 99.65\% is lower than the percentage correct (\texttt{Cor}) because it takesaccount of the insertion errors which the latter ignores.\centrefig{step11}{120}{Step 11}\mysect{Running the Recogniser Live}{egreclive}The recogniser can also be run with live input\index{live input}. \index{recognition!direct audio input}To do this it is onlynecessary to set the configuration variables needed to convert the inputaudio to the correct form of parameterisation. Specifically, the followingneeds to be appended to the configuration file \texttt{config} tocreate a new configuration file \texttt{config2}\begin{verbatim} # Waveform capture SOURCERATE=625.0 SOURCEKIND=HAUDIO SOURCEFORMAT=HTK ENORMALISE=F USESILDET=T MEASURESIL=F OUTSILWARN=T\end{verbatim}These indicate that the source is direct audio with sample period 62.5$\mu$secs. The silence detector is enabled and a measurement of the backgroundspeech/silence levels should be made at start-up. The final line makes surethat a warning is printed when this silence measurement is being made.Once the configuration file has been set-up for direct audio input,\htool{HVite} can be run as in the previous step except that no files need begiven as arguments\begin{verbatim} HVite -H hmm15/macros -H hmm15/hmmdefs -C config2 \ -w wdnet -p 0.0 -s 5.0 dict tiedlist\end{verbatim}On start-up, \htool{HVite} will prompt the user to speak anarbitrary sentence (approx. 4 secs) in order to measure the speech andbackground silence levels. It will then repeatedly recognise and, if tracelevel bit 1 is set, it will output each utterance to the terminal. A typicalsession is as follows\index{recognition!output}\begin{verbatim} Read 1648 physical / 4131 logical HMMs Read lattice with 26 nodes / 52 arcs Created network with 123 nodes / 151 links READY[1]> Please speak sentence - measuring levels Level measurement completed DIAL FOUR SIX FOUR TWO FOUR OH == [303 frames] -95.5773 [Ac=-28630.2 LM=-329.8] (Act=21.8) READY[2]> DIAL ZERO EIGHT SIX TWO == [228 frames] -99.3758 [Ac=-22402.2 LM=-255.5] (Act=21.8) READY[3]> etc\end{verbatim}During loading, information will be printed out regarding the differentrecogniser components. The physical models are the distinct HMMs used by the system, while the logical models include all model names. The number of logical models is higher than the number of physical models because many logically distinct models have been determined to be physically identical and have been merged during the previous model building steps. The latticeinformation refers to the number of links and nodes in the recognition syntax.The network information refers to actual recognition network built byexpanding the lattice using the current HMM set, dictionary and any contextexpansion rules specified.After each utterance, the numerical information gives the total numberof frames, the average log likelihood per frame, the total acoustic score,the total language model score and the average number of models active.Note that if it was required to recognise a new name, then thefollowing two changes would be needed\begin{enumerate}\item the grammar would be altered to include the new name\item a pronunciation for the new name would be added to the dictionary\end{enumerate}If the new name required triphones which did not exist, then they could becreated by loading the existing triphone set into\htool{HHEd}\index{hhed@\htool{HHEd}}, loading the decision trees using the\texttt{LT} command\index{lt@\texttt{LT} command} and then using the\texttt{AU} command\index{au@\texttt{AU} command} to generate a new completetriphone set.\index{triphones!synthesising unseen}\mysect{Summary}{exsyssum}This chapter has described the construction of a tied-state phone-basedcontinuous speech recogniser and in so doing, it has touched on most of themain areas addressed by \HTK: recording, data preparation, HMM definitions,training tools, adaptation tools, networks, decoding and evaluating. The rest of this book discusses each of these topics in detail.%%% Local Variables: %%% mode: latex%%% TeX-master: "htkbook"%%% End:
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -