📄 netdict.tex
字号:
%/* ----------------------------------------------------------- */%/* */%/* ___ */%/* |_| | |_/ SPEECH */%/* | | | | \ RECOGNITION */%/* ========= SOFTWARE */ %/* */%/* */%/* ----------------------------------------------------------- */%/* developed at: */%/* */%/* Speech Vision and Robotics group */%/* Cambridge University Engineering Department */%/* http://svr-www.eng.cam.ac.uk/ */%/* */%/* Entropic Cambridge Research Laboratory */%/* (now part of Microsoft) */%/* */%/* ----------------------------------------------------------- */%/* Copyright: Microsoft Corporation */%/* 1995-2000 Redmond, Washington USA */%/* http://www.microsoft.com */%/* */%/* 2001-2002 Cambridge University */%/* Engineering Department */%/* */%/* Use of this software is governed by a License Agreement */%/* ** See the file License for the Conditions of Use ** */%/* ** This banner notice must not be removed ** */%/* */%/* ----------------------------------------------------------- */%% HTKBook - Steve Young 24/11/97%\mychap{Networks, Dictionaries and Language Models}{netdict}\sidepic{Tool.netdict}{80}{ The preceding chapters have described how to process speechdata and how to train various types of HMM.This and the following chapter are concerned with buildinga speech recogniser using \HTK. This chapter focuses onthe use of networks\index{networks} and dictionaries\index{dictionaries}. A network describes thesequence of words that can be recognised and, for the case of sub-wordsystems, a dictionary describes the sequence of HMMs that constituteeach word.A word level network will typically represent eithera \textit{Task Grammar} which defines all of the legal word sequences explicitlyor a \textit{Word Loop} which simply puts all words of the vocabularyin a loop and therefore allows any word to follow any other word.Word-loop networks are often augmented by a stochastic language model. Networks can also be usedto define phone recognisers and various types of word-spotting systems.}Networks are specified using the \HTK\ \textit{Standard Lattice Format} (SLF)which is described in detail in Chapter~\ref{c:htkslf}.This is a general purpose text format which is used for representingmultiple hypotheses in a recogniser output as well as word networks. Since SLF\index{SLF} format is text-based, it can be written directly using any text editor.However, this can be rather tedious and \HTK\ providestwo tools which allow the application designer to use a higher-levelrepresentation. Firstly, the tool \htool{HParse} allows networksto be generated from a source text containing extended BNF formatgrammar rules. This format was the only grammar definitionlanguage provided in earlier versions of \HTK\ and hence \htool{HParse} also provides backwards compatibility. \index{standard lattice format}\htool{HParse} task grammars are very easy to write, but they do not allow fine controlover the actual network used by the recogniser. The tool \htool{HBuild} works directly at the SLF level to providethis detailed control. Its main function is to enable a large word network to be decomposed intoa set of small self-contained sub-networks using as input an extendedSLF format. This enhances thedesign process and avoids the need for unnecessary repetition.\htool{HBuild} can also be used to perform a numberof special-purpose functions. Firstly, it can construct word-loop and word-pair grammars automatically. Secondly,it can incorporate a statistical bigramlanguage model into a network. These can be generated from labeltranscriptions using \htool{HLStats}. However,\HTK\ supports the standard ARPA MIT-LL text format for backed-offN-gram language models, and hence, import from other sources is possible. Whichever tool is used to generate a word network, it is importantto ensure that the generated network represents the intended grammar.It is also helpful to have some measure of the difficulty of therecognition task. To assist with this, the tool \htool{HSGen} isprovided. This tool will generate example word sequences froman SLF network using random sampling. It will also estimate theperplexity of the network.When a word network is loaded into a recogniser, a dictionary is consulted to convert eachword in the network into a sequence of phone HMMs. The dictionary canhave multiple pronunciations in which case several sequences may be joinedin parallel to make a word. Options exist in this process to automaticallyconvert the dictionary entries to context-dependent triphonemodels, either within a word or cross-word. Pronouncing dictionaries are a vital resource in building speech recognitionsystems and, in practice, word pronunciations can be derived frommany different sources. The \HTK\ tool \htool{HDMan} enables a dictionaryto be constructed automatically from different sources. Each sourcecan be individually edited and translated and merged to form auniform \HTK\ format dictionary.The various facilities for describing a word network and expanding into aHMM level network suitable for building a recogniser are implementedby the \HTK\ library module \htool{HNet}. The facilities for loadingand manipulating dictionaries are implemented by the \HTK\ library module\htool{HDict} and for loadingand manipulating language models are implemented by\htool{HLM}. These facilities and those provided by \htool{HParse}, \htool{HBuild}, \htool{HSGen}, \htool{HLStats} and \htool{HDMan} arethe subject of this chapter.\mysect{How Networks are Used}{netuse}Before delving into the details of word networks\index{networks!in recognition} and dictionaries, it willbe helpful to understand their r\^{o}le in building a speech recogniserusing \HTK. Fig~\href{f:recsys} illustrates the overall recognitionprocess. A word network is defined using HTK Standard Lattice Format(SLF). An SLF word network is just a text file and it can be writtendirectly with a text editor or a tool can be used to build it. \HTK\ provides two such tools, \htool{HBuild} and\htool{HParse}. These both take as input a textual description andoutput an SLF file. % Another way to generate SLF files is to use Entropic's \textit{grapHvite} package, which includes a % graphical tool that allows the required networks to be constructed on% the screen. Whatever method is chosen, word network SLF generation is done \textit{off-line}and is part of the system build process.An SLF file contains a list of nodes representing words and alist of arcs representing the transitions between words. The transitionscan have probabilities attached to them and these can be used to indicate\textit{preferences} in a grammar network. They can also be used torepresent bigram probabilities in a back-off bigram network and \htool{HBuild} can generate such a bigram network automatically.In addition to an SLF file, a \HTK\ recogniser requires a dictionary to supply pronunciations for each word in the networkand a set of acoustic HMM phone models. Dictionaries are input via the \HTK\ interface module \htool{HDict}.The dictionary, HMM set and word network are input to the \HTK\ librarymodule \htool{HNet} whose function is to generate an equivalent network ofHMMs. Each word in the dictionary may have several pronunciations and inthis case there will be one branch in the network corresponding to eachalternative pronunciation. Each pronunciation may consist either of a listof phones or a list of HMM names. In the former case, \htool{HNet} canoptionally expand the HMM network to use either word internal triphones orcross-word triphones.Once the HMM network has been constructed, it can beinput to the decoder module \htool{HRec} and used to recognisespeech input. Note that HMM network construction is performed \textit{on-line}at recognition time as part of the initialisation process.\centrefig{recsys}{100}{Overview of the Recognition Process}For convenience, \HTK\ provides a recognition\index{recognition!overall process} tool called \htool{HVite}to allow the functions provided by \htool{HNet} and \htool{HRec}to be invoked from the command line. \htool{HVite} is particularlyuseful for running experimental evaluations on test speech storedin disk files and for basic testing using live audio input.However, application developersshould note that \htool{HVite} is just a shell containing calls toload the word network, dictionary and models; generate the recognitionnetwork and then repeatedly recognise each input utterance.For embedded applications, it may well be appropriate todispense with \htool{HVite} and call the functions in \htool{HNet} and \htool{HRec} directly from the application.The use of \htool{HVite} is explained in the next chapter.\mysect{Word Networks and Standard Lattice Format}{slfintro}\index{standard lattice format}This section provides a basic introduction to the \HTK\ Standard LatticeFormat (SLF). SLF files are used for a variety of functions some ofwhich lie beyond the scope of the standard \HTK\ package. Thedescription here is limited to those features of SLF which are requiredto describe word networks suitable for input to \htool{HNet}. Thefollowing Chapter describes the further features of SLF used forrepresenting the output of a recogniser. For reference, a fulldescription of SLF is given in Chapter~\ref{c:htkslf}.\index{SLF!format}A word network in SLF\index{SLF} consists of a list of nodes and a list of arcs. The nodes represent words and the arcs represent the transition betweenwords\footnote{More precisely, nodes represent the ends ofwords and arcs represent the transitions between word ends.This distinction becomes important when describingrecognition output since acoustic scores are attachedto arcs not nodes. }. Each node and arc definition is written on a single line andconsists of a number of fields. Each field specification consists of a``name=value'' pair. Field names can be any length but all commonly usedfield names consist of a single letter. By convention, field namesstarting with a capital letter are mandatory whereas field namesstarting with a lower-case letter are optional. Any line beginning witha \texttt{\#} is a comment and is ignored.\centrefig{wdnet}{80}{A Simple Word Network}The following example should illustrate the basic format \index{SLF!word network}of an SLF word network file. It corresponds to the networkillustrated in Fig~\href{f:wdnet} which represents all sequencesconsisting of the words ``bit'' and ``but'' starting with theword ``start'' and ending with the word ``end''. As will beseen later, the start and end words will be mapped to a silencemodel so this grammar allows speakers to say ``bit but but bit bit ....etc''.\begin{verbatim} # Define size of network: N=num nodes and L=num arcs N=4 L=8 # List nodes: I=node-number, W=word I=0 W=start I=1 W=end I=2 W=bit I=3 W=but # List arcs: J=arc-number, S=start-node, E=end-node J=0 S=0 E=2 J=1 S=0 E=3 J=2 S=3 E=1 J=3 S=2 E=1 J=4 S=2 E=3 J=5 S=3 E=3 J=6 S=3 E=2 J=7 S=2 E=2\end{verbatim}Notice that the first line which defines the size of the network must begiven before any node or arc definitions.A node is a \textit{network start node} if it has no predecessors,and a node is \textit{network end node} if it has no successors.There must be one and only one network start node and one networkend node.In the above, node 0 is a network start node and node 1 is anetwork end node.The choice of the names ``start'' and ``end'' for these nodeshas no significance.\centrefig{wdnet1}{80}{A Word Network Using Null Nodes}A word network can have null nodes indicated by the specialpredefined word name \texttt{!NULL}. Null nodes are useful for reducingthe number of arcs required. For example, the \textit{Bit-But}network could be defined as follows\index{SLF!null nodes}\begin{verbatim} # Network using null nodes N=6 L=7 I=0 W=start I=1 W=end I=2 W=bit I=3 W=but I=4 W=!NULL I=5 W=!NULL J=0 S=0 E=4 J=1 S=4 E=2 J=2 S=4 E=3 J=3 S=2 E=5 J=4 S=3 E=5 J=5 S=5 E=4 J=6 S=5 E=1\end{verbatim}In this case, there is no significant saving, however, if therewere many words in parallel, the total number of arcs would bemuch reduced by using null nodes to form common start and end points forthe loop-back connections.By default, all arcs are equally likely. However, the optionalfield \texttt{l=x} can be used to attach the log transition probability\texttt{x} to an arc. For example, if the word ``but'' was twiceas likely as ``bit'', the arcs numbered 1 and 2 in the last examplecould be changed to\begin{verbatim} J=1 S=4 E=2 l=-1.1
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -