📄 ch14.htm
字号:
accurate as the diphone method. However, if the TTS uses the synthesis method for
generating output, it is very easy to modify a few parameters and then create a new
"voice." </p>
<p>Synthesis-based TTS engines require less overall computational resources, and less
storage capacity. Synthesis-based systems are a bit more difficult to understand at first,
but usually offer users the ability to adjust the tone, speed, and inflection of the voice
rather easily. </p>
<h3><a NAME="TTSDiphoneConcatenation"><b>TTS Diphone Concatenation</b></a> </h3>
<p>The diphone concatenation method of generating speech uses pairs of phonemes (<i>di</i>
meaning two) to produce each sound. These diphones represent the start and end of each
individual speech part. For example, the word <i>pig</i> contains the diphones <i>silence-p,
p-i, i-g, </i>and <i>g-silence.</i> Diphone TTS systems scan the word and then piece
together the correct phoneme pairs to pronounce the word. </p>
<p>These phoneme pairs are produced not by computer synthesis, but from actual recordings
of human voices that have been broken down to their smallest elements and categorized into
the various diphone pairs. Since TTS systems that use diphones are using elements of
actual human speech, they can produce much more human-like output. However, since diphone
pairs are very language-specific, diphone TTS systems are usually dedicated to producing a
single language. Because of this, diphone systems do not do well in environments where
numerous foreign words may be present, or where the TTS might be required to produce
output in more than one language. </p>
<h2><a NAME="GrammarRules"><b><font SIZE="5" COLOR="#FF0000">Grammar Rules</font></b></a></h2>
<p>The final elements of a speech engine are the grammar rules. Grammar rules are used by
speech recognition (SR) software to analyze human speech input and, in the process,
attempt to understand what a person is saying. Most of us suffered through a series of
lessons in grade school where our teachers attempted to show us just how grammar rules
affect our everyday speech patterns. And most of us probably don't remember a great deal
from those lessons, but we all use grammar rules every day without thinking about them, to
express ourselves and make sense of what others say to us. Without an understanding of and
appreciation for the importance of grammars, computer speech recognition systems would not
be possible. </p>
<p>There can be any number of grammars, each composed of a set of rules of speech. Just as
humans must learn to share a common grammar in order to be understood, computers must also
share a common grammar with the speaker in order to convert audio information into text. </p>
<p>Grammars can be divided in to three types, each with its own strengths and weaknesses.
The types are:
<ul>
<li><font COLOR="#000000">Context-free grammars</font> </li>
<li><font COLOR="#000000">Dictation grammars</font> </li>
<li><font COLOR="#000000">Limited domain grammars</font> </li>
</ul>
<p>Context-free grammars offer the greatest degree of flexibility when interpreting human
speech. Dictation grammars offer the greatest degree of accuracy when converting spoken
words into printed text. Limited domain grammars offer a compromise between the highly
flexible context-free grammar and the restrictive dictation grammar. </p>
<p>The following sections discuss each grammar type in more detail. </p>
<h3><a NAME="ContextFreeGrammars"><b>Context-Free Grammars</b></a> </h3>
<p>Context-free grammars work on the principle of following established rules to determine
the most likely candidates for the next word in a sentence. Context-free grammars <i>do
not</i> work on the idea that each word should be understood within a context. Rather,
they evaluate the relationship of each word and word phrase to a known set of rules about
what words are possible at any given moment. </p>
<p>The main elements of a context-free grammar are:
<ul>
<li><i>Words</i>-A list of valid words to be spoken </li>
<li><i>Rules</i>-A set of speech structures in which words are used </li>
<li><i>Lists</i>-One or more word sets to be used within rules </li>
</ul>
<p>Context-free grammars are good for systems that have to deal with a wide variety of
input. Context-free systems are also able to handle variable vocabularies. This is because
most of the rule-building done for context-free grammars revolves around declaring lists
and groups of words that fit into common patterns or rules. Once the SR engine understands
the rules, it is very easy to expand the vocabulary by expanding the lists of possible
members of a group. </p>
<p>For example, rules in a context-free grammar might look something like this: </p>
<blockquote>
<tt><font FACE="Courier"><p><NameRule>=ALT("Mike","Curt","Sharon","Angelique")
<br>
<br>
<SendMailRule>=("Send Email to", <NameRule>) <br>
</font></tt>In the example above, two rules have been established. The first rule, <tt><font
FACE="Courier"><NameRule></font></tt>, creates a list of possible names. The second
rule, <tt><font FACE="Courier"><SendMailRule></font></tt>, creates a rule that
depends on <tt><font FACE="Courier"><NameRule></font></tt>. In this way,
context-free grammars allow you to build your own grammatical rules as a predictor of how
humans will interact with the system. </p>
</blockquote>
<p>Even more importantly, context-free grammars allow for easy expansion at run-time.
Since much of the way context-free grammars operate focuses on lists, it is easy to allow
users to add list members and, therefore, to improve the value of the SR system quickly.
This makes it easy to install a system with only basic components. The basic system can be
expanded to meet the needs of various users. In this way, context-free grammars offer a
high degree of flexibility with very little development cost or complication. </p>
<p>The construction of quality context-free grammars can be a challenge, however. Systems
that only need to do a few things (such as load and run programs, execute simple
directives, and so on) are easily expressed using context-free grammars. However, in order
to perform more complex tasks or a wider range of chores, additional rules are needed. As
the number of rules and the length of lists increases, the computational load rises
dramatically. Also, since context-free grammars base their predictions on predefined
rules, they are not good for tasks like dictation, where a large vocabulary is most
important. </p>
<h3><a NAME="DictationGrammars"><b>Dictation Grammars</b></a> </h3>
<p>Unlike context-free grammars that operate using rules, dictation grammars base their
evaluations on vocabulary. The primary function of a dictation grammar is to convert human
speech into text as accurately as possible. In order to do this, dictation grammars need
not only a rich vocabulary to work from, but also a sample output to use as a model when
analyzing speech input. Rules of speech are not important to a system that must simply
convert human input into printed text. </p>
<p>The elements of a dictation grammar are:
<ul>
<li><i>Topic</i>-Identifies the dictation topic (for example, medical or legal). </li>
<li><i>Common</i>-A set of words commonly used in the dictation. Usually the common group
contains technical or specialized words that are expected to appear during dictation, but
are not usually found in regular conversation. </li>
<li><i>Group</i>-A related set of words that can be expected, but that are not directly
related to the dictation topic. The group usually has a set of words that are expected to
occur frequently during dictation. The grammar model can contain more than one group. </li>
<li><i>Sample</i>-A sample of text that shows the writing style of the speaker or general
format of the dictation. This text is used to aid the SR engine in analyzing speech input.
</li>
</ul>
<p>The success of a dictation grammar depends on the quality of the vocabulary. The more
items on the list, the greater the chance of the SR engine mistaking one item for another.
However, the more limited the vocabulary, the greater the number of "unknown"
words that will occur during the course of the dictation. The most successful dictation
systems balance vocabulary depth and the uniqueness of the words in the database. For this
reason, dictation systems are usually tuned for one topic, such as legal or medical
dictation. By limiting the vocabulary to the words most likely to occur in the course of
dictation, translation accuracy is increased. </p>
<h3><a NAME="LimitedDomainGrammars"><b>Limited Domain Grammars</b></a> </h3>
<p>Limited domain grammars offer a compromise between the flexibility of context-free
grammars and the accuracy of dictation grammars. Limited domain grammars have the
following elements:
<ul>
<li><i>Words</i>-This is the list of specialized words that are likely to occur during a
session. </li>
<li><i>Group</i>-This is a set of related words that could occur during the session. The
grammar can contain multiple word groups. A single phrase would be expected to include one
of the words in the group. </li>
<li><i>Sample</i>-A sample of text that shows the writing style of the speaker or general
format of the dictation. This text is used to aid the SR engine in analyzing the speech
input. </li>
</ul>
<p>Limited domain grammars are useful in situations where the vocabulary of the system
need not be very large. Examples include systems that use natural language to accept
command statement, such as "How can I set the margins?" or "Replace all
instances of 'New York' with 'Los Angeles.'" Limited domain grammars also work well
for filling in forms or for simple text entry. </p>
<h2><a NAME="Summary"><b><font SIZE="5" COLOR="#FF0000">Summary</font></b></a> </h2>
<p>In this chapter you learned about the key factors behind creating and implementing a
complete speech system for pcs. You learned the three major parts to speech systems:
<ul>
<li><i>Speech recognition</i>-Converts audio input into printed text or directly into
computer commands. </li>
<li><i>Text-to-speech</i>-Converts printed text into audible speech. </li>
<li><i>Grammar rules</i>-Used by speech recognition systems to analyze audio input and
convert it into commands or text. </li>
</ul>
<p>In the next chapter, you'll learn the specifics behind the Microsoft speech recognition
engine. </p>
<hr WIDTH="100%">
<p align="center"><a HREF="ch13.htm"><img SRC="pc.gif" BORDER="0" HEIGHT="88" WIDTH="140"></a><a
HREF="#CONTENTS"><img SRC="cc.gif" BORDER="0" HEIGHT="88" WIDTH="140"></a><a
HREF="index.htm"><img SRC="hb.gif" BORDER="0" HEIGHT="88" WIDTH="140"></a> <a
HREF="ch15.htm"><img SRC="nc.gif" BORDER="0" HEIGHT="88" WIDTH="140"></a></p>
<hr WIDTH="100%">
<layer src="http://www.spidersoft.com/ads/bwz468_60.htm" visibility="hidden" id="a1" width="600" onload="moveToAbsolute(ad1.pageX,ad1.pageY); a1.clip.height=60;visibility='show';">
</layer>
</body>
</html>
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -