📄 ch14.htm
字号:
independence. This is especially true for systems that will have short encounters with
many different people, such as greeting kiosks at an airport. In such situations, training
is unlikely to occur, and a high degree of accuracy is expected right away. </p>
<p>For systems where multiple people will access the same workstation over a longer period
of time, the speaker-adaptive system will work fine. A good example would be a workstation
used by several employees to query information from a database. The initial investment
spent training the speech system will pay off over time as the same staff uses the system.
</p>
<h3><a NAME="WordMatching"><b>Word Matching</b></a></h3>
<p>Word matching is the process of performing look-ups into the speech database. As each
word is gathered (using the word separation techniques described earlier), it must be
matched against some item in the speech engine's database. It is the process of word
matching that connects the audio input signal to a meaningful item in the speech engine
database. </p>
<p>There are two primary methods of word matching:
<ul>
<li><font COLOR="#000000">Whole-word matching</font> </li>
<li><font COLOR="#000000">Phoneme matching</font> </li>
</ul>
<p>Under <i>whole-word matching</i>, the speech engine searches the database for a word
that matches the audio input. Whole-word matching requires less search capability than
phoneme matching. But, whole-word matching requires a greater amount of storage capacity.
Under the whole-word matching model, the system must store a word template that represents
each possible word that the engine can recognize. While quick retrieval makes whole-word
matching attractive, the fact that all words must be known ahead of time limits the
application of whole-word matching systems. </p>
<p><i>Phoneme matching systems</i> keep a dictionary of language phonemes. <i>Phonemes</i>
are the smallest unique sound part of a language, and can be numerous. For example, while
the English language has 26 individual letters, these letters do not represent the total
list of possible phonemes. Also, phonemes are not restricted by spelling conventions. </p>
<p>Consider the words <i>Philip</i> and <i>fill up</i>. These words have the same
phonemes: <i>f</i>, <i>eh</i>, <i>ul</i>, <i>ah</i>, and <i>pah</i>. However, they have
entirely different meanings. Under the whole-word matching model, these words could
represent multiple entries in the database. Under the phoneme matching model, the same
five phonemes can be used to represent both words. </p>
<p>As you may expect, phoneme matching systems require more computational resources, but
less storage space. </p>
<h3><a NAME="Vocabulary"><b>Vocabulary</b></a></h3>
<p>The final element of a speech recognition system is the vocabulary. There are two
competing issues regarding vocabulary: size and accuracy. As the vocabulary size
increases, recognition improves. With large vocabularies, it is easy for speech systems to
locate a word that matches the one identified in the word separation phase. However, one
of the reasons it is easy to find a match is that more than one entry in the vocabulary
may match the given input. For example, the words <i>no</i> and <i>go</i> are very similar
to most speech engines. Therefore, as vocabulary size grows, the accuracy of speech
recognition can decrease. </p>
<p>Contrary to what you might assume, a speech engine's vocabulary does not represent the
total number of words it understands. Instead, the vocabulary of a speech engine
represents the number of words that it can recognize in a current state or moment in time.
In effect, this is the total number of "unidentified" words that the system can
resolve at any moment. </p>
<p>For example, let's assume you have registered the following word phrases with your
speech engine: "Start running Exchange" and "Start running Word."
Before you say anything, the current state of the speech engine has four words: <i>start</i>,
<i>running</i>, <i>Exchange</i>, and <i>Word</i>. Once you say "Start running"
there are only two words in the current state: <i>Exchange</i> and <i>Word</i>. The
system's ability to keep track of the possible next word is determined by the size of its
vocabulary. </p>
<p>Small vocabulary systems (100 words or less) work well in situations where most of the
speech recognition is devoted to processing commands. However, you need a large vocabulary
to handle dictation systems. Dictation vocabularies can reach into tens of thousands of
words. This is one of the reasons that dictation systems are so difficult to implement.
Not only does the vocabulary need to be large, the resolutions must be made quite quickly.
</p>
<h3><a NAME="TexttoSpeech"><b>Text-to-Speech</b></a></h3>
<p>A second type of speech service provides the ability to convert written text into
spoken words. This is called <i>text-to-speech</i> (or <i>TTS</i>) technology. Just as
there are a number of factors to consider when developing speech recognition engines (SR),
there are a few issues that must be addressed when creating and implementing rules for TTS
engines. </p>
<p>The four common issues that must be addressed when creating a TTS engine are as
follows:
<ul>
<li><font COLOR="#000000">Phonemes</font> </li>
<li><font COLOR="#000000">Voice quality</font> </li>
<li><font COLOR="#000000">TTS synthesis</font> </li>
<li><font COLOR="#000000">TTS diphone concatenation</font> </li>
</ul>
<p>The first two factors deal with the creation of audio tones that are recognizable as
human speech. The last two items are competing methods for interpreting text that is to be
converted into audio. </p>
<h3><a NAME="VoiceQuality"><b>Voice Quality</b></a></h3>
<p>The quality of a computerized voice is directly related to the sophistication of the
rules that identify and convert text into an audio signal. It is not too difficult to
build a TTS engine that can create recognizable speech. However, it is extremely difficult
to create a TTS engine that does not sound like a computer. Three factors in human speech
are very difficult to produce with computers:
<ul>
<li><font COLOR="#000000">Prosody</font> </li>
<li><font COLOR="#000000">Emotion</font> </li>
<li><font COLOR="#000000">Pronunciation anomalies</font> </li>
</ul>
<p>Human speech has a special rhythm or <i>prosody</i>-a pattern of pauses, inflections,
and emphasis that is an integral part of the language. While computers can do a good job
of pronouncing individual words, it is difficult to get them to accurately mimic the tonal
and rhythmic in-flections of human speech. For this reason, it is always quite easy to
differentiate computer-generated speech from a computer playing back a recording of a
human voice. </p>
<p>Another factor of human speech that computers have difficulty rendering is emotion.
While TTS engines are capable of distinguishing declarative statements from questions or
exclamations, computers are still not able to convey believable emotive qualities when
rendering text into speech. </p>
<p>Lastly, every language has its own pronunciation anomalies. These are words that do not
"play by the rules" when it comes to converting text into speech. Some common
examples in English are <i>dough</i> and <i>tough</i> or <i>comb</i> and <i>home</i>. More
troublesome are words such as <i>read</i> which must be understood in context in order to
figure out their exact pronunciation. For example, the pronunciations are different in
"He <i>read</i> the paper" or "She will now <i>read</i> to the class."
Even more likely to cause problems is the interjection of technobabble such as
"SQL," "MAPI," and "SAPI." All these factors make the
development of a truly human-sounding computer-generated voice extremely difficult. </p>
<p>Speech systems usually offer some way to correct for these types of problems. One
typical solution is to include the ability to enter the phonetic spelling of a word and
relate that spelling to the text version. Another common adjustment is to allow users to
enter control tags in the text to instruct the speech engine to add emphasis or
inflection, or alter the speed or pitch of the audio output. Much of this type of
adjustment information is based on phonemes, as described in the next section. </p>
<h3><a NAME="Phonemes"><b>Phonemes</b></a></h3>
<p>As we've discussed, phonemes are the sound parts that make up words. Linguists use
phonemes to accurately record the vocal sounds uttered by humans when speaking. These same
phonemes also can be used to generate computerized speech. TTS engines use their knowledge
of grammar rules and phonemes to scan printed text and generate audio output. </p>
<div align="center"><center>
<table BORDERCOLOR="#000000" BORDER="1" WIDTH="80%">
<tr>
<td><b>Note</b></td>
</tr>
<tr>
<td><blockquote>
<p>If you are interested in learning more about phonemes and how they are used to analyze
speech, refer to the <i>Phonetic Symbol Guide</i> by Pullum and Ladusaw (Chicago
University Press, 1996). </p>
</blockquote>
</td>
</tr>
</table>
</center></div>
<p>The SAPI design model recognizes and allows for the incorporation of phonemes as a
method for creating speech output. Microsoft has developed an expression of the
International Phonetic Alphabet (IPA) in the form of Unicode strings. Programmers can use
these strings to improve the pronunciation skills of the TTS engine, or to add entirely
new words to the vocabulary. </p>
<div align="center"><center>
<table BORDERCOLOR="#000000" BORDER="1" WIDTH="80%">
<tr>
<td><b>Note</b></td>
</tr>
<tr>
<td><blockquote>
<p>If you wish to use direct Unicode to alter the behavior of your TTS engine, you'll have
to program using Unicode. SAPI does not support the direct use of phonemes in ANSI format.</p>
</blockquote>
</td>
</tr>
</table>
</center></div>
<p>As mentioned in the previous section on voice quality, most TTS engines provide several
methods for improving the pronunciation of words. Unless you are involved in the
development of a text-to-speech engine, you probably will not use phonemes very often. </p>
<h3><a NAME="TTSSynthesis"><b>TTS Synthesis</b></a></h3>
<p>Once the TTS knows what phonemes to use to reproduce a word, there are two possible
methods for creating the audio output: <i>synthesis</i> or <i>diphone concatenation</i>. </p>
<p>The synthesis method uses calculations of a person's lip and tongue position, the force
of breath, and other factors to synthesize human speech. This method is usually not as
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -