📄 ch16.htm
字号:
<p>Many vendors are now offering multifunction cards that provide speech, data, FAX, and
telephony services all in one card. You can usually purchase one of these cards for about
$250-$500. By installing one of these new cards, you can upgrade a workstation and reduce
the number of hardware slots in use at the same time. </p>
<p>A few speech-recognition engines still need a <i>DSP</i> (<i>digital signal</i> <i>processor</i>)
card. While it may be preferable to work with newer cards that do not require DSP
handling, there are advantages to using DSP technology. DSP cards handle some of the
computational work of interpreting speech input. This can actually reduce the resource
requirements for providing SR services. In systems where speech is a vital source of
process input, DSP cards can noticeably boost performance. </p>
<p>SR engines require the use of a microphone for audio input. This is usually handled by
a directional microphone mounted on the pc base. Other options include the use of a <i>lavaliere
microphone</i> draped around the neck, or a headset microphone that includes headphones.
Depending on the audio card installed, you may also be able to use a telephone handset for
input. </p>
<p>Most multimedia systems ship with a suitable microphone built into the pc or as an
external device that plugs into the sound card. It is also possible to purchase high-grade
unidirectional microphones from audio retailers. Depending on the microphone and the sound
card used, you may need an amplifier to boost the input to levels usable by the SR engine.
</p>
<p>The quality of the audio input is one of the most important factors in successful
implementation of speech services on a pc. If the system will be used in a noisy
environment, close-talk microphones should be used. This will reduce extraneous noise and
improve the recognition capabilities of the SR engine. </p>
<p>Speakers or headphones are needed to play back TTS output. In private office spaces,
free-standing speakers provide the best sound reproduction and fewest dangers of ear
damage through high-levels of playback. However, in larger offices, or in areas where the
playback can disturb others, headphones are preferred. </p>
<div align="center"><center>
<table BORDERCOLOR="#000000" BORDER="1" WIDTH="80%">
<tr>
<td><b>Tip</b></td>
</tr>
<tr>
<td><blockquote>
<p>As mentioned earlier in this chapter, some systems can also provide audio playback
through a telephone handset. Conversely, the use of free-standing speakers and a
microphone can be used successfully as a speaker-phone system.</p>
</blockquote>
</td>
</tr>
</table>
</center></div>
<h2><a NAME="TechnologyIssues"><font SIZE="5" COLOR="#FF0000">Technology Issues</font></a></h2>
<p>As advanced as SR/TTS technology is, it still has its limits. This section covers the
general technology issues for SR and TTS engines along with a quick summary of some of the
limits of the process and how this can affect perceived performance and system design. </p>
<h3><a NAME="SRTechniques">SR Techniques</a></h3>
<p>Speech recognition technology can be measured by three factors:
<ul>
<li><font COLOR="#000000">Word selection</font> </li>
<li><font COLOR="#000000">Speaker dependence</font> </li>
<li><font COLOR="#000000">Word analysis</font> </li>
</ul>
<p><i>Word selection</i> deals with the process of actually perceiving "word
items" as input. Any speech engine must have some method for listening to the input
stream and deciding when a word item has been uttered. There are three different methods
for selecting words from the input stream. They are:
<ul>
<li>Discrete speech </li>
<li>Word spotting </li>
<li>Continuous speech </li>
</ul>
<p><i>Discrete speech</i> is the simplest form of word selection. Under discrete speech,
the engine requires a slight pause between each word. This pause marks the beginning and
end of each word item. Discrete speech requires the least amount of computational
resources. However, discrete speech is not very natural for users. With a discrete speech
system, users must speak in a halting voice. This may be adequate for short interactions
with the speech system, but rather annoying for extended periods. </p>
<p>A much more preferred method of handling speech input is word spotting. Under <i>word
spotting</i>, the speech engine listens for a list of key words along the input stream.
This method allows users to use continuous speech. Since the system is
"listening" for key words, users do not need to use unnatural pauses while they
speak. The advantage of word spotting is that it gives users the perception that the
system is actually listening to every word while limiting the amount of resources required
by the engine itself. The disadvantage of word spotting is that the system can easily
misinterpret input. For example, if the engine recognizes the word <i>run</i>, it will
interpret the phrases "Run Excel" and "Run Access" as the same phrase.
For this reason, it is important to design vocabularies for word-spotting systems that
limit the possibility of confusion. </p>
<p>The most advanced form of word selection is the continuous speech method. Under <i>continuous
speech</i>, the SR engine attempts to recognize each word that is uttered in real time.
This is the most resource-intensive of the word selection methods. For this reason,
continuous speech is best reserved for dictation systems that require complete and
accurate perception of every word. </p>
<p>The process of word selection can be affected by the speaker. <i>Speaker dependence</i>
refers to the engine's ability to deal with different speakers. Systems can be speaker
dependent, speaker independent, or speaker adaptive. The disadvantage of speaker-dependent
systems is that they require extensive training by a single user before they become very
accurate. This training can last as much as one hour before the system has an accuracy
rate of over 90 percent. Another drawback to speaker-dependent systems is that each new
user must re-train the system to reduce confusion and improve performance. However,
speaker-dependent systems provide the greatest degree of accuracy while using the least
amount of computing resources. </p>
<p>Speaker-adaptive systems are designed to perform adequately without training, but they
improve with use. The advantage of speaker-adaptive systems is that users experience
success without tedious training. Disadvantages include additional computing resource
requirements and possible reduced performance on systems that must serve different people.
</p>
<p>Speaker-independent systems provide the greatest degree of accuracy without
performance. Speaker-independent systems are a must for installations where multiple
speakers need to use the same station. The drawback of speaker-independent systems is that
they require the greatest degree of computing resources. </p>
<p>Once a word item has been selected, it must be analyzed. <i>Word analysis</i>
techniques involve matching the word item to a list of known words in the engine's
vocabulary. There are two methods for handling word analysis: <i>whole-word matching</i>
or <i>sub-word matching</i>. Under whole-word matching, the SR engine matches the word
item against a vocabulary of complete word templates. The advantage of this method is that
the engine is able to make an accurate match very quickly, without the need for a great
deal of computing power. The disadvantage of whole-word matching is that it requires
extremely large vocabularies-into the tens of thousands of entries. Also, these words must
be stored as spoken templates. Each word can require as much as 512 bytes of storage. </p>
<p>An alternate word-matching method involves the use of sub-words called <i>phonemes</i>.
Each language has a fixed set of phonemes that are used to build all words. By informing
the SR engine of the phonemes and their representations it is much easier to recognize a
wider range of words. Under sub-word matching, the engine does not require an extensive
vocabulary. An additional advantage of sub-word systems is that the pronunciation of a
word can be determined from printed text. Phoneme storage requires only 5 to 20 bytes per
phoneme. The disadvantage of sub-word matching is that is requires more processing
resources to analyze input. </p>
<h3><a NAME="SRLimits">SR Limits</a></h3>
<p>It is important to understand the limits of current SR technology and how these limits
affect system performance. Three of the most vital limitations of current SR technology
are:
<ul>
<li><font COLOR="#000000">Speaker identification</font> </li>
<li><font COLOR="#000000">Input recognition</font> </li>
<li><font COLOR="#000000">Recognition accuracy</font> </li>
</ul>
<p>The first hurdle for SR engines is determining when the speaker is addressing the
engine and when the words are directed to someone else in the room. This skill is beyond
the SR systems currently on the market. Your program must allow users to inform the
computer that you are addressing the engine. Also, SR engines cannot distinguish between
multiple speakers. With speaker-independent systems, this is not a big problem. However,
speaker-dependent systems cannot deal well in situations where multiple users may be
addressing the same system. </p>
<p>Even speaker-independent systems can have a hard time when multiple speakers are
involved. For example, a dictation system designed to transcribe a meeting will not be
able to differentiate between speakers. Also, SR systems fail when two people are speaking
at the same time. </p>
<p>SR engines also have limits regarding the processing of identified words. First, SR
engines have no ability to process natural language. They can only recognize words in the
existing vocabulary and process them based on known grammar rules. Thus, despite any
perceived "friendliness" of speech-enabled systems, they do not really
understand the speaker at all. </p>
<p>SR engines also are unable to hear a new word and derive its meaning from previously
spoken words. The system is incapable of spelling or rendering words that are not already
in its vocabulary. </p>
<p>Finally, SR engines are not able to deal with wide variations in pronunciation of the
same word. For example, words such as either (<i>ee-ther</i> or <i>I-ther</i>) and potato
(<i>po-tay-toe</i> or <i>po-tah-toe</i>) can easily confuse the system. Wide variations in
pronunciation can greatly reduce the accuracy of SR systems. </p>
<p>Recognition accuracy can be affected by regional dialects, quality of the microphone,
and the ambient noise level during a speech session. Much like the problem with
pronunciation, dialect variations can hamper SR engine performance. If your software is
implemented in a location where the common speech contains local slang or other
region-specific words, these words may be misinterpreted or not recognized at all. </p>
<p>Poor microphones or noisy office spaces also affect accuracy. A system that works fine
in a quiet, well-equipped office may be unusable in a noisy facility. In a noisy
environment, the SR engine is more likely to confuse similar-sounding words such as <i>out</i>
and <i>pout,</i> or <i>in</i> and <i>when</i>. For this reason it is important to
emphasize the value of a good microphone and a quiet environment when performing SR
activities. </p>
<h3><a NAME="TTSTechniques">TTS Techniques</a></h3>
<p>TTS engines use two different techniques for turning text input into audio
output-synthesis or diphone concatenation. <i>Synthesis</i> involves the creation of human
speech through the use of stored phonemes. This method results in audio output that is
understandable, but not very human-like. The advantage of synthesis systems is that they
do not require a great deal of storage space to implement and that they allow for the
modification of voice quality through the adjustment of only a few parameters. </p>
<p><i>Diphone</i>-based systems produce output that is much closer to human speech. This
is because the system stores actual human speech phoneme sets and plays them back. The
disadvantage of this method is that it requires more computing and storage capacity.
However, if your application is used to provide long sessions of audio output, diphone
systems produce a speech quality much easier to understand. </p>
<h3><a NAME="TTSLimits">TTS Limits</a></h3>
<p>TTS engines are limited in their ability to re-create the details of spoken language,
including the rhythm, accent, and pitch inflection. This combination of properties is call
the <i>prosody</i> of speech. TTS engines are not very good at adding prosody. For this
reason, listening to TTS output can be difficult, especially for long periods of time.
Most TTS engines allow users to edit text files with embedded control information that
adds prosody to the ASCII text. This is useful for systems that are used to
"read" text that is edited and stored for later retrieval. </p>
<p>TTS systems have their limits when it comes to producing individualized voices.
Synthesis-based engines are relatively easy to modify to create new voice types. This
modification involves the adjustment of general pitch and speed to produce new vocal
personalities such as "old man," "child," "female,"
"male," and so on. However, these voices still use the same prosody and grammar
rules. </p>
<p>Creating new voices for diphone-based systems is much more costly than for
synthesis-based systems. Since each new vocal personality must be assembled from
pre-recorded human speech, it can take quite a bit of time and effort to alter an existing
voice set or to produce a new one. Diphone concatenation is costly for systems that must
support multiple languages or need to provide flexibility in voice personalities. </p>
<h2><a NAME="GeneralSRDesignIssues"><font SIZE="5" COLOR="#FF0000">General SR Design
Issues</font></a></h2>
<p>There are a number of general issues to keep in mind when designing SR interfaces to
your applications. </p>
<p>First, if you provide speech services within your application, you'll need to make sure
you let the user know the services are available. This can be done by adding a graphic
image to the display, telling the user that the computer is "listening," or you
can add caption or status items that indicate the current state of the SR engine. </p>
<p>It is also a good idea to make speech services an optional feature whenever possible.
Some installations may not have the hardware or RAM required to implement speech services.
Even if the workstation has adequate resources, the user may experience performance
degradation with the speech services active. It is a good idea to have a menu option or
some other method that allows users to turn off speech services entirely. </p>
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -