📄 ch14.htm
字号:
<html>
<head>
<title>Chapter 14 -- What Is SAPI</title>
<meta NAME="GENERATOR" CONTENT="Microsoft FrontPage 3.0">
</head>
<body TEXT="#000000" BGCOLOR="#FFFFFF" LINK="#0000EE" VLINK="#551A8B" ALINK="#CE2910">
<!-- Spidersoft WebZIP Ad Banner Insert -->
<!-- End of Spidersoft WebZIP Ad Banner Insert-->
<h1><font COLOR="#FF0000">Chapter 14</font></h1>
<h1><b><font SIZE="5" COLOR="#FF0000">What Is SAPI</font></b> </h1>
<hr WIDTH="100%">
<h3 ALIGN="CENTER"><font SIZE="+2" COLOR="#000000">CONTENTS<a NAME="CONTENTS"></a> </font></h3>
<ul>
<li><a HREF="#SpeechRecognition">Speech Recognition</a> <ul>
<li><a HREF="#WordSeparation">Word Separation</a> </li>
<li><a HREF="#SpeakerDependence">Speaker Dependence</a> </li>
<li><a HREF="#WordMatching">Word Matching</a> </li>
<li><a HREF="#Vocabulary">Vocabulary</a> </li>
<li><a HREF="#TexttoSpeech">Text-to-Speech</a> </li>
<li><a HREF="#VoiceQuality">Voice Quality</a> </li>
<li><a HREF="#Phonemes">Phonemes</a> </li>
<li><a HREF="#TTSSynthesis">TTS Synthesis</a> </li>
<li><a HREF="#TTSDiphoneConcatenation">TTS Diphone Concatenation</a> </li>
</ul>
</li>
<li><a HREF="#GrammarRules">Grammar Rules</a> <ul>
<li><a HREF="#ContextFreeGrammars">Context-Free Grammars</a> </li>
<li><a HREF="#DictationGrammars">Dictation Grammars</a> </li>
<li><a HREF="#LimitedDomainGrammars">Limited Domain Grammars</a> </li>
</ul>
</li>
<li><a HREF="#Summary">Summary</a> </li>
</ul>
<hr>
<p>One of the newest extensions for the Windows 95 operating system is the <i>Speech
Application Programming Interface</i> (SAPI). This Windows extension gives workstations
the ability to recognize human speech as input, and create human-like audio output from
printed text. This ability adds a new dimension to human/pc interaction. Speech
recognition services can be used to extend the use of pcs to those who find typing too
difficult or too time-consuming. Text-to-speech services can be used to provide aural
representations of text documents to those who cannot see typical display screens because
of physical limitations or due to the nature of their work. </p>
<p>Like the other Windows services described in this book, SAPI is part of the Windows
Open Services Architecture (WOSA) model. Speech recognition (SR) and text-to-speech (TTS)
services are actually provided by separate modules called <i>engines</i>. Users can select
the speech engine they prefer to use as long as it conforms to the SAPI interface. </p>
<p>In this chapter you'll learn the basic concepts behind designing and implementing a
speech recognition and text-to-speech engine using the SAPI design model. You'll also
learn about creating grammar definitions for speech recognition. </p>
<h2><a NAME="SpeechRecognition"><b><font SIZE="5" COLOR="#FF0000">Speech Recognition</font></b></a></h2>
<p>Any speech system has, at its heart, a process for recognizing human speech and turning
it into something the computer understands. In effect, the computer needs a translator.
Research into effective speech recognition algorithms and processing models has been going
on almost ever since the computer was invented. And a great deal of mathematics and
linguistics go into the design and implementation of a speech recognition system. A
detailed discussion of speech recognition algorithms is beyond the scope of this book, but
it is important to have a good idea of the commonly used techniques for turning human
speech into something a computer understands. </p>
<p>Every speech recognition system uses four key operations to listen to and understand
human speech. They are:
<ul>
<li><i>Word separation</i>-This is the process of creating discreet portions of human
speech. Each portion can be as large as a phrase or as small as a single syllable or word
part. </li>
<li><i>Vocabulary</i>-This is the list of speech items that the speech engine can identify. </li>
<li><i>Word matching</i>-This is the method that the speech system uses to look up a speech
part in the system's vocabulary-the <i>search engine</i> portion of the system. </li>
<li><i>Speaker dependence</i>-This is the degree to which the speech engine is dependent on
the vocal tones and speaking patterns of individuals. </li>
</ul>
<p>These four aspects of the speech system are closely interrelated. If you want to
develop a speech system with a rich vocabulary, you'll need a sophisticated word matching
system to quickly search the vocabulary. Also, as the vocabulary gets larger, more items
in the list could sound similar (for example, <i>yes</i> and <i>yet</i>). In order to
successfully identify these speech parts, the word separation portion of the system must
be able to determine smaller and smaller differences between speech items. </p>
<p>Finally, the speech engine must balance all of these factors against the aspect of
speaker dependence. As the speech system learns smaller and smaller differences between
words, the system becomes more and more dependent on the speaking habits of a single user.
Individual accents and speech patterns can confuse speech engines. In other words, as the
system becomes more responsive to a single user, that same system becomes less able to
translate the speech of other users. </p>
<p>The next few sections describe each of the four aspects of a speech engine in a bit
more detail. </p>
<h3><a NAME="WordSeparation"><b>Word Separation</b></a></h3>
<p>The first task of the speech engine is to accept words as input. Speech engines use a
process called <i>word separation</i> to gather human speech. Just as the keyboard is used
as an input device to accept physical keystrokes for translation into readable characters,
the process of word separation accepts the sound of human speech for translation by the
computer. </p>
<p>There are three basic methods of word separation. In ascending order of complexity they
are:
<ul>
<li><font COLOR="#000000">Discrete speech </font></li>
<li><font COLOR="#000000">Word spotting </font></li>
<li><font COLOR="#000000">Continuous speech</font> </li>
</ul>
<p>Systems that use the <i>discrete speech</i> method of word separation require the user
to place a short pause between each spoken word. This slight bit of silence allows the
speech system to recognize the beginning and ending of each word. The silences separate
the words much like the space bar does when you type. The advantage of the discrete speech
method is that it requires the least amount of computational resources. The disadvantage
of this method is that it is not very user-friendly. Discrete speech systems can easily
become confused if a person does not pause between words. </p>
<p>Systems that use <i>word spotting</i> avoid the need for users to pause in between each
word by listening only for key words or phrases. Word spotting systems, in effect, ignore
the items they do not know or care about and act only on the words they can match in their
vocabulary. For example, suppose the speech system can recognize the word <i>help</i>, and
knows to load the Windows Help engine whenever it hears the word. Under word spotting, the
following phrases will all result in the speech engine invoking Windows Help: </p>
<blockquote>
<i><p>Please load Help.<br>
Can you help me, please?<br>
These definitions are no help at all!</i> </p>
</blockquote>
<p>As you can see, one of the disadvantages of word spotting is that the system can easily
misinterpret the user's meaning. However, word spotting also has several key advantages.
Word spotting allows users to speak normally, without employing pauses. Also, since word
spotting systems simply ignore words they don't know and act only on key words, these
systems can give the appearance of being more sophisticated than they really are. Word
spotting requires more computing resources than discreet speech, but not as much as the
last method of word separation-continuous speech. </p>
<p><i>Continuous speech systems</i> recognize and process every word spoken. This gives
the greatest degree of accuracy when attempting to understand a speaker's request.
However, it also requires the greatest amount of computing power. First, the speech system
must determine the start and end of each word without the use of silence. This is much
like readingtextthathasnospacesinit (see!). Once the words have been separated, the system
must look them up in the vocabulary and identify them. This, too, can take precious
computing time. The primary advantage of continuous speech systems is that they offer the
greatest level of sophistication in recognizing human speech. The primary disadvantage is
the amount of computing resources they require. </p>
<h3><a NAME="SpeakerDependence"><b>Speaker Dependence</b></a> </h3>
<p>Speaker dependence is a key factor in the design and implementation of a speech
recognition system. In theory, you would like a system that has very little speaker
dependence. This would mean that the same workstation could be spoken to by several people
with the same positive results. People often speak quite differently from one another,
however, and this can cause problems. </p>
<p>First, there is the case of accents. Just using the United States as an example, you
can identify several regional sounds. Add to these the possibility that speakers may also
have accents that come from outside the U.S. due to the influence of other languages
(Spanish, German, Japanese), and you have a wide range of pronunciation for even the
simplest of sentences. Speaker speed and pitch inflection can also vary widely, which can
pose problems for speech systems that need to determine whether a spoken phrase is a
statement or a question. </p>
<p>Speech systems fall into three categories in terms of their speaker dependence. They
can be:
<ul>
<li><font COLOR="#000000">Speaker independent</font> </li>
<li><font COLOR="#000000">Speaker dependent</font> </li>
<li><font COLOR="#000000">Speaker adaptive</font> </li>
</ul>
<p><i>Speaker-independent</i> systems require the most resources. They must be able to
accurately translate human speech across as many dialects and accents as possible. <i>Speaker-dependent</i>
systems require the least amount of computing resources. These systems require that the
user "train" the system before it is able to accurately convert human speech. A
compromise between the two approaches is the speaker-adaptive method. <i>Speaker-adaptive</i>
systems are prepared to work without training, but increase their accuracy after working
with the same speaker for a period of time. </p>
<p>The additional training required by speaker-dependent systems can be frustrating to
users. Usually training can take several hours, but some systems can reach 90 percent
accuracy or better after just five minutes of training. Users with physical disabilities,
or those who find typing highly inefficient, will be most likely to accept using
speaker-dependent systems. </p>
<p>Systems that will be used by many different people need the power of speaker
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -