📄 ch14.htm

📁 MAPI__SAPI__TAPI
💻 HTM
📖 第 1 页 / 共 3 页
字号:
12 3 下一页
<html>

<head>
<title>Chapter 14 -- What Is SAPI</title>
<meta NAME="GENERATOR" CONTENT="Microsoft FrontPage 3.0">
</head>

<body TEXT="#000000" BGCOLOR="#FFFFFF" LINK="#0000EE" VLINK="#551A8B" ALINK="#CE2910">
<!-- Spidersoft WebZIP Ad Banner Insert -->
<!-- End of Spidersoft WebZIP Ad Banner Insert-->

<h1><font COLOR="#FF0000">Chapter 14</font></h1>

<h1><b><font SIZE="5" COLOR="#FF0000">What Is SAPI</font></b> </h1>

<hr WIDTH="100%">

<h3 ALIGN="CENTER"><font SIZE="+2" COLOR="#000000">CONTENTS<a NAME="CONTENTS"></a> </font></h3>

<ul>
  <li><a HREF="#SpeechRecognition">Speech Recognition</a> <ul>
      <li><a HREF="#WordSeparation">Word Separation</a> </li>
      <li><a HREF="#SpeakerDependence">Speaker Dependence</a> </li>
      <li><a HREF="#WordMatching">Word Matching</a> </li>
      <li><a HREF="#Vocabulary">Vocabulary</a> </li>
      <li><a HREF="#TexttoSpeech">Text-to-Speech</a> </li>
      <li><a HREF="#VoiceQuality">Voice Quality</a> </li>
      <li><a HREF="#Phonemes">Phonemes</a> </li>
      <li><a HREF="#TTSSynthesis">TTS Synthesis</a> </li>
      <li><a HREF="#TTSDiphoneConcatenation">TTS Diphone Concatenation</a> </li>
    </ul>
  </li>
  <li><a HREF="#GrammarRules">Grammar Rules</a> <ul>
      <li><a HREF="#ContextFreeGrammars">Context-Free Grammars</a> </li>
      <li><a HREF="#DictationGrammars">Dictation Grammars</a> </li>
      <li><a HREF="#LimitedDomainGrammars">Limited Domain Grammars</a> </li>
    </ul>
  </li>
  <li><a HREF="#Summary">Summary</a> </li>
</ul>

<hr>

<p>One of the newest extensions for the Windows 95 operating system is the <i>Speech 
Application Programming Interface</i> (SAPI). This Windows extension gives workstations 
the ability to recognize human speech as input, and create human-like audio output from 
printed text. This ability adds a new dimension to human/pc interaction. Speech 
recognition services can be used to extend the use of pcs to those who find typing too 
difficult or too time-consuming. Text-to-speech services can be used to provide aural 
representations of text documents to those who cannot see typical display screens because 
of physical limitations or due to the nature of their work. </p>

<p>Like the other Windows services described in this book, SAPI is part of the Windows 
Open Services Architecture (WOSA) model. Speech recognition (SR) and text-to-speech (TTS) 
services are actually provided by separate modules called <i>engines</i>. Users can select 
the speech engine they prefer to use as long as it conforms to the SAPI interface. </p>

<p>In this chapter you'll learn the basic concepts behind designing and implementing a 
speech recognition and text-to-speech engine using the SAPI design model. You'll also 
learn about creating grammar definitions for speech recognition. </p>

<h2><a NAME="SpeechRecognition"><b><font SIZE="5" COLOR="#FF0000">Speech Recognition</font></b></a></h2>

<p>Any speech system has, at its heart, a process for recognizing human speech and turning 
it into something the computer understands. In effect, the computer needs a translator. 
Research into effective speech recognition algorithms and processing models has been going 
on almost ever since the computer was invented. And a great deal of mathematics and 
linguistics go into the design and implementation of a speech recognition system. A 
detailed discussion of speech recognition algorithms is beyond the scope of this book, but 
it is important to have a good idea of the commonly used techniques for turning human 
speech into something a computer understands. </p>

<p>Every speech recognition system uses four key operations to listen to and understand 
human speech. They are: 

<ul>
  <li><i>Word separation</i>-This is the process of creating discreet portions of human 
    speech. Each portion can be as large as a phrase or as small as a single syllable or word 
    part. </li>
  <li><i>Vocabulary</i>-This is the list of speech items that the speech engine can identify. </li>
  <li><i>Word matching</i>-This is the method that the speech system uses to look up a speech 
    part in the system's vocabulary-the <i>search engine</i> portion of the system. </li>
  <li><i>Speaker dependence</i>-This is the degree to which the speech engine is dependent on 
    the vocal tones and speaking patterns of individuals. </li>
</ul>

<p>These four aspects of the speech system are closely interrelated. If you want to 
develop a speech system with a rich vocabulary, you'll need a sophisticated word matching 
system to quickly search the vocabulary. Also, as the vocabulary gets larger, more items 
in the list could sound similar (for example, <i>yes</i> and <i>yet</i>). In order to 
successfully identify these speech parts, the word separation portion of the system must 
be able to determine smaller and smaller differences between speech items. </p>

<p>Finally, the speech engine must balance all of these factors against the aspect of 
speaker dependence. As the speech system learns smaller and smaller differences between 
words, the system becomes more and more dependent on the speaking habits of a single user. 
Individual accents and speech patterns can confuse speech engines. In other words, as the 
system becomes more responsive to a single user, that same system becomes less able to 
translate the speech of other users. </p>

<p>The next few sections describe each of the four aspects of a speech engine in a bit 
more detail. </p>

<h3><a NAME="WordSeparation"><b>Word Separation</b></a></h3>

<p>The first task of the speech engine is to accept words as input. Speech engines use a 
process called <i>word separation</i> to gather human speech. Just as the keyboard is used 
as an input device to accept physical keystrokes for translation into readable characters, 
the process of word separation accepts the sound of human speech for translation by the 
computer. </p>

<p>There are three basic methods of word separation. In ascending order of complexity they 
are: 

<ul>
  <li><font COLOR="#000000">Discrete speech </font></li>
  <li><font COLOR="#000000">Word spotting </font></li>
  <li><font COLOR="#000000">Continuous speech</font> </li>
</ul>

<p>Systems that use the <i>discrete speech</i> method of word separation require the user 
to place a short pause between each spoken word. This slight bit of silence allows the 
speech system to recognize the beginning and ending of each word. The silences separate 
the words much like the space bar does when you type. The advantage of the discrete speech 
method is that it requires the least amount of computational resources. The disadvantage 
of this method is that it is not very user-friendly. Discrete speech systems can easily 
become confused if a person does not pause between words. </p>

<p>Systems that use <i>word spotting</i> avoid the need for users to pause in between each 
word by listening only for key words or phrases. Word spotting systems, in effect, ignore 
the items they do not know or care about and act only on the words they can match in their 
vocabulary. For example, suppose the speech system can recognize the word <i>help</i>, and 
knows to load the Windows Help engine whenever it hears the word. Under word spotting, the 
following phrases will all result in the speech engine invoking Windows Help: </p>

<blockquote>
  <i><p>Please load Help.<br>
  Can you help me, please?<br>
  These definitions are no help at all!</i> </p>
</blockquote>

<p>As you can see, one of the disadvantages of word spotting is that the system can easily 
misinterpret the user's meaning. However, word spotting also has several key advantages. 
Word spotting allows users to speak normally, without employing pauses. Also, since word 
spotting systems simply ignore words they don't know and act only on key words, these 
systems can give the appearance of being more sophisticated than they really are. Word 
spotting requires more computing resources than discreet speech, but not as much as the 
last method of word separation-continuous speech. </p>

<p><i>Continuous speech systems</i> recognize and process every word spoken. This gives 
the greatest degree of accuracy when attempting to understand a speaker's request. 
However, it also requires the greatest amount of computing power. First, the speech system 
must determine the start and end of each word without the use of silence. This is much 
like readingtextthathasnospacesinit (see!). Once the words have been separated, the system 
must look them up in the vocabulary and identify them. This, too, can take precious 
computing time. The primary advantage of continuous speech systems is that they offer the 
greatest level of sophistication in recognizing human speech. The primary disadvantage is 
the amount of computing resources they require. </p>

<h3><a NAME="SpeakerDependence"><b>Speaker Dependence</b></a> </h3>

<p>Speaker dependence is a key factor in the design and implementation of a speech 
recognition system. In theory, you would like a system that has very little speaker 
dependence. This would mean that the same workstation could be spoken to by several people 
with the same positive results. People often speak quite differently from one another, 
however, and this can cause problems. </p>

<p>First, there is the case of accents. Just using the United States as an example, you 
can identify several regional sounds. Add to these the possibility that speakers may also 
have accents that come from outside the U.S. due to the influence of other languages 
(Spanish, German, Japanese), and you have a wide range of pronunciation for even the 
simplest of sentences. Speaker speed and pitch inflection can also vary widely, which can 
pose problems for speech systems that need to determine whether a spoken phrase is a 
statement or a question. </p>

<p>Speech systems fall into three categories in terms of their speaker dependence. They 
can be: 

<ul>
  <li><font COLOR="#000000">Speaker independent</font> </li>
  <li><font COLOR="#000000">Speaker dependent</font> </li>
  <li><font COLOR="#000000">Speaker adaptive</font> </li>
</ul>

<p><i>Speaker-independent</i> systems require the most resources. They must be able to 
accurately translate human speech across as many dialects and accents as possible. <i>Speaker-dependent</i> 
systems require the least amount of computing resources. These systems require that the 
user &quot;train&quot; the system before it is able to accurately convert human speech. A 
compromise between the two approaches is the speaker-adaptive method. <i>Speaker-adaptive</i> 
systems are prepared to work without training, but increase their accuracy after working 
with the same speaker for a period of time. </p>

<p>The additional training required by speaker-dependent systems can be frustrating to 
users. Usually training can take several hours, but some systems can reach 90 percent 
accuracy or better after just five minutes of training. Users with physical disabilities, 
or those who find typing highly inefficient, will be most likely to accept using 
speaker-dependent systems. </p>

<p>Systems that will be used by many different people need the power of speaker
12 3 下一页
💿 文件大小 527 K
👤 上传用户 pjamytian
📂 所属分类 TAPI编程
📄 代码行数 595 行
💻 语言类型 HTM
🏷️ 相关标签

#MAPI #SAPI #TAPI
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -