📄 ch16.htm

📁 MAPI__SAPI__TAPI
💻 HTM
📖 第 1 页 / 共 3 页
字号:
<p>Many vendors are now offering multifunction cards that provide speech, data, FAX, and 
telephony services all in one card. You can usually purchase one of these cards for about 
$250-$500. By installing one of these new cards, you can upgrade a workstation and reduce 
the number of hardware slots in use at the same time. </p>

<p>A few speech-recognition engines still need a <i>DSP</i> (<i>digital signal</i> <i>processor</i>) 
card. While it may be preferable to work with newer cards that do not require DSP 
handling, there are advantages to using DSP technology. DSP cards handle some of the 
computational work of interpreting speech input. This can actually reduce the resource 
requirements for providing SR services. In systems where speech is a vital source of 
process input, DSP cards can noticeably boost performance. </p>

<p>SR engines require the use of a microphone for audio input. This is usually handled by 
a directional microphone mounted on the pc base. Other options include the use of a <i>lavaliere 
microphone</i> draped around the neck, or a headset microphone that includes headphones. 
Depending on the audio card installed, you may also be able to use a telephone handset for 
input. </p>

<p>Most multimedia systems ship with a suitable microphone built into the pc or as an 
external device that plugs into the sound card. It is also possible to purchase high-grade 
unidirectional microphones from audio retailers. Depending on the microphone and the sound 
card used, you may need an amplifier to boost the input to levels usable by the SR engine. 
</p>

<p>The quality of the audio input is one of the most important factors in successful 
implementation of speech services on a pc. If the system will be used in a noisy 
environment, close-talk microphones should be used. This will reduce extraneous noise and 
improve the recognition capabilities of the SR engine. </p>

<p>Speakers or headphones are needed to play back TTS output. In private office spaces, 
free-standing speakers provide the best sound reproduction and fewest dangers of ear 
damage through high-levels of playback. However, in larger offices, or in areas where the 
playback can disturb others, headphones are preferred. </p>
<div align="center"><center>

<table BORDERCOLOR="#000000" BORDER="1" WIDTH="80%">
  <tr>
    <td><b>Tip</b></td>
  </tr>
  <tr>
    <td><blockquote>
      <p>As mentioned earlier in this chapter, some systems can also provide audio playback 
      through a telephone handset. Conversely, the use of free-standing speakers and a 
      microphone can be used successfully as a speaker-phone system.</p>
    </blockquote>
    </td>
  </tr>
</table>
</center></div>

<h2><a NAME="TechnologyIssues"><font SIZE="5" COLOR="#FF0000">Technology Issues</font></a></h2>

<p>As advanced as SR/TTS technology is, it still has its limits. This section covers the 
general technology issues for SR and TTS engines along with a quick summary of some of the 
limits of the process and how this can affect perceived performance and system design. </p>

<h3><a NAME="SRTechniques">SR Techniques</a></h3>

<p>Speech recognition technology can be measured by three factors: 

<ul>
  <li><font COLOR="#000000">Word selection</font> </li>
  <li><font COLOR="#000000">Speaker dependence</font> </li>
  <li><font COLOR="#000000">Word analysis</font> </li>
</ul>

<p><i>Word selection</i> deals with the process of actually perceiving &quot;word 
items&quot; as input. Any speech engine must have some method for listening to the input 
stream and deciding when a word item has been uttered. There are three different methods 
for selecting words from the input stream. They are: 

<ul>
  <li>Discrete speech </li>
  <li>Word spotting </li>
  <li>Continuous speech </li>
</ul>

<p><i>Discrete speech</i> is the simplest form of word selection. Under discrete speech, 
the engine requires a slight pause between each word. This pause marks the beginning and 
end of each word item. Discrete speech requires the least amount of computational 
resources. However, discrete speech is not very natural for users. With a discrete speech 
system, users must speak in a halting voice. This may be adequate for short interactions 
with the speech system, but rather annoying for extended periods. </p>

<p>A much more preferred method of handling speech input is word spotting. Under <i>word 
spotting</i>, the speech engine listens for a list of key words along the input stream. 
This method allows users to use continuous speech. Since the system is 
&quot;listening&quot; for key words, users do not need to use unnatural pauses while they 
speak. The advantage of word spotting is that it gives users the perception that the 
system is actually listening to every word while limiting the amount of resources required 
by the engine itself. The disadvantage of word spotting is that the system can easily 
misinterpret input. For example, if the engine recognizes the word <i>run</i>, it will 
interpret the phrases &quot;Run Excel&quot; and &quot;Run Access&quot; as the same phrase. 
For this reason, it is important to design vocabularies for word-spotting systems that 
limit the possibility of confusion. </p>

<p>The most advanced form of word selection is the continuous speech method. Under <i>continuous 
speech</i>, the SR engine attempts to recognize each word that is uttered in real time. 
This is the most resource-intensive of the word selection methods. For this reason, 
continuous speech is best reserved for dictation systems that require complete and 
accurate perception of every word. </p>

<p>The process of word selection can be affected by the speaker. <i>Speaker dependence</i> 
refers to the engine's ability to deal with different speakers. Systems can be speaker 
dependent, speaker independent, or speaker adaptive. The disadvantage of speaker-dependent 
systems is that they require extensive training by a single user before they become very 
accurate. This training can last as much as one hour before the system has an accuracy 
rate of over 90 percent. Another drawback to speaker-dependent systems is that each new 
user must re-train the system to reduce confusion and improve performance. However, 
speaker-dependent systems provide the greatest degree of accuracy while using the least 
amount of computing resources. </p>

<p>Speaker-adaptive systems are designed to perform adequately without training, but they 
improve with use. The advantage of speaker-adaptive systems is that users experience 
success without tedious training. Disadvantages include additional computing resource 
requirements and possible reduced performance on systems that must serve different people. 
</p>

<p>Speaker-independent systems provide the greatest degree of accuracy without 
performance. Speaker-independent systems are a must for installations where multiple 
speakers need to use the same station. The drawback of speaker-independent systems is that 
they require the greatest degree of computing resources. </p>

<p>Once a word item has been selected, it must be analyzed. <i>Word analysis</i> 
techniques involve matching the word item to a list of known words in the engine's 
vocabulary. There are two methods for handling word analysis: <i>whole-word matching</i> 
or <i>sub-word matching</i>. Under whole-word matching, the SR engine matches the word 
item against a vocabulary of complete word templates. The advantage of this method is that 
the engine is able to make an accurate match very quickly, without the need for a great 
deal of computing power. The disadvantage of whole-word matching is that it requires 
extremely large vocabularies-into the tens of thousands of entries. Also, these words must 
be stored as spoken templates. Each word can require as much as 512 bytes of storage. </p>

<p>An alternate word-matching method involves the use of sub-words called <i>phonemes</i>. 
Each language has a fixed set of phonemes that are used to build all words. By informing 
the SR engine of the phonemes and their representations it is much easier to recognize a 
wider range of words. Under sub-word matching, the engine does not require an extensive 
vocabulary. An additional advantage of sub-word systems is that the pronunciation of a 
word can be determined from printed text. Phoneme storage requires only 5 to 20 bytes per 
phoneme. The disadvantage of sub-word matching is that is requires more processing 
resources to analyze input. </p>

<h3><a NAME="SRLimits">SR Limits</a></h3>

<p>It is important to understand the limits of current SR technology and how these limits 
affect system performance. Three of the most vital limitations of current SR technology 
are: 

<ul>
  <li><font COLOR="#000000">Speaker identification</font> </li>
  <li><font COLOR="#000000">Input recognition</font> </li>
  <li><font COLOR="#000000">Recognition accuracy</font> </li>
</ul>

<p>The first hurdle for SR engines is determining when the speaker is addressing the 
engine and when the words are directed to someone else in the room. This skill is beyond 
the SR systems currently on the market. Your program must allow users to inform the 
computer that you are addressing the engine. Also, SR engines cannot distinguish between 
multiple speakers. With speaker-independent systems, this is not a big problem. However, 
speaker-dependent systems cannot deal well in situations where multiple users may be 
addressing the same system. </p>

<p>Even speaker-independent systems can have a hard time when multiple speakers are 
involved. For example, a dictation system designed to transcribe a meeting will not be 
able to differentiate between speakers. Also, SR systems fail when two people are speaking 
at the same time. </p>

<p>SR engines also have limits regarding the processing of identified words. First, SR 
engines have no ability to process natural language. They can only recognize words in the 
existing vocabulary and process them based on known grammar rules. Thus, despite any 
perceived &quot;friendliness&quot; of speech-enabled systems, they do not really 
understand the speaker at all. </p>

<p>SR engines also are unable to hear a new word and derive its meaning from previously 
spoken words. The system is incapable of spelling or rendering words that are not already 
in its vocabulary. </p>

<p>Finally, SR engines are not able to deal with wide variations in pronunciation of the 
same word. For example, words such as either (<i>ee-ther</i> or <i>I-ther</i>) and potato 
(<i>po-tay-toe</i> or <i>po-tah-toe</i>) can easily confuse the system. Wide variations in 
pronunciation can greatly reduce the accuracy of SR systems. </p>

<p>Recognition accuracy can be affected by regional dialects, quality of the microphone, 
and the ambient noise level during a speech session. Much like the problem with 
pronunciation, dialect variations can hamper SR engine performance. If your software is 
implemented in a location where the common speech contains local slang or other 
region-specific words, these words may be misinterpreted or not recognized at all. </p>

<p>Poor microphones or noisy office spaces also affect accuracy. A system that works fine 
in a quiet, well-equipped office may be unusable in a noisy facility. In a noisy 
environment, the SR engine is more likely to confuse similar-sounding words such as <i>out</i> 
and <i>pout,</i> or <i>in</i> and <i>when</i>. For this reason it is important to 
emphasize the value of a good microphone and a quiet environment when performing SR 
activities. </p>

<h3><a NAME="TTSTechniques">TTS Techniques</a></h3>

<p>TTS engines use two different techniques for turning text input into audio 
output-synthesis or diphone concatenation. <i>Synthesis</i> involves the creation of human 
speech through the use of stored phonemes. This method results in audio output that is 
understandable, but not very human-like. The advantage of synthesis systems is that they 
do not require a great deal of storage space to implement and that they allow for the 
modification of voice quality through the adjustment of only a few parameters. </p>

<p><i>Diphone</i>-based systems produce output that is much closer to human speech. This 
is because the system stores actual human speech phoneme sets and plays them back. The 
disadvantage of this method is that it requires more computing and storage capacity. 
However, if your application is used to provide long sessions of audio output, diphone 
systems produce a speech quality much easier to understand. </p>

<h3><a NAME="TTSLimits">TTS Limits</a></h3>

<p>TTS engines are limited in their ability to re-create the details of spoken language, 
including the rhythm, accent, and pitch inflection. This combination of properties is call 
the <i>prosody</i> of speech. TTS engines are not very good at adding prosody. For this 
reason, listening to TTS output can be difficult, especially for long periods of time. 
Most TTS engines allow users to edit text files with embedded control information that 
adds prosody to the ASCII text. This is useful for systems that are used to 
&quot;read&quot; text that is edited and stored for later retrieval. </p>

<p>TTS systems have their limits when it comes to producing individualized voices. 
Synthesis-based engines are relatively easy to modify to create new voice types. This 
modification involves the adjustment of general pitch and speed to produce new vocal 
personalities such as &quot;old man,&quot; &quot;child,&quot; &quot;female,&quot; 
&quot;male,&quot; and so on. However, these voices still use the same prosody and grammar 
rules. </p>

<p>Creating new voices for diphone-based systems is much more costly than for 
synthesis-based systems. Since each new vocal personality must be assembled from 
pre-recorded human speech, it can take quite a bit of time and effort to alter an existing 
voice set or to produce a new one. Diphone concatenation is costly for systems that must 
support multiple languages or need to provide flexibility in voice personalities. </p>

<h2><a NAME="GeneralSRDesignIssues"><font SIZE="5" COLOR="#FF0000">General SR Design 
Issues</font></a></h2>

<p>There are a number of general issues to keep in mind when designing SR interfaces to 
your applications. </p>

<p>First, if you provide speech services within your application, you'll need to make sure 
you let the user know the services are available. This can be done by adding a graphic 
image to the display, telling the user that the computer is &quot;listening,&quot; or you 
can add caption or status items that indicate the current state of the SR engine. </p>

<p>It is also a good idea to make speech services an optional feature whenever possible. 
Some installations may not have the hardware or RAM required to implement speech services. 
Even if the workstation has adequate resources, the user may experience performance 
degradation with the speech services active. It is a good idea to have a menu option or 
some other method that allows users to turn off speech services entirely. </p>
💿 文件大小 527 K
👤 上传用户 pjamytian
📂 所属分类 TAPI编程
📄 代码行数 745 行
💻 语言类型 HTM
🏷️ 相关标签

#MAPI #SAPI #TAPI
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -