📄 ch14.htm

📁 MAPI__SAPI__TAPI
💻 HTM
📖 第 1 页 / 共 3 页
字号:
independence. This is especially true for systems that will have short encounters with 
many different people, such as greeting kiosks at an airport. In such situations, training 
is unlikely to occur, and a high degree of accuracy is expected right away. </p>

<p>For systems where multiple people will access the same workstation over a longer period 
of time, the speaker-adaptive system will work fine. A good example would be a workstation 
used by several employees to query information from a database. The initial investment 
spent training the speech system will pay off over time as the same staff uses the system. 
</p>

<h3><a NAME="WordMatching"><b>Word Matching</b></a></h3>

<p>Word matching is the process of performing look-ups into the speech database. As each 
word is gathered (using the word separation techniques described earlier), it must be 
matched against some item in the speech engine's database. It is the process of word 
matching that connects the audio input signal to a meaningful item in the speech engine 
database. </p>

<p>There are two primary methods of word matching: 

<ul>
  <li><font COLOR="#000000">Whole-word matching</font> </li>
  <li><font COLOR="#000000">Phoneme matching</font> </li>
</ul>

<p>Under <i>whole-word matching</i>, the speech engine searches the database for a word 
that matches the audio input. Whole-word matching requires less search capability than 
phoneme matching. But, whole-word matching requires a greater amount of storage capacity. 
Under the whole-word matching model, the system must store a word template that represents 
each possible word that the engine can recognize. While quick retrieval makes whole-word 
matching attractive, the fact that all words must be known ahead of time limits the 
application of whole-word matching systems. </p>

<p><i>Phoneme matching systems</i> keep a dictionary of language phonemes. <i>Phonemes</i> 
are the smallest unique sound part of a language, and can be numerous. For example, while 
the English language has 26 individual letters, these letters do not represent the total 
list of possible phonemes. Also, phonemes are not restricted by spelling conventions. </p>

<p>Consider the words <i>Philip</i> and <i>fill up</i>. These words have the same 
phonemes: <i>f</i>, <i>eh</i>, <i>ul</i>, <i>ah</i>, and <i>pah</i>. However, they have 
entirely different meanings. Under the whole-word matching model, these words could 
represent multiple entries in the database. Under the phoneme matching model, the same 
five phonemes can be used to represent both words. </p>

<p>As you may expect, phoneme matching systems require more computational resources, but 
less storage space. </p>

<h3><a NAME="Vocabulary"><b>Vocabulary</b></a></h3>

<p>The final element of a speech recognition system is the vocabulary. There are two 
competing issues regarding vocabulary: size and accuracy. As the vocabulary size 
increases, recognition improves. With large vocabularies, it is easy for speech systems to 
locate a word that matches the one identified in the word separation phase. However, one 
of the reasons it is easy to find a match is that more than one entry in the vocabulary 
may match the given input. For example, the words <i>no</i> and <i>go</i> are very similar 
to most speech engines. Therefore, as vocabulary size grows, the accuracy of speech 
recognition can decrease. </p>

<p>Contrary to what you might assume, a speech engine's vocabulary does not represent the 
total number of words it understands. Instead, the vocabulary of a speech engine 
represents the number of words that it can recognize in a current state or moment in time. 
In effect, this is the total number of &quot;unidentified&quot; words that the system can 
resolve at any moment. </p>

<p>For example, let's assume you have registered the following word phrases with your 
speech engine: &quot;Start running Exchange&quot; and &quot;Start running Word.&quot; 
Before you say anything, the current state of the speech engine has four words: <i>start</i>, 
<i>running</i>, <i>Exchange</i>, and <i>Word</i>. Once you say &quot;Start running&quot; 
there are only two words in the current state: <i>Exchange</i> and <i>Word</i>. The 
system's ability to keep track of the possible next word is determined by the size of its 
vocabulary. </p>

<p>Small vocabulary systems (100 words or less) work well in situations where most of the 
speech recognition is devoted to processing commands. However, you need a large vocabulary 
to handle dictation systems. Dictation vocabularies can reach into tens of thousands of 
words. This is one of the reasons that dictation systems are so difficult to implement. 
Not only does the vocabulary need to be large, the resolutions must be made quite quickly. 
</p>

<h3><a NAME="TexttoSpeech"><b>Text-to-Speech</b></a></h3>

<p>A second type of speech service provides the ability to convert written text into 
spoken words. This is called <i>text-to-speech</i> (or <i>TTS</i>) technology. Just as 
there are a number of factors to consider when developing speech recognition engines (SR), 
there are a few issues that must be addressed when creating and implementing rules for TTS 
engines. </p>

<p>The four common issues that must be addressed when creating a TTS engine are as 
follows: 

<ul>
  <li><font COLOR="#000000">Phonemes</font> </li>
  <li><font COLOR="#000000">Voice quality</font> </li>
  <li><font COLOR="#000000">TTS synthesis</font> </li>
  <li><font COLOR="#000000">TTS diphone concatenation</font> </li>
</ul>

<p>The first two factors deal with the creation of audio tones that are recognizable as 
human speech. The last two items are competing methods for interpreting text that is to be 
converted into audio. </p>

<h3><a NAME="VoiceQuality"><b>Voice Quality</b></a></h3>

<p>The quality of a computerized voice is directly related to the sophistication of the 
rules that identify and convert text into an audio signal. It is not too difficult to 
build a TTS engine that can create recognizable speech. However, it is extremely difficult 
to create a TTS engine that does not sound like a computer. Three factors in human speech 
are very difficult to produce with computers: 

<ul>
  <li><font COLOR="#000000">Prosody</font> </li>
  <li><font COLOR="#000000">Emotion</font> </li>
  <li><font COLOR="#000000">Pronunciation anomalies</font> </li>
</ul>

<p>Human speech has a special rhythm or <i>prosody</i>-a pattern of pauses, inflections, 
and emphasis that is an integral part of the language. While computers can do a good job 
of pronouncing individual words, it is difficult to get them to accurately mimic the tonal 
and rhythmic in-flections of human speech. For this reason, it is always quite easy to 
differentiate computer-generated speech from a computer playing back a recording of a 
human voice. </p>

<p>Another factor of human speech that computers have difficulty rendering is emotion. 
While TTS engines are capable of distinguishing declarative statements from questions or 
exclamations, computers are still not able to convey believable emotive qualities when 
rendering text into speech. </p>

<p>Lastly, every language has its own pronunciation anomalies. These are words that do not 
&quot;play by the rules&quot; when it comes to converting text into speech. Some common 
examples in English are <i>dough</i> and <i>tough</i> or <i>comb</i> and <i>home</i>. More 
troublesome are words such as <i>read</i> which must be understood in context in order to 
figure out their exact pronunciation. For example, the pronunciations are different in 
&quot;He <i>read</i> the paper&quot; or &quot;She will now <i>read</i> to the class.&quot; 
Even more likely to cause problems is the interjection of technobabble such as 
&quot;SQL,&quot; &quot;MAPI,&quot; and &quot;SAPI.&quot; All these factors make the 
development of a truly human-sounding computer-generated voice extremely difficult. </p>

<p>Speech systems usually offer some way to correct for these types of problems. One 
typical solution is to include the ability to enter the phonetic spelling of a word and 
relate that spelling to the text version. Another common adjustment is to allow users to 
enter control tags in the text to instruct the speech engine to add emphasis or 
inflection, or alter the speed or pitch of the audio output. Much of this type of 
adjustment information is based on phonemes, as described in the next section. </p>

<h3><a NAME="Phonemes"><b>Phonemes</b></a></h3>

<p>As we've discussed, phonemes are the sound parts that make up words. Linguists use 
phonemes to accurately record the vocal sounds uttered by humans when speaking. These same 
phonemes also can be used to generate computerized speech. TTS engines use their knowledge 
of grammar rules and phonemes to scan printed text and generate audio output. </p>
<div align="center"><center>

<table BORDERCOLOR="#000000" BORDER="1" WIDTH="80%">
  <tr>
    <td><b>Note</b></td>
  </tr>
  <tr>
    <td><blockquote>
      <p>If you are interested in learning more about phonemes and how they are used to analyze 
      speech, refer to the <i>Phonetic Symbol Guide</i> by Pullum and Ladusaw (Chicago 
      University Press, 1996). </p>
    </blockquote>
    </td>
  </tr>
</table>
</center></div>

<p>The SAPI design model recognizes and allows for the incorporation of phonemes as a 
method for creating speech output. Microsoft has developed an expression of the 
International Phonetic Alphabet (IPA) in the form of Unicode strings. Programmers can use 
these strings to improve the pronunciation skills of the TTS engine, or to add entirely 
new words to the vocabulary. </p>
<div align="center"><center>

<table BORDERCOLOR="#000000" BORDER="1" WIDTH="80%">
  <tr>
    <td><b>Note</b></td>
  </tr>
  <tr>
    <td><blockquote>
      <p>If you wish to use direct Unicode to alter the behavior of your TTS engine, you'll have 
      to program using Unicode. SAPI does not support the direct use of phonemes in ANSI format.</p>
    </blockquote>
    </td>
  </tr>
</table>
</center></div>

<p>As mentioned in the previous section on voice quality, most TTS engines provide several 
methods for improving the pronunciation of words. Unless you are involved in the 
development of a text-to-speech engine, you probably will not use phonemes very often. </p>

<h3><a NAME="TTSSynthesis"><b>TTS Synthesis</b></a></h3>

<p>Once the TTS knows what phonemes to use to reproduce a word, there are two possible 
methods for creating the audio output: <i>synthesis</i> or <i>diphone concatenation</i>. </p>

<p>The synthesis method uses calculations of a person's lip and tongue position, the force 
of breath, and other factors to synthesize human speech. This method is usually not as
💿 文件大小 527 K
👤 上传用户 pjamytian
📂 所属分类 TAPI编程
📄 代码行数 595 行
💻 语言类型 HTM
🏷️ 相关标签

#MAPI #SAPI #TAPI
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -