📄 speech synthesis & speech recognition using sapi 5_1.htm

📁 softonline.dll中函数的使用,请见不同的例程,VB函数见VB例子,VC函数见VC例子,VFP函数见VFP的例子,BCB函数见BCB例子, Delphi函数见Delphi例子
💻 HTM
📖 第 1 页 / 共 5 页
字号:
<P>The SAPI 5.1 SDK comes with a C++ example called TTSApp, which displays an 
animated cartoon microphone whose mouth is drawn to represent each viseme. The 
microphone is made up from a number of separate images that can all be loaded 
into an image list. The additional <A 
href="http://www.blong.com/Conferences/DCon2002/Speech/SAPI51/SAPI51.zip">demo 
program</A> TextToSpeechAnimated.dpr makes use of these images to show how the 
effect can be achieved.</P>
<TABLE bgColor=white border=1>
  <TBODY>
  <TR>
    <TD><PRE><CODE><FONT color=black size=2>
<B>const</B>
  Visemes: <B>array</B>[0..21] <B>of</B> Byte = (
   	0,  <FONT color=#003399><I>// SP_VISEME_0 = 0,    // Silence</I></FONT>
    11, <FONT color=#003399><I>// SP_VISEME_1,        // AE, AX, AH</I></FONT>
    11, <FONT color=#003399><I>// SP_VISEME_2,        // AA</I></FONT>
    11, <FONT color=#003399><I>// SP_VISEME_3,        // AO</I></FONT>
    10, <FONT color=#003399><I>// SP_VISEME_4,        // EY, EH, UH</I></FONT>
    11, <FONT color=#003399><I>// SP_VISEME_5,        // ER</I></FONT>
    9,  <FONT color=#003399><I>// SP_VISEME_6,        // y, IY, IH, IX</I></FONT>
    2,  <FONT color=#003399><I>// SP_VISEME_7,        // w, UW</I></FONT>
    13, <FONT color=#003399><I>// SP_VISEME_8,        // OW</I></FONT>
    9,  <FONT color=#003399><I>// SP_VISEME_9,        // AW</I></FONT>
    12, <FONT color=#003399><I>// SP_VISEME_10,       // OY</I></FONT>
    11, <FONT color=#003399><I>// SP_VISEME_11,       // AY</I></FONT>
    9,  <FONT color=#003399><I>// SP_VISEME_12,       // h</I></FONT>
    3,  <FONT color=#003399><I>// SP_VISEME_13,       // r</I></FONT>
    6,  <FONT color=#003399><I>// SP_VISEME_14,       // l</I></FONT>
    7,  <FONT color=#003399><I>// SP_VISEME_15,       // s, z</I></FONT>
    8,  <FONT color=#003399><I>// SP_VISEME_16,       // SH, CH, JH, ZH</I></FONT>
    5,  <FONT color=#003399><I>// SP_VISEME_17,       // TH, DH</I></FONT>
    4,  <FONT color=#003399><I>// SP_VISEME_18,       // f, v</I></FONT>
    7,  <FONT color=#003399><I>// SP_VISEME_19,       // d, t, n</I></FONT>
    9,  <FONT color=#003399><I>// SP_VISEME_20,       // k, g, NG</I></FONT>
    1   <FONT color=#003399><I>// SP_VISEME_21,       // p, b, m</I></FONT>
  );

<B>procedure</B> TfrmTextToSpeech.SpVoiceViseme(Sender: TObject;
  StreamNumber: Integer; StreamPosition: OleVariant; Duration: Integer;
  NextVisemeId, Feature, CurrentVisemeId: TOleEnum);
<B>const</B>
  EyesNarrow = 14;
  EyesClosed = 15;
<B>begin</B>
  imgsMic.Draw(pbMic.Canvas, 0, 0, Visemes[CurrentVisemeId]);
  <B>if</B> Visemes[CurrentVisemeId] <B>mod</B> 6 = 2 <B>then</B>
    imgsMic.Draw(pbMic.Canvas, 0, 0, EyesNarrow)
  <B>else</B>
    <B>if</B> Visemes[CurrentVisemeId] <B>mod</B> 6 = 5 <B>then</B>
      imgsMic.Draw(pbMic.Canvas, 0, 0, EyesClosed);
<B>end</B>;

<B>procedure</B> TfrmTextToSpeech.pbMicPaint(Sender: TObject);
<B>begin</B>
  imgsMic.Draw(pbMic.Canvas, 0, 0, 0);
<B>end</B>;
</FONT></CODE></PRE></TD></TR></TBODY></TABLE>
<P>The <FONT face="Courier New, Courier, mono">OnViseme</FONT> event gets the 
image list to draw on a paint box component and the image to draw is identified 
from a simple lookup table. There are 22 different visemes, but only 13 images 
(as in the Disney approach). Occasionally the code also draws narrowed or closed 
eyes, but whenever the silence viseme is received (at the start and end of each 
sentence) the default microphone (the first image in the image list) is 
drawn.</P>
<P align=center><IMG 
src="Speech Synthesis &amp; Speech Recognition Using SAPI 5_1.files/TextToSpeechAnimated.png"></P>
<P>You can take this idea further if you need, by using images of a person's 
face saying each of the 22 visemes (for real people it seems to work best if you 
use 22 images, rather than 13). This way you can animate a real person's face in 
sync with the spoken text quite trivially.</P>
<P align=center><IMG 
src="Speech Synthesis &amp; Speech Recognition Using SAPI 5_1.files/TextToSpeechAnimatedReal.png"></P>
<H3><A name=KeepingTrack>Keeping Track Of Spoken Text</A></H3>
<P>We can use <FONT face="Courier New, Courier, mono">OnWord</FONT> and <FONT 
face="Courier New, Courier, mono">OnSentence</FONT> to highlight the currently 
spoken work or sentence, as the events provide the character offset and length 
of the pertinent characters in the text. So when a sentence is started, the 
<FONT face="Courier New, Courier, mono">OnSentence</FONT> event tells you which 
character in the text is the start of the sentence, and also how long the 
sentence is.</P>
<TABLE bgColor=white border=1>
  <TBODY>
  <TR>
    <TD><PRE><CODE><FONT color=black size=2>
<B>procedure</B> TfrmTextToSpeech.SetTextHilite(FirstChar, Len: Integer);
<B>begin</B>
  reText.SelStart := FirstChar; <FONT color=#003399><I>//highlight word</I></FONT>
  reText.SelLength := Len;
<B>end</B>;

<B>procedure</B> TfrmTextToSpeech.SetTextStyle(FirstChar, Len: Integer; Styles: TFontStyles);
<B>begin</B>
  <B>with</B> reText <B>do</B>
  <B>begin</B>
    Lines.BeginUpdate;
    <B>try</B>
      SelStart := FirstChar; <FONT color=#003399><I>//highlight word</I></FONT>
      SelLength := Len;
      SelAttributes.Style := Styles; <FONT color=#003399><I>//apply requested style</I></FONT>
      SelLength := 0; <FONT color=#003399><I>//unhighlight word</I></FONT>
    <B>finally</B>
      Lines.EndUpdate
    <B>end</B>
  <B>end</B>
<B>end</B>;

<B>procedure</B> TfrmTextToSpeech.SpVoiceSentence(Sender: TObject;
  StreamNumber: Integer; StreamPosition: OleVariant; CharacterPosition,
  Length: Integer);
<B>begin</B>
  Log(<I>'OnSentence: stream %d, position: %s, char. pos. %d, length %d'</I>,
    [StreamNumber, <B>String</B>(StreamPosition), CharacterPosition, Length]);
  SetTextStyle(OldSentencePos, OldSentenceLen, []);
  <B>if</B> Length &gt; 0 <B>then</B>
  <B>begin</B>
    SetTextStyle(CharacterPosition, Length, [fsItalic]);
    OldSentencePos := CharacterPosition;
    OldSentenceLen := Length;
  <B>end</B>;
  <B>if</B> <B>not</B> StreamJustStarted <B>then</B>
    memEnginePhonemes.Text := memEnginePhonemes.Text + #13#10;
  StreamJustStarted := False;
<B>end</B>;

<B>procedure</B> TfrmTextToSpeech.SpVoiceWord(Sender: TObject;
  StreamNumber: Integer; StreamPosition: OleVariant; CharacterPosition,
  Length: Integer);
<B>begin</B>
  Log(<I>'OnWord: stream %d, position: %s, char. pos. %d, length %d'</I>,
    [StreamNumber, <B>String</B>(StreamPosition), CharacterPosition, Length]);
  SetTextHilite(CharacterPosition, Length);
<B>end</B>;
</FONT></CODE></PRE></TD></TR></TBODY></TABLE>
<P>Each sentence that gets spoken is italicised through the <FONT 
face="Courier New, Courier, mono">SetTextStyle</FONT> helper routine (which 
records the position details so the sentence can be set back to non-italic when 
the next sentence starts). Similarly, each spoken word is highlighted using the 
<FONT face="Courier New, Courier, mono">SetTextHilite</FONT> helper routine.</P>
<P><U><B>Note:</B></U> the comment in the <FONT 
face="Courier New, Courier, mono">OnSentence</FONT> event handler points out 
that the last <FONT face="Courier New, Courier, mono">OnSentence</FONT> event 
for some text has the character position set to the last character and the 
length set to the negative equivalent. This gives an opportunity to reset all 
the text formatting back to the default styles. However it is only true if the 
text ends with a full stop; if not you can use the <FONT 
face="Courier New, Courier, mono">OnEndStream</FONT> event for tidying up.</P>
<H3><A name=SpeakingDialogs>Speaking Dialogs</A></H3>
<P>As an example of using speech synthesis you can make all your VCL dialogs 
talk to you using this small piece of code.</P>
<TABLE bgColor=white border=1>
  <TBODY>
  <TR>
    <TD><PRE><CODE><FONT color=black size=2>
<B>uses</B>
  ComObj;

<B>var</B>
  Voice: Variant;

<B>procedure</B> TForm1.FormCreate(Sender: TObject);
<B>begin</B>
  Screen.OnActiveFormChange := ScreenFormChange;
<B>end</B>;

<B>procedure</B> TForm1.ReadVCLDialog(Form: TCustomForm);
<B>var</B>
  I: Integer;
  ButtonCaptions, LabelCaption, DialogText: <B>string</B>;
<B>const</B>
  SVSFlagsAsync = 1;
<B>begin</B>
  <B>try</B>
    <B>if</B> VarType(Voice) &lt;&gt; varDispatch <B>then</B>
      Voice := CreateOleObject(<I>'SAPI.SpVoice'</I>);
    <B>for</B> I := 0 <B>to</B> Form.ComponentCount - 1 <B>do</B>
      <B>if</B> Form.Components[I] <B>is</B> TLabel <B>then</B>
        LabelCaption := TLabel(Form.Components[I]).Caption
      <B>else</B>
        <B>if</B> Form.Components[I] <B>is</B> TButton <B>then</B>
          ButtonCaptions := Format(<I>'%s%s, '</I>,
            [ButtonCaptions, TButton(Form.Components[I]).Caption]);
    ButtonCaptions := StringReplace(ButtonCaptions,<I>'&amp;'</I>,<I>''</I>, [rfReplaceAll]);
    DialogText := Format(<I>'%s.%s%s.%s%s'</I>,
      [Form.Caption, sLineBreak, LabelCaption, sLineBreak, ButtonCaptions]);
    Memo1.Text := DialogText;
    Voice.Speak(DialogText, SVSFlagsAsync)
  <B>except</B>
    <FONT color=#003399><I>//pretend everything is okay</I></FONT>
  <B>end</B>
<B>end</B>;

<B>procedure</B> TForm1.ScreenFormChange(Sender: TObject);
<B>begin</B>
  <B>if</B> Assigned(Screen.ActiveForm) <B>and</B>
     (Screen.ActiveForm.ClassName = <I>'TMessageForm'</I>) <B>then</B>
    ReadVCLDialog(Screen.ActiveForm)
<B>end</B>;
</FONT></CODE></PRE></TD></TR></TBODY></TABLE>
<P>The form's <FONT face="Courier New, Courier, mono">OnCreate</FONT> event 
handler sets up an <FONT 
face="Courier New, Courier, mono">OnActiveFormChange</FONT> event handler for 
the screen object. This is triggered each time a new form is displayed, which 
includes VCL dialogs. Any call to <FONT 
face="Courier New, Courier, mono">ShowMessage</FONT>, <FONT 
face="Courier New, Courier, mono">MessageDlg</FONT> or related routines causes a 
<FONT face="Courier New, Courier, mono">TMessageForm</FONT> to be displayed so 
the code checks for this. If the form type is found, a textual version of what's 
on the dialog is built up and then spoken through the SAPI Automation 
component.</P>
<P>A statement such as:</P>
<TABLE bgColor=white border=1>
  <TBODY>
  <TR>
    <TD><PRE><CODE><FONT color=black size=2>
MessageDlg(<I>'Save changes?'</I>, mtConfirmation, mbYesNoCancel, 0)
</FONT></CODE></PRE></TD></TR></TBODY></TABLE>
<P>causes the <FONT face="Courier New, Courier, mono">ReadVCLDialog</FONT> 
routine to build up and say this text:</P>
<TABLE bgColor=white border=1>
  <TBODY>
  <TR>
    <TD><PRE><CODE><FONT color=black size=2>
Confirm.
Save changes?.
Yes, No, Cancel,
</FONT></CODE></PRE></TD></TR></TBODY></TABLE>
<P>Notice the full stops at the end of each line to briefly pause the speech 
engine at that point before moving on.</P>
<H2><A name=SR>Speech Recognition</A></H2>
<P>Continuous dictation is easy to set up as no specific grammar is required, 
but Command and Control recognition will need a grammar to educate the 
recogniser as to the permissible commands.</P>
<P>When you need SR you can either use a shared recogniser (<FONT 
face="Courier New, Courier, mono">TSpSharedRecognizer</FONT>) or an in-process 
recogniser (<FONT face="Courier New, Courier, mono">TSpInprocRecognizer</FONT>). 
The in-process recogniser is more efficient (it resides in your process address 
space) but means that no other SR applications can receive input from the 
microphone until it is closed down. On the other hand the shared recogniser can 
be used by multiple applications, and each one can access the microphone. It is 
more common to use the shared recogniser in typical SAPI applications.</P>
<P>The recogniser uses the notion of a <I>recognition context</I> to identify 
when it will be active (not to be confused with the use of context in a 
context-free grammar or CFG). A context is represented by the <FONT 
face="Courier New, Courier, mono">TSpInprocRecoContext</FONT> or <FONT 
face="Courier New, Courier, mono">TSpSharedRecoContext</FONT> interfaces. An 
application may use one context for each form that will use SR, or several 
contexts for different application modes (Office XP has a dictation mode for 
adding text to a document and a control mode for executing menu commands).</P>
<P>Recognition contexts enable you to start and stop recognition, set up the 
grammar and receive important recognition notifications.</P>
<H3><A name=Grammars>Grammars</A></H3>
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -