⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 http:^^www.cs.rutgers.edu^hpcd^area_ii.3^index.html

📁 This data set contains WWW-pages collected from computer science departments of various universities
💻 HTML
字号:
Date: Tue, 14 Jan 1997 22:09:19 GMT
Server: NCSA/1.5.2
Last-modified: Fri, 25 Oct 1996 17:21:26 GMT
Content-type: text/html
Content-length: 12609

<title>Area II.3. Adaptive Voice Mimic</title><h3>Cluster II. Hypercomputing in Design Tasks Supported by ComputationalFluid Dynamics (CFD)</h3><h4>Area II.3 Design of 'voice mimic' speech generation systems</h4>Area Coordinator:<ul><li>James Flanagan (Rutgers)</ul><CENTER><!WA0><IMG SRC="http://www.cs.rutgers.edu/hpcd/Area_II.3/Images/flow_chart.gif" ALT="The Adaptive Voice Mimic System"  ALIGN="middle"></CENTER><P><H4>I.  Introduction</H4>The research on the adaptive voice mimic aims to advance fundamentalunderstanding of human speech generation and coalesces the problems ofspeech synthesis, speech recognition, and low bit-rate speech codinginto a compact parametric framework. At its core, the mimic systemutilizes optimization techniques and a computationally-intensive modelof speech generation to provide a high quality estimate, moment bymoment, of articulatory parameters from an acoustic speech signal.The estimation of articulatory parameters is accomplished through atwo-step process: an open-loop (table look-up based) initialestimation followed by a closed-loop optimization refinement.<H4>II.  Articulatory Shape Estimation</H4> Starting from an acousticinput, the open-loop (i.e., with no optimization) estimate of thearticulatory parameters is obtained via a table look-up of precomputedsynthetic speech representations.  Each element in the table is storedwith the articulatory parameters from which it was produced.  Theinput speech is compared with the synthetic speech in the table via aspectral representation, and the articulatory shape corresponding tothe "closest" synthetic speech is selected.  Once initial articulatoryestimates are found for a series of speech segments, a dynamicprogramming module provides smooth articulatory trajectories byimposing articulatory constraints.  This concludes the open-loopprocess.<P>The open-loop estimates, initialize a closed-loop optimization bysuggesting a starting position which is likely in the vicinity of the(global) optimal solution. Effective open-loop estimates reduce thecomputation required by the computationally-costly optimization loop.Within the closed-loop optimization, synthetic speech is generatedfrom a compact set of articulatory parameters and compared with theinput speech using a perceptually weighted distance metric. Thearticulatory parameters are iteratively adjusted based on the resultof the comparison so that the weighted spectral distance between thearbitrary speech input and the synthetic speech is driven to below apreset threshold.<H4>III.  Articulatory Speech Synthesis</H4>As part of this research, two methods of high quality speech synthesisfrom articulatory parameters are studied.  The first method is basedon linear acoustic theory/models of speech production, and the secondmethod is based on a fluid-dynamic formulation. The techniques for thefirst method are relatively well established, but the method assumesplane wave propagation inside the vocal tract and also neglects mostof the non-linear terms. On the other hand, the second, new methodattempts to capture more accurately the physics behind human speechproduction.  This is done by formulating the speech production processas a fluid-dynamic phenomenon.  The approach uses a form of theReynolds-Averaged Navier-Stokes (RANS) equations describing fluidmotion to numerically solve for low Mach number, compressible flow invocal tract geometries.  Physical experiments, from which real flowquantities are acquired, support the computational approach byvalidating numerical results.<P>Both linear acoustic and fluid-dynamic synthesis use vocal tractshapes defined by means of articulatory models.  Two models have beenused: Tracttalk (Lin, 1990) and the Flanagan-Ishizaka model(Ishizaka, 1976).  Both models provide stylized vocal tract shapesdefined by a compact set of parameters.  These parameters quantify theposition and shape of articulators.  For example, the parametersprimarily used in this study specify the location and size of the mainconstriction in the vocal tract, the mouth aperture, and thecross-sectional area of the front cavity.  These parameters are shownin the schematic below.<P><CENTER><B>The Flanagan-Ishizaka Vocal Tract Model</B><P><!WA1><IMG SRC="http://www.cs.rutgers.edu/hpcd/Area_II.3/Images/fi_vtmodel.gif"></CENTER><H4>IV.  Achievements</H4><H5>IV.1. Vowel Recognition</H5>Using a spectral representation based on linear-predictive poles anda reduced number of articulatory parameters, a vowel recognition systembased on an articulatory representation of speech signals has beendesigned.  In contrast to the articulatory based approach, traditionalspeech recognition systems have relied on spectral and/or cepstralfeatures.  Despite considerable efforts seeking more accurate,compact, and reliable features for robust speech recognition, thearticulatory representation of speech has not been exploited due tothe difficulty and computational intensity involved in estimatingarticulatory parameters from speech waveforms.  Adaptive voice mimicwith optimized open-loop steering and efficient closed-loop controlprovides a promising solution to the challenge.<P>A nearly real-time laboratory prototype of the articulatory basedrecognition system has been implemented and demonstrated.  The systemcan recognize both isolated vowels and vowel strings.  A recognitionaccuracy of more than 97% is obtained.  During the recognitioncomputation, dynamically changing sagittal profiles of the vocal tract(corresponding to the input speech) are displayed.  The figure belowshows the main displays of the recognition prototype.<P><CENTER><B>The Voice Mimic Articulatory Based Vowel Recognition System</B><P><!WA2><IMG SRC="http://www.cs.rutgers.edu/hpcd/Area_II.3/Images/mimic_screen.gif"></CENTER><H5>IV.2. Speech Coding</H5> The articulatory representation is one of the most promising techniquefor high quality very low bit-rate speech coding.  It is thought thatsuch a representation can produce speech coders below 1 kbits persecond.  Thus, the importance of acoustic to articulatory mappingfor the purpose of coding is apparent.<P>The adaptive voice mimic research has also produced a coding systemfor vowels and fricative consonants (such as the /s/ in "sea" and /f/in "fire").  It was found that spectral comparison based on the polesof linear prediction, which works excellently for vowels, does notwork equally well for fricatives. The major reason being that for thefricatives there are a number of bound pole/zero pairs. As a result,linear prediction fails to provide accurate estimates of thesesingularities. Therefore, other feature representations have beenexplored. The cepstrum representation was chosen since it isrelatively compact and produces positive results.<P>In order to complete the extension of the voice mimic system tofricative sounds, an improved initial estimation of source parametershas been designed to include an efficient voiced/unvoiced decision.Evident discrepancies exist in the frequency content between soundsproduced by a source at the glottis (vibration of vocal cords) andsounds produced by a noise source at a constriction in the vocal tract(as is the case for fricatives).  These discrepancies make necessarythe use of multiple codebooks.  The appropriate codebook is selectedbased on the voiced-unvoiced decision. The estimation of articulatoryparameters is then completed by the open-loop steering followed byclosed-loop analysis.<P>This system has produced vowel/consonant/vowel utterances and shortsentences of very encouraging quality.  Below, are some codingexamples from the voice mimic where the articulatory parameters fromthe input speech have been used to re-synthesize the speech.<P><CENTER><B>Speech Coding Examples Using the Adaptive Voice Mimic</B><BR>   <EM>(Sun Audio, 32kHz, 16-bit, linear)</EM><P>   <TABLE BORDER>  <TR ALIGN=CENTER> <TH> Natural Input Speech <TH> Voice Mimic  <TR ALIGN=CENTER> <TD> <!WA3><A HREF="http://www.cs.rutgers.edu/hpcd/Area_II.3/Audio/oussou32_3.au">/usu/</A>		    <TD> <!WA4><A HREF="http://www.cs.rutgers.edu/hpcd/Area_II.3/Audio/oussou32syn_3.au">/usu/</A>  <TR ALIGN=CENTER> <TD> <!WA5><A HREF="http://www.cs.rutgers.edu/hpcd/Area_II.3/Audio/oushou32_3.au">/ushu/</A>		    <TD> <!WA6><A HREF="http://www.cs.rutgers.edu/hpcd/Area_II.3/Audio/oushou32syn_3.au">/ushu/</A>  <TR ALIGN=CENTER> <TD> <!WA7><A HREF="http://www.cs.rutgers.edu/hpcd/Area_II.3/Audio/ouffou32_3.au">/ufu/</A>		    <TD> <!WA8><A HREF="http://www.cs.rutgers.edu/hpcd/Area_II.3/Audio/ouffou32syn_3.au">/ufu/</A>  <TR ALIGN=CENTER> <TD> <!WA9><A HREF="http://www.cs.rutgers.edu/hpcd/Area_II.3/Audio/she1_3.au">she saw a fire</A>		    <TD> <!WA10><A HREF="http://www.cs.rutgers.edu/hpcd/Area_II.3/Audio/she1syn_3.au">she saw a fire</A></TABLE></CENTER><H5>IV.3. Speaker Identification</H5> Physiological information abouta particular speaker's vocal tract is "hidden" in their speech signal.Acoustic-to-articulatory mapping provides a means to extractthis information and use it to differentiate speakers.  In particular,vocal tract parameters can be used to supplement traditional speakeridentification methods.  The advantage of vocal tract parameters isthat they are not affected by emotion or sickness, and they cannot beeasily altered for the purpose of impersonation.<P> Preliminary experiments have been done towards the estimation of thevocal tract length from the acoustic signal. This is a criticalparameter for differentiating talkers in speaker identification orverification tasks. The estimation is performed using the voice mimicsystem and a two-step strategy.  First, the shape of the vocal tractis determined using a codebook built on a fixed vocal tract length.Then, the vocal tract length is estimated using a detailed codebookcomprising variations of the same shape with it's length stretched andcompressed.  Although such an approach requires advance knowledge ofwhich sound is produced, this problem will be overcome in the futureby replacing the second codebook by an optimization loop. Initialresults have been obtained using a database which associates X-rayimages of the vocal tract and the corresponding speech signalproduced. It is shown that the vocal tract length estimated by thevoice mimic system agrees well with the measured value.<H4>V.  Publications</H4><UL><LI>G. Richard, M. Goirand, D. Sinder, J. Flanagan, "Simulation andVisualization of Articulatory Trajectories Estimated from SpeechSignals," <I>Submitted for presentation at the International Symposium onSimulation, Visualization and Auralization for Acoustic Research andEducation (ASVA97)</I>, April 1997, Tokyo, Japan.<LI>G. Richard, Q. Lin, F. Zussa, D. Sinder, C. Che, and J. Flanagan,``Vowel recognition using an articulatory representation,''<I>JASA</I>, Vol. 98, No. 5, Pt. 2, November 1995, p. 2965. <LI>F. Zussa, Q. Lin, G. Richard, D. Sinder, and J. Flanagan,``Open-loop acoustic-to-articulatory mapping,'' <I>JASA</I>,Vol. 98,No. 5, Pt. 2, November 1995, p. 2931.<LI>Q. Lin, G. Richard, J. Zou, D. Sinder, J. Flanagan, ``Use ofTRACTTALK for adaptive voice mimic,'',<I>JASA</I>, Vol. 97, No 5, Pt2, May 1995, p. 3247.  </UL><H5>Related Publications</H5><UL><LI>D. Sinder, G. Richard, H. Duncan, J. Flanagan, S. Slimon,D. Davis, M. Krane, S. Levinson, "Flow Visualization in Stylized VocalTracts," <I>Submitted for presentation at the International Symposiumon Simulation, Visualization and Auralization for Acoustic Researchand Education (ASVA97)</I>, April 1997, Tokyo, Japan.<LI>S. Slimon, D. Davis, S. Levinson, M. Krane, G. Richard, D. Sinder,H. Duncan, Q. Lin, J. Flanagan, ``Low Mach number Flow Through AConstricted, Stylized Vocal Tract'', <I>American Institute of Aeronauticsand Astronautics Conference (AIAA96)</I>, Penn State Univ., PA., May 1996.  <LI>D. Sinder, G. Richard, H. Duncan, Q. Lin, J. Flanagan, S.Levinson, D. Davis, and S. Slimon, ``A fluid flow approach to speechgeneration'', <I>First ESCA Tutorial and Research Workshop on SpeechProduction Modeling: From control strategies to Acoustic</I>, Autrans,France, May 21-24, 1996.<LI>G. Richard, M. Liu, D. Sinder, H. Duncan, Q. Lin, J. Flanagan, S.Levinson, D. Davis, and S. Slimon, ``Numerical simulations of fluidflow in the vocal tract,'' <I>Proc. of 1995 Eurospeech</I>, pp. 1297-1300.Madrid, Spain, September 18-21, 1995.<LI>G. Richard, M. Liu, D. Sinder, H. Duncan, Q. Lin,J. Flanagan, S.  Levinson, D. Davis, S. Slimon, ``Vocal tractsimulations based on fluid dynamic analysis,'', <I>JASA</I>, Vol. 97, No 5,Pt 2, May 1995, pp3245.</UL><hr>Visit<!WA11><a href=http://www.caip.rutgers.edu/multimedia/>CAIP's Multimedia Lab<!WA12><IMG SRC="http://www.cs.rutgers.edu/hpcd/Area_II.3/Images/multimedia_button.gif" ALIGN=middle></a><p><!WA13><a href=http://www.cs.rutgers.edu/hpcd><!WA14><img src=http://www.cs.rutgers.edu/hpcd/goback.gif alt="Return to HPCD Home Page" align=middle>Return to HPCD Home Page</a>

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -