⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 (8)synthetic and snhc audio in mpeg-4.htm

📁 关于MPRG4的一些基本的指南
💻 HTM
字号:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<!-- saved from url=(0094)http://3c.nii.org.tw/3c/silicon/embedded/MPEG/Synthetic%20and%20SNHC%20Audio%20in%20MPEG-4.htm -->
<!-- saved from url=(0099)http://leonardo.telecomitalialab.com/icjfiles/mpeg-4_si/10-SNHC_audio_paper/10-SNHC_audio_paper.htm --><HTML><HEAD><TITLE>Synthetic and SNHC Audio in MPEG-4</TITLE>
<META http-equiv=Content-Type content="text/html; charset=windows-1252">
<META content="MSHTML 6.00.2800.1106" name=GENERATOR>
<META content=D:\office97\Office\html.dot name=Template></HEAD>
<BODY vLink=#800080 link=#0000ff 
background="&#65288;8&#65289;Synthetic and SNHC Audio in MPEG-4.files/yellowline.jpg"><B><FONT 
face=Arial size=4>
<P align=center>Synthetic and SNHC Audio in MPEG-4</P></FONT><FONT size=2>
<P align=center>Eric D. Scheirer </P></FONT></B><FONT size=2>
<P align=center>Machine Listening Group, MIT Media Laboratory<BR>E15-401D, 
Cambridge MA 02143-4307 USA<BR>Tel: +1 617 253 0112 Fax: +1 617 258 
6264<BR>eds@media.mit.edu<BR></P>
<P align=center><B>Youngjik Lee </B>and<B> Jae-Woo Yang</B></P>
<P align=center>Switching and Transmission Technology Laboratories, ETRI</P>
<P align=center>&nbsp;</P></FONT><FONT face=Arial>
<P>Abstract</P></FONT>
<DIR>
<DIR><FONT face=Arial></FONT><FONT size=2>
<P>In addition to its sophisticated audio-compression capabilities, MPEG-4 
contains extensive functions supporting synthetic sound and the 
synthetic/natural hybrid coding of sound. We present an overview of the 
Structured Audio format, which allows efficient transmission and client-side 
synthesis of music and sound effects. We also provide an overview of the 
Text-to-Speech Interface, which standardizes a single format for communication 
with speech synthesizers. Finally, we present an overview of the AudioBIFS 
portion of the Binary Format for Scene Description, which allows the description 
of hybrid soundtracks, 3-D audio environments, and interactive audio 
programming. The tools provided for advanced audio functionality in MPEG-4 are a 
new and important addition to the world of audio 
standards.</P></FONT></DIR></DIR>
<P align=center>
<HR>

<P></P><B><I><FONT face=Arial>
<P><A name=_Toc434198797>Introduction</A></P></FONT></I></B>
<DIR><FONT face=Arial><I></I></FONT>
<DIR><FONT face=Arial><I><B></B></I></FONT><FONT size=2>
<P>This article describes the parts of MPEG-4 that govern the compression, 
representation, and transmission of synthetic sound and the combination of 
synthetic and natural sound into hybrid soundtracks. Through these tools, MPEG-4 
provides advanced capabilities for ultra-low-bitrate sound transmission, 
interactive sound scenes, and flexible, repurposable delivery of sound 
content.</P>
<P>We will discuss three MPEG-4 audio tools. The first, MPEG-4 Structured Audio, 
standardizes precise, efficient delivery of synthetic music and sound effects. 
The second, MPEG-4 Text-to-Speech Interface, standardizes a representation 
protocol for synthesized speech, an interface to text-to-speech synthesizers, 
and the automatic synchronization of synthetic speech and "talking head" 
animated face graphics [24]. The third, MPEG-4 AudioBIFS--part of the main BIFS 
framework--standardizes terminal-side mixing and post-production of audio 
soundtracks [22]. AudioBIFS enables interactive soundtracks and 3-D sound 
presentation for virtual-reality applications. In MPEG-4, the capability to mix 
and synchronize real sound with synthetic is termed <I>Synthetic/Natural Hybrid 
Coding</I> of Audio, or SNHC Audio.</P>
<P>The organization of the present paper is as follows. First, we provide a 
general overview of the objectives for synthetic and SNHC audio in MPEG-4. This 
section also introduces concepts from speech and music synthesis to readers 
whose primary expertise may not be in the field of audio. Next, a detailed 
description of the synthetic-audio codecs in MPEG-4 is provided. Finally, we 
describe AudioBIFS and its use in the creation of SNHC audio soundtracks. 
</P></FONT></DIR></DIR>
<P align=center><A name=_Toc434198798></A>
<HR>

<P></P><B><I><FONT face=Arial>
<P>Synthetic Audio in MPEG-4: Concepts and Requirements</P></FONT></I></B>
<DIR><FONT face=Arial><I></I></FONT>
<DIR><FONT face=Arial><I><B></B></I></FONT><FONT size=2>
<P>In this section, we introduce speech synthesis and music synthesis. Then we 
discuss the inclusion of these technologies in MPEG-4, focusing on the 
capabilities provided by synthetic audio and the types of applications that are 
better addressed with synthetic audio coding than with natural audio 
coding.</P></FONT></DIR></DIR><FONT face=Arial>
<P>Relationship between natural and synthetic coding</P></FONT>
<DIR>
<DIR><FONT face=Arial></FONT><FONT size=2>
<P>Modern standards for natural audio coding [1, 2] use perceptual models to 
compress natural sound. In coding synthetic sound, perceptual models are not 
used; rather, very specific parametric models are used to transmit sound 
<I>descriptions</I>. The descriptions are received at the decoding terminal and 
converted into sound through real-time sound <I>synthesis</I>. The parametric 
model for the Text-to-Speech Interface is fixed in the standard; in the 
Structured Audio toolset, the model itself is transmitted as part of the 
bitstream and interpreted by a reconfigurable decoder.</P>
<P>Natural and synthetic audio are not unrelated methods for transmitting sound. 
Especially as sound models in perceptual coding grow more sophisticated, the 
boundary between "decompression" and "synthesis" becomes somewhat blurred. 
Vercoe, Gardner, and Scheirer [28] have discussed the relationships among 
parametric models of sound, digital sound creation and transmission, perceptual 
coding, parametric compression, and various techniques for algorithmic 
synthesis. </P></FONT></DIR></DIR><FONT face=Arial>
<P><A name=_Toc434198805>Concepts in speech synthesis</A></P></FONT>
<DIR>
<DIR><FONT face=Arial></FONT><FONT size=2>
<P>Text-to-speech (TTS) systems generate speech sound according to given text. 
This technology enables the translation of text information into speech so that 
the text can be transferred through speech channels such as telephone lines. 
Today, TTS systems are used for many applications, including automatic 
voice-response systems (the "telephone menu" systems that have become popular 
recently), e-mail reading, and information services for the visually handicapped 
[9, 10].</P>
<P>TTS systems typically consist of multiple processing modules as shown in 
Figure 1. Such a system accepts text as input and generates a corresponding 
<I>phoneme</I> sequence. Phonemes are the smallest units of human language; each 
phoneme corresponds to one sound used in speech. A surprisingly small set of 
phonemes, about 120, is sufficient to describe all human languages. </P></FONT>
<P align=center><IMG height=263 
src="&#65288;8&#65289;Synthetic and SNHC Audio in MPEG-4.files/image30.gif" 
width=503></P><B><FONT size=2>
<P align=center>Figure 1: Block diagram of a text-to-speech system, showing the 
interaction between text-to-phoneme conversion, text understanding, and prosody 
generation and application</P></FONT></B><FONT size=2>
<P>The phoneme sequence is used in turn to generate a basic speech sequence 
without <I>prosody</I>, that is, without pitch, duration, and amplitude 
variations. In parallel, a text-understanding module analyzes the input for 
phrase structure and inflections. Using the result of this processing, a prosody 
generation module creates the proper prosody for the text. Finally, a prosody 
control module changes the prosody parameters of the basic speech sequence 
according to the results of the text-understanding module, yielding synthesized 
speech.</P>
<P>One of the first successful TTS systems was the DecTalk English speech 
synthesizer developed in 1983 [11]. This system produces very intelligible 
speech and supports eight different speaking voices. However, developing speech 
synthesizers of this sort is a difficult process, since it is necessary to 
extract all the acoustic parameters for synthesis. It is a painstaking process 
to analyze enough data to accumulate the parameters that are used for all kinds 
of speech.</P>
<P>In 1992, CNET in France developed the pitch-synchronous overlap-and-add 
(PSOLA) method to control the pitch and phoneme duration of synthesized speech 
[25]. Using this technique, it is easy to control the prosody of synthesized 
speech. Thus synthesized speech using PSOLA sounds more natural; it can also use 
human speech as a guide to control the prosody of the synthesis, in an 
analysis-synthesis process that can also modify the tone and duration. However, 
if the tone is changed too much, the resulting speech is easily recognized as 
artificial.</P>
<P>In 1996, ATR in Japan developed the CHATR speech synthesizer [10]. This 
method relies on short samples of human speech without modifying any 
characteristics; it locates and sequences phonemes, words, or phrases from a 
database. A large database of human speech is necessary to develop a TTS system 
using this method. Automatic tools may be used to label each phoneme of the 
human speech to reduce the development time; typically, hidden Markov models 
(HMMs) are used to align the best phoneme candidates to the target speech. The 
synthesized speech is very intelligible and natural; however, this method of TTS 
requires large amounts of memory and processing power. </P>
<P>The applications of TTS are expanding in telecommunications, personal 
computing, and the Internet. Current research in TTS includes voice conversion 
(synthesizing the sound of a particular speaker抯 voice), multi-language TTS, 
and enhancing the naturalness of speech through more sophisticated voice models 
and prosody generators.</P></FONT></DIR></DIR><FONT face=Arial>
<P><A name=_Toc434198806>Applications for speech synthesis in 
MPEG-4</A></P></FONT>
<DIR>
<DIR><FONT face=Arial></FONT><FONT size=2>
<P>The synthetic speech system in MPEG-4 was designed to support interactive 
applications using text as the basic content type. Some of these applications 
include on-demand storytelling, motion picture dubbing, and "talking head" 
synthetic videoconferencing.</P>
<P>In the story-telling on demand (STOD) application, the user can select a 
story from a huge database stored on fixed media. The STOD system reads the 
story aloud, using the MPEG-4 Text-to-Speech Interface (henceforth, TTSI) with 
the MPEG-4 facial animation tool or with appropriately selected images. The user 
can stop and resume speaking at any moment he wants through the user interface 
of the local machine (for example, mouse or keyboard). The user can also select 
the gender, age, and the speech rate of the electronic story-teller.</P>
<P>In a motion-picture-dubbing application, synchronization between the MPEG-4 
TTSI decoder and the encoded moving picture is the essential feature. The 
architecture of the MPEG-4 TTS decoder provides several levels of 
synchronization granularity. By aligning the composition time of each sentence, 
coarse granularity of synchronization can be easily achieved. To get more 
finely-tuned synchronization, information about the speaker lip shape can be 
used. The finest granularity of synchronization can be achieved by using 
detailed prosody transmission and video-related information such as sentence 
duration and offset time in the sentence. With this synchronization capability, 
the MPEG-4 TTSI can be used for motion picture dubbing by following the lip 
shape and the corresponding time in the sentence.</P>
<P>To enable synthetic video-teleconferencing, the TTSI decoder can be used to 
drive the facial-animation decoder in synchronization. <I>Bookmarks</I> in the 
TTSI bitstream control an animated face by using facial animation parameters 
(FAP); in addition, the animation of the mouth can be derived directly from the 
speech phonemes. Other applications of the MPEG-4 TTSI include speech synthesis 
for avatars in virtual reality (VR) applications, voice newspapers, and 
low-bitrate Internet voice tools.<A name=_Toc434198799></A> 
</P></FONT></DIR></DIR><FONT face=Arial>
<P><A name=_Toc434198804></A>Concepts in music synthesis</P></FONT>
<DIR>
<DIR><FONT face=Arial></FONT><FONT size=2>
<P>The field of music synthesis is too large and varied to give a complete 
overview here. An artistic history by Chadabe [4] and a technical overview by 
Roads [16] are sources that provide more background on the concepts developed 
over the last 35 years.</P>
<P>The techniques used in MPEG-4 for synthetic music transmission were 
originally developed by Mathews [13, 14], who demonstrated the first digital 
synthesis programs. The so-called <I>unit-generator model</I> of synthesis he 
developed has proven to be a robust and practical tool for musicians interested 
in the precise control of sound. This paradigm has been refined by many others, 
particularly Vercoe [26], whose language "Csound" is very popular today with 
composers. </P>
<P>In the unit-generator model (also called the <I>Music-N</I> model after 
Mathews

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -