📄 (8)synthetic and snhc audio in mpeg-4.htm
字号:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<!-- saved from url=(0094)http://3c.nii.org.tw/3c/silicon/embedded/MPEG/Synthetic%20and%20SNHC%20Audio%20in%20MPEG-4.htm -->
<!-- saved from url=(0099)http://leonardo.telecomitalialab.com/icjfiles/mpeg-4_si/10-SNHC_audio_paper/10-SNHC_audio_paper.htm --><HTML><HEAD><TITLE>Synthetic and SNHC Audio in MPEG-4</TITLE>
<META http-equiv=Content-Type content="text/html; charset=windows-1252">
<META content="MSHTML 6.00.2800.1106" name=GENERATOR>
<META content=D:\office97\Office\html.dot name=Template></HEAD>
<BODY vLink=#800080 link=#0000ff
background="(8)Synthetic and SNHC Audio in MPEG-4.files/yellowline.jpg"><B><FONT
face=Arial size=4>
<P align=center>Synthetic and SNHC Audio in MPEG-4</P></FONT><FONT size=2>
<P align=center>Eric D. Scheirer </P></FONT></B><FONT size=2>
<P align=center>Machine Listening Group, MIT Media Laboratory<BR>E15-401D,
Cambridge MA 02143-4307 USA<BR>Tel: +1 617 253 0112 Fax: +1 617 258
6264<BR>eds@media.mit.edu<BR></P>
<P align=center><B>Youngjik Lee </B>and<B> Jae-Woo Yang</B></P>
<P align=center>Switching and Transmission Technology Laboratories, ETRI</P>
<P align=center> </P></FONT><FONT face=Arial>
<P>Abstract</P></FONT>
<DIR>
<DIR><FONT face=Arial></FONT><FONT size=2>
<P>In addition to its sophisticated audio-compression capabilities, MPEG-4
contains extensive functions supporting synthetic sound and the
synthetic/natural hybrid coding of sound. We present an overview of the
Structured Audio format, which allows efficient transmission and client-side
synthesis of music and sound effects. We also provide an overview of the
Text-to-Speech Interface, which standardizes a single format for communication
with speech synthesizers. Finally, we present an overview of the AudioBIFS
portion of the Binary Format for Scene Description, which allows the description
of hybrid soundtracks, 3-D audio environments, and interactive audio
programming. The tools provided for advanced audio functionality in MPEG-4 are a
new and important addition to the world of audio
standards.</P></FONT></DIR></DIR>
<P align=center>
<HR>
<P></P><B><I><FONT face=Arial>
<P><A name=_Toc434198797>Introduction</A></P></FONT></I></B>
<DIR><FONT face=Arial><I></I></FONT>
<DIR><FONT face=Arial><I><B></B></I></FONT><FONT size=2>
<P>This article describes the parts of MPEG-4 that govern the compression,
representation, and transmission of synthetic sound and the combination of
synthetic and natural sound into hybrid soundtracks. Through these tools, MPEG-4
provides advanced capabilities for ultra-low-bitrate sound transmission,
interactive sound scenes, and flexible, repurposable delivery of sound
content.</P>
<P>We will discuss three MPEG-4 audio tools. The first, MPEG-4 Structured Audio,
standardizes precise, efficient delivery of synthetic music and sound effects.
The second, MPEG-4 Text-to-Speech Interface, standardizes a representation
protocol for synthesized speech, an interface to text-to-speech synthesizers,
and the automatic synchronization of synthetic speech and "talking head"
animated face graphics [24]. The third, MPEG-4 AudioBIFS--part of the main BIFS
framework--standardizes terminal-side mixing and post-production of audio
soundtracks [22]. AudioBIFS enables interactive soundtracks and 3-D sound
presentation for virtual-reality applications. In MPEG-4, the capability to mix
and synchronize real sound with synthetic is termed <I>Synthetic/Natural Hybrid
Coding</I> of Audio, or SNHC Audio.</P>
<P>The organization of the present paper is as follows. First, we provide a
general overview of the objectives for synthetic and SNHC audio in MPEG-4. This
section also introduces concepts from speech and music synthesis to readers
whose primary expertise may not be in the field of audio. Next, a detailed
description of the synthetic-audio codecs in MPEG-4 is provided. Finally, we
describe AudioBIFS and its use in the creation of SNHC audio soundtracks.
</P></FONT></DIR></DIR>
<P align=center><A name=_Toc434198798></A>
<HR>
<P></P><B><I><FONT face=Arial>
<P>Synthetic Audio in MPEG-4: Concepts and Requirements</P></FONT></I></B>
<DIR><FONT face=Arial><I></I></FONT>
<DIR><FONT face=Arial><I><B></B></I></FONT><FONT size=2>
<P>In this section, we introduce speech synthesis and music synthesis. Then we
discuss the inclusion of these technologies in MPEG-4, focusing on the
capabilities provided by synthetic audio and the types of applications that are
better addressed with synthetic audio coding than with natural audio
coding.</P></FONT></DIR></DIR><FONT face=Arial>
<P>Relationship between natural and synthetic coding</P></FONT>
<DIR>
<DIR><FONT face=Arial></FONT><FONT size=2>
<P>Modern standards for natural audio coding [1, 2] use perceptual models to
compress natural sound. In coding synthetic sound, perceptual models are not
used; rather, very specific parametric models are used to transmit sound
<I>descriptions</I>. The descriptions are received at the decoding terminal and
converted into sound through real-time sound <I>synthesis</I>. The parametric
model for the Text-to-Speech Interface is fixed in the standard; in the
Structured Audio toolset, the model itself is transmitted as part of the
bitstream and interpreted by a reconfigurable decoder.</P>
<P>Natural and synthetic audio are not unrelated methods for transmitting sound.
Especially as sound models in perceptual coding grow more sophisticated, the
boundary between "decompression" and "synthesis" becomes somewhat blurred.
Vercoe, Gardner, and Scheirer [28] have discussed the relationships among
parametric models of sound, digital sound creation and transmission, perceptual
coding, parametric compression, and various techniques for algorithmic
synthesis. </P></FONT></DIR></DIR><FONT face=Arial>
<P><A name=_Toc434198805>Concepts in speech synthesis</A></P></FONT>
<DIR>
<DIR><FONT face=Arial></FONT><FONT size=2>
<P>Text-to-speech (TTS) systems generate speech sound according to given text.
This technology enables the translation of text information into speech so that
the text can be transferred through speech channels such as telephone lines.
Today, TTS systems are used for many applications, including automatic
voice-response systems (the "telephone menu" systems that have become popular
recently), e-mail reading, and information services for the visually handicapped
[9, 10].</P>
<P>TTS systems typically consist of multiple processing modules as shown in
Figure 1. Such a system accepts text as input and generates a corresponding
<I>phoneme</I> sequence. Phonemes are the smallest units of human language; each
phoneme corresponds to one sound used in speech. A surprisingly small set of
phonemes, about 120, is sufficient to describe all human languages. </P></FONT>
<P align=center><IMG height=263
src="(8)Synthetic and SNHC Audio in MPEG-4.files/image30.gif"
width=503></P><B><FONT size=2>
<P align=center>Figure 1: Block diagram of a text-to-speech system, showing the
interaction between text-to-phoneme conversion, text understanding, and prosody
generation and application</P></FONT></B><FONT size=2>
<P>The phoneme sequence is used in turn to generate a basic speech sequence
without <I>prosody</I>, that is, without pitch, duration, and amplitude
variations. In parallel, a text-understanding module analyzes the input for
phrase structure and inflections. Using the result of this processing, a prosody
generation module creates the proper prosody for the text. Finally, a prosody
control module changes the prosody parameters of the basic speech sequence
according to the results of the text-understanding module, yielding synthesized
speech.</P>
<P>One of the first successful TTS systems was the DecTalk English speech
synthesizer developed in 1983 [11]. This system produces very intelligible
speech and supports eight different speaking voices. However, developing speech
synthesizers of this sort is a difficult process, since it is necessary to
extract all the acoustic parameters for synthesis. It is a painstaking process
to analyze enough data to accumulate the parameters that are used for all kinds
of speech.</P>
<P>In 1992, CNET in France developed the pitch-synchronous overlap-and-add
(PSOLA) method to control the pitch and phoneme duration of synthesized speech
[25]. Using this technique, it is easy to control the prosody of synthesized
speech. Thus synthesized speech using PSOLA sounds more natural; it can also use
human speech as a guide to control the prosody of the synthesis, in an
analysis-synthesis process that can also modify the tone and duration. However,
if the tone is changed too much, the resulting speech is easily recognized as
artificial.</P>
<P>In 1996, ATR in Japan developed the CHATR speech synthesizer [10]. This
method relies on short samples of human speech without modifying any
characteristics; it locates and sequences phonemes, words, or phrases from a
database. A large database of human speech is necessary to develop a TTS system
using this method. Automatic tools may be used to label each phoneme of the
human speech to reduce the development time; typically, hidden Markov models
(HMMs) are used to align the best phoneme candidates to the target speech. The
synthesized speech is very intelligible and natural; however, this method of TTS
requires large amounts of memory and processing power. </P>
<P>The applications of TTS are expanding in telecommunications, personal
computing, and the Internet. Current research in TTS includes voice conversion
(synthesizing the sound of a particular speaker抯 voice), multi-language TTS,
and enhancing the naturalness of speech through more sophisticated voice models
and prosody generators.</P></FONT></DIR></DIR><FONT face=Arial>
<P><A name=_Toc434198806>Applications for speech synthesis in
MPEG-4</A></P></FONT>
<DIR>
<DIR><FONT face=Arial></FONT><FONT size=2>
<P>The synthetic speech system in MPEG-4 was designed to support interactive
applications using text as the basic content type. Some of these applications
include on-demand storytelling, motion picture dubbing, and "talking head"
synthetic videoconferencing.</P>
<P>In the story-telling on demand (STOD) application, the user can select a
story from a huge database stored on fixed media. The STOD system reads the
story aloud, using the MPEG-4 Text-to-Speech Interface (henceforth, TTSI) with
the MPEG-4 facial animation tool or with appropriately selected images. The user
can stop and resume speaking at any moment he wants through the user interface
of the local machine (for example, mouse or keyboard). The user can also select
the gender, age, and the speech rate of the electronic story-teller.</P>
<P>In a motion-picture-dubbing application, synchronization between the MPEG-4
TTSI decoder and the encoded moving picture is the essential feature. The
architecture of the MPEG-4 TTS decoder provides several levels of
synchronization granularity. By aligning the composition time of each sentence,
coarse granularity of synchronization can be easily achieved. To get more
finely-tuned synchronization, information about the speaker lip shape can be
used. The finest granularity of synchronization can be achieved by using
detailed prosody transmission and video-related information such as sentence
duration and offset time in the sentence. With this synchronization capability,
the MPEG-4 TTSI can be used for motion picture dubbing by following the lip
shape and the corresponding time in the sentence.</P>
<P>To enable synthetic video-teleconferencing, the TTSI decoder can be used to
drive the facial-animation decoder in synchronization. <I>Bookmarks</I> in the
TTSI bitstream control an animated face by using facial animation parameters
(FAP); in addition, the animation of the mouth can be derived directly from the
speech phonemes. Other applications of the MPEG-4 TTSI include speech synthesis
for avatars in virtual reality (VR) applications, voice newspapers, and
low-bitrate Internet voice tools.<A name=_Toc434198799></A>
</P></FONT></DIR></DIR><FONT face=Arial>
<P><A name=_Toc434198804></A>Concepts in music synthesis</P></FONT>
<DIR>
<DIR><FONT face=Arial></FONT><FONT size=2>
<P>The field of music synthesis is too large and varied to give a complete
overview here. An artistic history by Chadabe [4] and a technical overview by
Roads [16] are sources that provide more background on the concepts developed
over the last 35 years.</P>
<P>The techniques used in MPEG-4 for synthetic music transmission were
originally developed by Mathews [13, 14], who demonstrated the first digital
synthesis programs. The so-called <I>unit-generator model</I> of synthesis he
developed has proven to be a robust and practical tool for musicians interested
in the precise control of sound. This paradigm has been refined by many others,
particularly Vercoe [26], whose language "Csound" is very popular today with
composers. </P>
<P>In the unit-generator model (also called the <I>Music-N</I> model after
Mathews
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -