📄 (7)face and 2-d mesh animation in mpeg-4.htm
字号:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<!-- saved from url=(0100)http://3c.nii.org.tw/3c/silicon/embedded/MPEG/Synthetic%20Visual%20Object%20Coding%20in%20MPEG-4.htm -->
<!-- saved from url=(0099)http://leonardo.telecomitalialab.com/icjfiles/mpeg-4_si/8-SNHC_visual_paper/8-SNHC_visual_paper.htm --><HTML><HEAD><TITLE>Synthetic Visual Object Coding in MPEG-4</TITLE>
<META http-equiv=Content-Type content="text/html; charset=windows-1252">
<META content="MSHTML 6.00.2800.1106" name=GENERATOR>
<META content="C:\Program Files\Microsoft Office\Office\Dictdoc.dot"
name=Template></HEAD>
<BODY><B><FONT size=5>
<P align=center><A name=_Ref433880384></A>Face and 2-D Mesh Animation in
MPEG-4</P></FONT>
<OL type=A><FONT size=5></FONT><FONT size=4>
<LI>Murat Tekalp and J鰎n Ostermann </FONT></LI></OL></B>
<P align=justify><FONT size=2><B>Keywords</B></FONT><FONT size=2>: MPEG-4, face
animation, computer graphics, deformation, VRML, speech synthesizer, electronic
commerce</FONT></P><FONT size=4><B>
<P align=center>Abstract</P></B></FONT><FONT size=2>
<P align=justify>This paper presents an overview of some of the synthetic visual
objects supported by MPEG-4 version-1, namely animated faces and animated
arbitrary 2-D uniform and Delaunay meshes. We discuss both specification and
compression of face animation and 2D-mesh animation in MPEG-4. Face animation
allows to animate a proprietary face model or a face model downloaded to the
decoder. We also address integration of the face animation tool with the
text-to-speech interface (TTSI), so that face animation can be driven by text
input.</P></FONT>
<OL><FONT size=2></FONT><FONT size=4><B>
<LI>Introduction </B></FONT><FONT size=2>
<P align=justify>MPEG-4 is an object-based multimedia compression standard,
which allows for encoding of different audio-visual objects (AVO) in the scene
independently. The visual objects may have natural or synthetic content,
including arbitrary shape <I>video objects</I>, special synthetic objects such
as human face and body, and generic 2-D/3-D objects composed of primitives
like rectangles, spheres, or indexed face sets, which define an object surface
by means of vertices and surface patches. The synthetic visual objects are
animated by transforms and special purpose animation techniques, such as
face/body animation and 2D-mesh animation. MPEG-4 also provides synthetic
audio tools such as structured audio tools and a text-to-speech interface
(TTSI). This paper presents a detailed overview of synthetic visual objects
supported by MPEG-4 version-1, namely animated faces and animated arbitrary
2-D uniform and Delaunay meshes. We also address integration of the face
animation tool with the TTSI, so that face animation can be driven by text
input. Body animation and 3-D mesh compression and animation will be supported
in MPEG-4 version-2, and hence are not covered in this article.</P>
<P align=justify>The representation of synthetic visual objects in MPEG-4 is
based on the prior VRML standard [13][12][11] using nodes such as
<I>Transform</I>, which defines rotation, scale or translation of an object,
and <I>IndexedFaceSet</I> describing 3-D shape of an object by an indexed face
set. However, MPEG-4 is the first international standard that specifies a
compressed binary representation of animated synthetic audio-visual objects.
It is important to note that MPEG-4 only specifies the decoding of compliant
bit streams in an MPEG-4 terminal. The encoders do enjoy a large degree of
freedom in how to generate MPEG-4 compliant bit streams. Decoded audio-visual
objects can be composed into 2D and 3D scenes using the Binary Format for
Scenes (BIFS) [13], which also allows implementation of animation of objects
and their properties using the BIFS-Anim node. We recommend readers to refer
to an accompanying article on BIFS for the details of implementation of
BIFS-Anim. Compression of still textures (images) for mapping onto 2D or 3D
meshes is also covered in another accompanying article. In the following, we
cover the specification and compression of face animation and 2D-mesh
animation in Sections 2 and 3, respectively.</P></FONT><FONT size=4><B>
<LI>Face Animation </B></FONT><FONT size=2>
<P align=justify>MPEG-4 foresees that talking heads will serve an important
role in future customer service applications. For example, a customized agent
model can be defined for games or web-based customer service applications. To
this effect, MPEG-4 enables integration of face animation with multimedia
communications and presentations and allows face animation over low bit rate
communication channels, for point to point as well as multi-point connections
with low-delay. With AT&T抯 implementation of an MPEG-4 face animation
system, we can animate a face models with a data rate of 300 - 2000bits/s. In
many applications like Electronic Commerce, the integration of face animation
and text to speech synthesizer is of special interest. MPEG-4 defines an
application program interface for TTS synthesizer. Using this interface, the
synthesizer can be used to provide phonemes and related timing information to
the face model. The phonemes are converted into corresponding mouth shapes
enabling simple talking head applications. Adding facial expressions to the
talking head is achieved using bookmarks in the text. This integration allows
for animated talking heads driven just by one text stream at a data rate of
less than 200 bits/s [22]. Subjective tests reported in [26] show that an
Electronic Commerce web site with talking faces gets higher ratings than the
same web site without talking faces. In an amendment to the standard foreseen
in 2000, MPEG-4 will add body animation to its tool set thus allowing the
standardized animation of complete human bodies.</P>
<P align=justify>In the following sections, we describe how to specify and
animate 3D face models, compress facial animation parameters, and integrate
face animation with TTS in MPEG-4. The MPEG-4 standard allows using
proprietary 3D face models that are resident at the decoder as well as
transmission of face models such that the encoder can predict the quality of
the presentation at the decoder. In Section 2.1, we explain how MPEG-4
specifies a 3D face model and its animation using face definition parameters
(FDP) and facial animation parameters (FAP), respectively. Section 2.2
provides details on how to efficiently encode FAPs. The integration of face
animation into an MPEG-4 terminal with text-to-speech capabilities is shown in
Section 2.3. In Section 2.4, we describe briefly the integration of face
animation with MPEG-4 systems. MPEG-4 profiles with respect to face animation
are explained in Section 2.5.</P></FONT>
<OL><FONT size=2></FONT><B>
<LI>Specification <A name=_Ref416252863>and Animation</A> of Faces </B><FONT
size=2>
<P align=justify>MPEG-4 specifies a face model in its neutral state, a
number of feature points on this neutral face as reference points, and a set
of FAPs, each corresponding to a particular facial action deforming a face
model in its neutral state. Deforming a neutral face model according to some
specified FAP values at each time instant generates a facial animation
sequence. The FAP value for a particular FAP indicates the magnitude of the
corresponding action, e.g., a big versus a small smile or deformation of a
mouth corner. For an MPEG-4 terminal to interpret the FAP values using its
face model, it has to have predefined model specific animation rules to
produce the facial action corresponding to each FAP. The terminal can either
use its own animation rules or download a face model and the associated face
animation tables (FAT) to have a customized animation behavior. Since the
FAPs are required to animate faces of different sizes and proportions, the
FAP values are defined in face animation parameter units (FAPU). The FAPU
are computed from spatial distances between major facial features on the
model in its neutral state.</P>
<P align=justify>In the following, we first describe what MPEG-4 considers
to be a generic face model in its neutral state and the associated feature
points. Then, we explain the facial animation parameters for this generic
model. Finally, we show how to define MPEG-4 compliant face models that can
be transmitted from the encoder to the decoder for animation.</P>
<OL>
<LI><B><A name=_Ref447559909>MPEG-4 Face Model in Neutral State</A></B>
</LI></OL></FONT></LI></OL></LI></OL><FONT size=2>
<P align=justify>As the first step, MPEG-4 defines a generic face model in its
neutral state by the following properties (see Figure 1):</P>
<UL>
<LI>gaze is in direction of Z axis, </LI></UL>
<UL>
<LI>all face muscles are relaxed, </LI></UL>
<UL>
<LI>eyelids are tangent to the iris,
<LI>the pupil is one third of the diameter of the iris, </LI></UL>
<UL>
<LI>lips are in contact; the line of the lips is horizontal and at the same
height of lip corners, </LI></UL>
<UL>
<LI>the mouth is closed and the upper teeth touch the lower ones, </LI></UL>
<UL>
<LI>the tongue is flat, horizontal with the tip of tongue touching the
boundary between upper and lower teeth. </LI></UL>
<P align=justify>A FAPU and the feature points used to derive the FAPU are
defined next with respect to the face in its neutral state. </P>
<P align=center><IMG height=243
src="(7)Face and 2-D Mesh Animation in MPEG-4.files/Image19.gif" width=181></P>
<DIR><B>
<P align=justify><A name=_Ref416425989>Figure 1</A>: A face model in its neutral
state and the feature points used to define FAP units (FAPU). Fractions of
distances between the marked key features are used to define FAPU (from
[14]).</P></B></DIR><B>
<OL>
<OL>
<LI><A name=_Ref416255308></A>Face Animation Parameter Units
<LI><A name=_Ref416255308></A>
<P align=justify>In order to define face animation parameters for arbitrary
face models, MPEG-4 defines FAPUs that serve to scale facial animation
parameters for any face model. FAPUs are defined as fractions of distances
between key facial features (see Figure 1). These features, such as eye
separation, are defined on a face model that is in the neutral state. The
FAPU allow interpretation of the FAPs on any facial model in a consistent
way, producing reasonable results in terms of expression and speech
pronunciation. The measurement units are shown in Table 1.</P>
<LI>
<P align=justify><A name=_Ref416295987>Table 1</A>: Facial Animation
Parameter Units and their definitions.</P>
<LI>
<TABLE cellSpacing=1 cellPadding=7 width=590 border=1>
<TBODY>
<TR>
<TD vAlign=top width="18%">
<P align=justify>IRISD0</P></TD>
<TD vAlign=top width="57%">
<P align=justify>Iris diameter (by definition it is equal to the
distance between upper ad lower eyelid) in neutral face</P></TD>
<TD vAlign=top width="25%">
<P align=justify>IRISD = IRISD0 / 1024</P></TD></TR>
<TR>
<TD vAlign=top width="18%">
<P align=justify>ES0</P></TD>
<TD vAlign=top width="57%">
<P align=justify>Eye separation</P></TD>
<TD vAlign=top width="25%">
<P align=justify>ES = ES0 / 1024</P></TD></TR>
<TR>
<TD vAlign=top width="18%">
<P align=justify>ENS0</P></TD>
<TD vAlign=top width="57%">
<P align=justify>Eye - nose separation</P></TD>
<TD vAlign=top width="25%">
<P align=justify>ENS = ENS0 / 1024</P></TD></TR>
<TR>
<TD vAlign=top width="18%">
<P align=justify>MNS0</P></TD>
<TD vAlign=top width="57%">
<P align=justify>Mouth - nose separation</P></TD>
<TD vAlign=top width="25%">
<P align=justify>MNS = MNS0 / 1024</P></TD></TR>
<TR>
<TD vAlign=top width="18%">
<P align=justify>MW0</P></TD>
<TD vAlign=top width="57%">
<P align=justify>Mouth width</P></TD>
<TD vAlign=top width="25%">
<P align=justify>MW=MW0 / 1024</P></TD></TR>
<TR>
<TD vAlign=top width="18%">
<P align=justify>AU</P></TD>
<TD vAlign=top width="57%">
<P align=justify>Angle unit</P></TD>
<TD vAlign=top width="25%">
<P align=justify>10E-5 rad</P></TD></TR></TBODY></TABLE>
<LI><A name=_Ref416507376>Feature Points</A> </LI></OL></OL></B>
<P align=justify>MPEG-4 specifies 84 feature points on the neutral face (see
Figure 2). The main purpose of these feature points is to provide spatial
references for defining FAPs. Some feature points such as the ones along the
hairline are not affected by FAPs. However, they are required for defining the
shape of a proprietary face model using feature points (Section 2.1.3). Feature
points are arranged in groups like cheeks, eyes, and mouth. The location of
these feature points has to be known for any MPEG-4 compliant face model. The
feature points on the model should be located according to Figure 2 and the
hints given in Table 6 in the Annex of this paper.</P>
<P align=center><IMG height=805
src="(7)Face and 2-D Mesh Animation in MPEG-4.files/Image20.gif"
width=488></P><B>
<P align=center><A name=_Ref416508759>Figure 2</A>: Feature points may be used
to define the shape of a proprietary face model. The facial animation parameters
are defined by motion of some of these feature points (from
[14]).</P></B><B></B><A name=_Ref435873549>Face Animation Parameters</A>
<P align=justify>The FAPs are based on the study of minimal perceptible actions
and are closely related to muscle actions [2][4][9][10]. The 68 parameters are
categorized into 10 groups related to parts of the face (Table 2). FAPs
represent a complete set of basic facial actions including head motion, tongue,
eye, and mouth control. They allow representation of natural facial expressions
(see Table 7 in the Annex). For each FAP, the standard defines the appropriate
FAPU, FAP group, direction of positive motion and whether the motion of the
feature point is unidirectional (see FAP 3, open jaw) or bi-directional (see FAP
48, head pitch). FAPs can also be used to define facial action units [19].
Exaggerated amplitudes permit the definition of actions that are normally not
possible for humans, but are desirable for cartoon-like characters.</P>
<P align=justify>The FAP set contains two high-level parameters, visemes and
expressions (FAP group 1). A viseme (FAP 1) is a visual correlate to a phoneme.
Only 14 static visemes that are clearly distinguished are included in the
standard set (Table 3 </P>
<P align=justify> </P>
<P align=justify> </P>
<P>). In order to allow for coarticulation of speech and mouth movement [5], the
shape of the mouth of a speaking human is not only influenced by the current
phoneme, but also the previous and the next phoneme. In MPEG-4, transitions from
one viseme to the next are defined by blending only two visemes with a weighting
factor. So far, it is not clear how this can be used for high quality visual
speech animation.</P>
<P align=justify>The expression parameter FAP 2 defines the 6 primary facial
expressions (Table 4, Figure 3). In contrast to visemes, facial expressions are
animated by a value defining the excitation of the expression. Two facial
expressions can be animated simultaneously with an amplitude in the range of
[0-63] defined for each expression. The facial expression parameter values are
defined by textual descriptions. The expression parameter allows for an
efficient means of animating faces. They are high-level animation parameters. A
face model designer creates them for each face model. Since they are designed as
a complete expression, they allow animating unknown models with high subjective
quality [21][22].</P>
<P align=justify>Using FAP 1 and FAP 2 together with low-level FAPs 3-68 that
affect the same areas as FAP 1 and 2, may result in unexpected visual
representations of the face. Generally, the lower level FAPs have priority over
deformations caused by FAP 1 or 2. When specifying an expression with FAP 2, the
encoder may sent an init_face bit that deforms the neutral face of the model
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -