⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 (7)face and 2-d mesh animation in mpeg-4.htm

📁 关于MPRG4的一些基本的指南
💻 HTM
📖 第 1 页 / 共 2 页
字号:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<!-- saved from url=(0100)http://3c.nii.org.tw/3c/silicon/embedded/MPEG/Synthetic%20Visual%20Object%20Coding%20in%20MPEG-4.htm -->
<!-- saved from url=(0099)http://leonardo.telecomitalialab.com/icjfiles/mpeg-4_si/8-SNHC_visual_paper/8-SNHC_visual_paper.htm --><HTML><HEAD><TITLE>Synthetic Visual Object Coding in MPEG-4</TITLE>
<META http-equiv=Content-Type content="text/html; charset=windows-1252">
<META content="MSHTML 6.00.2800.1106" name=GENERATOR>
<META content="C:\Program Files\Microsoft Office\Office\Dictdoc.dot" 
name=Template></HEAD>
<BODY><B><FONT size=5>
<P align=center><A name=_Ref433880384></A>Face and 2-D Mesh Animation in 
MPEG-4</P></FONT>
<OL type=A><FONT size=5></FONT><FONT size=4>
  <LI>Murat Tekalp and J鰎n Ostermann </FONT></LI></OL></B>
<P align=justify><FONT size=2><B>Keywords</B></FONT><FONT size=2>: MPEG-4, face 
animation, computer graphics, deformation, VRML, speech synthesizer, electronic 
commerce</FONT></P><FONT size=4><B>
<P align=center>Abstract</P></B></FONT><FONT size=2>
<P align=justify>This paper presents an overview of some of the synthetic visual 
objects supported by MPEG-4 version-1, namely animated faces and animated 
arbitrary 2-D uniform and Delaunay meshes. We discuss both specification and 
compression of face animation and 2D-mesh animation in MPEG-4. Face animation 
allows to animate a proprietary face model or a face model downloaded to the 
decoder. We also address integration of the face animation tool with the 
text-to-speech interface (TTSI), so that face animation can be driven by text 
input.</P></FONT>
<OL><FONT size=2></FONT><FONT size=4><B>
  <LI>Introduction </B></FONT><FONT size=2>
  <P align=justify>MPEG-4 is an object-based multimedia compression standard, 
  which allows for encoding of different audio-visual objects (AVO) in the scene 
  independently. The visual objects may have natural or synthetic content, 
  including arbitrary shape <I>video objects</I>, special synthetic objects such 
  as human face and body, and generic 2-D/3-D objects composed of primitives 
  like rectangles, spheres, or indexed face sets, which define an object surface 
  by means of vertices and surface patches. The synthetic visual objects are 
  animated by transforms and special purpose animation techniques, such as 
  face/body animation and 2D-mesh animation. MPEG-4 also provides synthetic 
  audio tools such as structured audio tools and a text-to-speech interface 
  (TTSI). This paper presents a detailed overview of synthetic visual objects 
  supported by MPEG-4 version-1, namely animated faces and animated arbitrary 
  2-D uniform and Delaunay meshes. We also address integration of the face 
  animation tool with the TTSI, so that face animation can be driven by text 
  input. Body animation and 3-D mesh compression and animation will be supported 
  in MPEG-4 version-2, and hence are not covered in this article.</P>
  <P align=justify>The representation of synthetic visual objects in MPEG-4 is 
  based on the prior VRML standard [13][12][11] using nodes such as 
  <I>Transform</I>, which defines rotation, scale or translation of an object, 
  and <I>IndexedFaceSet</I> describing 3-D shape of an object by an indexed face 
  set. However, MPEG-4 is the first international standard that specifies a 
  compressed binary representation of animated synthetic audio-visual objects. 
  It is important to note that MPEG-4 only specifies the decoding of compliant 
  bit streams in an MPEG-4 terminal. The encoders do enjoy a large degree of 
  freedom in how to generate MPEG-4 compliant bit streams. Decoded audio-visual 
  objects can be composed into 2D and 3D scenes using the Binary Format for 
  Scenes (BIFS) [13], which also allows implementation of animation of objects 
  and their properties using the BIFS-Anim node. We recommend readers to refer 
  to an accompanying article on BIFS for the details of implementation of 
  BIFS-Anim. Compression of still textures (images) for mapping onto 2D or 3D 
  meshes is also covered in another accompanying article. In the following, we 
  cover the specification and compression of face animation and 2D-mesh 
  animation in Sections 2 and 3, respectively.</P></FONT><FONT size=4><B>
  <LI>Face Animation </B></FONT><FONT size=2>
  <P align=justify>MPEG-4 foresees that talking heads will serve an important 
  role in future customer service applications. For example, a customized agent 
  model can be defined for games or web-based customer service applications. To 
  this effect, MPEG-4 enables integration of face animation with multimedia 
  communications and presentations and allows face animation over low bit rate 
  communication channels, for point to point as well as multi-point connections 
  with low-delay. With AT&amp;T抯 implementation of an MPEG-4 face animation 
  system, we can animate a face models with a data rate of 300 - 2000bits/s. In 
  many applications like Electronic Commerce, the integration of face animation 
  and text to speech synthesizer is of special interest. MPEG-4 defines an 
  application program interface for TTS synthesizer. Using this interface, the 
  synthesizer can be used to provide phonemes and related timing information to 
  the face model. The phonemes are converted into corresponding mouth shapes 
  enabling simple talking head applications. Adding facial expressions to the 
  talking head is achieved using bookmarks in the text. This integration allows 
  for animated talking heads driven just by one text stream at a data rate of 
  less than 200 bits/s [22]. Subjective tests reported in [26] show that an 
  Electronic Commerce web site with talking faces gets higher ratings than the 
  same web site without talking faces. In an amendment to the standard foreseen 
  in 2000, MPEG-4 will add body animation to its tool set thus allowing the 
  standardized animation of complete human bodies.</P>
  <P align=justify>In the following sections, we describe how to specify and 
  animate 3D face models, compress facial animation parameters, and integrate 
  face animation with TTS in MPEG-4. The MPEG-4 standard allows using 
  proprietary 3D face models that are resident at the decoder as well as 
  transmission of face models such that the encoder can predict the quality of 
  the presentation at the decoder. In Section 2.1, we explain how MPEG-4 
  specifies a 3D face model and its animation using face definition parameters 
  (FDP) and facial animation parameters (FAP), respectively. Section 2.2 
  provides details on how to efficiently encode FAPs. The integration of face 
  animation into an MPEG-4 terminal with text-to-speech capabilities is shown in 
  Section 2.3. In Section 2.4, we describe briefly the integration of face 
  animation with MPEG-4 systems. MPEG-4 profiles with respect to face animation 
  are explained in Section 2.5.</P></FONT>
  <OL><FONT size=2></FONT><B>
    <LI>Specification <A name=_Ref416252863>and Animation</A> of Faces </B><FONT 
    size=2>
    <P align=justify>MPEG-4 specifies a face model in its neutral state, a 
    number of feature points on this neutral face as reference points, and a set 
    of FAPs, each corresponding to a particular facial action deforming a face 
    model in its neutral state. Deforming a neutral face model according to some 
    specified FAP values at each time instant generates a facial animation 
    sequence. The FAP value for a particular FAP indicates the magnitude of the 
    corresponding action, e.g., a big versus a small smile or deformation of a 
    mouth corner. For an MPEG-4 terminal to interpret the FAP values using its 
    face model, it has to have predefined model specific animation rules to 
    produce the facial action corresponding to each FAP. The terminal can either 
    use its own animation rules or download a face model and the associated face 
    animation tables (FAT) to have a customized animation behavior. Since the 
    FAPs are required to animate faces of different sizes and proportions, the 
    FAP values are defined in face animation parameter units (FAPU). The FAPU 
    are computed from spatial distances between major facial features on the 
    model in its neutral state.</P>
    <P align=justify>In the following, we first describe what MPEG-4 considers 
    to be a generic face model in its neutral state and the associated feature 
    points. Then, we explain the facial animation parameters for this generic 
    model. Finally, we show how to define MPEG-4 compliant face models that can 
    be transmitted from the encoder to the decoder for animation.</P>
    <OL>
      <LI><B><A name=_Ref447559909>MPEG-4 Face Model in Neutral State</A></B> 
      </LI></OL></FONT></LI></OL></LI></OL><FONT size=2>
<P align=justify>As the first step, MPEG-4 defines a generic face model in its 
neutral state by the following properties (see Figure 1):</P>
<UL>
  <LI>gaze is in direction of Z axis, </LI></UL>
<UL>
  <LI>all face muscles are relaxed, </LI></UL>
<UL>
  <LI>eyelids are tangent to the iris, 
  <LI>the pupil is one third of the diameter of the iris, </LI></UL>
<UL>
  <LI>lips are in contact; the line of the lips is horizontal and at the same 
  height of lip corners, </LI></UL>
<UL>
  <LI>the mouth is closed and the upper teeth touch the lower ones, </LI></UL>
<UL>
  <LI>the tongue is flat, horizontal with the tip of tongue touching the 
  boundary between upper and lower teeth. </LI></UL>
<P align=justify>A FAPU and the feature points used to derive the FAPU are 
defined next with respect to the face in its neutral state. </P>
<P align=center><IMG height=243 
src="&#65288;7&#65289;Face and 2-D Mesh Animation in MPEG-4.files/Image19.gif" width=181></P>
<DIR><B>
<P align=justify><A name=_Ref416425989>Figure 1</A>: A face model in its neutral 
state and the feature points used to define FAP units (FAPU). Fractions of 
distances between the marked key features are used to define FAPU (from 
[14]).</P></B></DIR><B>
<OL>
  <OL>
    <LI><A name=_Ref416255308></A>Face Animation Parameter Units 
    <LI><A name=_Ref416255308></A>
    <P align=justify>In order to define face animation parameters for arbitrary 
    face models, MPEG-4 defines FAPUs that serve to scale facial animation 
    parameters for any face model. FAPUs are defined as fractions of distances 
    between key facial features (see Figure 1). These features, such as eye 
    separation, are defined on a face model that is in the neutral state. The 
    FAPU allow interpretation of the FAPs on any facial model in a consistent 
    way, producing reasonable results in terms of expression and speech 
    pronunciation. The measurement units are shown in Table 1.</P>
    <LI>
    <P align=justify><A name=_Ref416295987>Table 1</A>: Facial Animation 
    Parameter Units and their definitions.</P>
    <LI>
    <TABLE cellSpacing=1 cellPadding=7 width=590 border=1>
      <TBODY>
      <TR>
        <TD vAlign=top width="18%">
          <P align=justify>IRISD0</P></TD>
        <TD vAlign=top width="57%">
          <P align=justify>Iris diameter (by definition it is equal to the 
          distance between upper ad lower eyelid) in neutral face</P></TD>
        <TD vAlign=top width="25%">
          <P align=justify>IRISD = IRISD0 / 1024</P></TD></TR>
      <TR>
        <TD vAlign=top width="18%">
          <P align=justify>ES0</P></TD>
        <TD vAlign=top width="57%">
          <P align=justify>Eye separation</P></TD>
        <TD vAlign=top width="25%">
          <P align=justify>ES = ES0 / 1024</P></TD></TR>
      <TR>
        <TD vAlign=top width="18%">
          <P align=justify>ENS0</P></TD>
        <TD vAlign=top width="57%">
          <P align=justify>Eye - nose separation</P></TD>
        <TD vAlign=top width="25%">
          <P align=justify>ENS = ENS0 / 1024</P></TD></TR>
      <TR>
        <TD vAlign=top width="18%">
          <P align=justify>MNS0</P></TD>
        <TD vAlign=top width="57%">
          <P align=justify>Mouth - nose separation</P></TD>
        <TD vAlign=top width="25%">
          <P align=justify>MNS = MNS0 / 1024</P></TD></TR>
      <TR>
        <TD vAlign=top width="18%">
          <P align=justify>MW0</P></TD>
        <TD vAlign=top width="57%">
          <P align=justify>Mouth width</P></TD>
        <TD vAlign=top width="25%">
          <P align=justify>MW=MW0 / 1024</P></TD></TR>
      <TR>
        <TD vAlign=top width="18%">
          <P align=justify>AU</P></TD>
        <TD vAlign=top width="57%">
          <P align=justify>Angle unit</P></TD>
        <TD vAlign=top width="25%">
          <P align=justify>10E-5 rad</P></TD></TR></TBODY></TABLE>
    <LI><A name=_Ref416507376>Feature Points</A> </LI></OL></OL></B>
<P align=justify>MPEG-4 specifies 84 feature points on the neutral face (see 
Figure 2). The main purpose of these feature points is to provide spatial 
references for defining FAPs. Some feature points such as the ones along the 
hairline are not affected by FAPs. However, they are required for defining the 
shape of a proprietary face model using feature points (Section 2.1.3). Feature 
points are arranged in groups like cheeks, eyes, and mouth. The location of 
these feature points has to be known for any MPEG-4 compliant face model. The 
feature points on the model should be located according to Figure 2 and the 
hints given in Table 6 in the Annex of this paper.</P>
<P align=center><IMG height=805 
src="&#65288;7&#65289;Face and 2-D Mesh Animation in MPEG-4.files/Image20.gif" 
width=488></P><B>
<P align=center><A name=_Ref416508759>Figure 2</A>: Feature points may be used 
to define the shape of a proprietary face model. The facial animation parameters 
are defined by motion of some of these feature points (from 
[14]).</P></B><B></B><A name=_Ref435873549>Face Animation Parameters</A> 
<P align=justify>The FAPs are based on the study of minimal perceptible actions 
and are closely related to muscle actions [2][4][9][10]. The 68 parameters are 
categorized into 10 groups related to parts of the face (Table 2). FAPs 
represent a complete set of basic facial actions including head motion, tongue, 
eye, and mouth control. They allow representation of natural facial expressions 
(see Table 7 in the Annex). For each FAP, the standard defines the appropriate 
FAPU, FAP group, direction of positive motion and whether the motion of the 
feature point is unidirectional (see FAP 3, open jaw) or bi-directional (see FAP 
48, head pitch). FAPs can also be used to define facial action units [19]. 
Exaggerated amplitudes permit the definition of actions that are normally not 
possible for humans, but are desirable for cartoon-like characters.</P>
<P align=justify>The FAP set contains two high-level parameters, visemes and 
expressions (FAP group 1). A viseme (FAP 1) is a visual correlate to a phoneme. 
Only 14 static visemes that are clearly distinguished are included in the 
standard set (Table 3 </P>
<P align=justify>&nbsp;</P>
<P align=justify>&nbsp;</P>
<P>). In order to allow for coarticulation of speech and mouth movement [5], the 
shape of the mouth of a speaking human is not only influenced by the current 
phoneme, but also the previous and the next phoneme. In MPEG-4, transitions from 
one viseme to the next are defined by blending only two visemes with a weighting 
factor. So far, it is not clear how this can be used for high quality visual 
speech animation.</P>
<P align=justify>The expression parameter FAP 2 defines the 6 primary facial 
expressions (Table 4, Figure 3). In contrast to visemes, facial expressions are 
animated by a value defining the excitation of the expression. Two facial 
expressions can be animated simultaneously with an amplitude in the range of 
[0-63] defined for each expression. The facial expression parameter values are 
defined by textual descriptions. The expression parameter allows for an 
efficient means of animating faces. They are high-level animation parameters. A 
face model designer creates them for each face model. Since they are designed as 
a complete expression, they allow animating unknown models with high subjective 
quality [21][22].</P>
<P align=justify>Using FAP 1 and FAP 2 together with low-level FAPs 3-68 that 
affect the same areas as FAP 1 and 2, may result in unexpected visual 
representations of the face. Generally, the lower level FAPs have priority over 
deformations caused by FAP 1 or 2. When specifying an expression with FAP 2, the 
encoder may sent an init_face bit that deforms the neutral face of the model 

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -