Paper:《First Order Motion Model for Image Animation》翻译与解读

Paper:《First Order Motion Model for Image Animation》翻译与解读





《First Order Motion Model for Image Animation》翻译与解读


1 Introduction  

2 Related work  

3 Method

3.1 Local Affine Transformations for Approximate Motion Description  

3.2 Occlusion-aware Image Generation 

3.3 Training Losses

3.4 Testing Stage: Relative Motion Transfer  

4 Experiments

5 Conclusions  




《First Order Motion Model for Image Animation》翻译与解读


《First Order Motion Model for Image Animation》

Aliaksandr Siarohin DISI, University of Trento Stéphane Lathuilière DISI, University of Trento LTCI, Télécom Paris, Institut polytechnique de Paris Sergey Tulyakov Snap Inc. Elisa Ricci DISI, University of Trento Fondazione Bruno Kessler Nicu Sebe DISI, University of Trento Huawei Technologies Ireland




Image animation consists of generating a video sequence so that an object in a source image is animated according to the motion of a driving video. Our framework addresses this problem without using any annotation or prior information about the specific object to animate. Once trained on a set of videos depicting objects of the same category (e.g. faces, human bodies), our method can be applied to any object of this class. To achieve this, we decouple appearance and motion information using a self-supervised formulation. To support complex motions, we use a representation consisting of a set of learned keypoints along with their local affine transformations. A generator network models occlusions arising during target motions and combines the appearance extracted from the source image and the motion derived from the driving video. Our framework scores best on diverse benchmarks and on a variety of object categories.图像动画包括生成视频序列,以便根据驱动视频的运动使源图像中的对象动画。我们的框架解决了这个问题,没有使用任何注释或关于动画特定对象的先验信息。一旦在一组描述同一类别对象(例如人脸、人体)的视频上进行训练,我们的方法就可以应用于该类中的任何对象。为了实现这一点,我们解耦外观表面和运动信息使用一个自监督的公式。为了支持复杂的运动,我们使用一种由一组学习过的关键点及其局部仿射变换组成的表示法。生成器网络对目标运动中产生的遮挡进行建模,并将从源图像中提取的外观与从驾驶视频中提取的运动相结合。我们的框架在各种基准测试和各种对象类别上得分最高。


1 Introduction  

Generating videos by animating objects in still images has countless applications across areas of  interest including movie production, photography, and e-commerce. More precisely, image animation  refers to the task of automatically synthesizing videos by combining the appearance extracted from  a source image with motion patterns derived from a driving video. For instance, a face image of a  certain person can be animated following the facial expressions of another individual (see Fig. 1). In  the literature, most methods tackle this problem by assuming strong priors on the object representation  (e.g. 3D model) [4] and resorting to computer graphics techniques [6, 33]. These approaches can  be referred to as object-specific methods, as they assume knowledge about the model of the specific  object to animate.  




Recently, deep generative models have emerged as effective techniques for image animation and  video retargeting [2, 41, 3, 42, 27, 28, 37, 40, 31, 21]. In particular, Generative Adversarial Networks  (GANs) [14] and Variational Auto-Encoders (VAEs) [20] have been used to transfer facial expressions  [37] or motion patterns [3] between human subjects in videos. Nevertheless, these approaches  usually rely on pre-trained models in order to extract object-specific representations such as keypoint  locations. Unfortunately, these pre-trained models are built using costly ground-truth data annotations  [2, 27, 31] and are not available in general for an arbitrary object category. To address this issues,  recently Siarohin et al. [28] introduced Monkey-Net, the first object-agnostic deep model for image animation. Monkey-Net encodes motion information via keypoints learned in a self-supervised fashion.  At test time, the source image is animated according to the corresponding keypoint trajectories  estimated in the driving video. The major weakness of Monkey-Net is that it poorly models object  appearance transformations in the keypoint neighborhoods assuming a zeroth order model (as we  show in Sec. 3.1). This leads to poor generation quality in the case of large object pose changes  (see Fig. 4). To tackle this issue,

  • we propose to use a set of self-learned keypoints together with  local affine transformations to model complex motions. We therefore call our method a first-order  motion model.
  • Second, we introduce an occlusion-aware generator, which adopts an occlusion mask  automatically estimated to indicate object parts that are not visible in the source image and that  should be inferred from the context. This is especially needed when the driving video contains large  motion patterns and occlusions are typical.
  • Third, we extend the equivariance loss commonly used  for keypoints detector training [18, 44], to improve the estimation of local affine transformations.  Fourth, we experimentally show that our method significantly outperforms state-of-the-art image  animation methods and can handle high-resolution datasets where other approaches generally fail.
  •  Finally, we release a new high resolution dataset, Thai-Chi-HD, which we believe could become a  reference benchmark for evaluating frameworks for image animation and video generation.


  • 我们提出使用一组自学习的关键点和局部仿射变换来建模复杂的运动。因此我们称我们的方法为一阶运动模型 [first-order  motion model.]
  • 其次,我们介绍了一个遮挡感知生成器,它采用一个自动估计的遮挡掩模来指示目标部分,在源图像中不可见的,需要从上下文推断。这是特别需要的时候,驾驶视频包含大的运动模式和遮挡是典型的。
  • 第三,我们扩展了关键点检测器训练中常用的等方差损失[18,44],以改进局部仿射变换的估计。
  • 第四,我们的实验表明,我们的方法明显优于最先进的图像动画方法,可以处理高分辨率数据集,其他方法通常失败。
  • 最后,我们发布了一个新的高分辨率数据集——Thai-Chi-HD,我们相信它可以成为评估图像动画和视频生成框架的参考基准。



2 Related work  

Video Generation. Earlier works on deep video generation discussed how spatio-temporal neural  networks could render video frames from noise vectors [36, 26]. More recently, several approaches  tackled the problem of conditional video generation. For instance, Wang et al. [38] combine a  recurrent neural network with a VAE in order to generate face videos. Considering a wider range  of applications, Tulyakov et al. [34] introduced MoCoGAN, a recurrent architecture adversarially  trained in order to synthesize videos from noise, categorical labels or static images. Another typical  case of conditional generation is the problem of future frame prediction, in which the generated video  is conditioned on the initial frame [12, 23, 30, 35, 44]. Note that in this task, realistic predictions can  be obtained by simply warping the initial video frame [1, 12, 35]. Our approach is closely related to these previous works since we use a warping formulation to generate video sequences. However,  in the case of image animation, the applied spatial deformations are not predicted but given by the  driving video.

视频生成。在深度视频生成方面的早期工作讨论了时空神经网络如何从噪声向量渲染视频帧[36,26]。最近,一些方法解决了条件视频生成的问题。例如,Wang et al.[38]结合递归神经网络和VAE来生成人脸视频。考虑到更广泛的应用,Tulyakov等人[34]引入了MoCoGAN,一种经过反训练的周期性建筑,用于从噪声、分类标签或静态图像合成视频。条件生成的另一个典型情况是未来帧预测问题,生成的视频以初始帧为条件[12,23,30,35,44]。注意,在这个任务中,可以通过简单地扭曲初始视频帧来获得现实的预测[1,12,35]。我们的方法与之前的工作密切相关,因为我们使用扭曲公式来生成视频序列。然而,在图像动画的情况下,应用的空间变形不是预测,而是由驾驶视频给出。


Image Animation. Traditional approaches for image animation and video re-targeting [6, 33,  13] were designed for specific domains such as faces [45, 42], human silhouettes [8, 37, 27] or  gestures [31] and required a strong prior of the animated object. For example, in face animation,  method of Zollhofer et al. [45] produced realistic results at expense of relying on a 3D morphable  model of the face. In many applications, however, such models are not available. Image animation  can also be treated as a translation problem from one visual domain to another. For instance, Wang  et al. [37] transferred human motion using the image-to-image translation framework of Isola et  al. [16]. Similarly, Bansal et al. [3] extended conditional GANs by incorporating spatio-temporal  cues in order to improve video translation between two given domains. Such approaches in order to  animate a single person require hours of videos of that person labelled with semantic information,  and therefore have to be retrained for each individual. In contrast to these works, we neither rely on  labels, prior information about the animated objects, nor on specific training procedures for each  object instance. Furthermore, our approach can be applied to any object within the same category  (e.g., faces, human bodies, robot arms etc).  

图像动画。传统的图像动画和视频重定向方法[6,33,13]是为特定领域设计的,如人脸[45,42],人体轮廓[8,37,27]或手势[31],并要求动画对象的强大先验。例如,在人脸动画中,Zollhofer等人[45]的方法以依赖人脸的3D morphable模型为代价,产生了逼真的结果。然而,在许多应用中,这样的模型是不可用的。图像动画也可以看作是一个从一个视觉领域到另一个视觉领域的转换问题。例如,Wang等人[37]使用Isola等人的图像到图像的翻译框架来传输人体运动。[16]。同样,Bansal等人[3]通过合并时空线索扩展了条件GANs,以改善两个给定域之间的视频平移。为了使一个人动起来,这种方法需要数小时的带有语义信息的视频,因此必须为每个人重新训练。与这些作品相比,我们既不依赖于标签,也不依赖于动画对象的先验信息,也不依赖于每个对象实例的特定训练程序。此外,我们的方法可以应用于同一类别中的任何对象。,人脸,人体,机器人手臂等)。 

Several approaches were proposed that do not require priors about the object. X2Face [40] uses  a dense motion field in order to generate the output video via image warping. Similarly to us  they employ a reference pose that is used to obtain a canonical representation of the object. In our  formulation, we do not require an explicit reference pose, leading to significantly simpler optimization  and improved image quality. Siarohin et al. [28] introduced Monkey-Net, a self-supervised framework  for animating arbitrary objects by using sparse keypoint trajectories. In this work, we also employ  sparse trajectories induced by self-supervised keypoints. However, we model object motion in the  neighbourhood of each predicted keypoint by a local affine transformation. Additionally, we explicitly  model occlusions in order to indicate to the generator network the image regions that can be generated  by warping the source image and the occluded areas that need to be inpainted.提出了几种不需要关于对象的先验的方法。X2Face[40]使用密集运动场,通过图像翘曲生成输出视频。与我们相似的是,它们使用一个参考姿态来获得对象的规范表示。在我们的公式中,我们不需要一个明确的参考姿态,导致显著简化优化和改善图像质量。Siarohin等人[28]介绍了Monkey-Net,这是一个自监督框架,通过使用稀疏的关键点轨迹来创建任意对象的动画。在这项工作中,我们也使用稀疏轨迹由自监督关键点。然而,我们通过局部仿射变换在每个预测关键点的邻域内建模物体的运动。此外,为了向生成网络表明扭曲源图像可以生成的图像区域和需要绘制的遮挡区域,我们对遮挡进行了显式建模。



3 Method

We are interested in animating an object depicted in a source image S based on the motion of a similar  object in a driving video D. Since direct supervision is not available (pairs of videos in which objects  move similarly), we follow a self-supervised strategy inspired from Monkey-Net [28]. For training,  we employ a large collection of video sequences containing objects of the same object category. Our  model is trained to reconstruct the training videos by combining a single frame and a learned latent  representation of the motion in the video. Observing frame pairs, each extracted from the same video,  it learns to encode motion as a combination of motion-specific keypoint displacements and local  affine transformations. At test time we apply our model to pairs composed of the source image and of  each frame of the driving video and perform image animation of the source object.我们感兴趣的动画对象描述了源图像的基于相似的对象的运动以来驾驶视频d直接监督不可用(对视频对象移动类似),我们遵循self-supervised策略启发从Monkey-Net[28]。为了进行训练,我们使用了大量的视频序列集合,其中包含了同一对象类别的对象。我们的模型被训练来重建训练视频结合一个单一的帧和一个学习的潜在的表示运动在视频。通过观察从同一视频中提取的帧对,它学会了将运动编码为特定运动关键点位移和局部仿射变换的组合。在测试时,我们将模型应用于由源图像和驱动视频的每一帧组成的对,并执行源对象的图像动画。
An overview of our approach is presented in Fig. 2. Our framework is composed of two main  modules: the motion estimation module and the image generation module. The purpose of the motion  estimation module is to predict a dense motion field from a frame D ∈ R  3×H×W of dimension  H × W of the driving video D to the source frame S ∈ R  3×H×W . The dense motion field is later  used to align the feature maps computed from S with the object pose in D. The motion field is  modeled by a function TS←D : R  2 → R  2  that maps each pixel location in D with its corresponding  location in S. TS←D is often referred to as backward optical flow. We employ backward optical flow,  rather than forward optical flow, since back-warping can be implemented efficiently in a differentiable  manner using bilinear sampling [17]. We assume there exists an abstract reference frame R. We  independently estimate two transformations: from R to S (TS←R) and from R to D (TD←R). Note  that unlike X2Face [40] the reference frame is an abstract concept that cancels out in our derivations  later. Therefore it is never explicitly computed and cannot be visualized. This choice allows us to  independently process D and S. This is desired since, at test time the model receives pairs of the  source image and driving frames sampled from a different video, which can be very different visually.  Instead of directly predicting TD←R and TS←R, the motion estimator module proceeds in two steps.我们的方法的概述如图2所示。我们的框架由两个主要模块组成:运动估计模块和图像生成模块。运动估计模块的目的是预测从驱动视频D的维数H×W的帧D∈R 3×H×W到源帧S∈R 3×H×W的密集运动场。密集的运动领域后用于对齐对象构成的特征图谱计算从S D运动领域建模函数TS←D: R 2→R 2映射每个像素位置与相应的位置在美国TS←D D通常被称为反向光流。由于使用双线性采样[17]可以以可微的方式有效地实现反向翘曲,因此我们采用了反向光流而不是前向光流。我们假设存在一个抽象参考系R,我们独立估计两个转换:从R到S (TS←R)和从R到D (TD←R)。注意,与X2Face[40]不同的是,参考框架是一个抽象概念,在后面的派生中会被抵消。因此,它从不被显式地计算,也不能被可视化。这种选择允许我们独立处理D和s,这是我们所希望的,因为在测试时,模型接收来自不同视频的源图像和驱动帧,它们在视觉上可能非常不同。动作估计器模块不直接预测TD←R和TS←R,而是分两步进行。
In the first step, we approximate both transformations from sets of sparse trajectories, obtained by  using keypoints learned in a self-supervised way. The locations of the keypoints in D and S are  separately predicted by an encoder-decoder network. The keypoint representation acts as a bottleneck  resulting in a compact motion representation. As shown by Siarohin et al. [28], such sparse motion  representation is well-suited for animation as at test time, the keypoints of the source image can be  moved using the keypoints trajectories in the driving video. We model motion in the neighbourhood  of each keypoint using local affine transformations. Compared to using keypoint displacements only,  the local affine transformations allow us to model a larger family of transformations. We use Taylor  expansion to represent TD←R by a set of keypoint locations and affine transformations. To this end,  the keypoint detector network outputs keypoint locations as well as the parameters of each affine  transformation.  
During the second step, a dense motion network combines the local approximations to obtain the  resulting dense motion field Tˆ  S←D. Furthermore, in addition to the dense motion field, this network  outputs an occlusion mask Oˆ  S←D that indicates which image parts of D can be reconstructed by  warping of the source image and which parts should be inpainted, i.e.inferred from the context.  
Finally, the generation module renders an image of the source object moving as provided in the  driving video. Here, we use a generator network G that warps the source image according to Tˆ  S←D  and inpaints the image parts that are occluded in the source image. In the following sections we detail  each of these step and the training procedure.
在第二步中,密集的运动网络结合了本地近似获得由此产生的密集运动领域TˆS←D。此外,除了茂密的运动领域,这个网络输出一个闭塞面具OˆS←D D表明图像部分可以重建源图像的扭曲和哪些部分应该填补,i.e.inferred从上下文。


3.1 Local Affine Transformations for Approximate Motion Description  局部仿射变换近似运动描述

The motion estimation module estimates the backward optical flow TS←D from a driving frame D to  the source frame S. As discussed above, we propose to approximate TS←D by its first order Taylor  expansion in a neighborhood of the keypoint locations. In the rest of this section, we describe the  motivation behind this choice, and detail the proposed approximation of TS←D.  
We assume there exist an abstract reference frame R. Therefore, estimating TS←D consists in  estimating TS←R and TR←D. Furthermore, given a frame X, we estimate each transformation  TX←R in the neighbourhood of the learned keypoints. Formally, given a transformation TX←R, we  consider its first order Taylor expansions in K keypoints p1, . . . pK. Here, p1, . . . pK denote the  coordinates of the keypoints in the reference frame R. Note that for the sake of simplicity in the  following the point locations in the reference pose space are all denoted by p while the point locations  in the X, S or D pose spaces are denoted by z. We obtain:

我们假设存在一个抽象的参考系R,因此,估算TS←D包含在估算TS←R和TR←D中。此外,给定一个坐标系X,我们估计每个变换TX←R在已学习关键点附近。正式地,给定一个变换TX←R,我们考虑它在K个关键点p1,…pK,这里是p1…pK表示的坐标参考系中的要点r .请注意,为了简单起见在参考点位置后构成的空间都是用p点位置在X,年代或D构成空间是用z。我们得到:



Combining Local Motions. We employ a convolutional network P to estimate Tˆ  S←D from the set  of Taylor approximations of TS←D(z) in the keypoints and the original source frame S. Importantly,  since Tˆ  S←D maps each pixel location in D with its corresponding location in S, the local patterns in  Tˆ  S←D, such as edges or texture, are pixel-to-pixel aligned with D but not with S. This misalignment  issue makes the task harder for the network to predict Tˆ  S←D from S. In order to provide inputs  already roughly aligned with Tˆ  S←D, we warp the source frame S according to local transformations  estimated in Eq. (4). Thus, we obtain K transformed images S  1  , . . . S  K that are each aligned with  Tˆ  S←D in the neighbourhood of a keypoint. Importantly, we also consider an additional image S  0 = S  for the background.  
For each keypoint pk we additionally compute heatmaps Hk indicating to the dense motion network  where each transformation happens. Each Hk(z) is implemented as the difference of two heatmaps  centered in TD←R(pk) and TS←R(pk):

结合局部运动。我们采用卷积网络P估计TˆS←D组泰勒近似的TS←D (z)的重点和原始帧S .重要的是,由于TˆS←D地图每个像素位置在D相应位置的年代,当地TˆS←D模式,如边缘或纹理,pixel-to-pixel与D但不与美国这个偏差问题使得网络任务更难预测TˆS←D S为了提供输入已经大致与TˆS←D,我们经源帧S根据当地转换在情商估计。(4)。因此,我们获得K S转换图像1,。K,都与Tˆ年代←D附近的一个关键点。重要的是,我们还考虑了一个额外的图像S 0 = S作为背景。 



3.2 Occlusion-aware Image Generation 遮挡感知图像生成

As mentioned in Sec.3, the source image S is not pixel-to-pixel aligned with the image to be generated  Dˆ . In order to handle this misalignment, we use a feature warping strategy similar to [29, 28, 15].  More precisely, after two down-sampling convolutional blocks, we obtain a feature map ξ ∈ R  H0×W0  of dimension H0 × W0  . We then warp ξ according to Tˆ  S←D. In the presence of occlusions in S,  optical flow may not be sufficient to generate Dˆ . Indeed, the occluded parts in S cannot be recovered  by image-warping and thus should be inpainted. Consequently, we introduce an occlusion map  Oˆ  S←D ∈ [0, 1]H0×W0  to mask out the feature map regions that should be inpainted. Thus, the  occlusion mask diminishes the impact of the features corresponding to the occluded parts. The  transformed feature map is written as:

Sec.3提到过,源图像年代不是pixel-to-pixel与图像生成Dˆ。为了处理这种错位,我们使用了类似于[29,28,15]的特征扭曲策略。更准确地说,经过两个采样下来卷积块,我们获得一个特性映射ξ∈R H0×W0 H0×W0的维度。然后经ξ根据TˆS←D。存在遮挡的年代,光流可能不足以生成Dˆ。实际上,S中被遮挡的部分是无法通过图像扭曲恢复的,因此应该进行补绘。因此,我们引入一个闭塞地图OˆS←D∈[0,1]H0×W0面具出功能映射区域应该填补。因此,遮挡掩模减少了与遮挡部分相对应的特征的影响。转换后的feature map为:




3.3 Training Losses

We train our system in an end-to-end fashion combining several losses. First, we use the reconstruction loss based on the perceptual loss of Johnson et al. [19] using the pre-trained VGG-19 network as our main driving loss. The loss is based on implementation of Wang et al. [37]. With the input driving frame D and the corresponding reconstructed frame Dˆ , the reconstruction loss is written as:

我们以端到端的方式训练我们的系统,结合了一些损失。首先,我们使用基于Johnson等人[19]的感知损失的重建损失,使用预训练的vggg -19网络作为我们的主要驱动损失。损失是基于Wang等人[37]的实施。与输入驱动框架和相应的重构帧Dˆ,重建损失是写成:

Imposing Equivariance Constraint. Our keypoint predictor does not require any keypoint annotations  during training. This may lead to unstable performance. Equivariance constraint is one of  the most important factors driving the discovery of unsupervised keypoints [18, 43]. It forces the  model to predict consistent keypoints with respect to known geometric transformations. We use thin  plate splines deformations as they were previously used in unsupervised keypoint detection [18, 43]  and are similar to natural image deformations. Since our motion estimator does not only predict the  keypoints, but also the Jacobians, we extend the well-known equivariance loss to additionally include  constraints on the Jacobians.  
We assume that an image X undergoes a known spatial deformation TX←Y. In this case TX←Y can  be an affine transformation or a thin plane spline deformation. After this deformation we obtain a  new image Y. Now by applying our extended motion estimator to both images, we obtain a set of  local approximations for TX←R and TY←R. The standard equivariance constraint writes as:


Note that the constraint Eq. (11) is strictly the same as the standard equivariance constraint for the  keypoints [18, 43]. During training, we constrain every keypoint location using a simple L1 loss  between the two sides of Eq. (11). However, implementing the second constraint from Eq. (12) with L1 would force the magnitude of the Jacobians to zero and would lead to numerical problems. To  this end, we reformulate this constraint in the following way:




3.4 Testing Stage: Relative Motion Transfer    测试阶段:相对运动转移

At this stage our goal is to animate an object in a source frame S1 using the driving video D1, . . . DT .  Each frame Dt is independently processed to obtain St. Rather than transferring the motion encoded  in TS1←Dt  (pk) to S, we transfer the relative motion between D1 and Dt to S1. In other words, we  apply a transformation TDt←D1  (p) to the neighbourhood of each keypoint pk:

在这个阶段,我们的目标是动画的对象在源帧S1使用驾驶视频D1,…DT。我们将D1和Dt之间的相对运动转移到S1,而不是将TS1←Dt (pk)中编码的运动转移到S中。换句话说,我们对每个关键点pk的邻域应用变换TDt←D1 (p):


Detailed mathematical derivations are provided in Sup. Mat.. Intuitively, we transform the neighbourhood  of each keypoint pk in S1 according to its local deformation in the driving video. Indeed,  transferring relative motion over absolute coordinates allows to transfer only relevant motion patterns,  while preserving global object geometry. Conversely, when transferring absolute coordinates, as in  X2Face [40], the generated frame inherits the object proportions of the driving video. It’s important  to note that one limitation of transferring relative motion is that we need to assume that the objects  in S1 and D1 have similar poses (see [28]). Without initial rough alignment, Eq. (14) may lead to  absolute keypoint locations physically impossible for the object of interest.在Sup. Mat中提供了详细的数学推导。直观上,我们根据driving video中每个关键点pk的局部变形,对S1中每个关键点pk的邻域进行变换。实际上,在绝对坐标上传输相对运动只允许传输相关的运动模式,同时保留全局物体的几何形状。相反,在传输绝对坐标时,如在X2Face[40]中,生成的帧继承驱动视频的对象比例。需要注意的是,传递相对运动的一个限制是,我们需要假设S1和D1中的物体具有相似的姿态(见[28])。在没有初始粗对准的情况下,Eq.(14)可能导致感兴趣对象在物理上无法得到绝对的关键点位置。


4 Experiments

Datasets. We train and test our method on four different datasets containing various objects. Our model is capable of rendering videos of much higher resolution compared to [28] in all our experiments.

  • The VoxCeleb dataset [22] is a face dataset of 22496 videos, extracted from YouTube videos. For  pre-processing, we extract an initial bounding box in the first video frame. We track this face until  it is too far away from the initial position. Then, we crop the video frames using the smallest crop  containing all the bounding boxes. The process is repeated until the end of the sequence. We filter  out sequences that have resolution lower than 256 × 256 and the remaining videos are resized to  256 × 256 preserving the aspect ratio. It’s important to note that compared to X2Face [40], we obtain  more natural videos where faces move freely within the bounding box. Overall, we obtain 12331  training videos and 444 test videos, with lengths varying from 64 to 1024 frames.  
  • The UvA-Nemo dataset [9] is a facial analysis dataset that consists of 1240 videos. We apply the  exact same pre-processing as for VoxCeleb. Each video starts with a neutral expression. Similar to  Wang et al. [38], we use 1116 videos for training and 124 for evaluation.  
  • The BAIR robot pushing dataset [10] contains videos collected by a Sawyer robotic arm pushing  diverse objects over a table. It consists of 42880 training and 128 test videos. Each video is 30 frame  long and has a 256 × 256 resolution.  
  • Following Tulyakov et al. [34], we collected 280 tai-chi videos from YouTube. We use 252 videos  for training and 28 for testing. Each video is split in short clips as described in pre-processing of  VoxCeleb dataset. We retain only high quality videos and resized all the clips to 256 × 256 pixels  (instead of 64 × 64 pixels in [34]). Finally, we obtain 3049 and 285 video chunks for training and  testing respectively with video length varying from 128 to 1024 frames. This dataset is referred to as  the Tai-Chi-HD dataset. The dataset will be made publicly available.


  • VoxCeleb数据集[22]是从YouTube视频中提取的22496个视频的人脸数据集。为了进行预处理,我们在第一帧视频中提取一个初始边界框。我们跟踪这个面,直到它离初始位置太远。然后,我们使用包含所有边框的最小剪裁来裁剪视频帧。这个过程一直重复,直到序列结束。我们过滤掉分辨率低于256×256的序列,其余的视频调整为256×256,保持高宽比不变。值得注意的是,与X2Face[40]相比,我们获得了更自然的视频,其中面在边框内自由移动。总的来说,我们获得了12331个训练视频和444个测试视频,长度从64帧到1024帧不等。 
  • UvA-Nemo数据集[9]是一个面部分析数据集,包含1240个视频。我们使用与VoxCeleb完全相同的预处理。每个视频都以一个中性的表情开始。与Wang et al.[38]类似,我们使用1116个视频进行培训,124个视频进行评估。 
  • BAIR机器人推送数据集[10]包含了由Sawyer机器人手臂在桌子上推送不同对象所收集的视频。它由42880个训练视频和128个测试视频组成。每个视频为30帧长,分辨率为256×256。 
  • 在Tulyakov等人[34]之后,我们从YouTube上收集了280个太极视频。我们使用252个视频进行培训,28个视频进行测试。每个视频被分割成简短的片段,正如在VoxCeleb数据集预处理中描述的那样。我们只保留高质量的视频,并将所有剪辑调整为256×256像素(而不是[34]中的64×64像素)。最后,我们分别得到3049和285个视频块进行训练和测试,视频长度在128到1024帧之间。这个数据集被称为taichi - hd数据集。数据集将向公众开放。
Evaluation Protocol. Evaluating the quality of image animation is not obvious, since ground truth  animations are not available. We follow the evaluation protocol of Monkey-Net [28]. First, we quantitatively evaluate each method on the "proxy" task of video reconstruction. This task consists of  reconstructing the input video from a representation in which appearance and motion are decoupled.  In our case, we reconstruct the input video by combining the sparse motion representation in (2) of  each frame and the first video frame. Second, we evaluate our model on image animation according  to a user-study. In all experiments we use K=10 as in [28]. Other implementation details are given in  Sup. Mat.

评估方案。评价图像动画的质量并不明显,因为地面真实动画是不可用的。我们遵循猴网[28]的评估协议。首先,我们对视频重建的“代理”任务进行了定量评估。这个任务包括从外观和运动解耦的再现中重构输入视频。在我们的例子中,我们结合每一帧的稀疏运动表示和第一帧视频来重建输入视频。其次,我们根据用户研究评估我们的图像动画模型。在所有的实验中,我们使用K=10作为[28]。其他实现细节见 Sup. Mat.

Metrics. To evaluate video reconstruction, we adopt the metrics proposed in Monkey-Net [28]:
  • L1. We report the average L1 distance between the generated and the ground-truth videos.
  • Average Keypoint Distance (AKD). For the Tai-Chi-HD, VoxCeleb and Nemo datasets, we use  3rd-party pre-trained keypoint detectors in order to evaluate whether the motion of the input video  is preserved. For the VoxCeleb and Nemo datasets we use the facial landmark detector of Bulat et  al. [5]. For the Tai-Chi-HD dataset, we employ the human-pose estimator of Cao et al. [7]. These  keypoints are independently computed for each frame. AKD is obtained by computing the average  distance between the detected keypoints of the ground truth and of the generated video.  
  • Missing Keypoint Rate (MKR). In the case of Tai-Chi-HD, the human-pose estimator returns an  additional binary label for each keypoint indicating whether or not the keypoints were successfully  detected. Therefore, we also report the MKR defined as the percentage of keypoints that are detected  in the ground truth frame but not in the generated one. This metric assesses the appearance quality of  each generated frame.  
  • Average Euclidean Distance (AED). Considering an externally trained image representation, we  report the average euclidean distance between the ground truth and generated frame representation,  similarly to Esser et al. [11]. We employ the feature embedding used in Monkey-Net [28].


  • L1。我们报告了生成的视频和地面真实视频之间的平均L1距离。
  • 平均关键点距离(AKD)。对于Tai-Chi-HD、VoxCeleb和Nemo数据集,我们使用第三方预训练的关键点检测器来评估输入视频的运动是否被保留。对于VoxCeleb和Nemo数据集,我们使用Bulat等人的面部地标检测器。[5]。对于taichi - hd数据集,我们采用了Cao等人[7]的人体姿态估计器。对于每一帧,这些关键点都是独立计算的。AKD是通过计算ground truth检测关键点与生成视频之间的平均距离得到的。 
  • 缺少关键点率(MKR)。在Tai-Chi-HD的情况下,人体姿态估计器为每个关键点返回一个额外的二进制标签,以指示是否成功地检测到关键点。因此,我们还报告MKR定义为在ground truth框架中检测到但在生成的框架中未检测到的关键点的百分比。这个度量评估每个生成帧的外观质量。 
  • 平均欧氏距离(AED)。考虑到外部训练的图像表示,我们报告了ground truth和生成的帧表示之间的平均欧氏距离,类似于Esser等人[11]。我们使用了在猴网[28]中使用的特征嵌入。
  • 烧蚀研究。我们比较模型的以下变体。基线:不使用遮挡模板训练的最简单模型(Eq.(8)中OS←D=1), Eq.(4)中雅可比矩阵(Jk =1),并且仅在最高分辨率下使用Lrec进行监督;吡定:金字塔损失添加到基线;吡定+OS←D:关于Pyr。,我们将产生网络替换为封闭感知网络;江淮。Eq.(12)我们的局部仿射变换模型,但对雅可比矩阵没有等方差约束完整:包括3.1节中描述的局部仿射变换的完整模型。 
Ablation Study. We compare the following variants of our model. Baseline: the simplest model  trained without using the occlusion mask (OS←D=1 in Eq. (8)), jacobians (Jk = 1 in Eq. (4)) and  is supervised with Lrec at the highest resolution only; Pyr.: the pyramid loss is added to Baseline;  Pyr.+OS←D: with respect to Pyr., we replace the generator network with the occlusion-aware network;  Jac. w/o Eq. (12) our model with local affine transformations but without equivariance constraints on  jacobians Eq. (12); Full: the full model including local affine transformations described in Sec. 3.1.  
In Fig. 3, we report the qualitative ablation. First, the pyramid loss leads to better results according  to all the metrics except AKD. Second, adding OS←D to the model consistently improves all the  metrics with respect to Pyr.. This illustrates the benefit of explicitly modeling occlusions. We found  that without equivariance constraint over the jacobians, Jk becomes unstable which leads to poor  motion estimations. Finally, our Full model further improves all the metrics. In particular, we note  that, with respect to the Baseline model, the MKR of the full model is smaller by the factor of 2.75.  It shows that our rich motion representation helps generate more realistic images. These results are  confirmed by our qualitative evaluation in Tab. 1 where we compare the Baseline and the Full models.  In these experiments, each frame D of the input video is reconstructed from its first frame (first  column) and the estimated keypoint trajectories. We note that the Baseline model does not locate any keypoints in the arms area. Consequently, when the pose difference with the initial pose increases,  the model cannot reconstruct the video (columns 3,4 and 5). In contrast, the Full model learns to  detect a keypoint on each arm, and therefore, to more accurately reconstruct the input video even in  the case of complex motion.

烧蚀研究。我们比较模型的以下变体。基线:不使用遮挡模板训练的最简单模型(Eq.(8)中OS←D=1), Eq.(4)中雅可比矩阵(Jk =1),并且仅在最高分辨率下使用Lrec进行监督;吡定:金字塔损失添加到基线;吡定+OS←D:关于Pyr。,我们将产生网络替换为封闭感知网络;江淮。Eq.(12)我们的局部仿射变换模型,但对雅可比矩阵没有等方差约束完整:包括3.1节中描述的局部仿射变换的完整模型。


Comparison with State of the Art. We now compare our method with state of the art for the video  reconstruction task as in [28]. To the best of our knowledge, X2Face [40] and Monkey-Net [28] are  the only previous approaches for model-free image animation. Quantitative results are reported in  Tab. 3. We observe that our approach consistently improves every single metric for each of the four  different datasets. Even on the two face datasets, VoxCeleb and Nemo datasets, our approach clearly  outperforms X2Face that was originally proposed for face generation. The better performance of our  approach compared to X2Face is especially impressive X2Face exploits a larger motion embedding  (128 floats) than our approach (60=K*(2+4) floats). Compared to Monkey-Net that uses a motion  representation with a similar dimension (50=K*(2+3)), the advantages of our approach are clearly  visible on the Tai-Chi-HD dataset that contains highly non-rigid objects (i.e.human body).  
We now report a qualitative comparison for image animation. Generated sequences are reported in  Fig. 4. The results are well in line with the quantitative evaluation in Tab. 3. Indeed, in both examples,  X2Face and Monkey-Net are not able to correctly transfer the body notion in the driving video,  instead warping the human body in the source image as a blob. Conversely, our approach is able  to generate significantly better looking videos in which each body part is independently animated.  This qualitative evaluation illustrates the potential of our rich motion description. We complete our  evaluation with a user study. We ask users to select the most realistic image animation. Each question  consists of the source image, the driving video, and the corresponding results of our method and a  competitive method. We require each question to be answered by 10 AMT worker. This evaluation  is repeated on 50 different input pairs. Results are reported in Tab. 2. We observe that our method  is clearly preferred over the competitor methods. Interestingly, the largest difference with the state  of the art is obtained on Tai-Chi-HD: the most challenging dataset in our evaluation due to its rich  motions.
我们现在报告一个图像动画的定性比较。生成的序列如图所示。4. 结果与表3的定量评价很一致。实际上,在这两个例子中,X2Face和Monkey-Net都无法在驱动视频中正确传输身体概念,而是将源图像中的人体扭曲成一个blob。相反,我们的方法能够产生明显更好的视频,其中身体的每个部分都是独立的动画。这种定性评价说明了我们丰富的运动描述的潜力。我们通过用户研究来完成我们的评估。我们要求用户选择最真实的图像动画。每个问题由源图像,驾驶视频,以及相应的结果,我们的方法和竞争方法。我们要求每个问题由10个AMT工人回答。这个评估在50个不同的输入对上重复。结果如表2所示。我们观察到我们的方法明显优于竞争对手的方法。有趣的是,与目前最先进的最大差异是在Tai-Chi-HD上获得的:由于其丰富的运动,在我们的评估中最具挑战性的数据集。

5 Conclusions  

We presented a novel approach for image animation based on keypoints and local affine transformations.  Our novel mathematical formulation describes the motion field between two frames and is  efficiently computed by deriving a first order Taylor expansion approximation. In this way, motion is  described as a set of keypoints displacements and local affine transformations. A generator network  combines the appearance of the source image and the motion representation of the driving video. In  addition, we proposed to explicitly model occlusions in order to indicate to the generator network  which image parts should be inpainted. We evaluated the proposed method both quantitatively and  qualitatively and showed that our approach clearly outperforms state of the art on all the benchmarks.本文提出了一种基于关键点和局部仿射变换的图像动画方法。我们的新的数学公式描述了两个帧之间的运动场,并通过推导一阶泰勒展开近似来有效地计算。这样,运动被描述为一组关键点位移和局部仿射变换。生成网络将源图像的外观和驱动视频的运动表示结合起来。此外,我们建议显式地建立遮挡模型,以便向生成器网络指示哪些图像部分需要修复。我们对所提出的方法进行了定量和定性的评估,并表明我们的方法在所有基准测试中都明显优于现有的技术水平。











一个处女座的程序猿 CSDN认证博客专家 华为杯研电赛一等 华为研数模一等奖 国内外AI竞十
已标记关键词 清除标记
©️2020 CSDN 皮肤主题: 代码科技 设计师:Amelia_0503 返回首页
实付 29.90元
钱包余额 0