CN113160034B

CN113160034B - Method for realizing complex motion migration based on multiple affine transformation representations

Info

Publication number: CN113160034B
Application number: CN202110395430.8A
Authority: CN
Inventors: 代龙泉; 刘敬威
Original assignee: Nanjing University of Science and Technology
Current assignee: Nanjing University of Science and Technology
Priority date: 2021-04-13
Filing date: 2021-04-13
Publication date: 2022-09-20
Anticipated expiration: 2041-04-13
Also published as: CN113160034A

Abstract

The invention discloses a complex motion migration method based on multiple affine transformation representation, which represents motion change between two motion states of the same object into multiple image affine transformations irrelevant to the appearance of the object, so that the motion change can migrate between different objects. On the one hand, the simple information representation of affine transformation is convenient for calculation, and therefore, the reasoning speed of the model is improved. On the other hand, the representation mode of the multiple affine transformations greatly reserves the action information and improves the generation quality of the image.

Description

Method for realizing complex motion migration based on multiple affine transformation representations

Technical Field

The invention relates to the field of image generation in computer vision, and provides a method for realizing complex motion migration based on multi-affine transformation representation.

Background

The purpose of the action migration method is to migrate action information in a section of video to a static image, so as to generate a new section of video, wherein the video section has the appearance of an object contained in the static image and the action of the object in the input video. The technology is widely applied to the fields of movie and television production, face changing, video conferences, electronic commerce and the like. In addition, with the explosion of the short video field, the demand for video special effects is generated, so that the motion migration method attracts more and more attention.

In order to generate good image generation quality, most of the current motion migration methods adopt a supervised method based on deep learning. Supervised methods require expensive manual annotation data, which creates an obstacle to the widespread use of this technique. In addition, the unsupervised method-based action migration research mainly faces the challenges of low generation quality and slow reasoning speed.

Chan et al, in the article "Everybody Dance Now", propose a method for generating a real human body image from key point information of a human body, and generate a person with a corresponding posture by inputting different key point information, so as to implement action migration. However, this method not only uses a detection model trained by a supervised method, but also needs to train a new motion migration model for each different person, and the huge cost limits the practicability of this method.

Siarohin et al propose a self-supervision Motion migration method in a First Order Motion Model for Image Animation, however, the method is complex in calculation and is bulky in Model, and two calculations are required to represent Motion information. These deficiencies affect the real-time nature of action migration.

Disclosure of Invention

The invention aims to provide a method for realizing complex motion migration based on multi-affine transformation representation, which avoids the requirement on expensive labeled data, reduces the calculated amount of a motion migration model, improves the reasoning speed of the model and improves the accuracy of generated images.

The technical solution for realizing the purpose of the invention is as follows: a method for realizing complex motion migration based on multiple affine transformation representation comprises the following steps:

and step A, aiming at the same type of target object, collecting a plurality of video sequences to form a data set for training an action migration model, and turning to step B.

And B, loading a first video sequence in the data set, and turning to the step C.

Step C, randomly selecting two frames of images from the loaded video sequence, wherein the two frames of images are respectively called a source image and a target image to form a training data pair; and D, estimating n groups of affine transformation matrixes from the two frames of images through a regression module in the motion migration model, wherein n is larger than or equal to 1, and turning to the step D.

And D, taking the source image and all the affine transformation matrixes as input of a mask generator, generating an equivalent mask indicating the occurrence position of each simple affine transformation by the mask generator, generating a global sampling grid by combining all the affine transformation matrixes and the corresponding masks thereof, and turning to the step E.

And E, reconstructing a false target image through the global sampling grid and the source image by a generation module in the motion migration model, calculating the loss of the motion migration model, training the model once by combining a back propagation algorithm, and turning to the step F.

And F, reloading the next video sequence, returning to the step C until the motion migration model converges to have a good generation effect, obtaining a trained motion migration model, and turning to the step G.

And G, aiming at the same type of target object, acquiring a section of video P and a static image S, and migrating the motion in the video P to the static image S by using a trained motion migration model to generate a new video with the appearance of the object in the static image S and the motion contained in the video P so as to realize the complex motion migration of the target object.

Compared with the prior art, the invention has the advantages that:

(1) the invention is based on the method of the self-supervision deep learning, the training of the model is carried out on the cheap and easily obtained video data, the expensive manual marking data is not needed, and the deployment cost can be greatly saved.

(2) The present invention uses multiple affine transformations to characterize appearance independent motion information in an image. The action information can be arbitrarily migrated to the same class of objects without training corresponding action migration models for each person, so that model deployment cost is saved, and the models can be applied to actual scenes.

(3) The multi-affine transformation matrix is directly output from the two input images, so that the coded motion information is more accurate, the coding speed is higher, the calculation speed of the model is improved, and the motion migration effect is improved.

Drawings

FIG. 1 is a flow chart of a method for implementing complex motion migration based on multiple affine transformation characterizations according to the present invention.

Detailed Description

The action migration model designed by the invention comprises a regression module E _regress Mask generator G _grid And generating a module G. The purpose of the regression module is to estimate multiple sets of affine transformation matrices from two images to decompose complex motion changes into multiple simple image affine transformations. The purpose of the mask generator is to generate a mask indicating where the affine transformation occurs and to generate a final global sampling grid describing the changes in the motion of the object. The effect of the generation module is to apply the coded global sampling grid to a given still image for the purpose of altering the illusion of motionAnd marking the image.

The invention is described in further detail below with reference to the accompanying drawings:

with reference to fig. 1, a complex motion migration method based on multiple affine transformation representations includes the following steps:

and step A, aiming at the same type of target object, collecting a plurality of video sequences to form a data set for training an action migration model.

The same type of target object means that the acquired objects belong to the same attribute range. For example, if a human face movement migration task is performed, the acquired videos should be all videos recording changes of facial movements of a human, and videos recording changes of body movements of a human should not be introduced, and even videos of other types other than a human should not be introduced. The collected target objects of the same class are not limited to collecting video data of the same person, and in order to enable the trained model to have generalization, video data of multiple persons should be collected as much as possible.

And B, loading a first video sequence in the data set.

Step C, randomly selecting two frames of images from the loaded video sequence, wherein the two frames of images are respectively called a source image and a target image to form a training data pair; estimating n groups of affine transformation matrixes from the two frames of images through a regression module in the motion migration model, wherein n is more than or equal to 1, and the method specifically comprises the following steps:

and C01, randomly selecting two frames of images from a section of video sequence V, and recording the two frames of images as a source image I and a target image T to form a training data pair.

And C02, connecting the source image I with the target image T in the channel dimension of the image to obtain an input tensor, and taking the input tensor as a regression module E in the action migration model _regress The regression module outputs n sets of affine transformation matrices:

[A ₁ ,A ₂ ,...A _i ...,A _n ]＝E _regress (concat(I,T))

where concat () represents a channel dimension join operation, A _i Represents the ith affine transformation matrix, A _n Representing the nth affine transformation matrix.

An affine transformation matrix may be used to describe an affine transformation of an image. The affine transformation of the image refers to the operations of turning, scaling, beveling and rotating the image. We consider a complex motion change of an object as a combination of simple affine transformations that occur locally. Thus, a complex motion change can be represented using multiple image affine transformations.

Two frames of images from a video sequence, namely a source image and a target image, are connected in a channel dimension and then used as input of a regression module. This allows the model to focus on the change in pixel position between two frames of images, while reducing the focus of the model on the change in pixel value. This allows the regression module to predict a more accurate affine transformation matrix.

And D, taking the source image and all the affine transformation matrixes as the input of a mask generator, generating an equivalent mask indicating the occurrence position of each simple affine transformation by the mask generator, and generating a global sampling grid by combining all the affine transformation matrixes and the corresponding masks thereof, wherein the method specifically comprises the following steps:

step D01.n groups of affine transformation matrixes respectively and correspondingly construct n local sampling grids, and identity transformation (identity transformation) is added

And obtaining n +1 local sampling grids in total according to the corresponding local sampling grids.

The sampling grid is a tensor that indicates how the sampling function samples the input image. The image sampled and generated by using the constructed local sampling grid is actually the image affine transformation represented by the affine transformation matrix of the input image. The image is transformed using a method that constructs a sampling grid in order to make this step propagable in a back propagation training.

The sampling grid constructed here is affine transformed of the image indicating the local position of the source image, so we call it a local sampling grid here.

According to affine transformation A _i The constructed sampling grid can be expressed as

Where G refers to a sampling grid with identity transformation, the expression meaning the affine transformation matrix A _i The method is applied to identity transformation sampling grids, so that the sampling grids for performing image affine transformation on the input image are obtained through calculation.

In the step, n local sampling grids are calculated, and the n +1 local sampling grids are added with the identity transformation sampling grid G.

Step D02, the n +1 local sampling grids respectively sample the source image I to obtain n +1 simple affine transformed source images:

wherein Sampler () is a function of sampling the image according to a sampling grid, the output of which

Namely the sampled image.

Step D03, connecting the n +1 source images subjected to the simple affine transformation in channel dimensions, and taking the tensors obtained after the connection as a mask generator G _grid The mask generator generates n +1 masks with element values ranging from 0 to 1:

wherein M is _i Is generated to correspond to a local sampling grid

Masking. In particular, M ₀ Is a mask corresponding to the identity transformed sampling grid G to indicate the regions of the source image I that have not changed in pixel position relative to the target image T.

In step D01, n +1 local sampling grids are obtained together, and we know that these image transformations occur at a certain position in the source image, but the specific position is not clear. In order to be able to generate a global sampling grid describing a complex motion variation in combination with these local sampling grids, we need to know for each local sampling grid specifically the image transformation describing which position in the image. This is the effect of the mask.

Step D04, calculating a global sampling grid of the target object with complex motion changes according to the n +1 local sampling grids and the n +1 masks:

and (3) masking out an effective area of each local sampling grid by using a corresponding mask, and finally summing all masked local sampling grids to obtain a global sampling grid:

wherein |, means that the matrix is operated by element multiplication,

a global sampling grid representing source image to target image motion variations.

Step E, a generation module in the motion migration model reconstructs a false target image through the global sampling grid and the source image, calculates the loss of the motion migration model, trains a primary model by combining a back propagation algorithm, and specifically comprises the following steps:

step E01, a generation module G in the action migration model takes a source image I and a global sampling grid as input, and the generation module generates a reconstructed false target image

In the generation module G, the generator first performs a down-sampling operation on the input image to obtain a low-resolution image feature that is easy to operate. Then applying the wholeThe local sampling field samples the image feature to approximately align the image feature with the target image. Then the aligned image features are up-sampled, and the final target image is output

On one hand, the calculation amount of the model can be reduced by sampling under the low resolution, and on the other hand, the down-sampling part of the generation module can optimize the low resolution image characteristics to a certain extent, so that the image generation effect is better

And E02, calculating the loss sum of the motion migration model according to the target image T and the false target image, and optimizing the primary model by combining a back propagation algorithm.

In the loss calculation stage of the model, the existing perception loss calculation method is adopted as the main driving loss of the model. The perception loss is used as a reconstruction loss to drive the model to generate the same result as the target image, and the resistance loss is used as an auxiliary loss to drive the action migration model to generate a more vivid and clear result.

And F, reloading the next video sequence, returning to the step C until the motion migration model converges to have a good generation effect, and obtaining the trained motion migration model.

And G, aiming at the same type of target object, acquiring a section of video P and a static image S, migrating the action in the video P to the static image S by using a trained action migration model, and generating a new video with the appearance of the object in the static image S and the action contained in the video P so as to realize the complex action migration of the target object, wherein the specific steps are as follows.

And step G01, aiming at the same type of target object, acquiring a section of video P and a static image S.

Assume that the goal of human face motion migration is now to be achieved and that a motion migration model of the human face has been obtained according to the steps described above. The video P acquired now is the same face data as the still image S. The video P describes the motion change of a human face, and the appearance of the human in the video P is different from that in the still image S.

Step G02. searching a similar frame P similar to the image S in motion in the video P _f 。

The frames that are similar in motion refer to no matter the image S and the similar frame P _f The appearances of the Chinese people are different in every day and even different in gender, but the two are similar from the aspect of the human face. For example, all on the same side face, all closing the mouth, opening the eyes, etc.

Step G03, according to the video P and the similar frame P _f Calculating a certain frame P in the video P by using a regression module in the motion migration model _i With similar frame P _f N sets of affine transformation matrices.

Step G04, obtaining a global sampling grid through a mask generator in the motion migration model according to the n groups of affine transformation matrixes and the image S

Wherein S _i Refers to the video P desired to be generated _S The ith frame in (1).

We will derive from the video frame P _i With similar frame P _f The estimated global sampling grid of (2) is applied to the image S. Although the face in image S is not the same person as the face in video P, this application is feasible because the motion information we characterize using multiple affine transformations, i.e., the global sampling grid, is appearance independent. I.e., our model completes the migration of action information.

Step G05, according to the global sampling grid

And a static image S, generating a false target image frame S by an image generator of the motion migration model _i ；

Step G06, traversing each frame in the video P, repeating G03-G05, and generating a false target image frame S corresponding to each frame in the video P _i Connected in sequence to obtain a result video P after action migration _S 。

The resulting video P that is finally generated _S I.e. the result we expect. The video has the characteristics that the human face in the video is the same as the human face in the image S, and the human face in the video has the same action as the human face in the video P.

Claims

1. A method for realizing complex motion migration based on multi-affine transformation representation is characterized by comprising the following steps:

step A, aiming at the same type of target object, collecting a plurality of video sequences to form a data set for training an action migration model, and turning to step B;

b, loading a first video sequence in the data set, and turning to the step C;

c01, randomly selecting two frames of images from a section of video sequence V, recording the two frames of images as a source image I and a target image T to form a training data pair;

[A ₁ ,A ₂ ,...A _i ...,A _n ]＝E _regress (concat(I,T))

where concat () represents a channel dimension join operation, A _i Representing the ith affine transformation matrix, A _n Representing an nth affine transformation matrix;

turning to the step D;

step D, taking the source image and all the affine transformation matrixes as input of a mask generator, generating an equivalent mask indicating the occurrence position of each simple affine transformation by the mask generator, and generating a global sampling grid by combining all the affine transformation matrixes and the corresponding masks thereof, wherein the method specifically comprises the following steps:

respectively and correspondingly constructing n local sampling grids by using D01.n groups of affine transformation matrixes, and then adding identity transformation

Obtaining n +1 local sampling grids in total by the corresponding local sampling grids;

step D02, the n +1 local sampling grids respectively sample the source image I to obtain n +1 simple affine transformed source images;

step D03, connecting the n +1 source images subjected to the simple affine transformation in a channel dimension, taking the tensor obtained after the connection as the input of a mask generator, and generating n +1 masks with the element value range of 0-1 by the mask generator;

covering the effective area of each local sampling grid by using a corresponding mask, and finally summing all the covered local sampling grids to obtain a global sampling grid;

e, turning to the step E;

e, a generation module in the motion migration model reconstructs a false target image through the global sampling grid and the source image, calculates the loss of the motion migration model, trains a primary model by combining a back propagation algorithm, and shifts to the step F;

f, reloading the next video sequence, returning to the step C until the motion migration model converges to have a good generation effect, obtaining a trained motion migration model, and turning to the step G;

2. The method for implementing complex motion migration based on multi-affine transformation characterization according to claim 1, wherein: in the step E, a generation module in the motion migration model reconstructs a false target image through the global sampling grid and the source image, calculates the loss of the motion migration model, trains a primary model by combining a back propagation algorithm, and specifically comprises the following steps:

step E01, a generating module in the action migration model takes a source image I and a global sampling grid as the input of an image generator, and the image generator generates a reconstructed false target image;

3. The method for implementing complex motion migration based on multi-affine transformation characterization according to claim 1, wherein: step G, aiming at the same type of target object, acquiring a section of video P and a static image S, migrating the motion in the video P to the static image S by using a trained motion migration model, and generating a new video with the appearance of the object in the static image S and the motion contained in the video P so as to realize the complex motion migration of the target object, which specifically comprises the following steps:

step G01, aiming at the same type of target object, acquiring a section of video P and a static image S;

step G02. searching a similar frame P similar to the image S in motion in the video P _f ；

Step G03, according to the video P and the similar frame P _f Calculating a certain frame P in the video P by using a regression module in the motion migration model _i With similar frame P _f N sets of affine transformation matrices;

WhereinS _i Refers to the video P desired to be generated _S The ith frame in (1);

step G05, according to the global sampling grid

4. The method for implementing complex motion migration based on multi-affine transformation characterization according to claim 1, wherein: the action migration model comprises a regression module, a generation module and a mask generator.