CN110197167B

CN110197167B - Video motion migration method

Info

Publication number: CN110197167B
Application number: CN201910485182.9A
Authority: CN
Inventors: 袁春; 成昆; 黄浩智; 刘威
Original assignee: Shenzhen Graduate School Tsinghua University
Current assignee: Shenzhen Graduate School Tsinghua University
Priority date: 2019-06-05
Filing date: 2019-06-05
Publication date: 2021-03-26
Anticipated expiration: 2039-06-05
Also published as: CN110197167A

Abstract

The invention provides a video motion migration method, which comprises the following steps: extracting action sequences of the source video and the target action video and respectively generating a source gesture and a target gesture; receiving an image input of a source video; performing preliminary feature extraction of the foreground and the background; fusing the preliminary characteristics of the background and the foreground respectively to generate a fused characteristic of the background and a fused characteristic of the foreground; synthesizing a fusion characteristic synthesis background through the fusion characteristic synthesis background; synthesizing the fusion characteristic synthesis foreground and the foreground mask through the fusion characteristic synthesis foreground, and further obtaining a frame model of the target video after the action migration at the time t; and adding a loss function into the frame model, wherein the loss function comprises a content loss function and a countermeasure loss function, the content loss function comprises pixel-level error loss and perception error loss, and the countermeasure loss function comprises spatial countermeasure loss and multi-scale temporal countermeasure loss. And an overall pipeline model with universality and flexibility is constructed.

Description

Video motion migration method

Technical Field

The invention relates to the technical field of computer vision, in particular to a video motion migration method.

Background

Portrait video generation is a leading topic with a large number of application scenes. It can be used to generate training data for higher-level vision tasks, such as body pose estimation, object detection and grouping, individual identification, and so forth. Meanwhile, the method is also helpful for developing a more powerful video directional editing tool. The existing portrait video generation modes mainly comprise three types: unconditional video generation, video frame prediction, and video motion migration.

Unconditional video generation focuses on mapping sets of one-dimensional hidden vectors to portrait video, and this approach relies on one-dimensional hidden vectors to simultaneously produce the appearance and motion information of the video. After training is completed, different generated videos can be obtained by randomly sampling in the hidden vector. However, this approach does not provide flexibility in controlling the motion and appearance of the generated video.

For video frame prediction, work has been directed to predicting future frames from previous frames. This problem can also be seen as a two-stage problem: the motion change of the future frame is predicted from the past frame, and then the complete frame is predicted from the motion of the future frame. The work and video motion migration in the second stage are similar, but existing video frame prediction methods focus on the first stage, and some considerations are lacking as to how the second stage maintains appearance details and temporal continuity.

The application focuses on the video motion migration problem and aims to migrate the human motion in the target video to the human body of the source video while maintaining the appearance of the source human. In this way, the motion of the generated video can be controlled exactly, as long as a set of target videos containing the ideal motion sequence is provided. Although there have been many methods that attempt to solve the motion migration problem of single frame images, it is not ideal to directly apply their method to a continuous video effect. Where video motion is complex and difficult to predict, single frame motion migration methods introduce severe blurring, aliasing, and other visually unnatural phenomena.

There has been recent work in addition to narrowing the general action migration problem to migrate arbitrary actions to fixed characters and scenes. Such methods often yield very attractive results due to the complexity of the simplification problem, however they are not strictly due to the migration problem: because the target character and the scene are single, the appearance and the background of the generated video do not even need to be obtained by transferring from the source video, but can be solidified and memorized in network parameters to form a generation process that the action hidden vector is directly transformed into the video. Therefore, such methods require a separate model to be trained for each source object, and the relationship between the foreground characters and the background scene is binding, which is contrary to our flexible and universal initiatives.

There is therefore a lack in the prior art of an efficient method for applying images to video.

Disclosure of Invention

The invention provides a video motion migration method for solving the existing problems.

In order to solve the above problems, the technical solution adopted by the present invention is as follows:

a video motion migration method comprises the following steps: s1: extracting action sequences of the source video and the target action video and respectively generating a source gesture and a target gesture; s2: receiving an image input of the source video; s3: performing preliminary feature extraction of the foreground and the background; s4: fusing the preliminary features of the background and the foreground respectively to generate a fused feature of the background and a fused feature of the foreground; synthesizing a fusion feature synthesis background through the fusion feature synthesis of the background; synthesizing a fusion characteristic synthesis foreground and a foreground mask through the fusion characteristic synthesis foreground, and further obtaining a frame model of the target video after action migration at the time t; s5: adding a loss function to the frame model, wherein the loss function comprises a content loss function and a countermeasure loss function, and the countermeasure loss function comprises a spatial countermeasure loss and a multi-scale time domain countermeasure loss.

In one embodiment of the invention, a2D gesture detection model is adopted to extract motion sequences of the source video and the target motion video.

In one embodiment of the invention, the image input of the source video comprises inputting K frames of images, the value of K being 4.

In an embodiment of the present invention, in step S3, a single frame migration method is used to select the penultimate layer features of the foreground and background branches for subsequent fusion.

In one embodiment of the present invention, the preliminary features of the background and the foreground are fused separately by a spatiotemporal attention mechanism in step S4; the spatiotemporal attention mechanism comprises: RB6 Structure: the backbone network consists of 6 residual modules, and the preliminary features are weighted and fused by SOFTMAX channel dimensions; structure SA3D + RB 6: adding a three-dimensional self-attention module enhancement feature before the RB6 structure; RB6+ SA2D Structure: two-dimensional self-attention module enhancement features are added after the RB6 structure.

In an embodiment of the present invention, in step S4, the frame model of the target video at time t obtained by the fusion feature synthesis foreground, the fusion feature synthesis background, and the foreground mask is:

wherein the content of the first and second substances,

synthesizing a foreground for the fused features;

synthesizing a background for the fused features;

is the foreground mask; an element by element multiplication.

In one embodiment of the invention, the content loss function is defined as:

wherein L is_MSEIs a mean square error function, O^tIs the frame model of the target video at time t,

is the real frame of the target video at the time t; the content loss function further includes a perceptual loss defined as:

where φ represents features extracted from the pre-trained VGG19 model.

In one embodiment of the invention, the spatial opposition loss is defined as:

wherein D is_IIs a single-frame image discrimination network,

representing the target posture of the target video at the time t;

the multi-scale temporal confrontation loss is defined as:

wherein the content of the first and second substances,W_Tis an optical flow sequence calculated by FlowNet2, and comprises optical flow information between each pair of continuous frames; v_TIs a target action video; v_oIs a target video;

the device is a time domain discriminator which receives n frames of images and optical flow information thereof as input and learns and discriminates the generated continuous n frames and real n frames.

In one embodiment of the invention, the loss function is defined as: l is_total＝L_MSE+λ_VGGL_VGG+λ_GIL_GAN,I+λ_GVL_GAN,V(ii) a Wherein λ is_VGG、λ_GI、λ_GVAnd the weight coefficients respectively correspond to the perception loss, the spatial countermeasure loss and the multi-frame countermeasure loss. The invention also provides a computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of the above.

The invention has the beneficial effects that: the video motion migration method provides appearance information through multi-frame input, has time-space attention mechanism guidance, and adopts a universal video motion migration scheme with a multi-time scale discriminator for resisting supervision. The assembly line is flexible, elements such as a foreground, a background and actions are analyzed from different videos, and the position sequence of the input video is changed to realize multiple combined videos such as actions of A in B scene; a brand-new content fusion mechanism is provided, and more real and natural foreground and background images can be generated based on a space-time attention mechanism; an end-to-end trained multi-time scale discriminator is presented to encourage the generator to produce temporally smoother continuous video.

Drawings

Fig. 1 is a schematic diagram of a video motion migration method according to an embodiment of the present invention.

Detailed Description

In order to make the technical problems, technical solutions and advantageous effects to be solved by the embodiments of the present invention more clearly apparent, the present invention is further described in detail below with reference to the accompanying drawings and the embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

It will be understood that when an element is referred to as being "secured to" or "disposed on" another element, it can be directly on the other element or be indirectly on the other element. When an element is referred to as being "connected to" another element, it can be directly connected to the other element or be indirectly connected to the other element. The connection may be for fixation or for circuit connection.

It is to be understood that the terms "length," "width," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," "outer," and the like are used in an orientation or positional relationship indicated in the drawings for convenience in describing the embodiments of the present invention and to simplify the description, and are not intended to indicate or imply that the referenced device or element must have a particular orientation, be constructed in a particular orientation, and be in any way limiting of the present invention.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the embodiments of the present invention, "a plurality" means two or more unless specifically limited otherwise.

Example 1

The problem addressed by the present application is directed to human motion migration in videos. V ═ I¹,I²,...,I^NRepresents an N-frame video in which a single person is performing a general movement, such as dancing. To simplify the problem, it is an unsolved challenging problem, even if it assumes that the viewpoint (camera) and background are both static. Given source video V_SAnd target action video V_TThe goal of action migration is to shift V_TIs migrated to V_SWhile maintaining V_SThe appearance characteristics of (1). In this way, the target video V is generated_OThe simultaneous control of motion and appearance can be displayed. Extracting motion sequence P ═ { P) of source video and target motion video by applying pre-trained 2D gesture detection model¹,p²,...,p^N}. Each of p^tThe posture of the t-th frame is represented, and the representation in the implementation is a thermodynamic value graph of M channels, wherein M is 14 to represent the number of key points. Labeling the source and target poses as P, respectively_SAnd P_T. It will be appreciated that more advanced pose extractors may also be employed to improve accuracy and performance, and are not limited herein.

Unlike single frame motion migration, accepting K frame inputs and their respective motion information, and target motion pose information, K has a value of 4 in one specific embodiment. The frame model of the target video may be roughly expressed as:

as shown in fig. 1, a video motion migration method includes the following steps:

s1: extracting action sequences of the source video and the target action video and respectively generating a source gesture and a target gesture;

s2: receiving an image input of the source video;

s3: performing preliminary feature extraction of the foreground and the background; i.e. extracting preliminary features of the foreground and background from the source pose, the target pose and the image input of the source video.

S4: fusing the preliminary features of the background and the foreground respectively to generate a fused feature of the background and a fused feature of the foreground; synthesizing a fusion feature synthesis background through the fusion feature synthesis of the background; synthesizing a fusion characteristic synthesis foreground and a foreground mask through the fusion characteristic synthesis foreground, and further obtaining a frame model of the target video after action migration at the time t;

s5: adding a loss function to the frame model, wherein the loss function comprises a content loss function and a countermeasure loss function, and the countermeasure loss function comprises a spatial countermeasure loss and a multi-scale time domain countermeasure loss.

The whole framework of the method mainly comprises a single-frame migration feature extraction module, a foreground and background fusion feature module and a final prediction and synthesis module. Wherein foreground and background are separated and merged by predictive masking

In step S3, the preliminary feature extraction module uses the existing single-frame migration method. And selecting the features of the penultimate layer of the foreground and background branches for subsequent processing. Compared with the foreground and background images generated by direct fusion, the previous layer of features contain richer information, which is beneficial to the training of the fusion module; layers near the output can easily yield an output image by the last single layer process compared to more advanced features. In consideration of both content richness and result usability, we select the penultimate features of the foreground branches, and leave them as subsequent fusion, enhancement, confrontation, and so on.

In step S4, in the case of single-frame pose migration, the quality of the composite foreground depends heavily on the selection of the source video frame. For example, the source video frame is a back view, and the pose of the front view is generated to generate a blurred result. In addition, the incompleteness of the single image information can also cause the instability of the synthesis result, and aggravate the time domain discontinuity in the generated video. The application provides a multi-frame fusion fine-tuning foreground (or background) synthesis module which fuses K frames of original features to generate a background before synthesis with higher quality. Preliminary features for each time step t, K frames

Is input into the fusion module to generate fused features

On the basis, a prediction module synthesizes fused features through the fused features of the backgroundA foreground; and synthesizing the fusion characteristic synthesis foreground and the foreground mask through the fusion characteristic synthesis foreground. The network structure of the prediction module is a single-layer 3x3 convolution, the activation function of the background image before prediction is Tanh, and the activation function of the foreground mask is Sigmoid.

For several different feature fusion approaches, the simplest and most intuitive is the channel dimension MAXFOOLING or AVERAGE-POLING. To further explore the multiframe information, the present application proposes three variants of the spatio-temporal attention mechanism:

RB6 Structure: the backbone network consists of 6 residual modules, and the preliminary features are weighted and fused by SOFTMAX channel dimensions;

structure SA3D + RB 6: adding a three-dimensional self-attention module enhancement feature before the RB6 structure;

RB6+ SA2D Structure: two-dimensional self-attention module enhancement features are added after the RB6 structure.

Their inputs are K sets of preliminary features and source pose and target pose information. The most basic variant "RB 6" consists of 6 residual blocks, computing a Kx H x W spatio-temporal attention map. Then weighting the foreground fusion features by K groups of preliminary features through an attention distribution map to obtain the foreground fusion features:

where F and a represent the preliminary feature and attention profile, respectively, the operator is an element-by-element multiplication.

The drawback of "RB 6" is that although attention is calculated from spatio-temporal information, the final processing is only spatially local temporal weighting. To alleviate this problem, two more complex variants, "SA 3D + RB 6" and "RB 6+ SA 2D" have been proposed. The results of the experiments showed that the results of the two variants performed similarly, but the efficiency of the operation of "RB 6+ SA 2D" was higher.

The frame model of the target video at the time t, which is obtained by the fusion feature synthesis foreground, the fusion feature synthesis background and the foreground mask, is as follows:

wherein the content of the first and second substances,

synthesizing a foreground for the fused features;

synthesizing a background for the fused features;

is the foreground mask; an element by element multiplication.

The loss functions as a whole can be divided into two broad categories, content loss and counter loss.

Content loss: in order to realize supervised training, different frames of the same video are used as a source role frame and a target action frame in a training stage, and the frames of the source video and the target action video are ensured not to be overlapped in the process. After the training process is finished, for an arbitrary source video, an arbitrary target motion video can be selected to provide a target motion sequence. On the premise of supervised training, the generated frame O is known^tShould be as close as possible to the target frame

The simplest and straightforward LOSS function is then the mean square error (MSE LOSS):

is the real frame of the target video at time t.

However, such a loss function tends to produce a blurred result because the generator learns to match as many possibilities as possible, and eventually converges to an averaged solution, i.e., a blurred result. To add more detail, perceptual loss is also exploited:

here phi denotes the features extracted by a pre-trained VGG19 model. In a practical implementation we have chosen the characteristics of these layers { conv1_1, conv2_1, conv3_1, conv4_1 }. L is_VGGThe constraint generation frame and the real frame are similar as much as possible on the feature domain of a pre-trained VGG network, thereby enhancing the perceptual similarity.

Space countermeasure loss: to encourage each generated frame to contain more realistic details, a spatial opposition loss function is introduced. A single frame conditional discriminator is trained to distinguish between generated frames and real frames. We used LSGAN and PatchGAN to ensure the stability of the training:

wherein D is_IIs a single-frame image discrimination network,

representing the target posture of the target video at the time t.

Multi-scale temporal countermeasure loss: in addition to spatial opposition loss, we also introduce multi-scale temporal opposition loss to encourage the generated video to be as close to real video as possible in temporal dynamics. Unlike time-domain discriminators that use only one fixed range, we have trained multiple time-domain discriminators to evaluate the time-domain continuity at different time scales. The multi-scale temporal confrontation loss is defined as:

wherein, W_TIs an optical flow sequence calculated by FlowNet2, including eachOptical flow information between a pair of successive frames; v_TIs a target action video; v_oIs a target video;

Total loss function: the overall loss is weighted by the parts:

L_total＝L_MSE+λ_VGGL_VGG+λ_GIL_GAN,I+λ_GVL_GAN,V

wherein λ is_VGG、λ_GI、λ_GVAnd the weight coefficients respectively correspond to the perception loss, the spatial countermeasure loss and the multi-frame countermeasure loss.

Thus, the objective problem of the present application can be expressed as:

here D_VSet of all video discriminators representing different time scales:

this objective function can be optimized by alternately updating the generator G and the discriminator D.

Example 2

The present application uses PSNR and VFID as evaluation indexes. To calculate the VFID, video features are first extracted using a pre-trained video classification model I3D, and then mean and covariance matrices are calculated over all videos in the dataset

Finally the VFID is calculated by the formula:

VFID measures both visual effect and temporal continuity.

For the migration in the same video, the real video is the target video, and the PSNR and the VFID can be easily calculated. For cross-video transitions, PSNR cannot be computed since there is no real frame correspondence. Meanwhile, the reference meaning of the VFID is greatly reduced, because the appearance and the background can greatly influence the characteristics extracted by the I3D network. Only quantitative results of motion migration within the video are provided.

TABLE 1 quantitative results

The above table shows PSNR and VFID scores for different methods on the "within video" test set. The higher the PSNR score is, the better the PSNR score is, the quality of a single frame is close to that of a real frame is represented; the lower the VFID, the better, indicating that the difference between the overall video quality and the real video is small. The best two sets of data for each standard in the table are highlighted in bold.

Comparing the first two rows of MSE and MSE + VGG in the table, knowing that for a single frame basic model, VGG loss is introduced as the assistance of MSE loss, content loss is measured together, and the single frame quality and the time domain continuity of the video level can be improved.

Comparing the tables at lines 2 and 3, except for the header, "MSE + VGG" and "MSE + VGG + Fusion", it can be observed that the VFID score is significantly improved after the addition of the multi-frame Fusion. This shows that multi-frame fusion has great benefit for improving the overall quality of video.

Comparing "RB 6" and "RB 6+ Dv", it can be observed that both indicators have different degrees of improvement after introducing the multi-scale time domain discriminator.

Comparing "RB 6+ Dv 3" with "RB 6+ Dv", it can be seen that while the PSNR index of Dv3 is somewhat better than the full version of Dv357, it sacrifices the overall perceptual quality at the video level.

On a comparison of the different fusion modes in the next four rows, "Max" shows the best VFID score and the worst PSNR score, which means that the single frame quality of maximum fusion is poor, but the VFID criterion is somehow fooled. (it can be seen from the subsequent qualitative experiments that the result quality of the max fusion approach is not very good, it enhances the temporal continuity by introducing some meaningless details, but the resulting result does not seem to be true.) the last two lines "SA 3D + RB 6" show the best PSNR score, and "RB 6+ SA 2D" shows outstanding performance on both standards.

A human user scoring test was also performed to compare the "RB 6+ SA 2D" experimental configuration to the underlying single frame model. For each configuration, each user would see 5 sets of cross-video action migration results and 5 sets of intra-video migration results. Results of different experiments are randomly disturbed, so that fair judgment is ensured. For each set of comparisons we ask the user two questions, one is overall video quality and realism, "which video looks more realistic". The second is temporal continuity, "which video flashes less". 20 persons between 20 and 30 years of age were tested. The average score of human evaluation is shown in table 2, and our method is significantly better than the current best single frame model.

Table 2 human score test results

Example 3

Qualitative experiments were also performed in this application. Two scenes of action migration in the same video and action migration across the video are respectively tested, and the two scenes correspond to two different test subsets: i) the source character/background frames and the target motion video are from different video sequences across the video test set. ii) the in-video test set, the source character/background frame and the target motion video are from the same video sequence. For each set, 50 pairs of videos were fixedly and randomly selected in the test set as the test subset. Note that in testing the subsets within the video, it is ensured that there is no crossover or overlap of the source and target sequences.

In the results generated by the single frame base model, significant blurring and unnaturalness can be observed.

The result of the maximum pooling fusion method tends to create strange colors and shadows in the foreground and background, presumably due to the persistence effect of maximum fusion. This also corroborates the conclusions in the quantitative experiments above, which, although improving temporal continuity, lose the original content meaning and realism of the video.

"RB 6+ SA 2D" and "SA 3D + RB 6" showed the best overall quality. Through multi-frame fusion and enhancement based on a space-time attention mechanism, the background completion result is more accurate, and meanwhile, more details are reserved in the foreground.

To explore the multi-frame fusion mechanism more deeply, intermediate results of some "RB 6+ SA 2D" fusion modules were visualized. The output of the attention allocation, i.e. the "RB 6" module, from different frames is shown. In the single-frame image prediction result, a remarkable inharmonious region can be seen, such as rail blurring in the background. But the areas of different frame blurriness are different, our method locates the 'comfort zone' of each source by attention allocation, and guides the synthesis of foreground and background with more precise details.

All or part of the flow of the method of the embodiments may be implemented by a computer program, which may be stored in a computer readable storage medium and executed by a processor, to instruct related hardware to implement the steps of the embodiments of the methods. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.

The foregoing is a more detailed description of the invention in connection with specific preferred embodiments and it is not intended that the invention be limited to these specific details. For those skilled in the art to which the invention pertains, several equivalent substitutions or obvious modifications can be made without departing from the spirit of the invention, and all the properties or uses are considered to be within the scope of the invention.

Claims

1. A video motion migration method is characterized by comprising the following steps:

s2: receiving an image input of the source video;

s3: performing preliminary feature extraction of the foreground and the background;

s4: fusing the preliminary features of the background and the foreground respectively to generate a fused feature of the background and a fused feature of the foreground; synthesizing a fusion feature synthesis background through the fusion feature synthesis of the background; synthesizing a fusion characteristic synthesis foreground and a foreground mask through the fusion characteristic synthesis foreground, and further obtaining a frame model of the target video after action migration at the time t; in step S4, the preliminary features of the background and the foreground are fused by a space-time attention mechanism, respectively; the spatiotemporal attention mechanism comprises:

RB6+ SA2D Structure: adding a two-dimensional self-attention module enhancement feature after the RB6 structure;

2. The video motion migration method according to claim 1, wherein a2D pose detection model is used to extract motion sequences of the source video and the target motion video.

3. The video motion migration method according to claim 1, wherein the image input of the source video comprises inputting K frames of images, and the value of K is 4.

4. The video motion migration method according to claim 1, wherein in step S3, a single frame migration method is used to select the penultimate features of the foreground and background branches for subsequent fusion.

5. The video motion migration method according to claim 1, wherein in step S4, the frame model of the target video at time t obtained by the fused feature synthetic foreground, the fused feature synthetic background and the foreground mask is:

wherein the content of the first and second substances,

synthesizing a foreground for the fused features;

synthesizing a background for the fused features;

is a stand forThe foreground mask; an element by element multiplication.

6. The video action migration method according to claim 1, wherein the content loss function is defined as:

is the real frame of the target video at the time t;

the content loss function further includes a perceptual loss defined as:

where φ represents features extracted from the pre-trained VGG19 model.

7. The video motion migration method according to claim 6, wherein the spatial countermeasure loss is defined as:

wherein D is_IIs a single-frame image discrimination network,

representing the target posture of the target video at the time t;

the multi-scale temporal confrontation loss is defined as:

wherein, W_TIs an optical flow sequence calculated by FlowNet2, and comprises optical flow information between each pair of continuous frames; v_TIs a target action video; v_oIs a target video;

8. The video motion migration method according to claim 7, wherein the loss function is defined as:

L_total＝L_MSE+λ_VGGL_VGG+λ_GIL_GAN,I+λ_GVL_GAN,V

wherein λ is_VGG、λ_GI、λ_GVAnd the weight coefficients respectively correspond to the perception loss, the spatial countermeasure loss and the multi-scale time domain countermeasure loss.

9. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 8.