CN110197167B - Video motion migration method - Google Patents

Video motion migration method Download PDF

Info

Publication number
CN110197167B
CN110197167B CN201910485182.9A CN201910485182A CN110197167B CN 110197167 B CN110197167 B CN 110197167B CN 201910485182 A CN201910485182 A CN 201910485182A CN 110197167 B CN110197167 B CN 110197167B
Authority
CN
China
Prior art keywords
video
foreground
background
target
loss
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910485182.9A
Other languages
Chinese (zh)
Other versions
CN110197167A (en
Inventor
袁春
成昆
黄浩智
刘威
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Graduate School Tsinghua University
Original Assignee
Shenzhen Graduate School Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Graduate School Tsinghua University filed Critical Shenzhen Graduate School Tsinghua University
Priority to CN201910485182.9A priority Critical patent/CN110197167B/en
Publication of CN110197167A publication Critical patent/CN110197167A/en
Application granted granted Critical
Publication of CN110197167B publication Critical patent/CN110197167B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames

Abstract

The invention provides a video motion migration method, which comprises the following steps: extracting action sequences of the source video and the target action video and respectively generating a source gesture and a target gesture; receiving an image input of a source video; performing preliminary feature extraction of the foreground and the background; fusing the preliminary characteristics of the background and the foreground respectively to generate a fused characteristic of the background and a fused characteristic of the foreground; synthesizing a fusion characteristic synthesis background through the fusion characteristic synthesis background; synthesizing the fusion characteristic synthesis foreground and the foreground mask through the fusion characteristic synthesis foreground, and further obtaining a frame model of the target video after the action migration at the time t; and adding a loss function into the frame model, wherein the loss function comprises a content loss function and a countermeasure loss function, the content loss function comprises pixel-level error loss and perception error loss, and the countermeasure loss function comprises spatial countermeasure loss and multi-scale temporal countermeasure loss. And an overall pipeline model with universality and flexibility is constructed.

Description

Video motion migration method
Technical Field
The invention relates to the technical field of computer vision, in particular to a video motion migration method.
Background
Portrait video generation is a leading topic with a large number of application scenes. It can be used to generate training data for higher-level vision tasks, such as body pose estimation, object detection and grouping, individual identification, and so forth. Meanwhile, the method is also helpful for developing a more powerful video directional editing tool. The existing portrait video generation modes mainly comprise three types: unconditional video generation, video frame prediction, and video motion migration.
Unconditional video generation focuses on mapping sets of one-dimensional hidden vectors to portrait video, and this approach relies on one-dimensional hidden vectors to simultaneously produce the appearance and motion information of the video. After training is completed, different generated videos can be obtained by randomly sampling in the hidden vector. However, this approach does not provide flexibility in controlling the motion and appearance of the generated video.
For video frame prediction, work has been directed to predicting future frames from previous frames. This problem can also be seen as a two-stage problem: the motion change of the future frame is predicted from the past frame, and then the complete frame is predicted from the motion of the future frame. The work and video motion migration in the second stage are similar, but existing video frame prediction methods focus on the first stage, and some considerations are lacking as to how the second stage maintains appearance details and temporal continuity.
The application focuses on the video motion migration problem and aims to migrate the human motion in the target video to the human body of the source video while maintaining the appearance of the source human. In this way, the motion of the generated video can be controlled exactly, as long as a set of target videos containing the ideal motion sequence is provided. Although there have been many methods that attempt to solve the motion migration problem of single frame images, it is not ideal to directly apply their method to a continuous video effect. Where video motion is complex and difficult to predict, single frame motion migration methods introduce severe blurring, aliasing, and other visually unnatural phenomena.
There has been recent work in addition to narrowing the general action migration problem to migrate arbitrary actions to fixed characters and scenes. Such methods often yield very attractive results due to the complexity of the simplification problem, however they are not strictly due to the migration problem: because the target character and the scene are single, the appearance and the background of the generated video do not even need to be obtained by transferring from the source video, but can be solidified and memorized in network parameters to form a generation process that the action hidden vector is directly transformed into the video. Therefore, such methods require a separate model to be trained for each source object, and the relationship between the foreground characters and the background scene is binding, which is contrary to our flexible and universal initiatives.
There is therefore a lack in the prior art of an efficient method for applying images to video.
Disclosure of Invention
The invention provides a video motion migration method for solving the existing problems.
In order to solve the above problems, the technical solution adopted by the present invention is as follows:
a video motion migration method comprises the following steps: s1: extracting action sequences of the source video and the target action video and respectively generating a source gesture and a target gesture; s2: receiving an image input of the source video; s3: performing preliminary feature extraction of the foreground and the background; s4: fusing the preliminary features of the background and the foreground respectively to generate a fused feature of the background and a fused feature of the foreground; synthesizing a fusion feature synthesis background through the fusion feature synthesis of the background; synthesizing a fusion characteristic synthesis foreground and a foreground mask through the fusion characteristic synthesis foreground, and further obtaining a frame model of the target video after action migration at the time t; s5: adding a loss function to the frame model, wherein the loss function comprises a content loss function and a countermeasure loss function, and the countermeasure loss function comprises a spatial countermeasure loss and a multi-scale time domain countermeasure loss.
In one embodiment of the invention, a2D gesture detection model is adopted to extract motion sequences of the source video and the target motion video.
In one embodiment of the invention, the image input of the source video comprises inputting K frames of images, the value of K being 4.
In an embodiment of the present invention, in step S3, a single frame migration method is used to select the penultimate layer features of the foreground and background branches for subsequent fusion.
In one embodiment of the present invention, the preliminary features of the background and the foreground are fused separately by a spatiotemporal attention mechanism in step S4; the spatiotemporal attention mechanism comprises: RB6 Structure: the backbone network consists of 6 residual modules, and the preliminary features are weighted and fused by SOFTMAX channel dimensions; structure SA3D + RB 6: adding a three-dimensional self-attention module enhancement feature before the RB6 structure; RB6+ SA2D Structure: two-dimensional self-attention module enhancement features are added after the RB6 structure.
In an embodiment of the present invention, in step S4, the frame model of the target video at time t obtained by the fusion feature synthesis foreground, the fusion feature synthesis background, and the foreground mask is:
Figure GDA0002112767570000021
wherein the content of the first and second substances,
Figure GDA0002112767570000022
synthesizing a foreground for the fused features;
Figure GDA0002112767570000023
synthesizing a background for the fused features;
Figure GDA0002112767570000024
is the foreground mask; an element by element multiplication.
In one embodiment of the invention, the content loss function is defined as:
Figure GDA0002112767570000031
wherein L isMSEIs a mean square error function, OtIs the frame model of the target video at time t,
Figure GDA0002112767570000037
is the real frame of the target video at the time t; the content loss function further includes a perceptual loss defined as:
Figure GDA0002112767570000032
where φ represents features extracted from the pre-trained VGG19 model.
In one embodiment of the invention, the spatial opposition loss is defined as:
Figure GDA0002112767570000033
wherein D isIIs a single-frame image discrimination network,
Figure GDA0002112767570000034
representing the target posture of the target video at the time t;
the multi-scale temporal confrontation loss is defined as:
Figure GDA0002112767570000035
wherein the content of the first and second substances,WTis an optical flow sequence calculated by FlowNet2, and comprises optical flow information between each pair of continuous frames; vTIs a target action video; voIs a target video;
Figure GDA0002112767570000036
the device is a time domain discriminator which receives n frames of images and optical flow information thereof as input and learns and discriminates the generated continuous n frames and real n frames.
In one embodiment of the invention, the loss function is defined as: l istotal=LMSEVGGLVGGGILGAN,IGVLGAN,V(ii) a Wherein λ isVGG、λGI、λGVAnd the weight coefficients respectively correspond to the perception loss, the spatial countermeasure loss and the multi-frame countermeasure loss. The invention also provides a computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of the above.
The invention has the beneficial effects that: the video motion migration method provides appearance information through multi-frame input, has time-space attention mechanism guidance, and adopts a universal video motion migration scheme with a multi-time scale discriminator for resisting supervision. The assembly line is flexible, elements such as a foreground, a background and actions are analyzed from different videos, and the position sequence of the input video is changed to realize multiple combined videos such as actions of A in B scene; a brand-new content fusion mechanism is provided, and more real and natural foreground and background images can be generated based on a space-time attention mechanism; an end-to-end trained multi-time scale discriminator is presented to encourage the generator to produce temporally smoother continuous video.
Drawings
Fig. 1 is a schematic diagram of a video motion migration method according to an embodiment of the present invention.
Detailed Description
In order to make the technical problems, technical solutions and advantageous effects to be solved by the embodiments of the present invention more clearly apparent, the present invention is further described in detail below with reference to the accompanying drawings and the embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
It will be understood that when an element is referred to as being "secured to" or "disposed on" another element, it can be directly on the other element or be indirectly on the other element. When an element is referred to as being "connected to" another element, it can be directly connected to the other element or be indirectly connected to the other element. The connection may be for fixation or for circuit connection.
It is to be understood that the terms "length," "width," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," "outer," and the like are used in an orientation or positional relationship indicated in the drawings for convenience in describing the embodiments of the present invention and to simplify the description, and are not intended to indicate or imply that the referenced device or element must have a particular orientation, be constructed in a particular orientation, and be in any way limiting of the present invention.
Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the embodiments of the present invention, "a plurality" means two or more unless specifically limited otherwise.
Example 1
The problem addressed by the present application is directed to human motion migration in videos. V ═ I1,I2,...,INRepresents an N-frame video in which a single person is performing a general movement, such as dancing. To simplify the problem, it is an unsolved challenging problem, even if it assumes that the viewpoint (camera) and background are both static. Given source video VSAnd target action video VTThe goal of action migration is to shift VTIs migrated to VSWhile maintaining VSThe appearance characteristics of (1). In this way, the target video V is generatedOThe simultaneous control of motion and appearance can be displayed. Extracting motion sequence P ═ { P) of source video and target motion video by applying pre-trained 2D gesture detection model1,p2,...,pN}. Each of ptThe posture of the t-th frame is represented, and the representation in the implementation is a thermodynamic value graph of M channels, wherein M is 14 to represent the number of key points. Labeling the source and target poses as P, respectivelySAnd PT. It will be appreciated that more advanced pose extractors may also be employed to improve accuracy and performance, and are not limited herein.
Unlike single frame motion migration, accepting K frame inputs and their respective motion information, and target motion pose information, K has a value of 4 in one specific embodiment. The frame model of the target video may be roughly expressed as:
Figure GDA0002112767570000051
as shown in fig. 1, a video motion migration method includes the following steps:
s1: extracting action sequences of the source video and the target action video and respectively generating a source gesture and a target gesture;
s2: receiving an image input of the source video;
s3: performing preliminary feature extraction of the foreground and the background; i.e. extracting preliminary features of the foreground and background from the source pose, the target pose and the image input of the source video.
S4: fusing the preliminary features of the background and the foreground respectively to generate a fused feature of the background and a fused feature of the foreground; synthesizing a fusion feature synthesis background through the fusion feature synthesis of the background; synthesizing a fusion characteristic synthesis foreground and a foreground mask through the fusion characteristic synthesis foreground, and further obtaining a frame model of the target video after action migration at the time t;
s5: adding a loss function to the frame model, wherein the loss function comprises a content loss function and a countermeasure loss function, and the countermeasure loss function comprises a spatial countermeasure loss and a multi-scale time domain countermeasure loss.
The whole framework of the method mainly comprises a single-frame migration feature extraction module, a foreground and background fusion feature module and a final prediction and synthesis module. Wherein foreground and background are separated and merged by predictive masking
In step S3, the preliminary feature extraction module uses the existing single-frame migration method. And selecting the features of the penultimate layer of the foreground and background branches for subsequent processing. Compared with the foreground and background images generated by direct fusion, the previous layer of features contain richer information, which is beneficial to the training of the fusion module; layers near the output can easily yield an output image by the last single layer process compared to more advanced features. In consideration of both content richness and result usability, we select the penultimate features of the foreground branches, and leave them as subsequent fusion, enhancement, confrontation, and so on.
In step S4, in the case of single-frame pose migration, the quality of the composite foreground depends heavily on the selection of the source video frame. For example, the source video frame is a back view, and the pose of the front view is generated to generate a blurred result. In addition, the incompleteness of the single image information can also cause the instability of the synthesis result, and aggravate the time domain discontinuity in the generated video. The application provides a multi-frame fusion fine-tuning foreground (or background) synthesis module which fuses K frames of original features to generate a background before synthesis with higher quality. Preliminary features for each time step t, K frames
Figure GDA0002112767570000061
Is input into the fusion module to generate fused features
Figure GDA0002112767570000062
On the basis, a prediction module synthesizes fused features through the fused features of the backgroundA foreground; and synthesizing the fusion characteristic synthesis foreground and the foreground mask through the fusion characteristic synthesis foreground. The network structure of the prediction module is a single-layer 3x3 convolution, the activation function of the background image before prediction is Tanh, and the activation function of the foreground mask is Sigmoid.
For several different feature fusion approaches, the simplest and most intuitive is the channel dimension MAXFOOLING or AVERAGE-POLING. To further explore the multiframe information, the present application proposes three variants of the spatio-temporal attention mechanism:
RB6 Structure: the backbone network consists of 6 residual modules, and the preliminary features are weighted and fused by SOFTMAX channel dimensions;
structure SA3D + RB 6: adding a three-dimensional self-attention module enhancement feature before the RB6 structure;
RB6+ SA2D Structure: two-dimensional self-attention module enhancement features are added after the RB6 structure.
Their inputs are K sets of preliminary features and source pose and target pose information. The most basic variant "RB 6" consists of 6 residual blocks, computing a Kx H x W spatio-temporal attention map. Then weighting the foreground fusion features by K groups of preliminary features through an attention distribution map to obtain the foreground fusion features:
Figure GDA0002112767570000063
where F and a represent the preliminary feature and attention profile, respectively, the operator is an element-by-element multiplication.
The drawback of "RB 6" is that although attention is calculated from spatio-temporal information, the final processing is only spatially local temporal weighting. To alleviate this problem, two more complex variants, "SA 3D + RB 6" and "RB 6+ SA 2D" have been proposed. The results of the experiments showed that the results of the two variants performed similarly, but the efficiency of the operation of "RB 6+ SA 2D" was higher.
The frame model of the target video at the time t, which is obtained by the fusion feature synthesis foreground, the fusion feature synthesis background and the foreground mask, is as follows:
Figure GDA0002112767570000064
wherein the content of the first and second substances,
Figure GDA0002112767570000065
synthesizing a foreground for the fused features;
Figure GDA0002112767570000066
synthesizing a background for the fused features;
Figure GDA0002112767570000067
is the foreground mask; an element by element multiplication.
The loss functions as a whole can be divided into two broad categories, content loss and counter loss.
Content loss: in order to realize supervised training, different frames of the same video are used as a source role frame and a target action frame in a training stage, and the frames of the source video and the target action video are ensured not to be overlapped in the process. After the training process is finished, for an arbitrary source video, an arbitrary target motion video can be selected to provide a target motion sequence. On the premise of supervised training, the generated frame O is knowntShould be as close as possible to the target frame
Figure GDA0002112767570000075
The simplest and straightforward LOSS function is then the mean square error (MSE LOSS):
Figure GDA0002112767570000071
wherein L isMSEIs a mean square error function, OtIs the frame model of the target video at time t,
Figure GDA0002112767570000076
is the real frame of the target video at time t.
However, such a loss function tends to produce a blurred result because the generator learns to match as many possibilities as possible, and eventually converges to an averaged solution, i.e., a blurred result. To add more detail, perceptual loss is also exploited:
Figure GDA0002112767570000072
here phi denotes the features extracted by a pre-trained VGG19 model. In a practical implementation we have chosen the characteristics of these layers { conv1_1, conv2_1, conv3_1, conv4_1 }. L isVGGThe constraint generation frame and the real frame are similar as much as possible on the feature domain of a pre-trained VGG network, thereby enhancing the perceptual similarity.
Space countermeasure loss: to encourage each generated frame to contain more realistic details, a spatial opposition loss function is introduced. A single frame conditional discriminator is trained to distinguish between generated frames and real frames. We used LSGAN and PatchGAN to ensure the stability of the training:
Figure GDA0002112767570000073
wherein D isIIs a single-frame image discrimination network,
Figure GDA0002112767570000074
representing the target posture of the target video at the time t.
Multi-scale temporal countermeasure loss: in addition to spatial opposition loss, we also introduce multi-scale temporal opposition loss to encourage the generated video to be as close to real video as possible in temporal dynamics. Unlike time-domain discriminators that use only one fixed range, we have trained multiple time-domain discriminators to evaluate the time-domain continuity at different time scales. The multi-scale temporal confrontation loss is defined as:
Figure GDA0002112767570000081
wherein, WTIs an optical flow sequence calculated by FlowNet2, including eachOptical flow information between a pair of successive frames; vTIs a target action video; voIs a target video;
Figure GDA0002112767570000082
the device is a time domain discriminator which receives n frames of images and optical flow information thereof as input and learns and discriminates the generated continuous n frames and real n frames.
Total loss function: the overall loss is weighted by the parts:
Ltotal=LMSEVGGLVGGGILGAN,IGVLGAN,V
wherein λ isVGG、λGI、λGVAnd the weight coefficients respectively correspond to the perception loss, the spatial countermeasure loss and the multi-frame countermeasure loss.
Thus, the objective problem of the present application can be expressed as:
Figure GDA0002112767570000083
here DVSet of all video discriminators representing different time scales:
Figure GDA0002112767570000084
this objective function can be optimized by alternately updating the generator G and the discriminator D.
Example 2
The present application uses PSNR and VFID as evaluation indexes. To calculate the VFID, video features are first extracted using a pre-trained video classification model I3D, and then mean and covariance matrices are calculated over all videos in the dataset
Figure GDA0002112767570000085
Finally the VFID is calculated by the formula:
Figure GDA0002112767570000086
VFID measures both visual effect and temporal continuity.
For the migration in the same video, the real video is the target video, and the PSNR and the VFID can be easily calculated. For cross-video transitions, PSNR cannot be computed since there is no real frame correspondence. Meanwhile, the reference meaning of the VFID is greatly reduced, because the appearance and the background can greatly influence the characteristics extracted by the I3D network. Only quantitative results of motion migration within the video are provided.
TABLE 1 quantitative results
Figure GDA0002112767570000091
The above table shows PSNR and VFID scores for different methods on the "within video" test set. The higher the PSNR score is, the better the PSNR score is, the quality of a single frame is close to that of a real frame is represented; the lower the VFID, the better, indicating that the difference between the overall video quality and the real video is small. The best two sets of data for each standard in the table are highlighted in bold.
Comparing the first two rows of MSE and MSE + VGG in the table, knowing that for a single frame basic model, VGG loss is introduced as the assistance of MSE loss, content loss is measured together, and the single frame quality and the time domain continuity of the video level can be improved.
Comparing the tables at lines 2 and 3, except for the header, "MSE + VGG" and "MSE + VGG + Fusion", it can be observed that the VFID score is significantly improved after the addition of the multi-frame Fusion. This shows that multi-frame fusion has great benefit for improving the overall quality of video.
Comparing "RB 6" and "RB 6+ Dv", it can be observed that both indicators have different degrees of improvement after introducing the multi-scale time domain discriminator.
Comparing "RB 6+ Dv 3" with "RB 6+ Dv", it can be seen that while the PSNR index of Dv3 is somewhat better than the full version of Dv357, it sacrifices the overall perceptual quality at the video level.
On a comparison of the different fusion modes in the next four rows, "Max" shows the best VFID score and the worst PSNR score, which means that the single frame quality of maximum fusion is poor, but the VFID criterion is somehow fooled. (it can be seen from the subsequent qualitative experiments that the result quality of the max fusion approach is not very good, it enhances the temporal continuity by introducing some meaningless details, but the resulting result does not seem to be true.) the last two lines "SA 3D + RB 6" show the best PSNR score, and "RB 6+ SA 2D" shows outstanding performance on both standards.
A human user scoring test was also performed to compare the "RB 6+ SA 2D" experimental configuration to the underlying single frame model. For each configuration, each user would see 5 sets of cross-video action migration results and 5 sets of intra-video migration results. Results of different experiments are randomly disturbed, so that fair judgment is ensured. For each set of comparisons we ask the user two questions, one is overall video quality and realism, "which video looks more realistic". The second is temporal continuity, "which video flashes less". 20 persons between 20 and 30 years of age were tested. The average score of human evaluation is shown in table 2, and our method is significantly better than the current best single frame model.
Table 2 human score test results
Figure GDA0002112767570000101
Example 3
Qualitative experiments were also performed in this application. Two scenes of action migration in the same video and action migration across the video are respectively tested, and the two scenes correspond to two different test subsets: i) the source character/background frames and the target motion video are from different video sequences across the video test set. ii) the in-video test set, the source character/background frame and the target motion video are from the same video sequence. For each set, 50 pairs of videos were fixedly and randomly selected in the test set as the test subset. Note that in testing the subsets within the video, it is ensured that there is no crossover or overlap of the source and target sequences.
In the results generated by the single frame base model, significant blurring and unnaturalness can be observed.
The result of the maximum pooling fusion method tends to create strange colors and shadows in the foreground and background, presumably due to the persistence effect of maximum fusion. This also corroborates the conclusions in the quantitative experiments above, which, although improving temporal continuity, lose the original content meaning and realism of the video.
"RB 6+ SA 2D" and "SA 3D + RB 6" showed the best overall quality. Through multi-frame fusion and enhancement based on a space-time attention mechanism, the background completion result is more accurate, and meanwhile, more details are reserved in the foreground.
To explore the multi-frame fusion mechanism more deeply, intermediate results of some "RB 6+ SA 2D" fusion modules were visualized. The output of the attention allocation, i.e. the "RB 6" module, from different frames is shown. In the single-frame image prediction result, a remarkable inharmonious region can be seen, such as rail blurring in the background. But the areas of different frame blurriness are different, our method locates the 'comfort zone' of each source by attention allocation, and guides the synthesis of foreground and background with more precise details.
All or part of the flow of the method of the embodiments may be implemented by a computer program, which may be stored in a computer readable storage medium and executed by a processor, to instruct related hardware to implement the steps of the embodiments of the methods. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.
The foregoing is a more detailed description of the invention in connection with specific preferred embodiments and it is not intended that the invention be limited to these specific details. For those skilled in the art to which the invention pertains, several equivalent substitutions or obvious modifications can be made without departing from the spirit of the invention, and all the properties or uses are considered to be within the scope of the invention.

Claims (9)

1. A video motion migration method is characterized by comprising the following steps:
s1: extracting action sequences of the source video and the target action video and respectively generating a source gesture and a target gesture;
s2: receiving an image input of the source video;
s3: performing preliminary feature extraction of the foreground and the background;
s4: fusing the preliminary features of the background and the foreground respectively to generate a fused feature of the background and a fused feature of the foreground; synthesizing a fusion feature synthesis background through the fusion feature synthesis of the background; synthesizing a fusion characteristic synthesis foreground and a foreground mask through the fusion characteristic synthesis foreground, and further obtaining a frame model of the target video after action migration at the time t; in step S4, the preliminary features of the background and the foreground are fused by a space-time attention mechanism, respectively; the spatiotemporal attention mechanism comprises:
RB6 Structure: the backbone network consists of 6 residual modules, and the preliminary features are weighted and fused by SOFTMAX channel dimensions;
structure SA3D + RB 6: adding a three-dimensional self-attention module enhancement feature before the RB6 structure;
RB6+ SA2D Structure: adding a two-dimensional self-attention module enhancement feature after the RB6 structure;
s5: adding a loss function to the frame model, wherein the loss function comprises a content loss function and a countermeasure loss function, and the countermeasure loss function comprises a spatial countermeasure loss and a multi-scale time domain countermeasure loss.
2. The video motion migration method according to claim 1, wherein a2D pose detection model is used to extract motion sequences of the source video and the target motion video.
3. The video motion migration method according to claim 1, wherein the image input of the source video comprises inputting K frames of images, and the value of K is 4.
4. The video motion migration method according to claim 1, wherein in step S3, a single frame migration method is used to select the penultimate features of the foreground and background branches for subsequent fusion.
5. The video motion migration method according to claim 1, wherein in step S4, the frame model of the target video at time t obtained by the fused feature synthetic foreground, the fused feature synthetic background and the foreground mask is:
Figure FDA0002945080340000011
wherein the content of the first and second substances,
Figure FDA0002945080340000012
synthesizing a foreground for the fused features;
Figure FDA0002945080340000013
synthesizing a background for the fused features;
Figure FDA0002945080340000014
is a stand forThe foreground mask; an element by element multiplication.
6. The video action migration method according to claim 1, wherein the content loss function is defined as:
Figure FDA0002945080340000021
wherein L isMSEIs a mean square error function, OtIs the frame model of the target video at time t,
Figure FDA0002945080340000027
is the real frame of the target video at the time t;
the content loss function further includes a perceptual loss defined as:
Figure FDA0002945080340000022
where φ represents features extracted from the pre-trained VGG19 model.
7. The video motion migration method according to claim 6, wherein the spatial countermeasure loss is defined as:
Figure FDA0002945080340000023
wherein D isIIs a single-frame image discrimination network,
Figure FDA0002945080340000024
representing the target posture of the target video at the time t;
the multi-scale temporal confrontation loss is defined as:
Figure FDA0002945080340000025
wherein, WTIs an optical flow sequence calculated by FlowNet2, and comprises optical flow information between each pair of continuous frames; vTIs a target action video; voIs a target video;
Figure FDA0002945080340000026
the device is a time domain discriminator which receives n frames of images and optical flow information thereof as input and learns and discriminates the generated continuous n frames and real n frames.
8. The video motion migration method according to claim 7, wherein the loss function is defined as:
Ltotal=LMSEVGGLVGGGILGAN,IGVLGAN,V
wherein λ isVGG、λGI、λGVAnd the weight coefficients respectively correspond to the perception loss, the spatial countermeasure loss and the multi-scale time domain countermeasure loss.
9. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 8.
CN201910485182.9A 2019-06-05 2019-06-05 Video motion migration method Active CN110197167B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910485182.9A CN110197167B (en) 2019-06-05 2019-06-05 Video motion migration method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910485182.9A CN110197167B (en) 2019-06-05 2019-06-05 Video motion migration method

Publications (2)

Publication Number Publication Date
CN110197167A CN110197167A (en) 2019-09-03
CN110197167B true CN110197167B (en) 2021-03-26

Family

ID=67753996

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910485182.9A Active CN110197167B (en) 2019-06-05 2019-06-05 Video motion migration method

Country Status (1)

Country Link
CN (1) CN110197167B (en)

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110210386B (en) * 2019-05-31 2022-03-04 北京市商汤科技开发有限公司 Video generation method for action migration and neural network training method and device
CN111489304B (en) * 2020-03-27 2022-04-26 天津大学 Image deblurring method based on attention mechanism
CN111462209B (en) * 2020-03-31 2022-05-24 北京市商汤科技开发有限公司 Action migration method, device, equipment and storage medium
CN111539262B (en) * 2020-04-02 2023-04-18 中山大学 Motion transfer method and system based on single picture
CN112508830B (en) * 2020-11-30 2023-10-13 北京百度网讯科技有限公司 Training method, device, equipment and storage medium of image processing model
CN112633158A (en) * 2020-12-22 2021-04-09 广东电网有限责任公司电力科学研究院 Power transmission line corridor vehicle identification method, device, equipment and storage medium
CN114760497A (en) * 2021-01-08 2022-07-15 阿里巴巴集团控股有限公司 Video generation method, nonvolatile storage medium, and electronic device
CN113706577A (en) * 2021-04-08 2021-11-26 腾讯科技(深圳)有限公司 Image processing method and device and computer readable storage medium
CN113870314B (en) * 2021-10-18 2023-09-19 南京硅基智能科技有限公司 Training method of action migration model and action migration method
CN113870315B (en) * 2021-10-18 2023-08-25 南京硅基智能科技有限公司 Multi-algorithm integration-based action migration model training method and action migration method
CN115713680B (en) * 2022-11-18 2023-07-25 山东省人工智能研究院 Semantic guidance-based face image identity synthesis method

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP3966392B2 (en) * 1997-09-30 2007-08-29 シャープ株式会社 Image composition communication device
CN108363973B (en) * 2018-02-07 2022-03-25 电子科技大学 Unconstrained 3D expression migration method
CN109951654B (en) * 2019-03-06 2022-02-15 腾讯科技(深圳)有限公司 Video synthesis method, model training method and related device

Also Published As

Publication number Publication date
CN110197167A (en) 2019-09-03

Similar Documents

Publication Publication Date Title
CN110197167B (en) Video motion migration method
CA3035298C (en) Predicting depth from image data using a statistical model
CN113194348B (en) Virtual human lecture video generation method, system, device and storage medium
CN101558404B (en) Image segmentation
CN111275518A (en) Video virtual fitting method and device based on mixed optical flow
CN108363973B (en) Unconstrained 3D expression migration method
CN115191005A (en) System and method for end-to-end scene reconstruction from multi-view images
Singh et al. Neural style transfer: A critical review
CN107194948B (en) Video significance detection method based on integrated prediction and time-space domain propagation
WO2023221684A1 (en) Digital human generation method and apparatus, and storage medium
CN111491187A (en) Video recommendation method, device, equipment and storage medium
CN114339409A (en) Video processing method, video processing device, computer equipment and storage medium
Stergiou et al. Spatio-temporal FAST 3D convolutions for human action recognition
Gafni et al. Wish you were here: Context-aware human generation
Sun et al. Twostreamvan: Improving motion modeling in video generation
CN114782596A (en) Voice-driven human face animation generation method, device, equipment and storage medium
CN111028318A (en) Virtual face synthesis method, system, device and storage medium
Zeng et al. Expression-tailored talking face generation with adaptive cross-modal weighting
Singh et al. Action recognition in dark videos using spatio-temporal features and bidirectional encoder representations from transformers
CN113395569A (en) Video generation method and device
CN116233567B (en) Speaker face video generation method and system based on audio emotion perception
Han et al. Two-stream LSTM for action recognition with RGB-D-based hand-crafted features and feature combination
Tang et al. A multi-task neural network for action recognition with 3d key-points
CN115936796A (en) Virtual makeup changing method, system, equipment and storage medium
CN111275778A (en) Face sketch generating method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant