CN113269700A

CN113269700A - Video generation method and device, electronic equipment and storage medium

Info

Publication number: CN113269700A
Application number: CN202110472857.3A
Authority: CN
Inventors: 饶强; 黄旭为; 张国鑫
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2021-04-29
Filing date: 2021-04-29
Publication date: 2021-08-17
Anticipated expiration: 2041-04-29
Also published as: CN113269700B

Abstract

The present disclosure relates to a video generation method, apparatus, electronic device, and storage medium, the method comprising: acquiring a front face expressionless image including a first object and a driving video including a second object; determining a second object contained in the driving video and an expression parameter sequence relative to the second object contained in a reference video frame in the driving video; determining a foreground image sequence of a first object according to the front face non-expression image and a target expression and action parameter sequence determined by the expression parameter sequence and the initial expression and action parameters extracted from the front face non-expression image; based on the target expression action parameter sequence, deforming the foreground mask image of the front face non-expression image to obtain a target mask image sequence; and fusing the foreground image sequence and the background image based on the target mask image sequence to obtain a target action video. The method and the device can reduce the consumption of the video generation process on computing resources, improve the video generation efficiency, and ensure that the generated video has better consistency.

Description

Video generation method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of internet technologies, and in particular, to a video generation method and apparatus, an electronic device, and a storage medium.

Background

In the related art, when performing foreground fusion on a face in a video, a mask image (mask) of each frame in the video is generally acquired, and then an example segmentation algorithm (mask blend) is used to fuse a foreground image and a background image. The BlendMask is a network structure composed of a detection network and a mask branch.

However, calculating the mask of each frame in the video requires more system calculation resources, and the calculation time is long, which greatly reduces the fusion efficiency of the foreground image and the background image, thereby reducing the video generation efficiency; in addition, calculating the mask of each frame in the video may also cause the foreground images to be incoherent in sequence, so that the coherence of the fused video is poor.

Disclosure of Invention

The present disclosure provides a video generation method, an apparatus, an electronic device, and a storage medium, so as to at least solve the problems in the related art that the video generation caused by calculating the mask of each frame needs to consume more system computing resources, the video generation efficiency is low, and the consistency of the video obtained by fusion is poor. The technical scheme of the disclosure is as follows:

according to a first aspect of the embodiments of the present disclosure, there is provided a video generation method, including:

acquiring a front face non-expression image of a first object and a driving video comprising a second object;

determining a reference video frame from the driving video, wherein the reference video frame comprises a smallest difference between the facial expression state of a second object and the front face non-expression state of the second object;

determining a second object contained in a video frame in the driving video, wherein the expression parameter sequence represents the second object contained in the video frame relative to an expression parameter sequence of the second object contained in the reference video frame, and the expression action of the second object contained in the reference video frame is changed relative to the expression parameter sequence;

determining a target expression and action parameter sequence of the first object based on the expression parameter sequence and initial expression and action parameters extracted from the front face non-expression image;

determining a foreground image sequence of the first object according to the front face non-expression image and the target expression action parameter sequence;

based on the target expression action parameter sequence, deforming the foreground mask image of the front face non-expression image to obtain a target mask image sequence;

and fusing the foreground image sequence and the background image in the front face non-expression image based on the target mask image sequence to obtain a target action video.

In an exemplary embodiment, the deforming the foreground mask image of the front-face blankness image based on the target expression and motion parameter sequence to obtain a target mask image sequence includes:

inputting the target expression and action parameter sequence and the foreground mask image into a first-order motion model, and performing matrix conversion processing on each target expression and action parameter in the target expression and action parameter sequence to obtain a grid matrix sequence corresponding to the target expression and action parameter sequence, wherein each grid matrix in the grid matrix sequence is used for representing the position information of a key point in the corresponding target expression and action parameter;

according to each grid matrix in the grid matrix sequence, moving the position information of each pixel point in the foreground mask image to obtain the position information sequence of each pixel point after moving;

and obtaining the target mask image sequence according to the moved position information sequence of each pixel point.

In an exemplary embodiment, the determining the sequence of expression parameters of the second object included in the video frame of the driving video relative to the sequence of expression parameters of the second object included in the reference video frame includes:

inputting the reference video frame into a first-order motion model to extract expression and action parameters, and obtaining first key point position information and first-order motion information of a second object contained in the reference video frame;

inputting each video frame in the driving video into the first-order motion model to extract expression action parameters, and obtaining a second key point position information sequence and a second first-order motion information sequence of a second object contained in the driving video;

calculating the difference between each second key point position information in the second key point position information sequence and the first key point position information to obtain a key point position change information sequence of a second object contained in the driving video;

calculating the difference value between each second-order motion information in the second-order motion information sequence and the first-order motion information to obtain a motion state change information sequence of a second object contained in the driving video;

and obtaining the expression parameter sequence according to the key point position change information sequence and the motion state change information sequence.

In an exemplary embodiment, the determining the target expression parameter sequence of the first object based on the expression parameter sequence and the initial expression parameters extracted from the front face blankness image includes:

obtaining a target key point position information sequence of the first object according to the initial key point position information and the key point position change information sequence;

obtaining a target first-order motion information sequence of the first object according to the initial first-order motion information and the motion state change information sequence;

and obtaining the target expression action parameter sequence according to the target key point position information sequence and the target first-order motion information sequence.

In an exemplary embodiment, the determining a foreground image sequence of the first object based on the front face blankness image and the target expressive action parameter sequence includes:

inputting the front face non-expression image and the target expression action parameter sequence into a first-order motion model, and transferring each target expression action parameter in the target expression action parameter sequence to the front face non-expression image to obtain an expression transferred image sequence;

and performing foreground segmentation processing on each image in the image sequence after the expression migration to obtain a foreground image sequence of the first object.

In an exemplary embodiment, the fusing the foreground image sequence and the background image in the front-face expressionless image based on the target mask image sequence to obtain a target motion video includes:

taking the pixel value of a pixel point in each target mask image in the target mask image sequence as the fusion weight of the corresponding foreground image in the foreground image sequence;

obtaining a background image fusion weight corresponding to the fusion weight of each foreground image according to the fusion weight of each foreground image;

fusing each foreground image with the background image based on the fusion weight of each foreground image and the background image fusion weight corresponding to the fusion weight of each foreground image to obtain a target action video frame corresponding to each foreground image;

and splicing the target action video frames corresponding to each foreground image to obtain the target action video.

According to a second aspect of the embodiments of the present disclosure, there is provided a video generating apparatus including:

an image video acquisition module configured to perform acquisition of a front-face blankness image of a first object and a drive video including a second object;

a reference video frame determination module configured to perform determination of a reference video frame from the driving video, where a difference between a facial expression state of a second object contained in the reference video frame and a front-face non-expression state of the second object is minimal;

a first sequence determination module configured to perform determining a second object included in a video frame in the driving video, wherein the expression parameter sequence represents the second object included in the video frame relative to an expression parameter sequence of the second object included in the reference video frame, and the expression action of the second object included in the reference video frame changes relative to the expression parameter sequence;

a second sequence determination module configured to perform determining a target expression and action parameter sequence of the first object based on the expression parameter sequence and the initial expression and action parameters extracted from the front face non-expression image;

a third sequence determination module configured to perform determining a foreground image sequence of the first object according to the front face non-expressive image and the target expression action parameter sequence;

the deformation module is configured to perform deformation on a foreground mask image of the front face non-expression image based on the target expression action parameter sequence to obtain a target mask image sequence;

and the fusion module is configured to perform fusion on the foreground image sequence and the background image in the front face expressionless image based on the target mask image sequence to obtain a target action video.

In an exemplary embodiment, the deformation module includes:

the matrix conversion unit is configured to input the target expression and action parameter sequence and the foreground mask image into a first-order motion model, perform matrix conversion processing on each target expression and action parameter in the target expression and action parameter sequence to obtain a grid matrix sequence corresponding to the target expression and action parameter sequence, wherein each grid matrix in the grid matrix sequence is used for representing the position information of a key point in the corresponding target expression and action parameter;

the moving unit is configured to move the position information of each pixel point in the foreground mask image according to each grid matrix in the grid matrix sequence to obtain a moved position information sequence of each pixel point;

and the target mask image sequence determining unit is configured to execute the position information sequence of each pixel point after the movement to obtain the target mask image sequence.

In an exemplary embodiment, the first sequence determining module includes:

a first parameter extraction unit, configured to perform expression and motion parameter extraction by inputting the reference video frame into a first-order motion model, so as to obtain first key point position information and first-order motion information of a second object included in the reference video frame;

the second parameter extraction unit is configured to input each video frame in the driving video into the first-order motion model to extract expression and action parameters, so that a second key point position information sequence and a second first-order motion information sequence of a second object contained in the driving video are obtained;

a difference value calculating unit configured to perform calculation of a difference value between each piece of second keypoint position information in the second keypoint position information sequence and the first keypoint position information to obtain a keypoint position change information sequence of a second object included in the driving video;

a ratio calculation unit configured to perform calculation of a difference between each second-order motion information in the second-order motion information sequence and the first-order motion information to obtain a motion state change information sequence of a second object included in the drive video;

and the expression parameter sequence acquisition unit is configured to execute the key point position change information sequence and the motion state change information sequence to obtain the expression parameter sequence.

In an exemplary embodiment, the initial expression action parameters include initial key point position information and initial first-order motion information, and the second sequence determination module includes:

a key point position information sequence obtaining unit configured to execute a target key point position information sequence of the first object according to the initial key point position information and the key point position change information sequence;

a first-order motion information sequence obtaining unit configured to execute a target first-order motion information sequence of the first object according to the initial first-order motion information and the motion state change information sequence;

and the target expression action parameter sequence acquisition unit is configured to execute the target expression action parameter sequence according to the target key point position information sequence and the target first-order motion information sequence.

In an exemplary embodiment, the third sequence determining module includes:

the migration unit is configured to input the front face expressionless image and the target expression action parameter sequence into a first-order motion model, and migrate each target expression action parameter in the target expression action parameter sequence to the front face expressionless image to obtain an expression-migrated image sequence;

and the foreground segmentation unit is configured to perform foreground segmentation processing on each image in the image sequence after the expression migration to obtain a foreground image sequence of the first object.

In an exemplary embodiment, the fusion module includes:

a first weight determination unit configured to perform, as a fusion weight of a corresponding foreground image in the foreground image sequence, a pixel value of a pixel point in each target mask image in the target mask image sequence;

the second weight determining unit is configured to execute fusion weight according to each foreground image to obtain background image fusion weight corresponding to the fusion weight of each foreground image;

the fusion unit is configured to execute a background image fusion weight corresponding to the fusion weight of each foreground image and the fusion weight of each foreground image, and fuse each foreground image and the background image to obtain a target action video frame corresponding to each foreground image;

and the splicing unit is configured to splice the target action video frames corresponding to each foreground image to obtain the target action video.

According to a third aspect of the embodiments of the present disclosure, there is provided an electronic apparatus including:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the video generation method according to any of the above embodiments.

According to a fourth aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium, wherein instructions, when executed by a processor of an electronic device, cause the electronic device to perform the video generation method according to any one of the above embodiments.

According to a fifth aspect of the embodiments of the present disclosure, there is provided a computer program product comprising a computer program, which when executed by a processor implements the video generation method of any of the above embodiments.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:

the method comprises the steps of determining a target expression and action parameter sequence of a first object according to a second object contained in a drive video, relative to an expression parameter sequence of the second object contained in a reference video frame and an initial expression and action parameter extracted from a front-face expressionless image, deforming a foreground mask image extracted from the front-face expressionless image according to the target expression and action parameter sequence to obtain a target mask image sequence, and finally fusing the foreground image sequence determined by the front-face expressionless image and the target expression and action parameter sequence into a background image of the front-face expressionless image according to the target mask image sequence to obtain a target action video of the first object simulating the expression and action of the second object. The foreground mask image can be directly deformed according to the target expression and action parameter sequence determined by the expression parameter sequence to obtain a target mask image sequence fused with the subsequent action video, namely only the foreground mask image of one frame (namely the front face non-expression image) needs to be calculated, and the subsequent foreground mask image can be obtained by directly deforming the foreground mask image of the front face non-expression image through the target expression and action parameter sequence, so that the consumption of a video generation process on system computing resources is reduced, the computing time consumption is short, the fusion efficiency of the foreground image and the background image can be improved, and the generation efficiency of the target action video is improved. In addition, the continuity of each target mask image in the target mask image sequence obtained through deformation is good, and the fusion is facilitated to obtain a more coherent target action video.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.

Fig. 1 is a diagram illustrating an application environment for a video generation method according to an exemplary embodiment.

Fig. 2 is a flow diagram illustrating a video generation method according to an example embodiment.

Fig. 3 is a schematic flow diagram illustrating a process for determining a sequence of expression parameters according to an exemplary embodiment.

Fig. 4 is a schematic flow diagram illustrating a process for determining a target expression action parameter sequence of a first object according to an exemplary embodiment.

Fig. 5 is a schematic flow chart illustrating a process of determining a foreground image sequence of a first object according to an exemplary embodiment.

Fig. 6 is a flowchart illustrating a process of transforming a foreground mask image of a blankness image of a face to obtain a sequence of target mask images according to an exemplary embodiment.

Fig. 7 is a schematic flowchart illustrating a process of fusing a foreground image sequence and a background image to obtain a target motion video according to an exemplary embodiment.

Fig. 8 is a block diagram illustrating a video generation apparatus according to an example embodiment.

FIG. 9 is a block diagram illustrating an electronic device for video generation in accordance with an exemplary embodiment.

Detailed Description

In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

Referring to fig. 1, fig. 1 is a diagram illustrating an application environment of a video generation method according to an exemplary embodiment, where the application environment may include a terminal 01 and a server 02.

The terminal 01 may be configured to collect a front-face blankness image including a first object and a driving video including a second object, and send the collected front-face blankness image and the driving video to the server 02. Optionally, the terminal may include terminal devices such as a smart phone, a desktop computer, a tablet computer, a notebook computer, a digital assistant, an Augmented Reality (AR)/Virtual Reality (VR) device, and a smart wearable device, and may also include an independently operating server, or a distributed server, or a server cluster including a plurality of servers. Illustratively, the terminal 01 may take a single picture of the first subject, select a front-face blankness picture as a front-face blankness image, and simultaneously capture a driving video including the second subject from a preset driving video. Then, the terminal 01 sends the front face non-expression image and the driving video containing the second object to the server 02, and obtains and displays the target action video, returned by the server 02, of the first object simulating the expression action of the second object.

The server 02 may be used to provide background services for the terminal 01. Illustratively, the server 02 may process the front-face blankness image and the driving video including the second object transmitted by the terminal 01 to obtain a target motion video of the first object simulating the expressive motion of the second object, and return the target motion video to the terminal 01. Optionally, the server 02 may be an independent physical server, or a server cluster or a distributed system formed by a plurality of physical servers, or may be a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a web service, cloud communication, a middleware service, a domain name service, a security service, a Content Delivery Network (CDN), a big data and artificial intelligence platform, and the like.

Fig. 2 is a flowchart illustrating a video generation method according to an exemplary embodiment, which is illustrated in fig. 2 and used in the server 02 illustrated in fig. 1, and includes the following steps.

In step S11, a front-face blankness image including a first object and a drive video including a second object are acquired.

In an optional embodiment, the front-face blankness image including the first object may be selected from a preset front-face blankness image material library, or may be an image captured by a user corresponding to the client.

Illustratively, the first object may be a face of a person q, and the front-face blankness image may be a front-face blankness image of the person q.

In an alternative embodiment, the driving video containing the second object may be selected from preset driving videos.

For example, the second object may be a face of a person p, and the driving video may be a video including the face of the person p.

In step S12, a reference video frame is determined from the driving video, and the difference between the facial expression state of the second object contained in the reference video frame and the front-face non-expression state of the second object is minimized.

In the embodiment of the disclosure, one frame may be selected from the video frames of the driving video as the reference video frame.

Optionally, if the facial expression state of the second object included in a certain video frame in the driving video is a front-face non-expression state of the second object, the certain video frame is taken as the reference video frame. And if the facial expression state of the second object contained in any video frame in the driving video is not the front-face non-expression state of the second object, taking the video frame with the minimum difference between the contained facial expression state of the second object and the front-face non-expression state of the second object as the reference video frame.

Illustratively, the second object is a face of a person p, and the reference video frame may be a frontal expressionless image of the person p. When the front-face blankness image of p is not included in the drive video, the video frame in which the difference between the facial expression state and the front-face blankness state of p is smallest may be taken as the reference video frame.

In the embodiment of the disclosure, the front-face non-expression image of the first object is selected, and the reference video frame is selected from the video frames of the driving video, so that the accuracy of the subsequent determination of the expression parameter sequence and the target expression and action parameter sequence can be improved, the accuracy of the subsequent fusion of the foreground image sequence and the background image can be improved, and the accuracy of the subsequently generated target action video of the first object simulating the expression and action of the second object can be improved.

In step S13, a second object included in a video frame in the driving video is determined, and the expression parameter sequence represents the second object included in the video frame and changes expression motion relative to the second object included in the reference video frame, relative to the expression parameter sequence of the second object included in the reference video frame.

In an alternative embodiment, as shown in fig. 3, fig. 3 is a flowchart illustrating a process of determining an expression parameter sequence of a second object included in the driving video relative to an expression parameter sequence of a second object included in the reference video frame according to an exemplary embodiment. Accordingly, the following steps may be included:

in step S131, the reference video frame is input into a first-order motion model to extract expression and motion parameters, so as to obtain first key point position information and first-order motion information of the second object included in the reference video frame.

In step S132, each video frame in the driving video is input into the first-order motion model to extract expression and motion parameters, so as to obtain a second key point position information sequence and a second first-order motion information sequence of a second object included in the driving video.

In step S133, a difference between each second keypoint location information in the second keypoint location information sequence and the first keypoint location information is calculated to obtain a keypoint location change information sequence of a second object included in the driving video.

In step S134, a ratio between each second-order motion information in the second-order motion information sequence and the first-order motion information is calculated to obtain a motion state change information sequence of the second object included in the driving video.

In step S135, the expression parameter sequence is obtained according to the key point position change information sequence and the motion state change information sequence.

Illustratively, the second object is a face of a person p. The above-mentioned "expressive motion change" may be a change in the position of five sense organs of a person p.

In an alternative embodiment, in the above steps S131 and S132, the First key point position information of the second object included in the reference video frame may be extracted through a First Order Motion Model (FOMM)

And first order motion information

And second key point position information of a second object contained in each video frame

And second first order motion information

Simultaneously, according to the time sequence of the video frames in the driving video, the position information of the second key point of the second object contained in each video frame

Sorting to obtain the position information sequence of the second key point, and according to the time sequence, carrying out the second-order motion information of the second object contained in each video frame

And sequencing to obtain the second-order motion information sequence.

Optionally, the FOMM may be a model of a facial expression simulation trained based on the facial video and a preset FOMM algorithm. Accordingly, the first key point position information

And second keypoint location information

Coordinate points detected for human face (can be packaged)Including left eyebrow, right eyebrow, left eye, right eye, nose, mouth), the first order motion information

And second first order motion information

The Jacobian matrix can be based on key points on a face image and the variation of the surrounding motion, and is a matrix for describing the coordinate variation gradient.

In an alternative embodiment, in the above step S133, the position information of each second key point may be calculated by the following formula

And the position information of the first key point

The difference between the two video frames is used to obtain the key point position change information (delta kp) of the second object contained in each video frame^p)：

After obtaining the key point position change information of the second object included in each video frame, the key point position change information of the second object included in each video frame may be sorted according to the time sequence, so as to obtain the key point position change information sequence of the second object included in the driving video.

In an alternative embodiment, in the above step S134, each second-order motion information may be calculated by the following formula

And the first order motion information

Obtaining the motion state change information (Δ jacob) of the second object included in each video frame^p)：

After obtaining the motion state change information of the second object included in each video frame, the motion state change information of the second object included in each video frame may be sorted according to the time sequence, so as to obtain the motion state change information sequence of the second object included in the driving video.

In an optional embodiment, in step S135, each piece of keypoint location change information in the sequence of keypoint location change information may be combined with corresponding motion state change information in the sequence of motion state change information, so as to obtain the expression parameter sequence, where each expression parameter in the sequence of expression parameters includes the keypoint location change information and the corresponding motion state change information.

Assuming that the driving video includes 10 video frames, the Δ kp of the second object included in each of the 10 video frames can be calculated according to the above formula^pAnd Δ jacob^pAnd according to the time sequence of the 10 video frames in the driving video, the delta kp of the second object contained in the 10 video frames^pAnd Δ jacob^pAnd sequencing in sequence to obtain the expression parameter sequence.

In the embodiment of the disclosure, because the first-order motion model is a model simulated by facial expressions obtained based on a facial video and a preset FOMM algorithm, the first-order motion model outputs the first key point position information and the first-order motion information of the second object contained in the reference video frame, and the second key point position information and the second first-order motion information of the second object contained in each video frame, so that the accuracy and efficiency of determining the key point position information and the first-order motion information can be improved, the accuracy and efficiency of determining the key point position change information and the motion state change information corresponding to each video frame can be improved, and the efficiency of subsequently generating a target motion video in which the first object simulates the expression motions of the second object and the fidelity of the first object simulating the expression motions of the second object can be improved.

In step S14, a target expression parameter sequence of the first object is determined based on the expression parameter sequence and the initial expression parameters extracted from the front-face blankness image.

In an alternative embodiment, if the initial expression parameters include initial key point position information and initial first-order motion information, as shown in fig. 4, fig. 4 is a flowchart illustrating a process of determining a target expression parameter sequence of the first object based on the expression parameter sequence and the initial expression parameters extracted from the front-face blankness image according to an exemplary embodiment. Accordingly, the following steps may be included:

in step S141, a target keypoint location information sequence of the first object is obtained based on the initial keypoint location information and the keypoint location change information sequence.

In step S142, a target first-order motion information sequence of the first object is obtained according to the initial first-order motion information and the motion state change information sequence.

In step S143, the target expression and motion parameter sequence is obtained according to the target key point position information sequence and the target first-order motion information sequence.

In an alternative embodiment, when the first object is a face of a person q, the front-face expressionless image may be input into a first-order motion model to perform expression action parameter extraction, so as to obtain initial key point position information of the first object

And initial first order motion information

Exemplarily, the above

Can be a coordinate point detected on the face of a person q, as described above

The jacobian matrix, which may be based on the detected key points on the face of the person q and the variation amount of the motion around the key points, is a matrix for describing the gradient of coordinate variation.

In the embodiment of the disclosure, the key point position change information and the motion state change information of the second object (e.g., the face of the person p) contained in the video frame relative to the second object (e.g., the face of the person p) contained in the reference video frame may be migrated to the first object (e.g., the face of the person q) contained in the front-face expressionless image, so as to obtain the target expression action parameter of the first object (e.g., the face of the person q), where the target expression action parameter may include the target key point position information

And first order motion information of the object

And finally, sequencing the position information of each target key point and the first-order motion information of the target according to the time sequence to obtain the target expression and action parameter sequence.

In an alternative embodiment, in step S141, the target keypoint location information corresponding to each piece of keypoint location variation information may be calculated by the following formula

Where s may be a scale factor determined by the ratio of the size of the second object contained in the video frame to the second object contained in the reference video frame. For example, when the second object is the face of a person p, s can be determined by the ratio of the size of the face of p contained in the video frame to the size of the face of p contained in the reference video frame.

After the target key point position information corresponding to each key point position change information is obtained, the target key point position information can be sequenced according to the time sequence to obtain the target key point position information sequence.

In an alternative embodiment, in step S142, the first-order motion information of the target corresponding to each motion state change information may be calculated by the following formula

After the first-order motion information of the target corresponding to each motion state change information is obtained, the first-order motion information of the target may be sorted according to the time sequence to obtain the first-order motion information of the target.

In an optional embodiment, in step S143, each piece of target key point position information in the target key point position information sequence and corresponding target first-order motion information in the target first-order motion information sequence may be combined, so as to obtain the target expression action parameter sequence.

In the embodiment of the disclosure, the initial key point position information and the initial first-order motion information of the first object contained in the front face non-expressive image are output through the first-order motion model, so that the determination accuracy and efficiency of the target key point position information and the target motion state information can be improved, the accuracy of transferring the expression action of the second object to the first object is improved, and further, the accuracy of subsequently generating a target action video of the first object simulating the expression action of the second object and the fidelity of the first object simulating the expression action of the second object are improved.

In step S15, a foreground image sequence of the first object is determined based on the front-face blankness image and the target expression motion parameter sequence.

In an alternative embodiment, as shown in fig. 5, fig. 5 is a flowchart illustrating a process of determining a foreground image sequence of the first object according to the front-face blankness image and the target expression motion parameter sequence according to an exemplary embodiment. Accordingly, the following steps may be included:

in step S151, the front-face blankness image and the target expression motion parameter sequence are input into a first-order motion model, and each target expression motion parameter in the target expression motion parameter sequence is migrated to the front-face blankness image, so as to obtain an image sequence after expression migration.

In step S152, foreground segmentation processing is performed on each image in the image sequence after the expression transition, so as to obtain a foreground image sequence of the first object.

In the embodiment of the present disclosure, for each target expression and action parameter in the target expression and action parameter sequence, the target expression and action parameter may be transferred to the front-face non-expression image, so as to obtain an expression-transferred image sequence, and then foreground segmentation processing is performed on each image in the expression-transferred image sequence, so as to obtain a foreground image sequence of the first object.

Exemplary first, foreground segmentation algorithm may be used to perform foreground segmentation on the image after the target expression migration, and the foreground segmentation algorithm includes but is not limited to: pixel-based methods, Edge-based methods, Region-based methods, and the like.

In an alternative embodiment, in the above-mentioned steps S151 to S152, the foreground image may be calculated by the following formula

Where M1 denotes the manner of generation of the foreground image. The embodiment of the formula does not limit the specific content of M1, and M can be realized by a trained first-order motion model₁The function of (c).

Is a frontal non-expressive image containing the first subject.

According to the embodiment of the invention, the first-order motion model is fully utilized to transfer each target expression and action parameter to the front-face non-expression image, so that the accuracy and efficiency of expression and action transfer are improved, and the efficiency and the fidelity of foreground image generation are improved.

In step S16, the foreground mask image of the front face non-expression image is transformed based on the target expression motion parameter sequence to obtain a target mask image sequence.

In an alternative embodiment, the foreground mask image of the front-face blankness image may be obtained by:

and inputting the front face non-expression image into a foreground segmentation model for foreground segmentation processing to obtain the foreground mask image.

In the embodiment, the foreground segmentation processing can be performed on the front-face expressionless image through the pre-trained existing foreground segmentation model to obtain the foreground mask image, so that the accuracy of determining the foreground mask image can be improved, the utilization rate of the existing foreground segmentation model can be fully improved, and the acquisition cost of the foreground mask image is reduced.

In an alternative embodiment, as shown in fig. 6, fig. 6 is a schematic flow chart illustrating a process of transforming a foreground mask image of the front-face blankness image based on the target expression motion parameter sequence to obtain a target mask image sequence according to an exemplary embodiment. Accordingly, the following steps may be included:

in step S161, the target expression and action parameter sequence and the foreground mask image are input into a first-order motion model, and matrix conversion processing is performed on each target expression and action parameter in the target expression and action parameter sequence to obtain a grid matrix sequence corresponding to the target expression and action parameter sequence, where each grid matrix in the grid matrix sequence is used to represent the key point position information in the corresponding target expression and action parameter.

In step S162, according to each grid matrix in the grid matrix sequence, the position information of each pixel in the foreground mask image is moved to obtain a position information sequence of each pixel after the movement.

In step S163, the target mask image sequence is obtained based on the position information sequence of each pixel after the movement.

In an alternative embodiment, in the above steps S161 to S162, each target expression and action parameter (including the target key point position information) in the target expression and action parameter sequence may be based on a deformation module in the first-order motion model

And first order motion information of the object

On the basis, the foreground mask image is subjected to deformation processing to obtain a target mask image corresponding to each target expression action parameter, and a specific calculation formula can be as follows:

wherein the content of the first and second substances,

in order to be the target mask image,

for foreground mask image, Deformation is Deformation processing.

For example, in step S161, after the target expression and action parameter sequence and the foreground mask image are input into the first-order motion model, a deformation module in the first-order motion model may obtain a coordinate vector of the target key point position information of each target expression and action parameter, and then perform matrix conversion processing on each coordinate vector to obtain a grid matrix (i.e., grid matrix) corresponding to each target expression and action parameter. The grid matrix is obtained by converting the coordinate vector of the target key point position information in the target expression action parameters, so that the grid matrix can be used for representing the target key point position information in the target expression action parameters.

Optionally, the matrix transformation process may be performed on each target expression and action parameter through preset computing software, where the preset computing software includes but is not limited to: commercial math software (e.g., matlab), open-source scientific computing libraries (e.g., NumPy), and the like.

For example, in step S162, each grid matrix may be multiplied by the position information of each pixel point in the foreground mask image to move the position information of each pixel point in the foreground mask image, so as to obtain the position information of each pixel point after moving corresponding to each grid matrix. And then sequencing the position information of each moved pixel point corresponding to each grid matrix to obtain the position information sequence of each moved pixel point.

For example, in step S163, after the position information of each moved pixel point is determined, the target mask image corresponding to each target expression and action parameter may be generated according to the position information of each moved pixel point and the pixel value of each pixel point. And then sequencing the target mask images corresponding to each target expression action parameter according to the time sequence to obtain the target mask image sequence.

Optionally, according to the position information of each moved pixel and each pixel itself, the process of generating the target mask image corresponding to each target expression action parameter may be as follows:

creating a drawing board, setting more elements in the drawing board to be 0, and enabling the drawing board to be white; and then, assigning values to corresponding elements in the drawing board according to the position information of each moved pixel point and the pixel value of each pixel point, thereby generating a target mask image.

In the embodiment of the disclosure, the foreground mask image can be directly deformed according to the target expression and action parameter sequence determined by the expression parameter sequence to obtain a target mask image sequence fused with the subsequent target action video, that is, only the foreground mask image of one frame (namely, the front-face non-expression image) needs to be calculated, the subsequent target mask image does not need to be calculated, but the foreground mask image of the front-face non-expression image can be directly deformed by the target expression and action parameter sequence to obtain the target mask image, so that the consumption of the video generation process on system computing resources is reduced, the calculation time consumption is short, the fusion efficiency of the foreground image and the background image can be improved, and the generation efficiency of the target action video is improved. In addition, the consistency of each target mask image in the target mask image sequence obtained by directly deforming the foreground image of the front-face expressionless image is better, so that more consistent target action video can be obtained through subsequent fusion; in addition, the deformation processing is carried out through the first-order motion model, the determination accuracy of the target mask image sequence can be improved, and therefore the accuracy of target motion video generation is improved.

In step S17, a target motion video is obtained by fusing the foreground image sequence and the background image in the front-face blankness image based on the target mask image sequence.

In an alternative embodiment, as shown in fig. 7, fig. 7 is a schematic flow chart illustrating a process of obtaining the target motion video by fusing the foreground image sequence and the background image in the front-face expressionless image based on the target mask image sequence according to an exemplary embodiment. Accordingly, the following steps may be included:

in step S171, the pixel value of the pixel point in each target mask image in the target mask image sequence is used as the fusion weight of the corresponding foreground image in the foreground image sequence.

In step S172, a background image fusion weight corresponding to the fusion weight of each foreground image is obtained according to the fusion weight of each foreground image.

In step S173, each foreground image and the background image are fused based on the fusion weight of each foreground image and the background image fusion weight corresponding to the fusion weight of each foreground image, so as to obtain a target motion video frame corresponding to each foreground image.

In step S174, the target motion video frames corresponding to each foreground image are spliced to obtain the target motion video.

In the embodiment of the present disclosure, the fusion weight of each foreground image and the background image fusion weight corresponding to the fusion weight of each foreground image may be determined according to the pixel value of the pixel point in the target mask image, and then each foreground image in the foreground image sequence and the background image are fused according to the two weights, so as to obtain the target action video. Alternatively, the formula for fusing the foreground image and the background image may be as follows:

wherein the content of the first and second substances,

fusing each foreground image with the background image to obtain a target action video frame I_bgIn order to be a background image,

is a foreground image, and the image is a foreground image,

is the fusion weight of the foreground image,

is the fusion weight of the background image.

Specifically, the background image may be an image other than the first object in the front face blankness image.

Specifically, in step S171, the pixel value of each pixel point of the target mask image may be used as the fusion weight of the corresponding foreground image in the foreground image sequence

Specifically, in the above step S172, the step c may be performed

And the fusion weight of the background image is taken as the fusion weight corresponding to the fusion weight of each foreground image.

Specifically, in step S173, a first product of the fusion weight of each foreground image (i.e., the pixel value of the pixel point of the target mask image) and the pixel value of the pixel point in each foreground image may be calculated, and a second product of the background image fusion weight corresponding to the fusion weight of each foreground image and the pixel value of the pixel point in the background image may be calculated. Then, the sum of the first product and the second product is calculated to obtain a target action video frame corresponding to each foreground image

Optionally, the calculating the first product may be: and calculating a first product of the pixel value of the ith pixel point of the target mask image and the pixel value of the ith pixel point in the corresponding foreground image. The second product may be calculated as: and calculating a second product of the pixel value of the ith pixel point in the background image and (1-the pixel value of the ith pixel point of the target mask image).

Specifically, in step S174, the target motion video frames corresponding to each foreground image may be spliced according to the above time sequence, so as to obtain a target motion video in which the first object simulates an expression motion of the second object.

In the embodiment of the disclosure, the target mask image sequence is obtained by deforming the foreground mask image, the continuity of the target mask image is better, the fusion weight of the foreground image and the background image is determined by the target mask image with better continuity, and each foreground image in the foreground image sequence is fused with the background image by the fusion weight, which is beneficial to obtaining a more coherent target motion video by fusion.

In the embodiment of the disclosure, the foreground mask image can be directly deformed according to the target expression and action parameter sequence determined by the expression parameter sequence to obtain a target mask image sequence fused with the subsequent action video, that is, only the foreground mask image of one frame (namely, the front-face non-expression image) needs to be calculated, and the subsequent foreground mask image can be obtained by directly deforming the foreground mask image of the front-face non-expression image through the target expression and action parameter sequence, so that the consumption of the video generation process on system computing resources is reduced, the calculation time consumption is short, the fusion efficiency of the foreground image and the background image can be improved, and the generation efficiency of the target action video is improved. In addition, the continuity of each target mask image in the target mask image sequence obtained through deformation is good, and the fusion is facilitated to obtain a more coherent target action video.

Fig. 8 is a block diagram illustrating a video generation apparatus according to an example embodiment. Referring to fig. 8, the apparatus may include an image video acquisition module 21, a reference video frame determination module 22, a first sequence determination module 23, a second sequence determination module 24, a third sequence determination module 25, a deformation module 26, and a fusion module 27.

The image video acquiring module 21 is configured to perform acquiring a front-face blankness image of a first object and a driving video including a second object.

The reference video frame determination module 22 is configured to perform determining a reference video frame from the driving video, where a difference between a facial expression state of a second object included in the reference video frame and a front-face non-expression state of the second object is minimal.

The first sequence determining module 23 is configured to determine a second object included in a video frame of the driving video, and determine an expression parameter sequence of the second object included in the reference video frame, where the expression parameter sequence represents the second object included in the video frame, and changes an expression action of the second object included in the reference video frame.

The second sequence determining module 24 is configured to determine a target expression parameter sequence of the first object based on the expression parameter sequence and the initial expression parameters extracted from the front face non-expression image.

The third sequence determining module 25 is configured to determine a foreground image sequence of the first object according to the front face non-expression image and the target expression motion parameter sequence.

The deformation module 26 is configured to perform deformation on the foreground mask image of the front face non-expression image based on the target expression motion parameter sequence to obtain a target mask image sequence.

The fusion module 27 is configured to perform fusion of the foreground image sequence and the background image in the front-face expressionless image based on the target mask image sequence to obtain a target motion video.

In an exemplary embodiment, the deformation module 26 may include:

a matrix conversion unit configured to perform input of the target expression and action parameter sequence and the foreground mask image into a first-order motion model; performing matrix conversion processing on each target expression action parameter in the target expression action parameter sequence to obtain a grid matrix sequence corresponding to the target expression action parameter sequence, wherein each grid matrix in the grid matrix sequence is used for representing the position information of a key point in the corresponding target expression action parameter;

a moving unit configured to move the position information of each pixel point in the foreground mask image according to each grid matrix in the grid matrix sequence to obtain a moved position information sequence of each pixel point;

In an exemplary embodiment, the first sequence determining module 23 may include:

and the first parameter extraction unit is configured to input the reference video frame into a first-order motion model to extract expression and action parameters, so as to obtain first key point position information and first-order motion information of a second object contained in the reference video frame.

And the second parameter extraction unit is configured to perform expression action parameter extraction by inputting each video frame in the driving video into the first-order motion model, so as to obtain a second key point position information sequence and a second first-order motion information sequence of a second object contained in the driving video.

And a difference value calculating unit configured to calculate a difference value between each piece of second keypoint position information in the second keypoint position information sequence and the first keypoint position information to obtain a keypoint position change information sequence of a second object included in the driving video.

And the ratio calculation unit is configured to calculate the ratio between each second-order motion information in the second-order motion information sequence and the first-order motion information to obtain a motion state change information sequence of a second object contained in the driving video.

And the expression parameter sequence acquisition unit is configured to execute the change information sequence according to the key point position and the motion state to obtain the expression parameter sequence.

In an exemplary embodiment, the initial expression motion parameters include initial key point position information and initial first-order motion information, and the second sequence determining module 24 may include:

a key point position information sequence obtaining unit configured to execute a target key point position information sequence of the first object according to the initial key point position information and the key point position change information sequence.

A first order motion information sequence obtaining unit configured to perform obtaining a target first order motion information sequence of the first object according to the initial first order motion information and the motion state change information sequence.

In an exemplary embodiment, the third sequence determining module 25 may include:

and the migration unit is configured to input the front face expressionless image and the target expression action parameter sequence into a first-order motion model, and migrate each target expression action parameter in the target expression action parameter sequence to the front face expressionless image to obtain an expression-migrated image sequence.

In an exemplary embodiment, the fusion module 27 may include:

and the first weight determining unit is configured to perform the step of taking the pixel value of the pixel point in each target mask image in the target mask image sequence as the fusion weight of the corresponding foreground image in the foreground image sequence.

And the second weight determining unit is configured to execute fusion weight according to each foreground image to obtain background image fusion weight corresponding to the fusion weight of each foreground image.

And the fusion unit is configured to execute a background image fusion weight corresponding to the fusion weight of each foreground image and the fusion weight of each foreground image, and fuse each foreground image and the background image to obtain a target action video frame corresponding to each foreground image.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

In an exemplary embodiment, there is also provided an electronic device, comprising a processor; a memory for storing processor-executable instructions; wherein the processor is configured to implement the steps of any of the video generation methods described in the embodiments above when executing the instructions stored on the memory.

The electronic device may be a terminal, a server, or a similar computing device, taking the electronic device as a server as an example, fig. 9 is a block diagram of an electronic device for video generation according to an exemplary embodiment, where the electronic device 30 may have a relatively large difference due to different configurations or performances, and may include one or more Central Processing Units (CPUs) 31 (the CPU 31 may include but is not limited to a Processing device such as a microprocessor MCU or a programmable logic device FPGA), a memory 33 for storing data, and one or more storage media 32 (e.g., one or more mass storage devices) for storing applications 323 or data 322. The memory 33 and the storage medium 32 may be, among other things, transient storage or persistent storage. The program stored on the storage medium 32 may include one or more modules, each of which may include a sequence of instructions operating on the electronic device. Still further, the central processor 31 may be configured to communicate with the storage medium 32, and execute a series of instruction operations in the storage medium 32 on the electronic device 30. The electronic device 30 may also include one or more power supplies 36, one or more wired or wireless network interfaces 35, one or more input-output interfaces 34, and/or one or more operating systems 321, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, and so forth.

The input output interface 34 may be used to receive or transmit data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the electronic device 30. In one example, the input/output Interface 34 includes a Network adapter (NIC) that can be connected to other Network devices through a base station to communicate with the internet. In an exemplary embodiment, the input/output interface 34 may be a Radio Frequency (RF) module, which is used for communicating with the internet in a wireless manner.

It will be understood by those skilled in the art that the structure shown in fig. 9 is merely an illustration and is not intended to limit the structure of the electronic device. For example, electronic device 30 may also include more or fewer components than shown in FIG. 9, or have a different configuration than shown in FIG. 9.

In an exemplary embodiment, there is also provided a computer-readable storage medium, wherein instructions of the computer-readable storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the steps of any of the video generation methods of the above embodiments.

In an exemplary embodiment, there is also provided a computer program product comprising a computer program which, when executed by a processor, implements the video generation method provided in any one of the above embodiments.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in embodiments provided by the present disclosure may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method of video generation, comprising:

2. The video generation method according to claim 1, wherein the transforming the foreground mask image of the front-face expressionless image based on the target expression motion parameter sequence to obtain a target mask image sequence comprises:

3. The video generation method according to claim 1 or 2, wherein the determining the sequence of expression parameters of the second object included in the video frame of the driving video relative to the second object included in the reference video frame comprises:

calculating the ratio of each second-order motion information in the second-order motion information sequence to the first-order motion information to obtain a motion state change information sequence of a second object contained in the driving video;

4. The video generation method according to claim 3, wherein the initial expression parameters include initial key point position information and initial first-order motion information, and the determining the target expression parameter sequence of the first object based on the expression parameter sequence and the initial expression parameters extracted from the front face blankness image includes:

5. A video generation method according to claim 1 or 2, wherein said determining a foreground image sequence of the first object based on the front face expressionless image and the target expression motion parameter sequence comprises:

6. The video generation method according to claim 1 or 2, wherein the fusing the foreground image sequence and the background image in the front-face blankness image based on the target mask image sequence to obtain a target motion video includes:

7. A video generation apparatus, comprising:

8. An electronic device, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the video generation method of any of claims 1 to 6.

9. A computer-readable storage medium whose instructions, when executed by a processor of an electronic device, cause the electronic device to perform the video generation method of any of claims 1-6.

10. A computer program product comprising a computer program, characterized in that the computer program, when executed by a processor, implements the video generation method of any of claims 1 to 6.