CN111696185A

CN111696185A - Method and device for generating dynamic expression image sequence by using static face image

Info

Publication number: CN111696185A
Application number: CN201910186356.1A
Authority: CN
Inventors: 郜业飞; 董健; 颜水成
Original assignee: Beijing Qihoo Technology Co Ltd
Current assignee: Beijing Qihoo Technology Co Ltd
Priority date: 2019-03-12
Filing date: 2019-03-12
Publication date: 2020-09-22

Abstract

The invention provides a method and a device for generating a dynamic expression image sequence by using a static face image. The method comprises the following steps: acquiring a target static face image; detecting face identification points in the target static face image by a face key point detection algorithm to be used as control points; acquiring expression action driving data corresponding to the control point; and carrying out local affine transformation on the control points in the target static face image frame by frame according to the expression action driving data to obtain a dynamic expression image sequence. The scheme of the invention realizes the dynamic conversion of a single static image, the generated dynamic effect is more natural, and the method can be used for the expression playing of the single static image or the generation of an expression packet and the like.

Description

Method and device for generating dynamic expression image sequence by using static face image

Technical Field

The invention relates to the technical field of computers, in particular to a method and a device for generating a dynamic expression image sequence by using a static face image, a computer storage medium and computing equipment.

Background

Generally, a video or a moving picture is composed of different frame pictures that change in time sequence, and a dynamic effect of the picture contents is realized by playing the picture contents having continuity at a constant frame rate on a time axis. Therefore, when a moving picture content is desired, continuous picture shooting for a certain period of time is often required. However, in the real life where cameras are widely spread, there are a huge number of still single pictures, and these still single pictures are far from being fully utilized. Therefore, how to give dynamic life to such static single pictures and make static pictures 'alive' is a problem to be solved urgently.

Disclosure of Invention

In view of the above, the present invention has been made to provide a method and apparatus, a computer storage medium, and a computing device for generating a sequence of dynamic expression images using a static face image that overcome or at least partially solve the above-mentioned problems.

According to an aspect of the embodiments of the present invention, there is provided a method for generating a dynamic expression image sequence by using a static face image, including:

acquiring a target static face image;

detecting face identification points in the target static face image by a face key point detection algorithm to be used as control points;

acquiring expression action driving data corresponding to the control point;

and carrying out local affine transformation on the control points in the target static face image frame by frame according to the expression action driving data to obtain a dynamic expression image sequence.

Optionally, the obtaining expression and motion driving data corresponding to the control point includes:

and simulating according to the facial muscle movement mode of the facial expression action to obtain the motion trail data of the control points of a plurality of continuous frames as the expression action driving data.

acquiring facial expression images of a plurality of continuous frames as a driving facial image sequence;

detecting face identification points in the face expression image as control points through the face key point detection algorithm, wherein the control points in the face expression image correspond to the control points in the target static face image one to one;

and calculating the motion trail data of the control points as the expression action driving data.

Optionally, before calculating the motion trajectory data of the control point, the method further includes:

and carrying out face alignment on the initial frame of the driving face image sequence to the target static face image.

Optionally, performing face alignment on the initial frame of the driving face image sequence to the target static face image, including:

taking three control points of the outer side of the eyes and the position of the nose tip in the initial frame of the driving face image sequence and the face identification points of the target static face image as conversion point pairs, and solving to obtain an alignment mapping matrix;

calculating motion trajectory data of the control points, including:

calculating the motion displacement of the control point in each frame after the initial frame of the driving human face image sequence relative to the control point in the initial frame;

and correcting the motion displacement of the control point by using the alignment mapping matrix to obtain the motion trail data of the control point.

Optionally, performing local affine transformation on the control points in the target static face image frame by frame according to the expression and motion driving data, including:

calculating to obtain a control point set of an ith frame of the dynamic expression image sequence by using the control point set of the target static face image, the control point set of the initial frame of the expression action driving data and the control point set of the ith frame, wherein the target static face image is used as the initial frame of the dynamic expression image sequence, and i is a natural number greater than or equal to 1;

generating a control point pair set for local affine transformation by using the control point set of each frame of the dynamic expression image sequence, wherein each unit in the control point pair set comprises a control point group consisting of at least three pairs of control points, and each pair of control points comprises position information of corresponding control points of an initial frame and an ith frame of the dynamic expression image sequence;

and carrying out local affine transformation on the control points in the target static face image frame by frame according to the control point group set.

Optionally, generating a set of control point pair groups for local affine transformation by using the set of control points of each frame of the dynamic expression image sequence, including:

triangulating control points in the target static face image to generate a triangular surface element;

generating a control point pair group set for local affine transformation by using the control point set of each frame of the dynamic expression image sequence based on the triangular surface element, wherein each unit in the control point pair group set comprises a control point group consisting of three pairs of vertexes of a triangular surface element, and each pair of vertexes comprises position information of corresponding control points of an initial frame and an ith frame of the dynamic expression image sequence;

performing local affine transformation on the control points in the target static face image frame by frame according to the control point group set, including:

traversing the control point pair group set, solving an affine transformation matrix of the triangular surface element region by using the control point pair group of each triangular surface element, and performing local affine transformation on the triangular surface element region by using the affine transformation matrix.

Optionally, the Triangulation is a dironi Triangulation Delaunay Triangulation;

the local affine transformation is a dense affine transformation.

Optionally, after the control point set of the ith frame of the dynamic expression image sequence is obtained through calculation, the method further includes:

and judging the effectiveness of the control points of the designated area in the ith frame of the dynamic expression image sequence, and correcting the control points of the designated area according to the judgment result.

Optionally, after performing local affine transformation on the control points in the target static face image frame by frame, the method further includes:

and fusing each triangular surface element region after local affine transformation to the corresponding position of the target static face image to obtain the ith frame of dynamic expression image.

Optionally, the method further comprises:

and fusing preset area materials for the hidden area in each frame of image of the obtained dynamic expression image sequence.

Optionally, the hidden area comprises at least one of:

mouth region, eye region, forehead region;

the preset area material comprises at least one of the following materials:

teeth and/or tongue, eye balls, wrinkles.

Optionally, when the hidden area is a mouth area, fusing preset area materials to the hidden area in each frame image of the obtained dynamic expression image sequence, including:

calculating an initial area A of a mouth region in an initial frame of the dynamic expression image sequence₀And setting a trigger threshold A according to the initial area_t；

Calculating the area A of a mouth region in each frame after the initial frame of the dynamic expression image sequence;

a and A are reacted_tComparing;

if A>A_tAnd acquiring a mouth region material image, and fusing the mouth region material image in the mouth region of the frame of the dynamic expression image sequence.

Optionally, A is mixed with A_tAfter the comparing, the method further comprises:

if A_t>A>A₀And performing Alpha fusion operation on the mouth region in the frame of the dynamic expression image sequence.

Optionally, acquiring a mouth region material image includes:

acquiring a dynamic expression image of a current frame and a driving face image for generating the dynamic expression image of the current frame from the target static face image;

aligning the driving face image with the mouth region in the dynamic expression image of the current frame;

and cutting a mouth region in the driving face image as the mouth region material image.

Optionally, aligning the driving face image with the mouth region in the dynamic expression image of the current frame includes:

selecting two mouth corners and eyebrow positions in the human face as three control points, and respectively acquiring the positions of the three control points in the driving human face image and the dynamic expression image of the current frame;

according to the positions of the three pairs of control points in the driving face image and the dynamic expression image of the current frame, solving an alignment transformation matrix;

and transforming the driving face image by using the alignment transformation matrix.

Optionally, the method further comprises:

and generating dynamic images or videos at a specified frame rate by using the dynamic expression image sequence.

According to another aspect of the embodiments of the present invention, there is also provided an apparatus for generating a dynamic expression image sequence using a static face image, including:

the static image acquisition module is suitable for acquiring a target static face image;

the face identification point detection module is suitable for detecting the face identification points in the target static face image through a face key point detection algorithm to serve as control points;

the driving data acquisition module is suitable for acquiring expression action driving data corresponding to the control point; and

and the local affine transformation module is suitable for carrying out local affine transformation on the control points in the target static face image frame by frame according to the expression action driving data to obtain a dynamic expression image sequence.

Optionally, the driving data obtaining module is further adapted to:

and before calculating the motion trajectory data of the control point, performing face alignment on the initial frame of the driving face image sequence to the target static face image.

Optionally, the driving data obtaining module is further adapted to:

Optionally, the local affine transformation module is further adapted to:

Optionally, the Triangulation is a dironi Triangulation Delaunay Triangulation;

the local affine transformation is a dense affine transformation.

Optionally, the local affine transformation module is further adapted to:

after the control point set of the ith frame of the dynamic expression image sequence is obtained through calculation, the effectiveness of the control points of the designated area in the ith frame of the dynamic expression image sequence is judged, and the control points of the designated area are corrected according to the judgment result.

Optionally, the local affine transformation module is further adapted to:

and after carrying out local affine transformation on the control points in the target static face image frame by frame, fusing each triangular surface element region subjected to local affine transformation to the corresponding position of the target static face image to obtain the ith frame of dynamic expression image.

Optionally, the apparatus further comprises:

and the hidden area processing module is suitable for fusing preset area materials in the hidden area in each frame of image of the obtained dynamic expression image sequence.

Optionally, the hidden area comprises at least one of:

mouth region, eye region, forehead region;

the preset area material comprises at least one of the following materials:

teeth and/or tongue, eye balls, wrinkles.

Optionally, the hidden area processing module is further adapted to:

when the hidden area is a mouth area, calculating an initial area A of the mouth area in an initial frame of the dynamic expression image sequence₀And setting a trigger threshold A according to the initial area_t；

a and A are reacted_tComparing;

Optionally, the hidden area processing module is further adapted to:

a and A are reacted_tAfter comparison, if A_t>A>A₀And performing Alpha fusion operation on the mouth region in the frame of the dynamic expression image sequence.

Optionally, the hidden area processing module is further adapted to:

Optionally, the apparatus further comprises:

and the dynamic image/video synthesis module is suitable for generating dynamic images or videos at a specified frame rate by utilizing the dynamic expression image sequence.

According to yet another aspect of the embodiments of the present invention, there is also provided a computer storage medium storing computer program code which, when run on a computing device, causes the computing device to execute the method for generating a sequence of dynamic expression images from static face images according to any one of the above.

According to still another aspect of the embodiments of the present invention, there is also provided a computing device including:

a processor; and

a memory storing computer program code;

the computer program code, when executed by the processor, causes the computing device to perform a method of generating a sequence of dynamic expression images from static face images according to any of the above.

The method and the device for generating the dynamic expression image sequence by using the static face image, which are provided by the embodiment of the invention, firstly obtain a target static face image, detect a face identification point from the target static face image as a control point through a face key point detection algorithm, then obtain expression action driving data corresponding to the control point, and finally perform local affine transformation on the control point in the target static face image frame by using the expression action driving data to obtain the dynamic expression image sequence, thereby realizing the dynamic conversion of a single static image, and being applicable to the application of expression playing or expression packet generation of a single static image and the like.

Furthermore, a control point pair set for local affine transformation is generated through subdivision of the discrete point triangular surface element, and an affine transformation matrix of a corresponding triangular surface element area is solved, so that the dynamic effect after transformation according to the affine transformation matrix is more natural, and the distortion condition is avoided. Furthermore, by performing material fusion on hidden areas such as mouth areas after mouth opening, the influence caused by content deletion in the generated dynamic expression images can be avoided, and the generated dynamic effect is more natural and beautiful.

The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.

The above and other objects, advantages and features of the present invention will become more apparent to those skilled in the art from the following detailed description of specific embodiments thereof, taken in conjunction with the accompanying drawings.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

fig. 1 is a flowchart illustrating a method for generating a sequence of dynamic expression images using a static face image according to an embodiment of the present invention;

fig. 2 is a flowchart illustrating steps of performing a material blending operation on a hidden area when the hidden area is a mouth area according to an embodiment of the present invention;

FIG. 3 is a flow chart illustrating a method for generating a sequence of dynamic expression images using static face images according to another embodiment of the present invention;

fig. 4 is a schematic structural diagram of an apparatus for generating a sequence of dynamic expression images using static face images according to an embodiment of the present invention; and

fig. 5 is a schematic structural diagram of an apparatus for generating a sequence of dynamic expression images using a static face image according to another embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

There are generally two ideas to convert a static image into a dynamic image. Firstly, generating a key frame by utilizing local image transformation, and further generating a dynamic effect containing a continuous picture sequence by utilizing an interpolation method; secondly, on the basis of learning, a generated model is used for carrying out picture understanding according to the initial value, and then expected dynamic effect output is generated. In the prior art, a method is to train and learn a large amount of unmarked video data by using a generated countermeasure network, and an obtained generation model can separate a front background of an input single image and estimate the motion of a foreground object so as to generate a dynamic video. The other method utilizes a reference face video to drive a static target face image, imitates face transformation in the reference video through 2D transformation, and realizes the effect of generating a dynamic picture based on a single face image, but the transformation is not accurate enough. Another method is to perform role playing again based on a parameter model, so as to realize real-time role playing from a single reference video to a single target video with facial expression, but the application object is a video source rather than a single static image.

In order to solve the above technical problem, an embodiment of the present invention provides a method for generating a dynamic expression image sequence by using a static face image. Fig. 1 is a flowchart illustrating a method for generating a sequence of dynamic expression images using a static face image according to an embodiment of the present invention. Referring to fig. 1, the method may include at least the following steps S102 to S108.

And step S102, acquiring a target static face image.

And step S104, detecting the face identification points in the target static face image as control points through a face key point detection algorithm.

And step S106, acquiring expression action driving data corresponding to the control point.

And S108, carrying out local affine transformation on the control points in the target static face image frame by frame according to the expression action driving data to obtain a dynamic expression image sequence.

The method for generating the dynamic expression image sequence by using the static face image comprises the steps of firstly obtaining a target static face image, detecting a face identification point from the target static face image as a control point through a face key point detection algorithm, then obtaining expression action driving data corresponding to the control point, and finally carrying out local affine transformation on the control point in the target static face image frame by using the expression action driving data to obtain the dynamic expression image sequence, so that the dynamic conversion of a single static image is realized, and the method can be used for expression playing or expression packet generation and other applications of a single static image.

The target static face image in step S102 may be a single static picture collected by a camera, or a single static picture obtained from an album, the internet, or the like. Preferably, the face in the target static face image has a natural expression of the face.

The face identification points detected in step S104 may be identification points of key parts such as mouth, eyebrow, eye, nose, and outer contour of face below eyes.

The embodiment of the invention fully utilizes the characteristics of the face image and uses a standard face key point detection algorithm, including but not limited to a computer vision open source library dlib and a face key point detection algorithm based on a CLM (Constrained Local Model) frame, and the like.

The expression motion driving data corresponding to the control point may be acquired in the above step S106 in the following two ways.

Firstly, simulation is carried out according to facial muscle movement modes of facial expression actions, so that movement track data of control points of a plurality of continuous frames are obtained and used as expression action driving data.

And secondly, acquiring motion trail data of the control points by using a human face key point detection algorithm and performing necessary processing. Specifically, the method may include the steps of:

firstly, facial expression images of a plurality of continuous frames are obtained as a driving facial image sequence. The facial expression image mentioned here may be a previously prepared image or an image acquired in real time by a camera (e.g., webcam). Preferably, the initial frame driving the face image sequence is also a face image of a natural facial expression. It should be noted that, in the embodiment of the present invention, the initial frame is defined as the 0 th frame.

Then, a face key point detection algorithm is used for detecting face identification points in the face expression images of each frame as control points, and the control points in the face expression images are in one-to-one correspondence with the control points in the target static face image.

And finally, calculating the motion trail data of the control points in the facial expression image of each frame as expression action driving data.

Considering the situation that the positions of the face in the initial frame of the driving face image sequence and the target static face image may be inconsistent, it is not accurate enough to directly drive the target static face image to transform by using the motion trajectory data of the control point in the driving face image, and then the image distortion phenomenon occurs. In this case, the step of performing face alignment on the initial frame of the driving face image sequence to the target static face image may be performed before calculating the motion trajectory data of the control point. For example, assuming that the face in the initial frame of the driving face image sequence is in a front direct-view state, and the face in the target static face image is in a slightly left-leaning state, after the face alignment of the two, the face in the initial frame of the driving face image sequence is adjusted to be in a slightly left-leaning state, so that the positions of the face in the initial frame of the driving face image sequence and the face in the target static face image are consistent.

Specifically, the step of face alignment may be implemented as: and taking three control points of the positions of the outer side of the eyes and the nose tip in the initial frame of the driving face image sequence and the face identification points of the target static face image as conversion point pairs, and solving to obtain an alignment mapping matrix. Accordingly, the step of calculating the motion trajectory data of the control point may be embodied as: and calculating the motion displacement of the control point in each frame after the initial frame of the driving human face image sequence relative to the control point in the initial frame, and then correcting the motion displacement of the control point obtained by calculation by using the alignment mapping matrix to obtain the motion trajectory data of the control point. In the step S108, local affine transformation is performed on the control points in the target static face image frame by frame, so as to obtain a plurality of continuous dynamic expression images.

Affine transformation, also called affine mapping, refers to a geometric transformation in which one vector space is linearly transformed and then translated into another vector space. The affine transformation of the image can realize the operations of zooming, rotating and translating, and realize the change of the picture content. Particularly, the affine transformation of the plane graph can be determined only by three pairs of transformation point pairs, and the method is more convenient.

After the expression action driving data is used for conversion, each frame of the obtained dynamic expression image sequence corresponds to each frame of the expression action driving data one by one.

In an optional embodiment of the present invention, step S108 may be further embodied as the following steps:

firstly, a control point set of an ith frame of a dynamic expression image sequence is obtained through calculation by utilizing a control point set of a target static face image, a control point set of an initial frame of expression action driving data and a control point set of the ith frame, wherein the target static face image is used as the initial frame of the dynamic expression image sequence, and i is a natural number which is greater than or equal to 1. Specifically, the control points of the expression motion driving data have time sequence consistency, and on the premise that the target static face image (i.e., the initial frame of the dynamic expression image sequence) and the initial frame of the expression motion driving data are both of front natural expressions, the frame-by-frame motion displacement of the control points of the expression motion driving data can be used to simulate the motion displacement of the control points in the corresponding frame of the dynamic expression image sequence, so as to obtain the control point set of each frame after the initial frame of the dynamic expression image sequence.

Then, a control point pair group set for local affine transformation is generated by using the control point set of each frame of the dynamic expression image sequence, wherein each unit in the control point pair group set comprises a control point group consisting of at least three pairs of control points, and each pair of control points comprises position information of the corresponding control point of the initial frame and the ith frame of the dynamic expression image sequence.

And finally, carrying out local affine transformation on the control points in the target static face image frame by frame according to the control point group set to obtain a dynamic expression image sequence which corresponds to the expression action driving data frame by frame.

The following describes a specific process of finding a set of control points for each frame of a dynamic expression image sequence (not referred to as a target image).

Assume that a target static face image (i.e., an initial frame of a dynamic expression image sequence) is t₀Using face key point detection algorithm to pair t₀Detecting the control point set obtained by the detection

As an initial set of input control points for the affine transformation. Then, in order to apply affine transformation, a solution to the set of control points mapped by the target static face image, that is, the set of control points of each frame after the initial frame of the dynamic expression image sequence (i.e., the target image), is required

A set of control points representing an ith frame of the successive frames of the target image. Wherein, the calculation is carried out according to the control point data of the expression action to be generated

The set of control points of the ith frame of expression motion driving data (not referred to as source data) can be represented as

The source data can be the data simulated according to the facial muscle movement mode of the facial expression action, and can also be a driving facial image sequence formed by collected facial expression images of a plurality of continuous frames. On the premise that the faces in the initial frames of the source data and the target image are both normal natural facial expressions, the control point set of the initial frame of the source data can be expressed as

Because the control points of the source data have time sequence consistency and the human face states in the initial frames of the target image and the source data are assumed to be positive natural expressions, the motion displacement of the control points of the source data from frame to frame can be used

Simulating the motion displacement of the corresponding frame control point of the target image, namely, the relationship of the following formula (1):

further, based on the above formula(1) I.e. according to the initial frame control point set of the detected target image

And simulating or detecting the initial frame control point set of the obtained source data

And ith frame control point set

Calculating to obtain the control point set of the ith frame of the target image

In addition, when the source data is a driving face image sequence formed by a plurality of collected continuous frames of face expression images, and the positions of the faces in the initial frame of the source data and the initial frame of the target image are inconsistent, the initial frame of the source data is aligned to the initial frame of the target image in advance. Three control points at the outer side of the eyes and the position of the nose tip in the face control points of the initial frame of the source data and the initial frame of the target image are used as transformation point pairs, and the alignment mapping matrix phi is obtained by solving^A. At this time, the alignment mapping matrix phi may be utilized^AAnd correcting the motion displacement of the control point. Thus, the set of control points for the ith frame of the target image

Can be calculated from the following formula (2):

in order to perform local deformation operation on a target static face image by using local affine transformation to generate a dynamic expression image sequence, a control point pair set for local transformation needs to be determined. In addition, at least three pairs of control points are needed for arbitrary local application of affine transformation to the plane image, and therefore, the triangulation method is very suitable for the occasion of generating a control point pair set for local affine transformation by using a plurality of control point sets of the face.

In an optional embodiment of the present invention, the step of generating a set of control point pair groups for local affine transformation using the set of control points of each of the successive frames of the calculated dynamic expression image sequence (i.e. the target image) may be further implemented as:

firstly, triangulation is carried out on control points in a target static face image, and a triangular surface element is generated. And then, generating a control point pair group set for local affine transformation by using the control point set of each frame of the dynamic expression image sequence based on the subdivided triangular surface element, wherein each unit in the control point pair group set comprises a control point group consisting of three pairs of vertexes of the triangular surface element, and each pair of vertexes comprises position information of a corresponding control point of the initial frame and the ith frame of the dynamic expression image sequence.

Correspondingly, the step of performing local affine transformation on the control points in the target static face image frame by frame according to the control point group set can also be implemented as follows:

Triangulation (Triangulation, also called Triangulation) refers to the process of dividing a given plane into triangular sub-regions, using points as vertices, for a set of points in the plane, and the resulting planar graph of Triangulation satisfies the following conditions: (1) edges in the plan view do not contain any points in the set of points, except for the endpoints; (2) there are no intersecting edges; (3) all faces in the plan view are triangular faces and the collection of all triangular faces is the convex hull of the point set. By triangulation, the key parts of the human face can be segmented into triangular surface elements which are easy to process according to the control points in the target static human face image.

One skilled in the art will readily appreciate that there are many possible outcomes to triangularization. Among the Triangulation methods, the dironi Triangulation (Delaunay Triangulation) has the following excellent characteristics: (1) the characteristic of a null circle, namely, no other triangle vertex is contained in any triangle circumcircle; (2) the minimum angle characteristic is maximized, namely, in the triangulation possibly formed by the scatter set, the minimum angle of the triangle formed by the Delaunay triangulation is the maximum, specifically, if the diagonals of the convex quadrangle formed by any two adjacent triangles can be interchanged, the minimum angle in the six interior angles of the two triangles cannot be increased; (3) uniqueness, i.e., no matter where the region is built, will eventually yield consistent results. Due to the characteristics, the triangle generated by the Delaunay triangulation has the characteristic of being more 'full', namely, the values of the three interior angles are uniform, so that the dynamic effect based on the Delaunay triangular surface element transformation is more natural, and the distortion condition cannot be generated.

The following describes the generation of the local triangular bin and the process of performing the local affine transformation in detail by using an example.

In this example, a number of face control points in a static face image of a target (i.e., an initial frame of the target image) are triangulated by a Delaunay Triangulation method, and a plurality of Delaunay triangular bins (Delaunay triangulates) are generated. Based on the result of the Delaunay triangulation, a set of control point pair groups for local affine transformation is generated using the set of control points of each frame (taking the ith frame as an example) of the target image obtained by the foregoing calculation. Here, each cell in the control point pair group set includes a control point group composed of three pairs of vertices of the Delaunay Triangle, and each pair of control points includes the position information (including the relative displacement information) of the corresponding control point of the initial frame and the ith frame of the target image obtained by the foregoing calculation. Traversing the control point pair group set, and solving the affine transformation matrix of the triangular surface element region by using the control point pair group consisting of three pairs of vertexes of the Delaunay Triangle. Let the triangular surface element set obtained after the face control point passes through Delaunay Triangulation be T, the set size n is the number of triangular surface elements T, traverse the control point to set and solve any triangular surface element T in the set T obtained after the set_jThe vertex of (a) is at the affine transformation matrix corresponding to the initial frame and the ith frame of the target image

Further, affine transformation matrix is used by the following expression (3)

For the triangular surface element region t_jCarrying out dense affine transformation to obtain the triangular surface element area t_jAt a specific position in the ith frame of the target image.

T in the above formula (3)_jRepresents any triangular surface element region, T 'in the set T'_jRepresenting a triangular bin area t_jAnd the position in the ith frame picture of the target image after the local affine transformation. And after traversing is finished, the ith frame result of the target image (dynamic expression image sequence) can be obtained.

Optionally, after local affine transformation that the control points in the target static face image are mapped to the ith frame of picture is performed, fusing each triangular surface element region after the local affine transformation to the corresponding position of the target static face image, and obtaining the ith frame of dynamic expression image. Specifically, a Seamless cloning (Seamless Clone) technique such as Poisson fusion (Poisson Blending) or the like can be employed.

The transformation operation and the subsequent fusion operation are carried out on the target static face image by using the control point data of all frames of the source data, and the required dynamic expression image sequence can be obtained.

In addition, in general, the human face key point detection algorithm determines that the upper lip control point of the human face is above the lower lip control point according to the detection template, and for an initial frame of the target image, the relationship between the upper lip control point and the lower lip control point can be completely obtained through a detection method, and the above logic is true. However, when calculating the control points of other frames following the initial frame of the target image, sometimes the calculated upper lip control point (for example, under some expressive motion) appears below the lower lip control point due to the intermediate conversion process. This phenomenon, after local affine transformation, causes a triangular dark area to appear in the mouth area.

In order to avoid the above situation, in an alternative embodiment, after the set of control points of each frame (taking the ith frame as an example) of the target image (dynamic expression image sequence) is calculated, the following steps may be further performed:

and judging the effectiveness of the control points of the designated area in the ith frame of the target image, and correcting the control points of the designated area according to the judgment result.

The designated areas mentioned herein are specific areas, such as the mouth area. For example, after the control point set of the ith frame of the target image is calculated, whether the upper lip control point in the mouth region is above the lower lip control point is determined. If yes, the control point is considered to be effective, and correction is not needed. If not, the control point is considered invalid, and necessary correction is carried out on the control point to enable the control point to meet normal logic, for example, the control point of the upper lip is corrected to be above the control point of the lower lip.

As will be readily appreciated by those skilled in the art, the source data, particularly the sequence of driver face images, will have new visual content, such as exposed teeth and tongue portions when the mouth is opened, eye parts when the eyes are rotated, and forehead wrinkles, as the expression changes. In this case, only performing affine transformation on the target static face image may result in missing content in the target image.

In order to avoid the influence caused by the content missing in the target image, in an optional embodiment of the present invention, after the dynamic expression image sequence is obtained through transformation, the hidden area in each frame image of the obtained dynamic expression image sequence may be subjected to fusion of preset area materials. The hidden area referred to herein may include at least one of a mouth area, an eye area, a forehead area, and the like. Accordingly, the preset area material may include at least one of teeth and/or tongue, eyeball, wrinkles, etc.

For the application scene of the facial expression actions, each action has a fixed motion mode, and the content of the hidden area has high reusability. In practical applications, the preset region material may include a pre-made motion model material, or a region material cut from a driving face image (e.g., a face image captured in real time by webcam).

Next, a procedure of material fusion of the hidden area will be described with reference to fig. 2, taking the hidden area as the mouth area as an example.

As described above, it has been assumed that the face in the target static face image has a positive natural expression, and in this case, the mouth is in a closed or approximately closed state, and therefore, it is only necessary to treat the internal region of the mouth as a hidden region when the mouth is opened by a certain amount.

Fig. 2 is a flowchart illustrating the steps of performing material fusion operation on a hidden area when the hidden area is a mouth area according to an embodiment of the present invention. As shown in fig. 2, when the hidden area is a mouth area, the operation of fusing the preset area material to the hidden area in each frame image of the obtained dynamic expression image sequence includes the following steps:

in step S202, the area A of the mouth region in the initial frame of the sequence of dynamic expression images (referred to as initial area) is calculated₀And according to the initial area A₀Setting a trigger threshold A_t。

In this embodiment, the area of the mouth region is the area surrounded by the mark points inside the mouth.

In practice, the trigger threshold may be set according to the desired effect, for example, trigger threshold A_tIs a of the initial area₀In the range of 2-4 times.

Step S204, calculating the area A of the mouth region in each frame after the initial frame of the dynamic expression image sequence;

step S206, A and A_tComparing;

step S208, if A>A_tAcquiring the material image of the mouth region, and displaying the dynamic expression image sequenceThe mouth regions in the frame of the column fuse mouth region material images.

The mouth region material in this embodiment may be teeth and/or tongue.

The fusion can be performed by using a seamless cloning technique, such as Poisson Image Editing (Poisson Image Editing).

Further preferably, to make the transition of the change in the mouth region more natural, before the mouth opens to the trigger threshold, i.e. a and a are compared_tAfter comparison, if A_t>A>A₀And performing Alpha fusion operation on the mouth region in the frame of the dynamic expression image sequence, thereby realizing gradual transition of fusion.

In an alternative embodiment of the invention, the mouth region material images may be acquired by:

first, a dynamic expression image of a current frame and a driving face image for generating the dynamic expression image of the current frame from a target static face image are acquired. The driving face image here may be an image acquired by webcam, and the dynamic expression image of the current frame is a Warped target image generated by the driving of the driving face image.

Then, the driving face image is aligned with the mouth region in the dynamic expression image of the current frame, so that the influence caused by the position mismatch of the mouth regions is eliminated.

And finally, cutting the mouth region in the driving face image as a mouth region material image.

Further, the step of aligning the driving face image with the mouth region in the dynamic expression image of the current frame may be further implemented as:

the first step is to select two mouth corners and eyebrow positions in the human face as three control points, and respectively acquire the positions of the three control points in the driving human face image and the dynamic expression image of the current frame.

And secondly, solving an alignment transformation matrix according to the positions of the three pairs of control points in the driving face image and the dynamic expression image of the current frame.

And thirdly, carrying out affine transformation on the driving face image by using the alignment transformation matrix.

Under the condition that the mouth positions of the driving face image and the dynamic expression image of the current frame are known, the alignment of the mouth regions can be realized only by carrying out affine transformation once.

By performing material fusion on hidden areas such as the mouth area and the like, the influence caused by content deletion in the generated dynamic expression image is avoided, and the generated dynamic effect is more natural and beautiful.

In an optional embodiment of the present invention, after obtaining the required dynamic expression image sequence, the dynamic expression image sequence may be further utilized to generate a dynamic image or a video at a specified frame rate, so as to expand applications in aspects of customizing an expression package and the like.

In the above, various implementation manners of each link of the embodiment shown in fig. 1 are introduced, and the implementation process of the method for generating a dynamic expression image sequence by using a static face image according to the present invention will be described in detail through a specific embodiment.

Fig. 3 is a flowchart illustrating a method for generating a sequence of dynamic expression images using static face images according to an embodiment of the present invention. Referring to fig. 3, the method may include at least the following steps S302 to S322.

Step S302, a driving face image sequence is obtained, and a face identification point in each frame is detected as a control point.

In this step, the sequence of driving facial images consists of several consecutive frames of facial expression images acquired in real time by webcam. And detecting the identification points by a face key point detection algorithm.

Step S304, a target static face image is obtained, and face identification points in the target static face image are detected as control points, wherein the face identification points in the target static face image correspond to the face identification points in each frame of the driving face image sequence one by one.

Step S306, the initial frame of the driving face image sequence is aligned to the target static face image.

The specific alignment operation in this step is as described above, and is not described herein again.

Step S308, calculating the motion trail data of the control points of the driving human face image sequence to obtain the position information of the control points in each frame.

In step S310, a control point set of the ith frame of the dynamic expression image sequence is calculated.

In this step, the control point set of the ith frame of the dynamic expression image sequence to be obtained is calculated by using the control point set of the target static face image obtained by the previous detection and calculation, and the control point set of the initial frame and the control point set of the ith frame of the driving face image sequence, and the specific solving process is as described above and is not described herein again.

Step S312, performing Delaunay triangulation on the control points in the target static face image to generate a triangular surface element, and generating a control point pair group set for local affine transformation by using the control point set of each frame of the dynamic expression image sequence based on the triangular surface element.

In this step, each unit in the generated control point group set includes a control point group composed of three pairs of vertices of a triangular bin, and each pair of vertices includes position information of a corresponding control point of an initial frame and an i-th frame of the dynamic expression image sequence.

In step S314, a local affine transformation matrix of the dynamic expression image sequence is calculated frame by frame.

Specifically, for each frame in the dynamic expression image sequence, the control point pair group set is traversed, and the affine transformation matrix of the triangular surface element region is solved by using the control point pair group of each triangular surface element.

In step S316, local affine transformation is performed frame by frame using the local affine transformation matrix.

Specifically, for each frame in the dynamic expression image sequence, local affine transformation is performed on each triangular surface element region by using the affine transformation matrix of the triangular surface element region.

Step S318 is to merge the converted screen contents.

In this step, for each frame in the dynamic expression image sequence, fusing each triangular surface element region after local affine transformation to a corresponding position of the target static face image by using a Seamless Clone operation, so as to obtain a dynamic expression image of the frame.

Step S320, fusing the preset area materials to the hidden area in each frame of image of the obtained dynamic expression image sequence.

In this step, the hidden area is a mouth area, and the material fusion operation is as described above, which is not described herein again.

In step S222, a dynamic image or video is generated at a specified frame rate by using the dynamic expression image sequence.

The embodiment of the invention realizes the dynamic conversion of a single static picture, and the generated dynamic effect is more natural and beautiful, thereby being very suitable for manufacturing the customized facial expression bag.

Based on the same inventive concept, the embodiment of the present invention further provides a device for generating a dynamic expression image sequence by using a static face image, which is used to support the method for generating a dynamic expression image sequence by using a static face image provided in any one of the above embodiments or a combination thereof. Fig. 4 is a schematic structural diagram of an apparatus 400 for generating a sequence of dynamic expression images by using static face images according to an embodiment of the present invention. Referring to fig. 4, the apparatus 400 may include at least: a static image acquisition module 410, a face identification point detection module 420, a drive data acquisition module 430, and a local affine transformation module 440.

The functions of the components or devices of the apparatus 400 for generating a sequence of dynamic expression images using static face images and the connection relationship between the components will now be described:

and a static image obtaining module 410, adapted to obtain a static face image of the target.

The face identification point detection module 420 is connected to the static image acquisition module 410, and is adapted to detect a face identification point in the target static face image as a control point through a face key point detection algorithm.

And a driving data obtaining module 430 adapted to obtain the expression action driving data corresponding to the control point.

And the local affine transformation module 440 is respectively connected with the face identification point detection module 420 and the driving data acquisition module 430, and is adapted to perform local affine transformation on the control points in the target static face image frame by frame according to the expression action driving data to obtain a dynamic expression image sequence.

In an alternative embodiment of the present invention, the driving data obtaining module 430 is further adapted to:

and simulating according to the facial muscle movement mode of the facial expression action to obtain the motion trail data of the control points of a plurality of continuous frames as expression action driving data.

In another alternative embodiment of the present invention, the driving data obtaining module 430 is further adapted to:

detecting face identification points in the face expression image as control points through a face key point detection algorithm, wherein the control points in the face expression image correspond to the control points in the target static face image one to one;

and calculating the motion trail data of the control point as expression action driving data.

Further, the driving data obtaining module 430 is further adapted to:

and before calculating the motion trajectory data of the control point, carrying out face alignment on an initial frame of the driving face image sequence to a target static face image.

Still further, the drive data acquisition module 430 is further adapted to:

calculating the motion displacement of a control point in each frame after an initial frame of a driving human face image sequence relative to a control point in the initial frame;

In an optional embodiment of the invention, the local affine transformation module 440 is further adapted to:

calculating to obtain a control point set of an ith frame of a dynamic expression image sequence by utilizing a control point set of a target static face image, a control point set of an initial frame of expression action driving data and a control point set of the ith frame, wherein the target static face image is used as the initial frame of the dynamic expression image sequence, and i is a natural number greater than or equal to 1;

Further, the local affine transformation module 440 is further adapted to:

generating a control point pair group set for local affine transformation by using a control point set of each frame of the dynamic expression image sequence based on the triangular surface element, wherein each unit in the control point pair group set comprises a control point group consisting of three pairs of vertexes of the triangular surface element, and each pair of vertexes comprises position information of corresponding control points of an initial frame and an ith frame of the dynamic expression image sequence;

and traversing the control point pair group set, solving an affine transformation matrix of the triangular surface element region by using the control point pair group of each triangular surface element, and performing local affine transformation on the triangular surface element region by using the affine transformation matrix.

Preferably, the Triangulation is a Dirony Triangulation Delaunay Triangulation. Accordingly, the local affine transformation is a dense affine transformation.

after local affine transformation is carried out on control points in the target static face image frame by frame, all triangular surface element areas after the local affine transformation are fused to corresponding positions of the target static face image, and the ith frame of dynamic expression image is obtained.

In an alternative embodiment of the present invention, as shown in fig. 5, the apparatus 400 may further include a hidden area processing module 450. The hidden area processing module 450 may be connected to the local affine transformation module 440, and is adapted to perform fusion of preset area materials on a hidden area in each frame of image of the obtained dynamic expression image sequence.

In an optional embodiment of the invention, the hidden area comprises at least one of:

mouth area, eye area, forehead area.

Correspondingly, the preset area material comprises at least one of the following:

teeth and/or tongue, eye balls, wrinkles.

In an optional embodiment of the invention, the hidden area processing module 450 is further adapted to:

a and A are reacted_tComparing;

if A>A_tThen, a mouth region material image is obtained, and the mouth region material image is fused in the mouth region in the frame of the dynamic expression image sequence.

Further, the hidden area processing module 450 is further adapted to:

a and A are reacted_tAfter comparison, if A_t>A>A₀Then, the Alpha fusion operation is performed on the mouth region in the frame of the dynamic expression image sequence.

acquiring a dynamic expression image of a current frame and a driving face image for generating the dynamic expression image of the current frame from a target static face image;

and cutting the mouth region in the driving face image as a mouth region material image.

Further, the hidden area processing module 450 is further adapted to:

and transforming the driving face image by using an alignment transformation matrix so as to realize the alignment of the mouth region.

In an alternative embodiment of the present invention, still referring to fig. 5, the apparatus 400 may further include a motion picture/video composition module 460. The motion picture/video synthesis module 460 may be connected to the hidden area processing module 450, and is adapted to generate a motion picture or video at a specified frame rate by using the sequence of motion picture images. It is understood that the motion picture/video synthesis module 460 may be connected to the local affine transformation module 440 without the hidden region processing module 450.

Based on the same inventive concept, the embodiment of the invention also provides a computer storage medium. The computer storage medium stores computer program code which, when run on a computing device, causes the computing device to execute a method of generating a sequence of dynamic expression images from static face images according to any one or combination of the above embodiments.

Based on the same inventive concept, the embodiment of the invention also provides the computing equipment. The computing device may include:

a processor; and

a memory storing computer program code;

the computer program code, when executed by a processor, causes the computing device to perform a method for generating a sequence of dynamic expression images from static face images according to any one or combination of the above embodiments.

According to any one or a combination of multiple optional embodiments, the embodiment of the present invention can achieve the following advantages:

Furthermore, a control point pair set for local affine transformation is generated through subdivision of the discrete point triangular surface element, and an affine transformation matrix of a corresponding triangular surface element area is solved, so that the dynamic effect after transformation according to the affine transformation matrix is more natural, and the distortion condition is avoided.

Furthermore, by performing material fusion on hidden areas such as mouth areas after mouth opening, the influence caused by content deletion in the generated dynamic expression images can be avoided, and the generated dynamic effect is more natural and beautiful.

It is clear to those skilled in the art that the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and for the sake of brevity, further description is omitted here.

In addition, the functional units in the embodiments of the present invention may be physically independent of each other, two or more functional units may be integrated together, or all the functional units may be integrated in one processing unit. The integrated functional units may be implemented in the form of hardware, or in the form of software or firmware.

Those of ordinary skill in the art will understand that: the integrated functional units, if implemented in software and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computing device (e.g., a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention when the instructions are executed. And the aforementioned storage medium includes: u disk, removable hard disk, Read Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disk, and other various media capable of storing program code.

Alternatively, all or part of the steps of implementing the foregoing method embodiments may be implemented by hardware (such as a computing device, e.g., a personal computer, a server, or a network device) associated with program instructions, which may be stored in a computer-readable storage medium, and when the program instructions are executed by a processor of the computing device, the computing device executes all or part of the steps of the method according to the embodiments of the present invention.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments can be modified or some or all of the technical features can be equivalently replaced within the spirit and principle of the present invention; such modifications or substitutions do not depart from the scope of the present invention.

According to an aspect of the embodiments of the present invention, a1. a method for generating a dynamic expression image sequence by using a static face image is provided, including:

acquiring a target static face image;

acquiring expression action driving data corresponding to the control point;

A2. The method according to a1, wherein the acquiring of the expression motion driving data corresponding to the control point comprises:

A3. The method according to a1, wherein the acquiring of the expression motion driving data corresponding to the control point comprises:

A4. The method according to a3, wherein before calculating the motion trajectory data of the control point, the method further comprises:

A5. The method of a4, wherein face-aligning an initial frame of the sequence of driver face images to the target static face image comprises:

calculating motion trajectory data of the control points, including:

A6. The method of any of a1-a5, wherein performing a local affine transformation of control points in the target static face image frame by frame according to the expressive motion driving data comprises:

A7. The method according to a6, wherein generating a set of control point pair groups for local affine transformation using the set of control points for each frame of the dynamic expression image sequence comprises:

A8. The method of a7, wherein the triangulation is a dironi triangulation delaunay triangulation;

the local affine transformation is a dense affine transformation.

A9. The method according to a6, wherein after the calculating the control point set of the ith frame of the dynamic expression image sequence, the method further comprises:

A10. The method according to a7, wherein after performing local affine transformation on control points in the target static face image frame by frame, the method further comprises:

A11. The method of any one of A3-a10, further comprising:

A12. The method of a11, wherein the hidden area comprises at least one of:

mouth region, eye region, forehead region;

the preset area material comprises at least one of the following materials:

teeth and/or tongue, eye balls, wrinkles.

A13. The method according to a12, wherein, when the hidden area is a mouth area, fusing preset area materials to the hidden area in each frame image of the obtained dynamic expression image sequence includes:

a and A are reacted_tComparing;

A14. The method of A13, wherein A is combined with A_tAfter the comparison, the method further comprises the following steps:

A15. The method of a13 or a14, wherein acquiring mouth region material images comprises:

A16. The method according to a15, wherein aligning the driving face image with the mouth region in the dynamic expression image of the current frame comprises:

A17. The method of any one of a1-a16, further comprising:

According to another aspect of the embodiments of the present invention, there is also provided a device for generating a dynamic expression image sequence by using a static face image, including:

B19. The apparatus of B18, wherein the drive data acquisition module is further adapted to:

B20. The apparatus of B18, wherein the drive data acquisition module is further adapted to:

B21. The apparatus of B20, wherein the drive data acquisition module is further adapted to:

B22. The apparatus of B21, wherein the drive data acquisition module is further adapted to:

B23. The apparatus of any one of B18-B22, wherein the local affine transformation module is further adapted to:

B24. The apparatus of B23, wherein the local affine transformation module is further adapted to:

B25. The apparatus of B24, wherein the triangulation is a dironey triangulation delaunay triangulation;

the local affine transformation is a dense affine transformation.

B26. The apparatus of B23, wherein the local affine transformation module is further adapted to:

B27. The apparatus of B24, wherein the local affine transformation module is further adapted to:

B28. The apparatus of any one of B20-B27, further comprising:

B29. The apparatus of B28, wherein the hidden area comprises at least one of:

mouth region, eye region, forehead region;

the preset area material comprises at least one of the following materials:

teeth and/or tongue, eye balls, wrinkles.

B30. The apparatus of B29, wherein the hidden area processing module is further adapted to:

a and A are reacted_tComparing;

if A>A_tObtaining the material image of the mouth area and displaying the dynamic expression image sequenceAnd fusing the mouth area material images in the mouth area of the frame.

B31. The apparatus of B30, wherein the hidden area processing module is further adapted to:

B32. The apparatus of B30 or B31, wherein the hidden area processing module is further adapted to:

B33. The apparatus of B32, wherein the hidden area processing module is further adapted to:

B34. The apparatus of any one of B18-B33, further comprising:

According to yet another aspect of embodiments of the present invention, there is also provided c35 a computer storage medium storing computer program code which, when run on a computing device, causes the computing device to execute the method of generating a sequence of dynamic expression images from static face images according to any of a1-a 17.

There is also provided, in accordance with yet another aspect of an embodiment of the present invention, apparatus for computing, including:

a processor; and

a memory storing computer program code;

the computer program code, when executed by the processor, causes the computing device to perform a method of generating a sequence of dynamic expression images from static face images according to any of a1-a 17.

Claims

1. A method for generating a sequence of dynamic expression images using static face images, comprising:

acquiring a target static face image;

acquiring expression action driving data corresponding to the control point;

2. The method of claim 1, wherein obtaining expression motion actuation data corresponding to the control point comprises:

3. The method of claim 1, wherein obtaining expression motion actuation data corresponding to the control point comprises:

4. The method of claim 3, wherein prior to calculating the motion trajectory data for the control point, further comprising:

5. The method of claim 4, wherein face aligning an initial frame of the sequence of driver face images to the target static face image comprises:

calculating motion trajectory data of the control points, including:

6. The method of any of claims 1-5, wherein performing a local affine transformation of control points in the target static face image frame by frame according to the expressive motion driving data comprises:

7. The method of claim 6, wherein generating a set of control point pair groups for a local affine transformation using the set of control points for each frame of the sequence of dynamic expression images comprises:

8. An apparatus for generating a sequence of dynamic expression images using static face images, comprising:

9. A computer storage medium having computer program code stored thereon which, when run on a computing device, causes the computing device to perform a method of generating a sequence of dynamic expression images from static face images according to any of claims 1-7.

10. A computing device, comprising:

a processor; and

a memory storing computer program code;

the computer program code, when executed by the processor, causes the computing device to perform the method of generating a sequence of dynamic expression images from static face images according to any of claims 1-7.