CN117974852A

CN117974852A - Intelligent binding method and device for facial images and computer equipment

Info

Publication number: CN117974852A
Application number: CN202311543676.0A
Authority: CN
Inventors: 王子傑; 田越
Original assignee: Beijing Huichang Shuyu Technology Development Co ltd
Current assignee: Beijing Huichang Shuyu Technology Development Co ltd
Priority date: 2023-11-17
Filing date: 2023-11-17
Publication date: 2024-05-03

Abstract

The invention relates to the technical field of image processing, in particular to an intelligent binding method and device for facial images and computer equipment, comprising the following steps of: acquiring a face video; obtaining a group of second face images through a time sequence prediction network according to the group of first face images; extracting features of the first face facial image and the second face facial image at each time sequence through a feature extraction network, and correspondingly obtaining binding facial features at each time sequence; carrying out facial image analysis on the binding facial features at each time sequence position through a facial image restoration model to obtain a third facial image; and carrying out facial image binding parameter calculation according to the third facial image at each time sequence through a parameter calculation network to obtain facial image binding parameters at each time sequence. The invention ensures that the facial binding gives consideration to time sequence fluency and space authenticity and improves the facial binding effect.

Description

Intelligent binding method and device for facial images and computer equipment

Technical Field

The invention relates to the technical field of image processing, in particular to an intelligent binding method and device for facial images and computer equipment.

Background

Facial binding is an important step in game or animation, and can enable a game character or an animation character to have facial expression, for example, in the existing battle chess hand game on the market, when the character expression is made, a lot of skeleton points are usually built in 3ds max or Maya and covered, and the skeleton scaling, rotation and displacement are utilized to change the appearance of the face.

However, when the 3ds max or Maya is used for making the expression, the subjective effect of a producer is judged when the producer designs in the earlier stage due to manual control, the actual effect of facial expression binding is difficult to accurately grasp, the authenticity of the facial expression binding is directly reduced, and the continuous expression adjustment on the game or the animation is difficult to accurately adjust in the later stage due to better time continuity, so that the facial expression binding in the game animation has a clamping effect, the effect is poor, and if each expression is adjusted in place for the time continuity, the facial point-by-point adjustment needs to be carried out frame by frame, so that not only is much time wasted, but also the effect is poor.

Therefore, in the prior art, due to subjective effect judgment of a producer, the authenticity of facial expression binding is reduced, and the continuous expression on a game or an animation is difficult to adjust accurately, so that the facial expression binding in the game animation has a clamping effect.

Disclosure of Invention

The invention aims to provide an intelligent binding method for facial images, which aims to solve the technical problems that the authenticity of facial expression binding in the prior art is reduced, and the facial expression binding in game animation has a clamping effect.

In order to solve the technical problems, the invention specifically provides the following technical scheme:

an intelligent binding method for facial images comprises the following steps:

acquiring a face video, wherein the face video comprises a group of first face images connected according to time sequence;

Obtaining a group of second face images through a time sequence prediction network according to the group of first face images, wherein each second face image is positioned at the time sequence of each first face image, the second face images correspond to the time sequence prediction result of the group of first face images, and the time sequence prediction network is a neural network;

Performing feature extraction on the first face image and the second face image at each time sequence through a feature extraction network, and correspondingly obtaining binding face features at each time sequence, wherein the feature extraction network is a neural network;

carrying out facial image analysis on the binding facial features at each time sequence position through a facial image restoration model to obtain a third facial image, wherein the facial image restoration model is a neural network;

Carrying out facial image binding parameter measurement according to the third facial image at each time sequence through a parameter measurement network to obtain facial image binding parameters at each time sequence, wherein the parameter measurement network is a neural network, the facial image binding parameters correspond to Blender Shape binding parameters applied to a FLAME facial model, and the FLAME facial model is used for realizing automatic binding of the facial model to the facial image according to the Blender Shape binding parameters.

As a preferred embodiment of the present invention, obtaining a set of second face images through a time-series prediction network according to a set of first face images includes:

acquiring a time sequence of each first face image, and sequentially taking the time sequence as a prediction time sequence;

Deep learning is carried out on each first face facial image positioned at a pre-time sequence of the prediction time sequence by utilizing an LSTM network, so that a time sequence prediction model of a second face facial image at the prediction time sequence is obtained;

And predicting the second face image at each prediction time sequence by using a time sequence prediction model at each prediction time sequence to obtain a group of second face images.

As a preferred embodiment of the present invention, the expression of the time sequence prediction network is:

G2_t＝LSTM({G1₁,G1₂,G1₃,…,G1_t-1})；

t∈[3,N]；

G2₁＝G1₁,G2₂＝G1₂；

Wherein, G2 _t is the second face image at the t-th time sequence, G1 ₁,G1₂,G1₃,…,G1_t-1 is the first face image at the 1,2,3, t-1 time sequences respectively, N is the total number of time sequences in the face video, G2 ₁ is the second face image at the 1 st time sequence, G2 ₂ is the second face image at the 2 nd time sequence, and t is the counting variable.

As a preferred solution of the present invention, feature extraction is performed on a first face image and a second face image at each time sequence through a feature extraction network, so as to obtain binding face features at each time sequence, including:

The first face facial image and the second face facial image at each time sequence are simultaneously input into a feature extraction network, and the feature extraction network outputs the binding facial features at each time sequence.

As a preferred embodiment of the present invention, the construction of the feature extraction network includes:

Randomly selecting a first face image and a second face image at a plurality of time sequences to serve as a first sample image and a second sample image respectively;

Inputting the first sample image at each timing to a first CNN neural network, outputting facial features of the first sample image by the first CNN neural network;

Inputting the second sample image at each time sequence to a second CNN neural network, outputting facial features of the second sample image by the second CNN neural network;

taking the difference between the facial features of the first sample image and the facial features of the second sample image as a loss function;

Based on the minimization of the loss function, learning and training the first CNN neural network and the second CNN neural network to obtain the feature extraction network;

The expression of the feature extraction network is as follows:

H3_t＝H1_torH2_t；

Loss＝MSE(H1_t,H2_t)；

Where H3 _t is the binding facial feature at the t-th time, H1 _t is the facial feature of the first sample image at the t-th time, H2 _t is the facial feature of the second sample image at the t-th time, G1 _t is the first sample image at the t-th time, G2 _t is the second sample image at the t-th time, CNN1 is the first CNN neural network, CNN 2 is the second CNN neural network, loss is the Loss function value, MSE (H1 _t,H2_t) is the mean square error between H1 _t and H2 _t, t is the count variable, or is the mathematical identifier of either.

As a preferred scheme of the invention, the facial image analysis is carried out on the binding facial features at each time sequence position through a facial image restoration model to obtain a third facial image, which comprises the following steps:

The bound facial features at each timing are input to a facial image restoration model, and a third facial image at each timing is output by the facial image restoration model.

As a preferred embodiment of the present invention, the construction of the face image restoration model includes:

Acquiring binding facial features corresponding to the first sample image and the second sample image at each time sequence by using a feature extraction network, and taking the binding facial features as binding sample facial features at each time sequence;

Carrying out mean value fusion on the first sample image and the second sample image at each time sequence to obtain a third sample image;

taking each time sequence binding sample facial feature as an input item of SRCNN neural network, and taking a third sample image at each time sequence as an output item of SRCNN neural network;

Learning and training an input item of the SRCNN neural network and an output item of the SRCNN neural network by utilizing the SRCNN neural network to obtain a facial image restoration model;

The expression of the facial image restoration model is:

G3_t＝SRCNN(H3_t)；

Where G3 _t is the third sample image at the t-th timing, H3 _t is the binding sample facial feature at the t-th timing, SRCNN is SRCNN neural network, H3 _t＝H1_torH2_t,H1_t is the facial feature of the first sample image at the t-th timing, H2 _t is the facial feature of the second sample image at the t-th timing, and t is the count variable.

As a preferred scheme of the invention, according to the third facial image at each time sequence, the facial image binding parameters are calculated through the parameter calculation network, so as to obtain the facial image binding parameters at each time sequence, and the method comprises the following steps:

Inputting the third facial image at each time sequence into a parameter measuring network, and outputting facial image binding parameters at each time sequence by the parameter measuring network;

the construction of the parameter measuring network comprises the following steps:

marking the facial image binding parameters of the first sample image to obtain the facial image binding parameters of the first sample image;

taking the first sample image as an input item of the BP neural network, and taking the facial image binding parameter of the first sample image as an output item of the BP neural network;

Learning and training an input item of the BP neural network and an output item of the BP neural network by using the BP neural network to obtain the parameter measuring and calculating network;

S_t＝BP(G3_t)；

In the formula, S _t is a facial image binding parameter at the t time sequence, G3 _t is a third facial image at the t time sequence, BP is a BP neural network, and t is a counting variable.

As a preferred embodiment of the present invention, the present invention provides an intelligent binding apparatus for facial images, including:

The data acquisition module is used for acquiring a face video, wherein the face video comprises a group of first face images connected according to time sequence;

The data processing module is used for obtaining a group of second face images through a time sequence prediction network according to a group of first face images;

The method comprises the steps of carrying out feature extraction on a first face image and a second face image at each time sequence through a feature extraction network, and correspondingly obtaining binding face features at each time sequence, wherein the feature extraction network is a neural network;

the facial image analysis method comprises the steps of carrying out facial image analysis on binding facial features at each time sequence position through a facial image restoration model to obtain a third facial image, wherein the facial image restoration model is a neural network; and

The parameter measurement network is a neural network, the facial image binding parameters correspond to Blender Shape binding parameters applied to a FLAME face model, and the FLAME face model is used for realizing automatic binding of the face model to the facial image according to the Blender Shape binding parameters;

the data storage module is used for storing a time sequence prediction network, a feature extraction network, a facial image restoration model and a parameter measurement network.

As a preferred aspect of the present invention, there is provided a computer apparatus,

At least one processor; and

A memory communicatively coupled to the at least one processor;

Wherein the memory stores instructions executable by the at least one processor to cause a computer device to perform the facial image intelligent binding method.

Compared with the prior art, the invention has the following beneficial effects:

According to the facial expression timing sequence prediction method, the facial image is subjected to timing sequence prediction to obtain the facial image predicted value representing the facial expression timing sequence rule, feature extraction is carried out based on the facial image predicted value and the facial image true value, and the binding parameters are calculated to obtain the facial image binding parameters, so that the facial binding gives consideration to timing sequence fluency and space authenticity, and the facial binding effect is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. It will be apparent to those of ordinary skill in the art that the drawings in the following description are exemplary only and that other implementations can be obtained from the extensions of the drawings provided without inventive effort.

FIG. 1 is a flowchart of a facial image intelligent binding method provided by an embodiment of the invention;

FIG. 2 is a block diagram of a facial image intelligent binding device according to an embodiment of the present invention;

Fig. 3 is an internal structure diagram of a computer device according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

As shown in fig. 1, the present invention provides an intelligent binding method for facial images, comprising the following steps:

performing feature extraction on the first facial image and the second facial image at each time sequence through a feature extraction network, and correspondingly obtaining binding facial features (such as facial expression features, facial posture features and other facial features) at each time sequence, wherein the feature extraction network is a neural network;

And carrying out facial image binding parameter measurement according to the third facial image at each time sequence through a parameter measurement network to obtain facial image binding parameters at each time sequence, wherein the parameter measurement network is a neural network, the facial image binding parameters correspond to Blender Shape binding parameters applied to a FLAME facial model, and the FLAME facial model is used for realizing automatic binding of the facial model to the facial image according to the Blender Shape binding parameters.

In order to realize facial binding, the facial expression and the gesture feature of the face are comprehensively mastered through the facial video, and then the facial binding of the face model is carried out through the facial expression and the gesture feature mastered in the facial video, so that the facial image binding is more detailed, and the display effect on the face model is better.

According to the invention, facial binding is carried out by using facial videos of the face, the facial binding required by binding can be more fully mastered, and binding is carried out in time sequence, so that the facial binding of the face model is changed from static binding to dynamic binding, namely, the face model has dynamic facial expression and facial gesture, and is more suitable for use by roles of animation and games, the dynamic expression requirement of the animation game is met, the dynamic facial binding can be obtained at one time, dynamic expression and gesture are not required to be manufactured from the static expression and gesture, and the facial binding efficiency in the roles is greatly improved.

Furthermore, in order to enable the dynamic expression and the dynamic gesture of the face model to have a better binding effect, the invention analyzes the face video on one hand, learns the change rule of the face expression and the gesture by utilizing the LSTM neural network, predicts the face expression and the gesture on a single time sequence based on the learned change rule of the face expression and the gesture, and predicts the face expression and the gesture on the single time sequence according with the change rule, so that the face binding is carried out through the predicted image of the face expression and the gesture on the single time sequence, and the face binding can be enabled to obtain time continuity of the face expression on each time sequence and the gesture on each time sequence in the face model, namely the dynamic expression and the dynamic gesture on the face model are smoother.

On the other hand, although the prediction of the facial expression and the gesture on a single time sequence is obtained based on the prediction of the real facial expression and the gesture in the facial video, the prediction of the facial expression and the gesture on the single time sequence is virtual in nature, so if the facial binding is carried out based on the predicted image of the facial expression and the gesture on the single time sequence, the reality of the dynamic expression and the dynamic gesture on the facial model is reduced, and in order to keep the reality of the dynamic expression and the dynamic gesture on the facial model, the facial binding is carried out based on the facial image with the real facial expression and the gesture in the facial video, so that the facial binding obtains the reality of the facial expression and the gesture on each time sequence in the facial model, namely, the dynamic expression and the dynamic gesture on the facial model are more real.

The invention carries out facial binding together based on the facial image with real facial expression and gesture and the predictive image with virtual facial expression and gesture, can keep the time continuity of dynamic expression and dynamic gesture on the facial model while keeping the space reality of the dynamic expression and the dynamic gesture on the facial model, so that the facial binding has both reality and fluency, and has better facial binding effect.

According to the invention, the facial video is analyzed, the LSTM neural network is utilized to learn the change rule of facial expression and gesture, and the facial expression and gesture of a person on a single time sequence are predicted based on the learned change rule of facial expression and gesture, and the method specifically comprises the following steps:

obtaining a group of second face images through a time sequence prediction network according to the group of first face images, wherein the method comprises the following steps:

The expression of the timing prediction network is:

G2_t＝LSTM({G1₁,G1₂,G1₃,…,G1_t-1})；

t∈[3,N]；

G2₁＝G1₁,G2₂＝G1₂；

The invention carries out facial binding together based on the facial image with real facial expression and gesture and the predictive image with virtual facial expression and gesture, can keep the space reality of the dynamic expression and the dynamic gesture on the facial model and simultaneously keep the time continuity of the dynamic expression and the dynamic gesture on the facial model, and is concretely as follows:

Performing feature extraction on the first face image and the second face image at each time sequence through a feature extraction network to correspondingly obtain binding face features at each time sequence, including:

The construction of the feature extraction network comprises the following steps:

based on the minimization of the loss function, learning and training the first CNN neural network and the second CNN neural network to obtain a feature extraction network;

the expression of the feature extraction network is:

H3_t＝H1_torH2_t；

Loss＝MSE(H1_t,H2_t)；

When facial features are extracted (facial expression features, facial posture features and the like), the facial features are extracted through the constructed feature extraction network, wherein the network structure of the feature extraction network comprises two CNN neural networks which are respectively used for extracting facial features in facial images (first facial images) with real facial expressions and postures and extracting facial postures in predicted images (second facial images) with virtual facial expressions and postures, when training is performed, the output difference of the two CNN neural networks is used as a loss function for carrying out network training, and the output difference of the two CNN neural networks is used as a loss function for carrying out network training, so that the facial features in facial images (first facial images) with real facial expressions and postures output in the feature extraction network and the facial features in facial images (first facial images) with real facial expressions and postures are the highest in similarity, and the facial features output in the feature extraction network are provided with both the first facial images and the second facial images, namely, the facial features output in the feature extraction network have real time sequence and the smoothness.

The facial image (the third facial image) restored by the facial features output in the feature extraction network also has time sequence fluency and space authenticity, so that facial image binding parameters obtained based on the third facial image can enable facial expression and gesture in the bound facial model to have time sequence fluency and space authenticity.

Carrying out facial image analysis on the binding facial features at each time sequence position through a facial image restoration model to obtain a third facial image, wherein the facial image analysis comprises the following steps:

The construction of the face image restoration model comprises the following steps:

The expression of the face image restoration model is:

G3_t＝SRCNN(H3_t)；

Carrying out face image binding parameter calculation according to the third face image at each time sequence through a parameter calculation network to obtain face image binding parameters at each time sequence, wherein the face image binding parameters comprise:

The construction of the parameter measurement network comprises the following steps:

learning and training an input item of the BP neural network and an output item of the BP neural network by using the BP neural network to obtain a parameter measuring and calculating network;

S _t＝BP(G3_t)；

As shown in fig. 2, the present invention provides an intelligent binding apparatus for facial images, comprising:

the data acquisition module is used for acquiring a facial video which comprises a group of first facial images connected according to time sequence;

The method comprises the steps that a first face facial image and a second face facial image at each time sequence are subjected to feature extraction through a feature extraction network, binding facial features at each time sequence are correspondingly obtained, and the feature extraction network is a neural network;

The parameter measurement network is a neural network, the face image binding parameters correspond to Blender Shape binding parameters applied to a FLAME face model, and the FLAME face model is used for realizing automatic binding of the face model to the face image according to the Blender Shape binding parameters;

As shown in fig. 3, the present invention provides a computer device,

At least one processor; and

A memory communicatively coupled to the at least one processor;

The memory stores instructions executable by the at least one processor to cause the computer device to perform a facial image intelligent binding method.

The above embodiments are only exemplary embodiments of the present application and are not intended to limit the present application, the scope of which is defined by the claims. Various modifications and equivalent arrangements of this application will occur to those skilled in the art, and are intended to be within the spirit and scope of the application.

Claims

1. An intelligent binding method for facial images is characterized in that: the method comprises the following steps:

2. The intelligent binding method for facial images according to claim 1, wherein: obtaining a group of second face images through a time sequence prediction network according to the group of first face images, wherein the method comprises the following steps:

3. The intelligent binding method for facial images according to claim 2, wherein:

The expression of the time sequence prediction network is as follows:

G2_t＝LSTM({G1₁,G1₂,G1₃,…,G1_t-1})；

t∈[3,N]；

G2₁＝G1₁,G2₂＝G1₂；

4. A facial image intelligent binding method according to claim 3, wherein: performing feature extraction on the first face image and the second face image at each time sequence through a feature extraction network to correspondingly obtain binding face features at each time sequence, including:

5. The intelligent binding method for facial images according to claim 4, wherein: the construction of the feature extraction network comprises the following steps:

The expression of the feature extraction network is as follows:

H3_t＝H1_torH2_t；

Loss＝MSE(H1_t,H2_t)；

6. The intelligent binding method for facial images according to claim 5, wherein: carrying out facial image analysis on the binding facial features at each time sequence position through a facial image restoration model to obtain a third facial image, wherein the facial image analysis comprises the following steps:

7. The intelligent binding method for facial images according to claim 6, wherein: the construction of the facial image restoration model comprises the following steps:

The expression of the facial image restoration model is:

G3_t＝SRCNN(H3_t)；

8. The intelligent binding method for facial images according to claim 7, wherein: carrying out face image binding parameter calculation according to the third face image at each time sequence through a parameter calculation network to obtain face image binding parameters at each time sequence, wherein the face image binding parameters comprise:

S_t＝BP(G3_t)；

9. An intelligent binding apparatus for facial images, comprising:

10. A computer device, characterized in that,

At least one processor; and

A memory communicatively coupled to the at least one processor;

wherein the memory stores instructions executable by the at least one processor to cause a computer device to perform the method of any of claims 1-8.