CN116434303A - Facial expression capturing method, device and medium based on multi-scale feature fusion - Google Patents

Facial expression capturing method, device and medium based on multi-scale feature fusion Download PDF

Info

Publication number
CN116434303A
CN116434303A CN202310331030.XA CN202310331030A CN116434303A CN 116434303 A CN116434303 A CN 116434303A CN 202310331030 A CN202310331030 A CN 202310331030A CN 116434303 A CN116434303 A CN 116434303A
Authority
CN
China
Prior art keywords
face
coefficient
identity
expression
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310331030.XA
Other languages
Chinese (zh)
Inventor
谭明奎
李振梁
刘艳霞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China University of Technology SCUT
Original Assignee
South China University of Technology SCUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China University of Technology SCUT filed Critical South China University of Technology SCUT
Priority to CN202310331030.XA priority Critical patent/CN116434303A/en
Publication of CN116434303A publication Critical patent/CN116434303A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/174Facial expression recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Human Computer Interaction (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a facial expression capturing method, a device and a medium based on multi-scale feature fusion, wherein the method comprises the following steps: acquiring a face image; inputting the obtained facial image into a trained expression capturing model, and outputting a facial coefficient; the expression capture model comprises a coefficient prediction network, wherein the coefficient prediction network comprises a backbone network, a full-connection layer and a multi-scale feature fusion module; the multi-scale feature fusion module is used for fusing image features of different stages of a backbone network and predicting identity coefficients, expression coefficients and texture coefficients through the fused features. According to the invention, the multi-scale feature fusion module is used for continuously fusing the image features of different stages of the backbone network on the basis of the backbone network, and the fusion features are used for predicting the identity, the expression and the texture of the face, so that a finer prediction result can be obtained. The invention can be widely applied to the field of face image data processing.

Description

Facial expression capturing method, device and medium based on multi-scale feature fusion
Technical Field
The invention relates to the field of facial image data processing, in particular to a facial expression capturing method, device and medium based on multi-scale feature fusion.
Background
In recent years, the field of virtual digital people has gained more attention, and facial expression capture is a great important technology for realizing virtual digital people driving. The task is generally input into an image or video containing a human face, the image characteristics are learned by a deep neural network, various coefficients of the human face are predicted, the three-dimensional human face is reconstructed through a three-dimensional deformable model, and finally the expression coefficient of the human face is obtained as a prediction result of expression capturing. However, the existing methods have some drawbacks: firstly, the global and local information of an input image are not considered at the same time, so that the fine facial expression is difficult to capture; and secondly, the confidence difference of different input images is not considered, and the identity and the expression of the human face are difficult to decouple.
Disclosure of Invention
In order to solve at least one of the technical problems existing in the prior art to a certain extent, the invention aims to provide a facial expression capturing method, a device and a medium based on multi-scale feature fusion.
The technical scheme adopted by the invention is as follows:
a facial expression capturing method based on multi-scale feature fusion comprises the following steps:
acquiring a face image;
inputting the obtained facial image into a trained expression capturing model, and outputting a facial coefficient;
the expression capture model comprises a coefficient prediction network, wherein the coefficient prediction network comprises a backbone network, a full-connection layer and a multi-scale feature fusion module; the multi-scale feature fusion module is used for fusing image features of different stages of a backbone network and predicting identity coefficients, expression coefficients and texture coefficients through the fused features.
Further, the expression capture model is obtained by training in the following manner:
acquiring face data, preprocessing the face data, and acquiring a training set;
constructing an expression capturing model, wherein the expression capturing model consists of a coefficient prediction network, a confidence prediction network and a reconstruction and rendering module;
training the constructed expression capturing model according to the training set and the loss function, and removing a confidence prediction network and a reconstruction and rendering module after training the expression capturing model to obtain a trained expression capturing model;
wherein the coefficient prediction network is used for learning image characteristics and predicting face coefficients, T face images I with the same identity and different expressions are input into the coefficient prediction network, outputting an identity coefficient alpha, an expression coefficient beta, a texture coefficient delta, an attitude coefficient p and an illumination coefficient gamma of a human face;
the confidence prediction network is used for predicting respective confidence degrees for a plurality of face images input during training so as to optimize identity consistency loss; the reconstruction and rendering module is used for reconstructing a three-dimensional face by utilizing the predicted face coefficient and combining a three-dimensional deformable model (3 DMM) and rendering the three-dimensional face into a two-dimensional image.
Further, the working mode of the coefficient prediction network is as follows:
the full-connection layer receives the output of the backbone network and outputs an attitude coefficient p and an illumination coefficient gamma;
definition X l-1 And X is l For the output characteristics of the previous stage and the current stage of the backbone network, the multi-scale characteristic fusion module fuses the characteristics by the following modes:
convolution with 3X3 (stride=2) is used for the last stage feature X l-1 Downsampling to obtain features
Figure BDA0004154922210000021
The purpose is to keep the spatial dimension of the previous stage characteristic diagram consistent with that of the current stage characteristic diagram;
downsampling the obtained features
Figure BDA0004154922210000022
And current stage feature X l Performing splicing operation, and fusing features by using a 3X3 convolution layer to obtain a feature X after multi-scale fusion f
Further, the reconstruction and rendering module works as follows:
constructing a three-dimensional face model according to an identity coefficient alpha, an expression coefficient beta and a texture coefficient delta which are obtained by prediction of a coefficient prediction network;
according to the three-dimensional face model obtained by construction, combining the attitude coefficient p and the illumination coefficient gamma, and rendering the three-dimensional face model to a two-dimensional image;
wherein the obtained two-dimensional image is used for the calculation of the loss function.
Further, the three-dimensional face model is expressed as two major parts of a shape S and a texture T:
Figure BDA0004154922210000023
Figure BDA0004154922210000024
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure BDA0004154922210000025
and->
Figure BDA0004154922210000026
The average face shape and the average texture of the three-dimensional face model are B id ,B exp And B is connected with tex And respectively representing the identity, expression and texture substrates of the face subjected to PCA dimension reduction in the three-dimensional face model.
Further, the loss function comprises an identity consistency loss function combined with a confidence weight;
by adding a confidence prediction network on the basis of a backbone network, the confidence degree c of the identity coefficient of T input images is predicted 1…T Wherein the dimension of the confidence level c is the same as the dimension of the identity coefficient α;
the identity consistency loss function is calculated as follows:
for T input images with the same identity, the coefficient prediction network obtains an identity coefficient prediction result alpha of the T personal face t (t=1, …, T), identity coefficient device combined with confidence prediction network outputConfidence, pseudo tag for constructing T identity coefficients
Figure BDA0004154922210000031
Namely:
Figure BDA0004154922210000032
constraining the T identity coefficients to be close to the pseudo tag, and obtaining a function:
Figure BDA0004154922210000033
wherein c t Confidence for the t-th input image; alpha t The identity coefficient of the t-th input image.
Further, the loss function also comprises a face area illumination loss function and a face key point loss function;
the expression of the illumination loss function of the face area is as follows:
Figure BDA0004154922210000034
in the formula, a face area mask;
the expression of the face key point loss function is as follows:
Figure BDA0004154922210000035
wherein P and
Figure BDA0004154922210000036
and the coordinates of the 2D key points of the input image and the rendered image are respectively, and n is the number of human face key points.
Further, the preprocessing the face data to obtain a training set includes:
cutting the image in the face data into images with preset sizes;
cutting the cut face data into three parts of a training set, a verification set and a test set;
carrying out face segmentation on the images in the training set to obtain a face region segmentation result of the images; and obtaining two-dimensional key point coordinates of the face by adopting a preset face key point detection method.
The invention adopts another technical scheme that:
a facial expression capture device based on multi-scale feature fusion, comprising:
at least one processor;
at least one memory for storing at least one program;
the at least one program, when executed by the at least one processor, causes the at least one processor to implement the method described above.
The invention adopts another technical scheme that:
a computer readable storage medium, in which a processor executable program is stored, which when executed by a processor is adapted to carry out the method as described above.
The beneficial effects of the invention are as follows: according to the invention, the multi-scale feature fusion module is used for continuously fusing the image features of different stages of the backbone network on the basis of the backbone network, and the fusion features are used for predicting the identity, the expression and the texture of the face, so that a finer prediction result can be obtained.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the following description is made with reference to the accompanying drawings of the embodiments of the present invention or the related technical solutions in the prior art, and it should be understood that the drawings in the following description are only for convenience and clarity of describing some embodiments in the technical solutions of the present invention, and other drawings may be obtained according to these drawings without the need of inventive labor for those skilled in the art.
FIG. 1 is a training flow diagram of a form capture model in an embodiment of the invention;
FIG. 2 is a diagram of an expression capture model based on multi-scale feature fusion in an embodiment of the invention;
FIG. 3 is a block diagram of a multi-scale feature fusion module in an embodiment of the invention;
fig. 4 is a schematic diagram of construction of an identity pseudo tag according to an embodiment of the present invention.
Detailed Description
Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative only and are not to be construed as limiting the invention. The step numbers in the following embodiments are set for convenience of illustration only, and the order between the steps is not limited in any way, and the execution order of the steps in the embodiments may be adaptively adjusted according to the understanding of those skilled in the art.
In the description of the present invention, it should be understood that references to orientation descriptions such as upper, lower, front, rear, left, right, etc. are based on the orientation or positional relationship shown in the drawings, are merely for convenience of description of the present invention and to simplify the description, and do not indicate or imply that the apparatus or elements referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus should not be construed as limiting the present invention.
In the description of the present invention, a number means one or more, a number means two or more, and greater than, less than, exceeding, etc. are understood to not include the present number, and above, below, within, etc. are understood to include the present number. The description of the first and second is for the purpose of distinguishing between technical features only and should not be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated or implicitly indicating the precedence of the technical features indicated.
In the description of the present invention, unless explicitly defined otherwise, terms such as arrangement, installation, connection, etc. should be construed broadly and the specific meaning of the terms in the present invention can be reasonably determined by a person skilled in the art in combination with the specific contents of the technical scheme.
The existing expression capturing method based on three-dimensional face reconstruction has the following problems: (1) The identity, expression and texture information of the face are not considered, and the global and local information of the image are simultaneously combined to predict, so that the fine expression of the face cannot be captured. (2) When identity consistency is lost in the existing method, confidence degree difference of a plurality of images is not considered, so that decoupling effect of the identity and the expression of the face is insufficient. Aiming at the problem (1), the invention provides a multi-scale feature fusion module which is used for continuously fusing image features of different stages of a backbone network on the basis of the existing backbone network, and predicting the identity, expression and texture of a face by utilizing the fusion features to obtain a finer prediction result. Aiming at the problem (2), the invention provides an identity consistency loss function combined with confidence weight, and utilizes an additional confidence prediction network to output the confidence coefficients of the identity coefficients for a plurality of input images, and calculates the identity pseudo tags according to the confidence coefficients, so as to calculate the identity consistency loss function and realize better decoupling of the face identity and the expression.
The embodiment provides a facial expression capturing method based on multi-scale feature fusion, which comprises the following steps:
s101, acquiring a face image;
s102, inputting the obtained face image into a trained expression capturing model, and outputting a face coefficient;
the expression capture model comprises a coefficient prediction network, wherein the coefficient prediction network comprises a backbone network, a full-connection layer and a multi-scale feature fusion module; the multi-scale feature fusion module is used for fusing image features of different stages of a backbone network and predicting identity coefficients, expression coefficients and texture coefficients through the fused features.
As an optional implementation manner, the training manner of the expression capturing model is as follows: the first step is to construct a face image dataset for training a network model and perform certain data preprocessing. Secondly, constructing a deep neural network model for predicting face coefficients, and aiming at the problem that fine facial expressions are difficult to capture, providing a multi-scale feature fusion module, and simultaneously combining global and local information of an image to obtain a more accurate expression prediction result; aiming at the problem of large difference of different input images, a confidence prediction branch network is added, and better identity and expression decoupling is realized by using an identity consistency loss function combined with confidence weight, so that the accuracy of expression prediction is further improved. And finally training the constructed coefficient prediction model on the data set until convergence. Further as an alternative embodiment, the model is trained in a self-supervising manner.
Referring to fig. 1, the training steps described above specifically include steps A1-A3:
a1, acquiring face data, preprocessing the face data, and acquiring a training set;
a2, constructing an expression capturing model, wherein the expression capturing model consists of a coefficient prediction network, a confidence prediction network and a reconstruction and rendering module;
a3, training the constructed expression capture model according to the training set and the loss function, and removing a confidence prediction network and a reconstruction and rendering module after training the expression capture model to obtain a trained expression capture model;
wherein the coefficient prediction network is used for learning image characteristics and predicting face coefficients, T face images I with the same identity and different expressions are input into the coefficient prediction network, outputting an identity coefficient alpha, an expression coefficient beta, a texture coefficient delta, an attitude coefficient p and an illumination coefficient gamma of a human face;
the confidence prediction network is used for predicting respective confidence degrees for a plurality of face images input during training so as to optimize identity consistency loss; the reconstruction and rendering module is used for reconstructing a three-dimensional face by utilizing the predicted face coefficient and combining the three-dimensional deformable model and rendering the three-dimensional face into a two-dimensional image.
The above method is explained in detail below with reference to the drawings and specific examples.
S1: collecting and processing face data sets
S1-1: image data with different faces and expressions are collected and cut to specific sizes (e.g., 224 x 224 pixels).
S1-2: the data set is divided into three parts, namely a training set, a verification set and a test set.
S1-3: and (3) for the images in the training set, using the existing face segmentation method to obtain a face region segmentation result of the images. And meanwhile, the 2D key point coordinates of the face are obtained by using the existing face key point detection method.
S2: construction of network model
The invention aims to capture the expression of an input 2D face image, namely, predicting the expression coefficient of the face corresponding to a specific three-dimensional deformable model (3D Morphable Model,3DMM) in the image. The overall structure of the model, as shown in fig. 2, is mainly divided into three parts: (1) coefficient prediction network: learning image characteristics and predicting face coefficients by using a backbone neural network; (2) identity consistency loss in combination with confidence weights: the confidence prediction network predicts respective confidence degrees for a plurality of face images input during training, and accordingly identity consistency loss is optimized; (3) a reconstruction and rendering module: and reconstructing a three-dimensional face by combining the predicted face coefficients with the 3DMM and rendering the three-dimensional face into a 2D image.
In this embodiment, a confidence prediction network is required during training, and the information generated by the reconstruction and rendering module participates in the calculation of the loss function, and only the coefficient prediction network is required during testing and application, and the module generates the required expression coefficient.
S2-1, constructing a coefficient prediction network: and inputting T face images I with the same identity and different expressions into the coefficient prediction network to obtain the identity, the expressions, the textures, the attitudes and the illumination coefficients (alpha, beta, delta, p and gamma) of the face. The coefficient prediction network is composed of a backbone network, a full-connection layer and a multi-scale feature fusion module.
S2-1-1: backbone network: the full connection layer receives the output of the backbone network and outputs the coefficients described above for the pre-trained ResNet50 model on ImageNet. The existing method adopts the last layer of characteristics output by the backbone network as the input of a full-connection layer to predict all coefficients. In the invention, we consider the identity, expression and texture of the face to consider the image features of different scales, namely, consider the global and local information of the image at the same time, so that a multi-scale feature fusion module shown in fig. 3 is used, the above 3 coefficients are predicted by using the fused features, and for the pose and illumination coefficients, the final layer of feature prediction of the backbone network is still utilized.
S2-1-2, a multi-scale feature fusion module:
definition X l-1 And X is l Is an output characteristic of a stage and a current stage on a backbone network. The module is divided into two steps: (1) Convolution with 3X3 (stride=2) is used for the last stage feature X l-1 Downsampling to obtain
Figure BDA0004154922210000061
The purpose is to keep the spatial dimensions of the previous stage feature map consistent with the current stage feature map. (2) Downsampling the obtained features
Figure BDA0004154922210000062
And current stage feature X l Performing splicing operation, and fusing features by using a 3X3 convolution layer to obtain a feature X after multi-scale fusion f
As shown in fig. 2, for 4 stages of the backbone network res net50 model, the embodiment adopts 3 feature fusion modules to continuously fuse backbone network features of the stage and fusion features of the previous stage, so as to realize multi-scale feature fusion of images, and finally, the fusion features are used for predicting the identity, expression and texture coefficients of the face.
S2-2: confidence prediction network:
in order to realize better decoupling of the identity and the expression of the face, the invention provides an identity consistency loss function (particularly in step S3-3) combined with a confidence coefficient weight, and in order to realize the loss function, the confidence coefficient of the identity coefficient of the T input identical identity images is predicted. Specifically, a branch network is added on the basis of a backbone network to predict the confidence coefficient c of the identity coefficient of T input images 1…T Wherein the confidence level c has the same dimension as the identity coefficient α.
As an alternative implementation, the lightweight MobileNetV3 model is used as the branch network in this embodiment to reduce training time. And modifying the last network layer of the MobileNet V3 into a full-connection layer, and setting the output dimension to be the same as the dimension of the identity coefficient alpha, so that the identity confidence coefficient of the T images can be output.
S2-3, a reconstruction and rendering module:
s2-3-1: after obtaining the identity, expression, texture, gesture and illumination coefficients of the face through the coefficient prediction network (respectively, alpha, beta, delta, p and gamma in fig. 2), the three-dimensional face reconstruction is firstly required to be carried out by combining a three-dimensional deformable model, namely 3 DMM. In 3DMM, an arbitrary three-dimensional face can be expressed as two major parts, namely, a shape S and a texture T:
Figure BDA0004154922210000071
Figure BDA0004154922210000072
wherein the method comprises the steps of
Figure BDA0004154922210000073
And->
Figure BDA0004154922210000074
Average face shape and average texture for 3DMM model, B id ,B exp And B is connected with tex And respectively representing the identity, expression and texture substrates of the face subjected to PCA dimension reduction in the 3 DMM. By combining the above formulas with the identity, expression and texture coefficients alpha, beta and delta predicted in the step S2-1, a three-dimensional face model corresponding to the input image can be obtained.
S2-3-2: and (3) for the three-dimensional face model obtained in the last step. And three face models can be rendered to a two-dimensional image by utilizing a pinhole camera model and a spherical harmonic illumination model which are contained in the differential renderer and combining the attitude coefficient p and the illumination coefficient gamma. The two-dimensional image is used for the calculation of the subsequent loss function.
S3: calculating a loss function
And (3) constructing the loss function for training by using the face identity coefficient predicted in the step (S2) and the finally obtained two-dimensional rendering image.
S3-1: face region illumination loss function. In the T input images, for the T-th image I, a rendered image is calculated
Figure BDA0004154922210000087
Difference between the face pixel values to obtain a loss function L p The method comprises the following steps:
Figure BDA0004154922210000081
wherein A is the face region mask obtained in the step S1-3.
S3-2: face key point loss function. The face 2D key point difference between the input image and the rendered image is calculated, namely:
Figure BDA0004154922210000082
wherein P and
Figure BDA0004154922210000083
and the coordinates of the 2D key points of the input image and the rendered image are respectively, and n is the number of human face key points.
S3-3: identity consistency loss function combined with confidence weight.
The invention provides the loss function for processing the situation that the confidence degrees of human faces of a plurality of input images are different due to gestures, blurriness, shielding and the like. For T input images with the same identity, the coefficient prediction network obtains an identity coefficient prediction result alpha of the T personal face t (t=1, …, T), firstly combining confidence levels of identity coefficients output by a confidence prediction network in step S2 to construct pseudo tags of T identity coefficients
Figure BDA0004154922210000084
(as in fig. 4), namely:
Figure BDA0004154922210000085
where c is the confidence level of the step T input image. And then restraining the T identity coefficients to be close to the pseudo tag to obtain a loss function, namely:
Figure BDA0004154922210000086
the identity coefficient pseudo tag obtained by calculating the confidence coefficient and the identity coefficient is compared with the mode of directly calculating the average value of the identity coefficient by the existing method, so that the influence caused by differences between different input images can be reduced, and the decoupling between the identity and the expression of the face can be better realized.
S4: and (3) performing deep learning model training on the preprocessed and divided data set by using the coefficient prediction network model constructed in the step S2. And updating the weight of the model by adopting a random gradient descent method through the loss function designed in the step S3 until the loss function value is converged, and storing the network weight of the model. And finally, testing and evaluating the verification set and the test set of the data set.
In summary, the embodiment proposes a facial expression capturing method based on multi-scale feature fusion, which solves the problem that a fine facial expression is difficult to capture through multi-scale feature fusion, and realizes better decoupling of a facial expression and an expression through the identity consistency loss function combining confidence weight. Table 1 shows experimental comparison results of the method of the present invention with the existing expression capturing method on the FEAFA face data set, and the method of the present embodiment surpasses the existing optimal method in terms of the prediction index of the expression coefficient.
TABLE 1
Figure BDA0004154922210000091
Note that: table 1 shows the results of the comparison of the present invention with other methods on different facial expressions of FEAFA dataset, the numerical values representing the Mean Absolute Error (MAE) of the predicted and actual values of the expression coefficients
The embodiment also provides a facial expression capturing device based on multi-scale feature fusion, which comprises:
at least one processor;
at least one memory for storing at least one program;
the at least one program, when executed by the at least one processor, causes the at least one processor to implement the method described above.
The facial expression capturing device based on the multi-scale feature fusion can execute any combination implementation steps of the facial expression capturing method based on the multi-scale feature fusion, and has corresponding functions and beneficial effects.
The present application also discloses a computer program product or a computer program comprising computer instructions stored in a computer readable storage medium. The computer instructions may be read from a computer-readable storage medium by a processor of a computer device, and executed by the processor, to cause the computer device to perform the method shown in fig. 1.
The embodiment also provides a storage medium which stores instructions or programs for executing the facial expression capturing method based on the multi-scale feature fusion, and when the instructions or programs are run, the steps can be implemented by any combination of the executable method embodiments, so that the method has corresponding functions and beneficial effects.
In some alternative embodiments, the functions/acts noted in the block diagrams may occur out of the order noted in the operational illustrations. For example, two blocks shown in succession may in fact be executed substantially concurrently or the blocks may sometimes be executed in the reverse order, depending upon the functionality/acts involved. Furthermore, the embodiments presented and described in the flowcharts of the present invention are provided by way of example in order to provide a more thorough understanding of the technology. The disclosed methods are not limited to the operations and logic flows presented herein. Alternative embodiments are contemplated in which the order of various operations is changed, and in which sub-operations described as part of a larger operation are performed independently.
Furthermore, while the invention is described in the context of functional modules, it should be appreciated that, unless otherwise indicated, one or more of the described functions and/or features may be integrated in a single physical device and/or software module or one or more functions and/or features may be implemented in separate physical devices or software modules. It will also be appreciated that a detailed discussion of the actual implementation of each module is not necessary to an understanding of the present invention. Rather, the actual implementation of the various functional modules in the apparatus disclosed herein will be apparent to those skilled in the art from consideration of their attributes, functions and internal relationships. Accordingly, one of ordinary skill in the art can implement the invention as set forth in the claims without undue experimentation. It is also to be understood that the specific concepts disclosed are merely illustrative and are not intended to be limiting upon the scope of the invention, which is to be defined in the appended claims and their full scope of equivalents.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
Logic and/or steps represented in the flowcharts or otherwise described herein, e.g., a ordered listing of executable instructions for implementing logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). In addition, the computer readable medium may even be paper or other suitable medium on which the program is printed, as the program may be electronically captured, via, for instance, optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner, if necessary, and then stored in a computer memory.
It is to be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.
In the foregoing description of the present specification, reference has been made to the terms "one embodiment/example", "another embodiment/example", "certain embodiments/examples", and the like, means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
While embodiments of the present invention have been shown and described, it will be understood by those of ordinary skill in the art that: many changes, modifications, substitutions and variations may be made to the embodiments without departing from the spirit and principles of the invention, the scope of which is defined by the claims and their equivalents.
While the preferred embodiment of the present invention has been described in detail, the present invention is not limited to the above embodiments, and various equivalent modifications and substitutions can be made by those skilled in the art without departing from the spirit of the present invention, and these equivalent modifications and substitutions are intended to be included in the scope of the present invention as defined in the appended claims.

Claims (10)

1. The facial expression capturing method based on multi-scale feature fusion is characterized by comprising the following steps of:
acquiring a face image;
inputting the obtained facial image into a trained expression capturing model, and outputting a facial coefficient;
the expression capture model comprises a coefficient prediction network, wherein the coefficient prediction network comprises a backbone network, a full-connection layer and a multi-scale feature fusion module; the multi-scale feature fusion module is used for fusing image features of different stages of a backbone network and predicting identity coefficients, expression coefficients and texture coefficients through the fused features.
2. The facial expression capturing method based on multi-scale feature fusion according to claim 1, wherein the expression capturing model is obtained by training in the following manner:
acquiring face data, preprocessing the face data, and acquiring a training set;
constructing an expression capturing model, wherein the expression capturing model consists of a coefficient prediction network, a confidence prediction network and a reconstruction and rendering module;
training the constructed expression capturing model according to the training set and the loss function, and removing a confidence prediction network and a reconstruction and rendering module after training the expression capturing model to obtain a trained expression capturing model;
wherein the coefficient prediction network is used for learning image characteristics and predicting face coefficients, T face images I with the same identity and different expressions are input into the coefficient prediction network, outputting an identity coefficient alpha, an expression coefficient beta, a texture coefficient delta, an attitude coefficient p and an illumination coefficient gamma of a human face;
the confidence prediction network is used for predicting respective confidence degrees for a plurality of face images input during training so as to optimize identity consistency loss; the reconstruction and rendering module is used for reconstructing a three-dimensional face by utilizing the predicted face coefficient and combining the three-dimensional deformable model and rendering the three-dimensional face into a two-dimensional image.
3. The facial expression capturing method based on multi-scale feature fusion according to claim 1 or 2, wherein the coefficient prediction network works as follows:
the full-connection layer receives the output of the backbone network and outputs an attitude coefficient p and an illumination coefficient gamma;
definition X l-1 And X is l For the output characteristics of the previous stage and the current stage of the backbone network, the multi-scale characteristic fusion module fuses the characteristics by the following modes:
For the previous stage of characteristic X l-1 Downsampling to obtain features
Figure FDA0004154922200000011
Downsampling the obtained features
Figure FDA0004154922200000012
And current stage feature X l Performing splicing operation, and fusing the features to obtain a feature X after multi-scale fusion f
4. The facial expression capturing method based on multi-scale feature fusion according to claim 2, wherein the reconstruction and rendering module works as follows:
constructing a three-dimensional face model according to an identity coefficient alpha, an expression coefficient beta and a texture coefficient delta which are obtained by prediction of a coefficient prediction network;
according to the three-dimensional face model obtained by construction, combining the attitude coefficient p and the illumination coefficient gamma, and rendering the three-dimensional face model to a two-dimensional image;
wherein the obtained two-dimensional image is used for the calculation of the loss function.
5. The facial expression capturing method based on multi-scale feature fusion according to claim 4, wherein the three-dimensional facial model is expressed as two major parts of a shape S and a texture T:
Figure FDA0004154922200000021
Figure FDA0004154922200000022
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure FDA0004154922200000023
and->
Figure FDA0004154922200000024
The average face shape and the average texture of the three-dimensional face model are B id ,B exp And B is connected with tex And respectively representing the identity, expression and texture substrates of the face subjected to PCA dimension reduction in the three-dimensional face model.
6. A facial expression capture method based on multi-scale feature fusion according to claim 2, wherein the penalty function comprises an identity consistency penalty function combined with confidence weights;
by adding a confidence prediction network on the basis of a backbone network, the confidence degree c of the identity coefficient of T input images is predicted 1…T Wherein the dimension of the confidence level c is the same as the dimension of the identity coefficient α;
the identity consistency loss function is calculated as follows:
for T input images with the same identity, the coefficient prediction network obtains an identity coefficient prediction result alpha of the T personal face t (t=1, …, T), combining confidence coefficient prediction network output identity coefficient confidence coefficient, and constructing pseudo tags of T identity coefficients
Figure FDA0004154922200000025
Namely:
Figure FDA0004154922200000026
constraining the T identity coefficients to be close to the pseudo tag, and obtaining a function:
Figure FDA0004154922200000027
wherein c t Confidence for the t-th input image; alpha t The identity coefficient of the t-th input image.
7. The facial expression capturing method based on multi-scale feature fusion according to claim 6, wherein the loss function further comprises a face region illumination loss function and a face key point loss function;
the expression of the illumination loss function of the face area is as follows:
Figure FDA0004154922200000028
in the formula, a face area mask;
the expression of the face key point loss function is as follows:
Figure FDA0004154922200000031
wherein P and
Figure FDA0004154922200000032
and the coordinates of the 2D key points of the input image and the rendered image are respectively, and n is the number of human face key points.
8. The facial expression capturing method based on multi-scale feature fusion according to claim 2, wherein the preprocessing of the facial data to obtain the training set comprises:
cutting the image in the face data into images with preset sizes;
cutting the cut face data into three parts of a training set, a verification set and a test set;
carrying out face segmentation on the images in the training set to obtain a face region segmentation result of the images; and obtaining two-dimensional key point coordinates of the face by adopting a preset face key point detection method.
9. Facial expression capturing device based on multiscale feature fusion, characterized by comprising:
at least one processor;
at least one memory for storing at least one program;
the at least one program, when executed by the at least one processor, causes the at least one processor to implement the method of any one of claims 1-8.
10. A computer readable storage medium, in which a processor executable program is stored, characterized in that the processor executable program is for performing the method according to any of claims 1-8 when being executed by a processor.
CN202310331030.XA 2023-03-30 2023-03-30 Facial expression capturing method, device and medium based on multi-scale feature fusion Pending CN116434303A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310331030.XA CN116434303A (en) 2023-03-30 2023-03-30 Facial expression capturing method, device and medium based on multi-scale feature fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310331030.XA CN116434303A (en) 2023-03-30 2023-03-30 Facial expression capturing method, device and medium based on multi-scale feature fusion

Publications (1)

Publication Number Publication Date
CN116434303A true CN116434303A (en) 2023-07-14

Family

ID=87086607

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310331030.XA Pending CN116434303A (en) 2023-03-30 2023-03-30 Facial expression capturing method, device and medium based on multi-scale feature fusion

Country Status (1)

Country Link
CN (1) CN116434303A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117218499A (en) * 2023-09-29 2023-12-12 北京百度网讯科技有限公司 Training method of facial expression capturing model, facial expression driving method and device

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117218499A (en) * 2023-09-29 2023-12-12 北京百度网讯科技有限公司 Training method of facial expression capturing model, facial expression driving method and device

Similar Documents

Publication Publication Date Title
CN111667459B (en) Medical sign detection method, system, terminal and storage medium based on 3D variable convolution and time sequence feature fusion
US11164306B2 (en) Visualization of inspection results
CN114339409B (en) Video processing method, device, computer equipment and storage medium
US11043027B2 (en) Three-dimensional graphics image processing
CN116958492B (en) VR editing method for reconstructing three-dimensional base scene rendering based on NeRf
CN112132739A (en) 3D reconstruction and human face posture normalization method, device, storage medium and equipment
CN115147598A (en) Target detection segmentation method and device, intelligent terminal and storage medium
CN116434303A (en) Facial expression capturing method, device and medium based on multi-scale feature fusion
CN112258565A (en) Image processing method and device
Zhou et al. A superior image inpainting scheme using Transformer-based self-supervised attention GAN model
CN113570725A (en) Three-dimensional surface reconstruction method and device based on clustering, server and storage medium
CN112307991A (en) Image recognition method, device and storage medium
CN112750110A (en) Evaluation system for evaluating lung lesion based on neural network and related products
CN113240789A (en) Virtual object construction method and device
CN112396657A (en) Neural network-based depth pose estimation method and device and terminal equipment
CN117078809A (en) Dynamic effect generation method, device, equipment and storage medium based on image
CN111429388B (en) Image processing method and device and terminal equipment
CN113158970B (en) Action identification method and system based on fast and slow dual-flow graph convolutional neural network
CN113610856B (en) Method and device for training image segmentation model and image segmentation
CN114820755A (en) Depth map estimation method and system
CN117036658A (en) Image processing method and related equipment
CN114494611A (en) Intelligent three-dimensional reconstruction method, device, equipment and medium based on nerve basis function
CN114004751A (en) Image processing method and related equipment thereof
CN117690128B (en) Embryo cell multi-core target detection system, method and computer readable storage medium
CN112541535B (en) Three-dimensional point cloud classification method based on complementary multi-branch deep learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination