CN116434303A

CN116434303A - Facial expression capturing method, device and medium based on multi-scale feature fusion

Info

Publication number: CN116434303A
Application number: CN202310331030.XA
Authority: CN
Inventors: 谭明奎; 李振梁; 刘艳霞
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2023-03-30
Filing date: 2023-03-30
Publication date: 2023-07-14

Abstract

The invention discloses a facial expression capturing method, a device and a medium based on multi-scale feature fusion, wherein the method comprises the following steps: acquiring a face image; inputting the obtained facial image into a trained expression capturing model, and outputting a facial coefficient; the expression capture model comprises a coefficient prediction network, wherein the coefficient prediction network comprises a backbone network, a full-connection layer and a multi-scale feature fusion module; the multi-scale feature fusion module is used for fusing image features of different stages of a backbone network and predicting identity coefficients, expression coefficients and texture coefficients through the fused features. According to the invention, the multi-scale feature fusion module is used for continuously fusing the image features of different stages of the backbone network on the basis of the backbone network, and the fusion features are used for predicting the identity, the expression and the texture of the face, so that a finer prediction result can be obtained. The invention can be widely applied to the field of face image data processing.

Description

Facial expression capturing method, device and medium based on multi-scale feature fusion

Technical Field

The invention relates to the field of facial image data processing, in particular to a facial expression capturing method, device and medium based on multi-scale feature fusion.

Background

In recent years, the field of virtual digital people has gained more attention, and facial expression capture is a great important technology for realizing virtual digital people driving. The task is generally input into an image or video containing a human face, the image characteristics are learned by a deep neural network, various coefficients of the human face are predicted, the three-dimensional human face is reconstructed through a three-dimensional deformable model, and finally the expression coefficient of the human face is obtained as a prediction result of expression capturing. However, the existing methods have some drawbacks: firstly, the global and local information of an input image are not considered at the same time, so that the fine facial expression is difficult to capture; and secondly, the confidence difference of different input images is not considered, and the identity and the expression of the human face are difficult to decouple.

Disclosure of Invention

In order to solve at least one of the technical problems existing in the prior art to a certain extent, the invention aims to provide a facial expression capturing method, a device and a medium based on multi-scale feature fusion.

The technical scheme adopted by the invention is as follows:

a facial expression capturing method based on multi-scale feature fusion comprises the following steps:

acquiring a face image;

inputting the obtained facial image into a trained expression capturing model, and outputting a facial coefficient;

the expression capture model comprises a coefficient prediction network, wherein the coefficient prediction network comprises a backbone network, a full-connection layer and a multi-scale feature fusion module; the multi-scale feature fusion module is used for fusing image features of different stages of a backbone network and predicting identity coefficients, expression coefficients and texture coefficients through the fused features.

Further, the expression capture model is obtained by training in the following manner:

acquiring face data, preprocessing the face data, and acquiring a training set;

constructing an expression capturing model, wherein the expression capturing model consists of a coefficient prediction network, a confidence prediction network and a reconstruction and rendering module;

training the constructed expression capturing model according to the training set and the loss function, and removing a confidence prediction network and a reconstruction and rendering module after training the expression capturing model to obtain a trained expression capturing model;

wherein the coefficient prediction network is used for learning image characteristics and predicting face coefficients, T face images I with the same identity and different expressions are input into the coefficient prediction network, outputting an identity coefficient alpha, an expression coefficient beta, a texture coefficient delta, an attitude coefficient p and an illumination coefficient gamma of a human face;

the confidence prediction network is used for predicting respective confidence degrees for a plurality of face images input during training so as to optimize identity consistency loss; the reconstruction and rendering module is used for reconstructing a three-dimensional face by utilizing the predicted face coefficient and combining a three-dimensional deformable model (3 DMM) and rendering the three-dimensional face into a two-dimensional image.

Further, the working mode of the coefficient prediction network is as follows:

the full-connection layer receives the output of the backbone network and outputs an attitude coefficient p and an illumination coefficient gamma;

definition X _l-1 And X is _l For the output characteristics of the previous stage and the current stage of the backbone network, the multi-scale characteristic fusion module fuses the characteristics by the following modes:

convolution with 3X3 (stride=2) is used for the last stage feature X _l-1 Downsampling to obtain features

The purpose is to keep the spatial dimension of the previous stage characteristic diagram consistent with that of the current stage characteristic diagram;

downsampling the obtained features

And current stage feature X _l Performing splicing operation, and fusing features by using a 3X3 convolution layer to obtain a feature X after multi-scale fusion _f 。

Further, the reconstruction and rendering module works as follows:

constructing a three-dimensional face model according to an identity coefficient alpha, an expression coefficient beta and a texture coefficient delta which are obtained by prediction of a coefficient prediction network;

according to the three-dimensional face model obtained by construction, combining the attitude coefficient p and the illumination coefficient gamma, and rendering the three-dimensional face model to a two-dimensional image;

wherein the obtained two-dimensional image is used for the calculation of the loss function.

Further, the three-dimensional face model is expressed as two major parts of a shape S and a texture T:

wherein, the liquid crystal display device comprises a liquid crystal display device,

and->

The average face shape and the average texture of the three-dimensional face model are B _id ,B _exp And B is connected with _tex And respectively representing the identity, expression and texture substrates of the face subjected to PCA dimension reduction in the three-dimensional face model.

Further, the loss function comprises an identity consistency loss function combined with a confidence weight;

by adding a confidence prediction network on the basis of a backbone network, the confidence degree c of the identity coefficient of T input images is predicted _1…T Wherein the dimension of the confidence level c is the same as the dimension of the identity coefficient α;

the identity consistency loss function is calculated as follows:

for T input images with the same identity, the coefficient prediction network obtains an identity coefficient prediction result alpha of the T personal face ^t (t=1, …, T), identity coefficient device combined with confidence prediction network outputConfidence, pseudo tag for constructing T identity coefficients

Namely:

constraining the T identity coefficients to be close to the pseudo tag, and obtaining a function:

wherein c ^t Confidence for the t-th input image; alpha ^t The identity coefficient of the t-th input image.

Further, the loss function also comprises a face area illumination loss function and a face key point loss function;

the expression of the illumination loss function of the face area is as follows:

in the formula, a face area mask;

the expression of the face key point loss function is as follows:

wherein P and

and the coordinates of the 2D key points of the input image and the rendered image are respectively, and n is the number of human face key points.

Further, the preprocessing the face data to obtain a training set includes:

cutting the image in the face data into images with preset sizes;

cutting the cut face data into three parts of a training set, a verification set and a test set;

carrying out face segmentation on the images in the training set to obtain a face region segmentation result of the images; and obtaining two-dimensional key point coordinates of the face by adopting a preset face key point detection method.

The invention adopts another technical scheme that:

a facial expression capture device based on multi-scale feature fusion, comprising:

at least one processor;

at least one memory for storing at least one program;

the at least one program, when executed by the at least one processor, causes the at least one processor to implement the method described above.

The invention adopts another technical scheme that:

a computer readable storage medium, in which a processor executable program is stored, which when executed by a processor is adapted to carry out the method as described above.

The beneficial effects of the invention are as follows: according to the invention, the multi-scale feature fusion module is used for continuously fusing the image features of different stages of the backbone network on the basis of the backbone network, and the fusion features are used for predicting the identity, the expression and the texture of the face, so that a finer prediction result can be obtained.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the following description is made with reference to the accompanying drawings of the embodiments of the present invention or the related technical solutions in the prior art, and it should be understood that the drawings in the following description are only for convenience and clarity of describing some embodiments in the technical solutions of the present invention, and other drawings may be obtained according to these drawings without the need of inventive labor for those skilled in the art.

FIG. 1 is a training flow diagram of a form capture model in an embodiment of the invention;

FIG. 2 is a diagram of an expression capture model based on multi-scale feature fusion in an embodiment of the invention;

FIG. 3 is a block diagram of a multi-scale feature fusion module in an embodiment of the invention;

fig. 4 is a schematic diagram of construction of an identity pseudo tag according to an embodiment of the present invention.

Detailed Description

Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative only and are not to be construed as limiting the invention. The step numbers in the following embodiments are set for convenience of illustration only, and the order between the steps is not limited in any way, and the execution order of the steps in the embodiments may be adaptively adjusted according to the understanding of those skilled in the art.

In the description of the present invention, it should be understood that references to orientation descriptions such as upper, lower, front, rear, left, right, etc. are based on the orientation or positional relationship shown in the drawings, are merely for convenience of description of the present invention and to simplify the description, and do not indicate or imply that the apparatus or elements referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus should not be construed as limiting the present invention.

In the description of the present invention, a number means one or more, a number means two or more, and greater than, less than, exceeding, etc. are understood to not include the present number, and above, below, within, etc. are understood to include the present number. The description of the first and second is for the purpose of distinguishing between technical features only and should not be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated or implicitly indicating the precedence of the technical features indicated.

In the description of the present invention, unless explicitly defined otherwise, terms such as arrangement, installation, connection, etc. should be construed broadly and the specific meaning of the terms in the present invention can be reasonably determined by a person skilled in the art in combination with the specific contents of the technical scheme.

The existing expression capturing method based on three-dimensional face reconstruction has the following problems: (1) The identity, expression and texture information of the face are not considered, and the global and local information of the image are simultaneously combined to predict, so that the fine expression of the face cannot be captured. (2) When identity consistency is lost in the existing method, confidence degree difference of a plurality of images is not considered, so that decoupling effect of the identity and the expression of the face is insufficient. Aiming at the problem (1), the invention provides a multi-scale feature fusion module which is used for continuously fusing image features of different stages of a backbone network on the basis of the existing backbone network, and predicting the identity, expression and texture of a face by utilizing the fusion features to obtain a finer prediction result. Aiming at the problem (2), the invention provides an identity consistency loss function combined with confidence weight, and utilizes an additional confidence prediction network to output the confidence coefficients of the identity coefficients for a plurality of input images, and calculates the identity pseudo tags according to the confidence coefficients, so as to calculate the identity consistency loss function and realize better decoupling of the face identity and the expression.

The embodiment provides a facial expression capturing method based on multi-scale feature fusion, which comprises the following steps:

s101, acquiring a face image;

s102, inputting the obtained face image into a trained expression capturing model, and outputting a face coefficient;

As an optional implementation manner, the training manner of the expression capturing model is as follows: the first step is to construct a face image dataset for training a network model and perform certain data preprocessing. Secondly, constructing a deep neural network model for predicting face coefficients, and aiming at the problem that fine facial expressions are difficult to capture, providing a multi-scale feature fusion module, and simultaneously combining global and local information of an image to obtain a more accurate expression prediction result; aiming at the problem of large difference of different input images, a confidence prediction branch network is added, and better identity and expression decoupling is realized by using an identity consistency loss function combined with confidence weight, so that the accuracy of expression prediction is further improved. And finally training the constructed coefficient prediction model on the data set until convergence. Further as an alternative embodiment, the model is trained in a self-supervising manner.

Referring to fig. 1, the training steps described above specifically include steps A1-A3:

a1, acquiring face data, preprocessing the face data, and acquiring a training set;

a2, constructing an expression capturing model, wherein the expression capturing model consists of a coefficient prediction network, a confidence prediction network and a reconstruction and rendering module;

a3, training the constructed expression capture model according to the training set and the loss function, and removing a confidence prediction network and a reconstruction and rendering module after training the expression capture model to obtain a trained expression capture model;

the confidence prediction network is used for predicting respective confidence degrees for a plurality of face images input during training so as to optimize identity consistency loss; the reconstruction and rendering module is used for reconstructing a three-dimensional face by utilizing the predicted face coefficient and combining the three-dimensional deformable model and rendering the three-dimensional face into a two-dimensional image.

The above method is explained in detail below with reference to the drawings and specific examples.

S1: collecting and processing face data sets

S1-1: image data with different faces and expressions are collected and cut to specific sizes (e.g., 224 x 224 pixels).

S1-2: the data set is divided into three parts, namely a training set, a verification set and a test set.

S1-3: and (3) for the images in the training set, using the existing face segmentation method to obtain a face region segmentation result of the images. And meanwhile, the 2D key point coordinates of the face are obtained by using the existing face key point detection method.

S2: construction of network model

The invention aims to capture the expression of an input 2D face image, namely, predicting the expression coefficient of the face corresponding to a specific three-dimensional deformable model (3D Morphable Model,3DMM) in the image. The overall structure of the model, as shown in fig. 2, is mainly divided into three parts: (1) coefficient prediction network: learning image characteristics and predicting face coefficients by using a backbone neural network; (2) identity consistency loss in combination with confidence weights: the confidence prediction network predicts respective confidence degrees for a plurality of face images input during training, and accordingly identity consistency loss is optimized; (3) a reconstruction and rendering module: and reconstructing a three-dimensional face by combining the predicted face coefficients with the 3DMM and rendering the three-dimensional face into a 2D image.

In this embodiment, a confidence prediction network is required during training, and the information generated by the reconstruction and rendering module participates in the calculation of the loss function, and only the coefficient prediction network is required during testing and application, and the module generates the required expression coefficient.

S2-1, constructing a coefficient prediction network: and inputting T face images I with the same identity and different expressions into the coefficient prediction network to obtain the identity, the expressions, the textures, the attitudes and the illumination coefficients (alpha, beta, delta, p and gamma) of the face. The coefficient prediction network is composed of a backbone network, a full-connection layer and a multi-scale feature fusion module.

S2-1-1: backbone network: the full connection layer receives the output of the backbone network and outputs the coefficients described above for the pre-trained ResNet50 model on ImageNet. The existing method adopts the last layer of characteristics output by the backbone network as the input of a full-connection layer to predict all coefficients. In the invention, we consider the identity, expression and texture of the face to consider the image features of different scales, namely, consider the global and local information of the image at the same time, so that a multi-scale feature fusion module shown in fig. 3 is used, the above 3 coefficients are predicted by using the fused features, and for the pose and illumination coefficients, the final layer of feature prediction of the backbone network is still utilized.

S2-1-2, a multi-scale feature fusion module:

definition X _l-1 And X is _l Is an output characteristic of a stage and a current stage on a backbone network. The module is divided into two steps: (1) Convolution with 3X3 (stride=2) is used for the last stage feature X _l-1 Downsampling to obtain

The purpose is to keep the spatial dimensions of the previous stage feature map consistent with the current stage feature map. (2) Downsampling the obtained features

As shown in fig. 2, for 4 stages of the backbone network res net50 model, the embodiment adopts 3 feature fusion modules to continuously fuse backbone network features of the stage and fusion features of the previous stage, so as to realize multi-scale feature fusion of images, and finally, the fusion features are used for predicting the identity, expression and texture coefficients of the face.

S2-2: confidence prediction network:

in order to realize better decoupling of the identity and the expression of the face, the invention provides an identity consistency loss function (particularly in step S3-3) combined with a confidence coefficient weight, and in order to realize the loss function, the confidence coefficient of the identity coefficient of the T input identical identity images is predicted. Specifically, a branch network is added on the basis of a backbone network to predict the confidence coefficient c of the identity coefficient of T input images _1…T Wherein the confidence level c has the same dimension as the identity coefficient α.

As an alternative implementation, the lightweight MobileNetV3 model is used as the branch network in this embodiment to reduce training time. And modifying the last network layer of the MobileNet V3 into a full-connection layer, and setting the output dimension to be the same as the dimension of the identity coefficient alpha, so that the identity confidence coefficient of the T images can be output.

S2-3, a reconstruction and rendering module:

s2-3-1: after obtaining the identity, expression, texture, gesture and illumination coefficients of the face through the coefficient prediction network (respectively, alpha, beta, delta, p and gamma in fig. 2), the three-dimensional face reconstruction is firstly required to be carried out by combining a three-dimensional deformable model, namely 3 DMM. In 3DMM, an arbitrary three-dimensional face can be expressed as two major parts, namely, a shape S and a texture T:

wherein the method comprises the steps of

And->

Average face shape and average texture for 3DMM model, B _id ,B _exp And B is connected with _tex And respectively representing the identity, expression and texture substrates of the face subjected to PCA dimension reduction in the 3 DMM. By combining the above formulas with the identity, expression and texture coefficients alpha, beta and delta predicted in the step S2-1, a three-dimensional face model corresponding to the input image can be obtained.

S2-3-2: and (3) for the three-dimensional face model obtained in the last step. And three face models can be rendered to a two-dimensional image by utilizing a pinhole camera model and a spherical harmonic illumination model which are contained in the differential renderer and combining the attitude coefficient p and the illumination coefficient gamma. The two-dimensional image is used for the calculation of the subsequent loss function.

S3: calculating a loss function

And (3) constructing the loss function for training by using the face identity coefficient predicted in the step (S2) and the finally obtained two-dimensional rendering image.

S3-1: face region illumination loss function. In the T input images, for the T-th image I, a rendered image is calculated

Difference between the face pixel values to obtain a loss function L _p The method comprises the following steps:

wherein A is the face region mask obtained in the step S1-3.

S3-2: face key point loss function. The face 2D key point difference between the input image and the rendered image is calculated, namely:

wherein P and

S3-3: identity consistency loss function combined with confidence weight.

The invention provides the loss function for processing the situation that the confidence degrees of human faces of a plurality of input images are different due to gestures, blurriness, shielding and the like. For T input images with the same identity, the coefficient prediction network obtains an identity coefficient prediction result alpha of the T personal face ^t (t=1, …, T), firstly combining confidence levels of identity coefficients output by a confidence prediction network in step S2 to construct pseudo tags of T identity coefficients

(as in fig. 4), namely:

where c is the confidence level of the step T input image. And then restraining the T identity coefficients to be close to the pseudo tag to obtain a loss function, namely:

the identity coefficient pseudo tag obtained by calculating the confidence coefficient and the identity coefficient is compared with the mode of directly calculating the average value of the identity coefficient by the existing method, so that the influence caused by differences between different input images can be reduced, and the decoupling between the identity and the expression of the face can be better realized.

S4: and (3) performing deep learning model training on the preprocessed and divided data set by using the coefficient prediction network model constructed in the step S2. And updating the weight of the model by adopting a random gradient descent method through the loss function designed in the step S3 until the loss function value is converged, and storing the network weight of the model. And finally, testing and evaluating the verification set and the test set of the data set.

In summary, the embodiment proposes a facial expression capturing method based on multi-scale feature fusion, which solves the problem that a fine facial expression is difficult to capture through multi-scale feature fusion, and realizes better decoupling of a facial expression and an expression through the identity consistency loss function combining confidence weight. Table 1 shows experimental comparison results of the method of the present invention with the existing expression capturing method on the FEAFA face data set, and the method of the present embodiment surpasses the existing optimal method in terms of the prediction index of the expression coefficient.

TABLE 1

Note that: table 1 shows the results of the comparison of the present invention with other methods on different facial expressions of FEAFA dataset, the numerical values representing the Mean Absolute Error (MAE) of the predicted and actual values of the expression coefficients

The embodiment also provides a facial expression capturing device based on multi-scale feature fusion, which comprises:

at least one processor;

at least one memory for storing at least one program;

The facial expression capturing device based on the multi-scale feature fusion can execute any combination implementation steps of the facial expression capturing method based on the multi-scale feature fusion, and has corresponding functions and beneficial effects.

The present application also discloses a computer program product or a computer program comprising computer instructions stored in a computer readable storage medium. The computer instructions may be read from a computer-readable storage medium by a processor of a computer device, and executed by the processor, to cause the computer device to perform the method shown in fig. 1.

The embodiment also provides a storage medium which stores instructions or programs for executing the facial expression capturing method based on the multi-scale feature fusion, and when the instructions or programs are run, the steps can be implemented by any combination of the executable method embodiments, so that the method has corresponding functions and beneficial effects.

In some alternative embodiments, the functions/acts noted in the block diagrams may occur out of the order noted in the operational illustrations. For example, two blocks shown in succession may in fact be executed substantially concurrently or the blocks may sometimes be executed in the reverse order, depending upon the functionality/acts involved. Furthermore, the embodiments presented and described in the flowcharts of the present invention are provided by way of example in order to provide a more thorough understanding of the technology. The disclosed methods are not limited to the operations and logic flows presented herein. Alternative embodiments are contemplated in which the order of various operations is changed, and in which sub-operations described as part of a larger operation are performed independently.

Furthermore, while the invention is described in the context of functional modules, it should be appreciated that, unless otherwise indicated, one or more of the described functions and/or features may be integrated in a single physical device and/or software module or one or more functions and/or features may be implemented in separate physical devices or software modules. It will also be appreciated that a detailed discussion of the actual implementation of each module is not necessary to an understanding of the present invention. Rather, the actual implementation of the various functional modules in the apparatus disclosed herein will be apparent to those skilled in the art from consideration of their attributes, functions and internal relationships. Accordingly, one of ordinary skill in the art can implement the invention as set forth in the claims without undue experimentation. It is also to be understood that the specific concepts disclosed are merely illustrative and are not intended to be limiting upon the scope of the invention, which is to be defined in the appended claims and their full scope of equivalents.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Logic and/or steps represented in the flowcharts or otherwise described herein, e.g., a ordered listing of executable instructions for implementing logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.

More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). In addition, the computer readable medium may even be paper or other suitable medium on which the program is printed, as the program may be electronically captured, via, for instance, optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner, if necessary, and then stored in a computer memory.

It is to be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.

In the foregoing description of the present specification, reference has been made to the terms "one embodiment/example", "another embodiment/example", "certain embodiments/examples", and the like, means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

While embodiments of the present invention have been shown and described, it will be understood by those of ordinary skill in the art that: many changes, modifications, substitutions and variations may be made to the embodiments without departing from the spirit and principles of the invention, the scope of which is defined by the claims and their equivalents.

While the preferred embodiment of the present invention has been described in detail, the present invention is not limited to the above embodiments, and various equivalent modifications and substitutions can be made by those skilled in the art without departing from the spirit of the present invention, and these equivalent modifications and substitutions are intended to be included in the scope of the present invention as defined in the appended claims.

Claims

1. The facial expression capturing method based on multi-scale feature fusion is characterized by comprising the following steps of:

acquiring a face image;

2. The facial expression capturing method based on multi-scale feature fusion according to claim 1, wherein the expression capturing model is obtained by training in the following manner:

acquiring face data, preprocessing the face data, and acquiring a training set;

3. The facial expression capturing method based on multi-scale feature fusion according to claim 1 or 2, wherein the coefficient prediction network works as follows:

definition X _l-1 And X is _l For the output characteristics of the previous stage and the current stage of the backbone network, the multi-scale characteristic fusion module fuses the characteristics by the following modes：

For the previous stage of characteristic X _l-1 Downsampling to obtain features

Downsampling the obtained features

And current stage feature X _l Performing splicing operation, and fusing the features to obtain a feature X after multi-scale fusion _f 。

4. The facial expression capturing method based on multi-scale feature fusion according to claim 2, wherein the reconstruction and rendering module works as follows:

5. The facial expression capturing method based on multi-scale feature fusion according to claim 4, wherein the three-dimensional facial model is expressed as two major parts of a shape S and a texture T:

and->

6. A facial expression capture method based on multi-scale feature fusion according to claim 2, wherein the penalty function comprises an identity consistency penalty function combined with confidence weights;

the identity consistency loss function is calculated as follows:

for T input images with the same identity, the coefficient prediction network obtains an identity coefficient prediction result alpha of the T personal face ^t (t=1, …, T), combining confidence coefficient prediction network output identity coefficient confidence coefficient, and constructing pseudo tags of T identity coefficients

Namely:

7. The facial expression capturing method based on multi-scale feature fusion according to claim 6, wherein the loss function further comprises a face region illumination loss function and a face key point loss function;

in the formula, a face area mask;

the expression of the face key point loss function is as follows:

wherein P and

8. The facial expression capturing method based on multi-scale feature fusion according to claim 2, wherein the preprocessing of the facial data to obtain the training set comprises:

cutting the image in the face data into images with preset sizes;

9. Facial expression capturing device based on multiscale feature fusion, characterized by comprising:

at least one processor;

at least one memory for storing at least one program;

the at least one program, when executed by the at least one processor, causes the at least one processor to implement the method of any one of claims 1-8.

10. A computer readable storage medium, in which a processor executable program is stored, characterized in that the processor executable program is for performing the method according to any of claims 1-8 when being executed by a processor.