WO2024032464A1

WO2024032464A1 - Three-dimensional face reconstruction method, apparatus, and device, medium, and product

Info

Publication number: WO2024032464A1
Application number: PCT/CN2023/111005
Authority: WO
Inventors: 靳凯
Original assignee: 广州市百果园信息技术有限公司
Priority date: 2022-08-12
Filing date: 2023-08-03
Publication date: 2024-02-15
Also published as: CN115330947A

Abstract

A three-dimensional face reconstruction method, apparatus, and device, a medium, and a product. The method comprises: acquiring face image data, and extracting a face image in the face image data; then performing key point detection on the face image to obtain a face area image of an area where face key points are located; performing bilinear modeling of a face identity and a face expression on the face area image by using a bilinear modeling layer of a three-dimensional face reconstruction network pre-trained to a convergence state, so as to obtain a parameterized three-dimensional face model; and finally, mapping the face area image into corresponding parameter coefficients in the parameterized three-dimensional face model by using a parameter mapping layer of the three-dimensional face reconstruction network.

Description

Three-dimensional face reconstruction method and its devices, equipment, media and products

This application claims priority to the Chinese patent application with application number 202210969989.1, which was submitted to the China Patent Office on August 12, 2022. The entire content of this application is incorporated into this application by reference.

Technical field

This application relates to the field of image processing technology, for example, to a three-dimensional face reconstruction method and its device, equipment, media, and products.

Background technique

The evolution of basic network technology has promoted the development of digital people, virtual characters and 3D images. Its application in related fields such as film and television, games, and education has sharply increased the demand for three-dimensional virtual character generation technology. Among them, three-dimensional face reconstruction technology is more important in the field of three-dimensional virtual character generation technology.

Traditional 3D face reconstruction methods are based on 3DMM (3D Morphable Models, 3DMM) priors, which rely on visual signals. The deviation of visual signals can easily lead to weak generalization, so more samples are needed for training. In addition, expression migration that relies on key points is prone to unnatural expressions and a strong sense of non-realism.

To sum up, in related technologies, 3DMM based on neural networks needs rich and accurate training data to achieve better reconstruction results, which means the training cost is high. Secondly, it is difficult to effectively transfer expressions, that is, it cannot obtain three-dimensional expressions that accurately express expressions. Face images.

Contents of the invention

This application provides a three-dimensional face reconstruction method and its corresponding devices, equipment, non-volatile readable storage media, and computer program products.

According to one aspect of this application, a three-dimensional face reconstruction method is provided, including the following steps:

Obtain facial image data and extract facial images;

Perform key point detection on the face image to obtain a face area image of the area where the face key points are located;

Use the bilinear modeling layer of the three-dimensional face reconstruction network pre-trained to the convergence state to perform bilinear modeling of facial identity and facial expression on the face area image to obtain a parameterized three-dimensional face model;

The parameter mapping layer of the three-dimensional face reconstruction network is used to map the face area image to the Corresponding parameter coefficients in the parameterized three-dimensional face model, the parameter coefficients include an identity coefficient corresponding to the facial identity and an expression coefficient corresponding to the facial expression.

According to another aspect of the present application, a three-dimensional face reconstruction device is provided, including:

The image acquisition module is configured to acquire face image data and extract the face image therein;

A face detection module configured to perform key point detection on the face image and obtain a face region image of the area where the face key points are located;

The face modeling module is configured to use the bilinear modeling layer of the three-dimensional face reconstruction network pre-trained to a converged state to perform bilinear modeling of facial identity and facial expression on the face area image, and obtain Parametric three-dimensional face model;

A parameter mapping module configured to use the parameter mapping layer of the three-dimensional face reconstruction network to map the face area image into corresponding parameter coefficients in the parameterized three-dimensional face model, where the parameter coefficients include the The identity coefficient corresponding to the facial identity and the expression coefficient corresponding to the facial expression.

According to another aspect of the present application, a three-dimensional face reconstruction device is provided, including a central processor and a memory. The central processor is configured to call and run a computer program stored in the memory to execute the three-dimensional face reconstruction method described in the present application. Steps of face reconstruction method.

According to another aspect of the present application, a non-volatile readable storage medium is provided, which stores a computer program implemented according to the three-dimensional face reconstruction method in the form of computer-readable instructions, and the computer program is When the computer calls the runtime, it executes the steps involved in the method.

According to another aspect of the present application, a computer program product is provided, including a computer program/instructions that, when executed by a processor, implement the steps of the method described in any embodiment of the present application.

Description of drawings

In order to explain the technical solutions in the embodiments of the present application more clearly, the drawings needed to be used in the description of the embodiments will be briefly introduced below. The drawings in the following description are only some embodiments of the present application. For those in the field, Ordinary technicians can also obtain other drawings based on these drawings without exerting creative work.

Figure 1 is a schematic flow chart of an embodiment of the three-dimensional face reconstruction method of the present application;

Figure 2 is a flowchart of an embodiment of the exemplary scenario application of the three-dimensional face reconstruction method of the present application. intention;

Figure 3 is a schematic diagram of the expression migration results of the three-dimensional face model in the embodiment of the present application;

Figure 4 is a schematic flowchart of obtaining a face area image in an embodiment of the present application;

Figure 5 is a schematic diagram of the results of obtaining a three-dimensional face model in an embodiment of the present application;

Figure 6 is a schematic flowchart of parameter mapping for facial feature maps in an embodiment of the present application;

Figure 7 is a schematic flowchart of training a three-dimensional face reconstruction network in an embodiment of the present application;

Figure 8 is a schematic diagram of the training framework used in the three-dimensional face reconstruction network method in the embodiment of the present application;

Figure 9 is a schematic flow chart of the calculation of the reconstruction loss function in the embodiment of the present application;

Figure 10 is a functional block diagram of the three-dimensional face reconstruction device of the present application;

Figure 11 is a schematic structural diagram of a three-dimensional face reconstruction device used in this application.

Detailed ways

The models cited or that may be cited in this application include traditional machine learning models or deep learning models. Unless expressly specified, they can be deployed on a remote server and remotely called on the client, or they can be deployed on a client with competent device capabilities. Direct call, in some embodiments, when it is run on the client, its corresponding machine intelligence can be obtained through transfer learning, so as to reduce the requirements for client hardware running resources and avoid excessive occupation of client hardware running resources.

Please refer to Figure 1. According to a three-dimensional face reconstruction method provided by this application, in one embodiment, it includes the following steps:

Step S1100: Obtain facial image data and extract facial images therein;

Face image data refers to image data with human face parts. This type of image data can be obtained through authorized live broadcast, on-demand and other legal channels. It can be video stream data or image data.

In one embodiment, when a real person performs a live broadcast in the form of a digital person, the image data of the real person needs to be collected in real time through a camera, and then sent to the backend server for further processing to generate a digital person image and replace it with the image data of the real person. describe the real person in the image data, and finally output the image data with the image of the digital person to the display terminal device facing the audience. In the above embodiment, the collected image data of real people can be used as the face image data.

In another embodiment, in some film and television works, it is often necessary to replace real people with digital human images to generate works with corresponding styles. In this embodiment, the video data that has been shot can be stored in the server, and the relevant technical personnel can capture the image data with the target person in it, and then replace it with the image data. The corresponding digital human image is generated, and finally the corresponding image file is generated. The image data with the target person can be used as the face image data.

In another embodiment, some advertising posters need to use digital human images to attract the masses. To serve this purpose, an image with a real person is first captured by a camera, and then handed over to relevant technical personnel to generate a digital image of the corresponding style. human image to replace the real person in the image. In the above embodiment, the image with a real person is the face image data.

The above embodiments are some examples of the facial image data in exemplary application scenarios. Therefore, the face image data may be a kind of video stream data or a kind of image data. To meet the needs of this application, it is necessary to further extract the face images in the face image data, that is, when the face image data is video stream data, each frame of the image is extracted as a face image; when When the face image data is image data, the face image data is a face image.

It is worth noting that the extracted face images need to be in a unified format, which can be YUV420 format, or RGB24 format, or YUV444 format, or other similar encoding formats. The unification of its image data format can make the interfaces for subsequent operations consistent, facilitating unified processing and rapid completion.

Step S1200: Perform key point detection on the face image to obtain a face region image of the area where the face key points are located;

After obtaining the face image, face detection and face key point detection are performed to detect and obtain the face area image and face key points in the face image. Optionally, for the face image, a face detection model pre-trained to a converged state is used to perform face detection to obtain face target frame information. The face target frame information includes the upper left point and the lower right point of the face part. Point coordinate information. According to the face target frame information, the image of the corresponding area position is intercepted from the face image, which is the face area image, which eliminates the interference of redundant image information in non-face areas and has the ability to focus more on face information. Features. In one embodiment, for the face region image, a face key point detection model pre-trained to a converged state is used to perform face key point detection to obtain face key point information. The face key points are key points pointing to the face part in the face area image, which can represent the location of the key areas of the face, such as eyebrows, eyes, nose, mouth, facial contour, etc.

After obtaining the face area image and face key points, a standard alignment operation also needs to be performed. In one embodiment, a preset standard three-dimensional face model can be projected into a two-dimensional plane, and the standard face key point information on the two-dimensional plane can be obtained accordingly, and the face key points and the standard face key points can be obtained. The points are aligned and matched to obtain standard transformation parameters, and the face area image is transformed into a face area image with standard size and angle according to the standard transformation parameters.

Step S1300: Use the bilinear modeling layer of the three-dimensional face reconstruction network pre-trained to the convergence state to perform bilinear modeling of facial identity and facial expression on the face area image to obtain a parameterized three-dimensional dimensional face model;

The 3D face reconstruction network includes a two-layer structure. The first layer is a bilinear modeling layer, based on a parameterized 3D face model, used to decouple facial identity and facial expressions for the face region image. Modeling, the corresponding identity coefficient and expression coefficient need to be further determined; the second layer is the parameter mapping layer, which is used to map the face area image to the corresponding parameter coefficients in the parameterized three-dimensional face model , the parameter coefficients include an identity coefficient corresponding to the facial identity and an expression coefficient corresponding to the facial expression.

In the bilinear modeling layer, a parameterized face model is first determined as the three-dimensional face model to be optimized; in one embodiment, the parameterized face model can be BFM (Basel Face Model , BFM) model. The BFM model is based on the 3DMM (3D Morphable Models, 3DMM) statistical model. According to the principle of 3DMM, each face is a superposition of shape vectors and texture vectors. In another embodiment, which is an exemplary application example in this application, a 3DMM based on a bilinear model is used as a parameterized face model, and its parameterized representation can be:
core_tensor=vertex*identity*expression

Among them, vertex represents the face grid vertex, identity represents the identity coefficient, expression represents the expression coefficient, and core_tensor represents the tensor representation of the three-dimensional face model grid vertex.

Compared with the traditional 3DMM, the 3DMM based on the bilinear model uses coefficient multiplication to decouple the identity information and expression information of the face for modeling, and can realize the separate application of the identity coefficient and the expression coefficient to realize the expression. Migration etc. In one embodiment, people with different identities and the same expression can be represented by a set of different identity coefficients and the same expression coefficient. In another embodiment, people with the same identity and different expressions can be represented by a set of the same identity coefficients and different expression coefficients.

To explain the modeling itself in more detail, the 3DMM based on the bilinear model defines the representation of the face as the core_tensor in the above, which is the weighted combination of all three-dimensional face models in the preset three-dimensional face model library, It can be expressed uniformly as:

B ₀ =U ₀ , B = [U ₁ -U ₀ ,U ₂ -U ₀ ,…,U _m -U ₀ ]

Among them, U _i B _i ∈R ^n×(l+1) , α∈R ^m×1 , n is the base number, l is the number of expressions, and m is the number of identities.

Then the corresponding mesh vertices mapped to the three-dimensional space can be expressed as:
f ₀ + f _α

In the current embodiment, the three-dimensional face model database can be used by relevant technical personnel according to actual applications. Set according to scenarios and actual business needs. In the exemplary application, this application pre-constructs a three-dimensional face model database with a number of 79, with 46 types of expressions, that is, the vector dimension of the identity coefficient in the face model is 79. The vector dimension of the expression coefficient is 46. In other application scenarios, the number of the three-dimensional face model database, the number of expression types, the vector dimensions of the identity coefficient and the vector dimension of the expression coefficient can be adjusted according to the actual application scenario, without affecting the actual application of the method. .

Step S1400: Use the parameter mapping layer of the three-dimensional face reconstruction network to map the face area image into corresponding parameter coefficients in the parameterized three-dimensional face model. The parameter coefficients include the parameters corresponding to the face The identity coefficient corresponding to the identity and the expression coefficient corresponding to the facial expression.

The parameter mapping layer of the three-dimensional face reconstruction network is the second layer structure of the three-dimensional face reconstruction network, which is used to map the face region image to the corresponding parameter coefficients in the parameterized three-dimensional face model. .

The face area image contains all the information of the target face, such as the identity information that represents the identity of the face, the expression information that represents the facial expression, etc. Therefore, the mapping relationship between it and the identity coefficient and expression coefficient in the three-dimensional face model is constructed. feasible. In addition, texture parameters, lighting parameters, posture parameters and transformation parameters can all be expressed in the face area image, and it is also feasible to construct corresponding mapping relationships based on these parameters.

Therefore, a mapping relationship can be constructed between the face area image and the identity parameters, expression parameters, texture parameters, lighting parameters, posture parameters, and transformation parameters, so that the identity coefficient, Expression coefficient, texture coefficient, lighting coefficient, posture coefficient, transformation coefficient, etc.

In one embodiment, the encoder in the three-dimensional face reconstruction network is first used to perform feature extraction on the face area image to obtain the depth features of the face area image, which is called a face feature map; secondly, Perform spatial mapping on the facial feature map to obtain all parameter coefficients, including: identity coefficient, expression coefficient, texture coefficient, lighting coefficient, posture coefficient, transformation coefficient, where the identity coefficient and expression coefficient are the double Parameter coefficients corresponding to identity parameters and expression parameters in the linear modeling layer.

Various coefficients can be called as needed for three-dimensional face reconstruction to obtain a three-dimensional face model corresponding to the face area image, which can be output as a result of three-dimensional face reconstruction. In one embodiment, the parameter coefficients corresponding to each face image, including the identity coefficients and expression coefficients, can be stored independently for later use, and can be used to arbitrarily combine to construct different three-dimensional face models, so as to obtain different effects of people. face image. For example, one identity coefficient is combined with multiple expression coefficients to generate face images corresponding to different expressions of the same person, or one expression coefficient is combined with multiple different identity coefficients to generate face images corresponding to the same expressions of different characters. face images, etc. In another embodiment, after using the parameter mapping layer of the three-dimensional face reconstruction network to map the face region image to the corresponding parameter coefficients in the parameterized three-dimensional face model, the method includes:

Three-dimensional reconstruction is performed according to the parameter coefficients to obtain a three-dimensional face model of the face area image.

In one embodiment, the identity coefficient and the expression coefficient among the parameter coefficients are used to construct the corresponding three-dimensional face model. Therefore, the above process of the present application is performed according to a face area image to obtain parameterized By using the three-dimensional face model and the identity coefficient and expression coefficient, a three-dimensional face model that effectively reflects the identity information and expression information of the face area image can be obtained.

In one embodiment, after the three-dimensional face model is reconstructed, its grid representation needs to be further determined to complete the reconstruction of the corresponding face in the three-dimensional space. Therefore, to model the three-dimensional face mesh, first define T as [V, I ₇₉ , E ₄₆ ], where V is the vertex mesh, I is the identity coefficient, and E is the expression coefficient, then the three-dimensional face The grid can be expressed as:
V _x =T×σ _exp (F _g(x) )×σ _id (F _g(x) )

_Among _them _, _V _{_} The expression coefficient output by the parameter mapping layer, σ _id (F _g(x) ) represents the identity coefficient output by the parameter mapping layer in the three-dimensional face reconstruction network.

It is worth noting that usually different faces have the same number of three-dimensional mesh vertices.

Compared with related technologies, this application uses the bilinear modeling layer of the three-dimensional face reconstruction network pre-trained to the convergence state to target the face after obtaining the face area image of the area where the key points of the face are located in the face image. The regional image performs bilinear modeling of identity information and expression information to obtain a parameterized three-dimensional face model; then the parameter mapping layer of the three-dimensional face reconstruction network is used to map the facial region image to the parameterized The corresponding parameter coefficients in the three-dimensional face model are used to complete the reconstruction of the three-dimensional face model. The three-dimensional face reconstruction method uses a bilinear modeling layer to decouple the identity information and expression information in the face, thereby effectively separating the expression parameters and realizing expression migration, which can greatly promote live broadcast, film and television, and animation and other related industries; secondly, the three-dimensional face reconstruction network is suitable for training using a weakly supervised learning method based on a single image, which can greatly reduce the acquisition cost and labeling cost of training data, and contribute to scale application.

Based on any of the above embodiments, please refer to Figure 2. The parameter mapping layer of the three-dimensional face reconstruction network is used to map the face area image to the corresponding parameter coefficients in the parameterized three-dimensional face model. Afterwards, the method further includes:

Step S1500: Obtain the target parameter coefficients required to constitute the parameterized three-dimensional face model, where the target parameter coefficients include pre-specified identity coefficients and pre-specified expression coefficients;

The parameterized three-dimensional face model is constructed in the bilinear modeling layer of the three-dimensional face reconstruction network, and its undetermined parameter coefficients are the identity coefficient and the expression coefficient. In an exemplary application of this application, the vector dimension of the identity coefficient is 79, and the vector dimension of the expression coefficient is 46. when After the pre-specified identity coefficient and the pre-specified expression coefficient are determined, the parameter coefficients of the parameterized three-dimensional face model are determined, that is, the three-dimensional face model corresponding to the face area image is reconstructed. .

Step S1600: Migrate the target parameter coefficients to the three-dimensional face model of the corresponding digital person to obtain the three-dimensional face model of the digital person;

The previous step completed the reconstruction of the three-dimensional face model of the face area image, but in actual application scenario requirements, it is more inclined to apply its digital image. In one embodiment, a digital person is used to replace the face part in the face area image, in order to replace the "real person" with a "digital person" for activities such as live broadcast or communication and interaction. In this scenario, the real-time emotional simulation of "digital people" has become an urgent problem to be solved. One solution is to migrate the real expressions of the "real person" to the "digital person" so that it can simultaneously express the emotions of the "real person". Therefore, in one embodiment, the bilinear modeling layer constructed by the present application can realize the decoupling of expression information, thereby migrating the expression coefficients in the three-dimensional face model of the "real person" to the "digital person". "In the three-dimensional face model in ", the expression migration from "real person" to "digital person" can be completed.

In actual application scenarios, in order to achieve expression migration from "real people" to "digital people", the number of identities and the number of expressions, that is, the vector dimensions of the identity coefficient and the expression coefficient should be consistent. As shown in Figure 3, on this basis, the expression coefficient corresponding to the "real person" can be directly replaced with the expression coefficient in the "digital person" three-dimensional face model, and the other parameters remain unchanged, and then we can obtain 3D face model of digital human after expression transfer.

Step S1700: Render and project the three-dimensional face model of the digital human into a two-dimensional image space to obtain a digital human image.

In the previous step, after obtaining the three-dimensional face model of the "digital human", three-dimensional rendering and projection are performed based on the illumination coefficient, posture coefficient, transformation coefficient obtained in step S1400, and the texture coefficient of the "digital human" itself. In the two-dimensional image space, the image of the "digital human" is obtained, that is, the expression migration from the face area image to the "digital human" image is completed. In one embodiment, in the video stream of the live broadcast platform, the face area image in the single-frame face image is obtained and replaced with the "digital human" image, so that the "digital human" can be broadcast simultaneously. This type of application is one of the scenarios where the expression migration function of the method is applied, and it can also be used in other scenarios.

According to the above embodiments, it can be seen that from the reconstruction of the three-dimensional face model of a real person to the expression migration of a digital person, the method is aimed at decoupling modeling of identity information and expression information, which can bring huge benefits to industries such as live broadcast, film and television, and digital image. It has great application value, and its expression migration application does not affect the changes of other face information.

Based on any of the above embodiments, please refer to Figure 4 to implement key point detection on the face image. Obtain the face area image of the area where the key points of the face are located, including:

Step S1210: Perform face key point detection on the face image to obtain the face area image and face key point information;

The face detection model pre-trained to the convergence state is used to perform face detection on the face image, and the face rectangular frame information in the face image is obtained. The face rectangular frame can calibrate the position and size of the face part in the face image, and the calibration result can be represented by a set with four coordinate elements, such as S _roi . Thereafter, the corresponding area image is selected from the face image according to the set, that is, the face area image is obtained. The face area image completely contains the face part, and redundant parts of other non-face areas in the face image are removed.
S _roi ={x ₁ ,y ₁ ,x ₂ ,y ₂ }

Among them, x ₁ and y ₁ represent the pixel coordinates of the upper left corner of the detected face part, and x ₂ and y ₂ represent the pixel coordinates of the lower right corner of the face part.

Use the face key point detection model pre-trained to the convergence state to detect the face area image obtained above to obtain the face key point information. The facial key points can represent the positions of key areas of the human face, such as eyebrows, eyes, nose, mouth, facial contours, etc. All results of the facial key points can be expressed as a set of points L ⁿ . Among them, n represents the number of face key points. The number can be set by relevant technical personnel according to actual needs, and can be 5, 30, 68, 106, 240, etc.
L ⁿ ={(x ₁ ,y ₁ ),(x ₂ ,y ₂ ),…,(x _n ,y _n )}

The face detection model and face key point detection model are implemented by neural network models. In practical applications, relatively excellent face detection models and face key point detection models in related technologies can be used.

Step S1220: Align the face key points with standard face key points to obtain standard alignment parameters. The standard face key points are corresponding face key points obtained by two-dimensional projection of a standard three-dimensional face model;

Due to the diversity of actual scenes, the face contours in the face area image have different angles and sizes, which can easily interfere with subsequent three-dimensional face parameter calibration work. Therefore, it is necessary to perform standard alignment on the face area images.

After obtaining the face key points in the face area image, the face key points are also detected from the standard face image projected from the standard three-dimensional face model to the two-dimensional plane, thereby obtaining the standard face key points. The standard three-dimensional face model can be preset by relevant technical personnel. Using the relative positions, scales, and angles of the standard face key points as standards, the face key points detected from the face area image are aligned to obtain corresponding standard transformation parameters. The method used in the alignment operation can be any minimization method such as PnP, least squares method, etc. In one embodiment of the present application, the PnP method is used. The standard transformation parameters include translation transformation parameters and scale transformation parameters.

Step S1230: Align the face area image according to the standard alignment parameter.

According to the standard transformation parameters, a standard transformation is performed on the face area image S _roi and the face key point L ⁿ . After the face region image is transformed, its size is adjusted to a preset size, which is 224×224×3 in one embodiment of the present application. Through the above operations, the aligned face area image can be obtained.

It is worth noting that after the facial key points undergo standard transformation, in one embodiment, the Hough transformation can be used to obtain the posture information of the three-dimensional facial model corresponding to the facial region image. The posture information of the three-dimensional face model includes pitch angle, roll angle and rotation angle.

According to the above embodiments, it can be seen that by performing face detection and facial key point detection on the object to be processed, and then performing standard transformation, interference caused by position offset and scale deviation, as well as redundancy in non-face areas can be eliminated. subsequent interference from residual information.

Based on any of the above embodiments, please refer to Figure 5. The parameter mapping layer of the three-dimensional face reconstruction network is used to map the face area image to the corresponding part of the parameterized three-dimensional face model. Parameter coefficients, including:

Step S1410: Use the encoder in the three-dimensional face reconstruction network to perform feature extraction on the face area image to obtain a face feature map;

After the bilinear modeling layer of the three-dimensional face reconstruction network determines the parameterized three-dimensional face model, the encoder pre-trained to convergence is used to perform feature extraction on the face area image obtained in step S1200, and obtain Facial feature map. The face feature map can reduce the interference of redundant information in non-face area images in the face image, thereby better extracting the semantic information of the face part.

The encoder is implemented by a neural network model. The neural network model can use a variety of relatively excellent feature extraction models in related technologies, including: VGG16 model, VGG19 model, InceptionV3 model, Xception model, MobileNet model, AlexNet model, LeNet Model, ZF_Net model, ResNet18 model, ResNet34 model, ResNet_50 model, ResNet_101 model, ResNet_152 model, etc., are all mature feature extraction models. The feature extraction model is a neural network model that has been trained to convergence. In one embodiment, it is trained to convergence on the ImageNet large-scale data set.

The output of the encoder is set to a feature map. In one embodiment of the present application, the encoder directly outputs the feature map of the last convolutional layer, which is called a face feature map. The input size of the encoder is defined as N×C×H×W, and the output size is N×C'×H'×W', where N represents the number of samples, C represents the number of channels, and H×W represents the preset image. Size, C' represents the number of features, H'×W' represents the feature map size.

Step S1420: Perform spatial mapping on the facial feature map to obtain parameter coefficients in the bilinear modeling layer;

In this step, the above facial feature map is spatially mapped to obtain the parameter system of the three-dimensional facial model. numbers and related parameter coefficients for 3D rendering and 2D projection.

It should be noted that the space mapping includes semantic space mapping and parameter space mapping. The semantic space mapping maps the face feature map into a face feature vector. The face feature vector contains all the depth semantic information in the face image, which is the face identity semantic information, expression semantic information, and texture semantic information. , comprehensive representation of illumination semantic information, posture semantic information, and transformation semantic information. The parameter space mapping maps the face feature vector to the corresponding parameter subspace, thereby obtaining the coefficients of its corresponding parameters. The parameter space includes a face identity parameter space, an expression parameter space, a texture parameter space, and an illumination parameter space. , attitude parameter space, transformation parameter space.

The facial feature map is processed through the above-mentioned semantic space mapping and parameter space mapping to obtain identity coefficients, expression coefficients, texture coefficients, illumination coefficients, posture coefficients, and transformation coefficients. Among them, the identity coefficient and expression coefficient are used to reconstruct the three-dimensional face model of the face area image; the texture coefficient, lighting coefficient, posture coefficient, and transformation coefficient are used for three-dimensional rendering and two-dimensional projection.

From the above embodiments, it is easy to understand that the parameter mapping layer of the three-dimensional face reconstruction network first extracts the face feature map in the face area image, and then maps it to the semantic space to extract its semantic feature vector, and then It is mapped to different parameter spaces respectively in order to obtain the coefficients in the corresponding parameter space; it can make full use of the identity information, expression information, texture information, lighting information, posture information, and transformation information in the face area image without introducing Other additional information achieves the purpose of integrated modeling of 3D face reconstruction and rendering projection.

Based on any of the above embodiments, please refer to Figure 6. The spatial mapping is performed on the facial feature map to obtain the parameter coefficients in the bilinear modeling layer, including:

Step S1421: Perform semantic space mapping on the facial feature map to obtain a facial feature vector;

The facial feature map is N×C'×H'×W', where N represents the number of samples, C' represents the number of features, and H'×W' represents the size of the feature map.

Semantic space mapping is performed on the facial feature map x. In one embodiment, global pooling is used:
F _g (x)＝gloabl_pooling(x)＝x′[N,C′]

The F _g (x) contains rich information describing the characteristic information of the face, including identity information, shape information, texture information, lighting information, posture information and transformation information.

The F _g (x) after the semantic space mapping is a feature vector, that is, a face feature vector, represented by x'[N,C'].

Step S1422: Perform parameter space mapping on the facial feature vector to obtain parameter coefficients in the bilinear modeling layer.

In one embodiment, a corresponding number of parameter space mapping layers are designed to map the facial feature vectors into corresponding parameter subspaces for optimization, and obtain coefficients of corresponding parameters.

Specifically it can be expressed as:
F _all (x)＝{σ _id (F _g(x) ),σ _exp (F _g(x) ),σ _texture (F _g(x) ),σ _light (F _g(x) ),σ _pose ( F _g(x) ),σ _transition (F _g(x) )}

where σ(x) represents a learnable mapping function: σ(x)=Wx+b, where W represents the learnable weight, which varies with its mapping relationship in different parameter subspaces, and b represents the learnable weight Bias varies with different mapping relationships in different parameter subspaces. Here σ _id represents the learning of identity coefficients. The same person should have similar coefficient representations, and different people have different coefficient representations. The parameter size can be described as [C′,79]; σ _exp represents the learning of expression coefficients. The same expression People should have similar coefficients, such as closing eyes, opening mouth, curling lips, etc. People with different expressions should have different coefficients. For example, closing eyes and opening eyes should be inconsistent in a specific shape. The parameter size can be described as [C′, 46]; σ _texture represents the learning of texture coefficients, which are used to model real textures, and their parameters are described as [C′,79]. σ _light is used to estimate the current facial illumination, and its parameters are described as [C′,27], which represents the basis coefficients of 27 spherical harmonics. σ _pose is used to estimate the pose of the human face and contains three sub-parameters yaw, pitch and roll, corresponding to roll, pitch and rotation respectively. σ _transition is used to estimate the transformation of the three-dimensional face space, so it contains the transformation coefficients of the three axes of x, y, and z.

According to the understanding of the above embodiments, the decoupled modeling based on the bilinear modeling layer in the three-dimensional face reconstruction network can separately model the identity information and expression information, which is helpful for the scenario application of expression migration and drives Development of expression generation applications in related industries. At the same time, the spatial mapping in the parameter mapping layer is used to map the face area image with the three-dimensional face model parameters and rendering projection parameters, making full use of the feature information of the input face area image to provide more convenient acquisition of parameter coefficients. effective way.

The input of the three-dimensional face reconstruction network of this application is a face region image, and its output is a three-dimensional face model. In this application, a framework corresponding to the weakly supervised learning mechanism is constructed for the three-dimensional face reconstruction network, and the training of the three-dimensional face reconstruction network is completed. As shown in Figure 7, it shows a schematic diagram of the principle of the framework corresponding to the weakly supervised learning mechanism used to train the three-dimensional face reconstruction network of the present application. The three-dimensional face reconstruction network is trained according to this framework. Therefore, based on any of the above embodiments, please refer to Figure 8. The training process of the three-dimensional face reconstruction network includes:

Step S2100: Obtain a single sample of the preprocessed face image data;

The facial image data refers to image data with human face parts. This type of image data can be obtained through authorized live broadcast, on-demand and other legal channels. In one embodiment, it can be video stream data. Its video storage formats can be diverse, including MP4, avi, rmvb, x264, etc. In another embodiment, it may also be image data. The image data content may include indoor, outdoor, news media, sports and entertainment and other scenes, including natural scenes.

The data storage format of the image data is inconsistent due to various data sources, including RGB24, YUV444, YUV420 and other formats. In order to realize the automated application of the related technologies in this application, the data storage formats are unified. In one embodiment, image data from different sources can be converted into a unified YUV420 format. In another embodiment, image data from different sources can also be converted into a unified RGB24 format, or YUV444 format, or others. The above-mentioned preprocessing method is applied to the training and application of relevant technical methods in this application, unifying various data formats into one to improve the efficiency of technical applications without affecting its performance.

From the preprocessed face image data, whether it is video stream data or image data, one face image with a face part is extracted as a single sample for subsequent processing.

Step S2200: Obtain the face area image, face key points and three-dimensional face model posture coefficients in the single sample;

The face area image, face key points and three-dimensional face model posture coefficients are extracted from the single sample in the same manner as in step S1200 above. Specifically, the face detection model pre-trained to the convergence state is used to detect the single sample, the face rectangular frame information is obtained, and the face area image is further obtained; and then the face key point detection model pre-trained to the convergence state is used to detect the face key point detection model. Describe the face area image to obtain face key point information; align the face area image S _roi and the face key point information L ⁿ according to standard alignment parameters; finally, use Hough transform calculation on the face key points Obtain three-dimensional face pose information Y _pose .

It should be noted that the face region image is used as the input of the three-dimensional face reconstruction network, and the face key points and three-dimensional face pose information are used to calculate the loss value.

Step S2300: Use the three-dimensional face reconstruction network to reconstruct and obtain a three-dimensional face model of the face area image, and obtain a face reconstruction image through rendering and projection into two dimensions;

The bilinear modeling layer of the three-dimensional face reconstruction network is used to perform decoupled modeling of identity information and expression information, and the parameter mapping layer of the three-dimensional face reconstruction network is used to obtain the identity coefficient, expression coefficient, texture coefficient, and lighting Coefficients, attitude coefficients, transformation coefficients. The identity coefficient and expression coefficient are used to reconstruct a three-dimensional face model to obtain the face area image.

The three-dimensional rendering and two-dimensional projection of the three-dimensional face model include the following operations: estimating the surface texture of the face, assuming in advance that the face is a Lambertian surface, and using spherical harmonics to approximate the scene lighting, which can be combined with the face surface discovery and Skin texture σ _texture (F _g(x) ) to calculate the radiance of the vertex where Φ represents the spherical harmonic function basis function.

The three-dimensional rendering work of the three-dimensional face model can be completed on it, and then the camera system transformation of the face is performed, using the posture parameter σ _pose (F _g(x) ) and the transformation parameter σ _transition (F _g(x) ), Combined with the camera perspective model to perform translation and rotation changes on the three-dimensional face, it can be projected into a two-dimensional plane to obtain all the projection points L _x of the face vertices, which can be expressed as [N _v ,2], where 2 represents the x, y plane coordinate information. It should be noted that the face projection has completed the relevant transformation from the world coordinate system to the pixel coordinate system, and it matches the relevant positions of the standard face key points. At this point, the projection of the three-dimensional face model into the two-dimensional plane is completed, and the reconstructed face image is obtained.

Assuming that the input face area image is x, the reconstructed face image after rendering and projection after three-dimensional face reconstruction can be expressed as:
R(x)＝Render(F _id ,F _exp ,F _ill ,F _albedo ,F _pose ,F _transition )

Step S2400: Calculate a reconstruction loss value based on the face area image and the face reconstruction image, and update the parameters of the three-dimensional face reconstruction network based on the reconstruction loss value;

Construct a reconstruction loss function and calculate the error between the face area image and the face reconstruction image. In one embodiment, the three-dimensional reconstruction loss function is a weighted sum of four sub-loss functions: the first sub-loss function is a perceptual loss function, used to minimize the face area image and the face reconstruction image The second sub-loss function is the photometric loss function, which is used to enhance the shape and pixel-level alignment between the face area image and the face reconstruction image; the third sub-loss function is the posture loss function , used to ensure higher accuracy of the posture; the fourth sub-loss function is the reprojection loss function, used to optimize the accuracy of the projection point. The weighted sum of the above sub-loss values is the reconstruction loss value of the three-dimensional face reconstruction network under the current iteration number, that is, the error L(x).

After the error L(x) is calculated, the relevant weights can be updated according to the back propagation mechanism of the neural network.

The updated weight part is mainly the weight of the space mapping in the parameter mapping layer in the three-dimensional face reconstruction network, that is, the semantic space mapping component and the parameter space mapping component.

The direction of the weight update is a direction that makes the error L(x) smaller.

Step S2500: Repeat the above operations until the preset termination condition is triggered to end the training, and obtain the three-dimensional face reconstruction network.

Repeat the above steps, namely: obtain the sample-->obtain the face reconstruction image-->calculate the error-->update parameters. Training can be terminated until the training conditions reach the preset termination condition, indicating that the training has reached convergence. The preset termination condition can be set by relevant technical personnel according to actual application scenario requirements. In one embodiment, it can be a constraint on the number of iterations, that is, training is terminated when the number of training times reaches a preset number. In another embodiment, it can be Loss value constraint, that is, when the reconstruction loss value reaches the preset minimum value during the iterative training process, Terminate training.

According to the above embodiments, it is not difficult to understand that the weakly supervised learning mechanism based on a single face image can construct training data in large quantities at low cost, thereby effectively reducing the acquisition cost and labeling cost of training samples, which is beneficial to the rapid research and development of related technologies. Provides powerful motivation. In addition, this method can decouple and obtain facial expression models for expression migration applications, such as film and television, animation, digital humans and other related fields, which has great practical application value and commercial value.

Based on any of the above embodiments, please refer to Figure 9. Calculating the reconstruction loss value based on the aligned face area image and the reconstructed face image includes:

Step S2410: Calculate a first loss value, which is used to minimize the error between the face area image and the face reconstruction image;

The first loss value is calculated based on the depth perception of the face area image and the face reconstruction image. That is, a neural network with mature perceptual capabilities is used to pre-extract the semantic features of the face area image and the face reconstruction image, and then calculate the correlation loss value based on the semantic features.

Optionally, self-supervised modeling is first performed on the reconstructed face image. In one embodiment, a face recognition network pre-trained to a converged state is introduced to extract the top-level depth of the reconstructed face image and the face region image. feature. It should be noted that the face recognition network can use mature neural network models in related technologies, and face recognition models such as VGGNet, FaceNet, and ArcFaceNet can be used for self-supervised training. In the embodiment of this application, the ArcFaceNet network can be used, which has better effects.

Define the face area image as x, the reconstructed face reconstruction image as R(x), and the face recognition model as E(x). The perceptual loss function can be expressed as:

The above similarity loss function is used to constrain the network model so that the reconstructed face is close to the real face, and the surface texture features and lighting parameters are optimized.

Step S2420: Calculate a second loss value, the second loss value is used to enhance the shape and pixel level alignment between the face area image and the face reconstruction image;

The first loss value implicitly constrains the approximate relationship of the face feature layer. In order to further strengthen the alignment of the shape and pixel level, a second loss value is added to strengthen the face region image and the face reconstruction. Shape and pixel level alignment between images, which can be expressed as:

This constraint has a strong pixel-level constraint. Therefore, in one embodiment, a smaller weight w _photo is given to prevent the network from falling into a local solution.

Step S2430: Calculate a third loss value. The third loss value is used to ensure that the posture has higher accuracy;

The first loss value implicitly constrains and optimizes the pose. In order to further ensure that the posture has higher accuracy, the third loss value is calculated. In one embodiment, the three-dimensional face model posture coefficients in step S2200 are used as marker data, and L1 norm loss is used for numerical constraints and minimization:
L _pose ＝||σ _pose (F _g(x) )-Y _pose || ¹

Among them, σ _pose (F _g(x) )∈R ³ is the posture coefficient obtained in the forward reasoning of the three-dimensional face reconstruction network, including roll angle, pitch angle and rotation angle, and Y _pose ∈R ³ represents It is the posture coefficient of the three-dimensional face model obtained in step S2200, and also includes the roll angle, pitch angle and rotation angle.

Step S2440: Calculate a fourth loss value, which is used to optimize the accuracy of projection points in two-dimensional projection;

In order to further optimize the accuracy of face vertex mesh modeling, the fourth loss value can also be used for model constraints. Optional, face key point data extracted based on the sample and reprojected points after 3D rendering and 2D projection after 3D face reconstruction to construct reprojection error constraints. The number of vertices is consistent with the number of two-dimensional facial key point detection.

This constrains the accuracy of the projected points.

Step S2450: Calculate a reconstruction loss value, which is a weighted fusion of the first loss value, the second loss value, the third loss value, and the fourth loss value.

Weighted fusion is performed based on the four sub-loss functions constructed in the above steps. In one embodiment of this application, the complete network training loss function can be expressed as:
L(x)＝w _percep L _percep (x,R(x))+L _photo (x,R(x))+w _pose L _pose (x)+w _proj L _proj (x)

Among them, w _percep , w _pose and w _proj represent the weights expressed as the first loss value, the third loss value and the fourth loss value respectively.

According to the above embodiments, it is easy to understand that the weighted fusion reconstruction loss value calculation based on the first loss value, the second loss value, the third loss value, and the fourth loss value can more comprehensively constrain the three-dimensional face reconstruction network. All parameters obtained in the method are close to the real label values. At the same time, loss calculation and parameter update based on single samples can accelerate convergence and save training costs.

Please refer to Figure 10, a three-dimensional face reconstruction device provided according to one aspect of the present application, a In the embodiment, it includes an image acquisition module 1100, a face detection module 1200, a face modeling module 1300, and a parameter mapping module 1400. The image acquisition module 1100 is configured to acquire face image data and extract face images therein; The face detection module 1200 is configured to perform key point detection on the face image to obtain a face region image of the area where the key points of the face are located; the face modeling module 1300 is configured to use a three-dimensional model pre-trained to a converged state. The bilinear modeling layer of the face reconstruction network performs bilinear modeling of facial identity and facial expression on the face region image to obtain a parameterized three-dimensional face model; the parameter mapping module 1400 is set to use The parameter mapping layer of the three-dimensional face reconstruction network maps the face region image to the corresponding parameter coefficients in the parameterized three-dimensional face model, and the parameter coefficients include the identity corresponding to the face identity. coefficient and the expression coefficient corresponding to the facial expression.

In one embodiment, the parameter mapping module 1400 includes: a coefficient acquisition unit configured to obtain target parameter coefficients required to constitute the parameterized three-dimensional face model, wherein the target parameter coefficients include prespecified The identity coefficient and the pre-specified expression coefficient; the expression migration unit is configured to migrate the target parameter coefficients to the three-dimensional face model of the corresponding digital person to obtain the three-dimensional face model of the digital person; the rendering projection unit is configured to transfer the three-dimensional face model of the digital person. The three-dimensional face model of the digital human is rendered and projected into the two-dimensional image space to obtain the digital human image.

In one embodiment, the face detection module 1200 includes: a face detection unit configured to detect face key points on the face image to obtain face area images and face key point information; standard alignment A unit configured to align the face key points with the standard face key points to obtain standard alignment parameters. The standard face key points are corresponding face key points obtained by two-dimensional projection of a standard three-dimensional face model; face An alignment unit configured to align the face area image according to the standard alignment parameters.

In one embodiment, the modeling projection module 1400 includes: a feature encoding unit configured to use the encoder in the three-dimensional face reconstruction network to perform feature extraction on the face area image to obtain a face feature map ; A spatial mapping unit configured to perform spatial mapping on the facial feature map to obtain parameter coefficients in the bilinear modeling layer.

In one embodiment, the spatial mapping unit includes: a semantic space mapping subunit, which is configured to perform semantic space mapping on the facial feature map to obtain a facial feature vector; and a parameter space mapping subunit, which is configured to perform semantic space mapping on the facial feature map. The facial feature vector is mapped in parameter space to obtain the parameter coefficients in the bilinear modeling layer.

In one embodiment, the network training module includes: a sample acquisition unit configured to acquire a single sample of preprocessed face image data; a data acquisition unit configured to acquire a face region image in the single sample , face key points and three-dimensional face model posture coefficients; the reconstruction image unit is configured to use the three-dimensional face reconstruction network to reconstruct the three-dimensional face model to obtain the face area image, and then render and project it into two dimensions to obtain the human face model. face reconstruction image; a loss optimization unit configured to calculate a reconstruction loss value according to the face area image and the face reconstruction image, and perform the reconstruction loss on the three-dimensional face according to the reconstruction loss value. The face reconstruction network performs parameter updates; the training repetition unit is set to repeat the above operations until the preset termination condition is triggered and the training ends, and the three-dimensional face reconstruction network is obtained.

In one embodiment, the loss optimization unit includes: a first loss subunit configured to calculate a first loss value, the first loss value being used to minimize the relationship between the face region image and the face reconstruction Error between images; a second loss subunit configured to calculate a second loss value, the second loss value being used to enhance the shape and pixel-level alignment between the face region image and the face reconstruction image ; The third loss subunit is configured to calculate a third loss value, the third loss value is used to ensure that the posture has higher accuracy; the fourth loss subunit is configured to calculate a fourth loss value, the fourth loss value The value is used to optimize the accuracy of the projection points in the two-dimensional projection; the loss fusion subunit is set to calculate the reconstruction loss value, which is the first loss value, the second loss value, the third loss value, and the fourth loss value. Weighted fusion of values.

Another embodiment of the present application also provides a three-dimensional face reconstruction device. As shown in Figure 11, a schematic diagram of the internal structure of the three-dimensional face reconstruction device. The three-dimensional face reconstruction device includes a processor, a computer-readable storage medium, a memory and a network interface connected through a system bus. Among them, the computer-readable non-volatile readable storage medium of the three-dimensional face reconstruction device stores an operating system, a database and computer-readable instructions. The database can store information sequences, and the computer-readable instructions are processed by the processor. When executed, the processor can be enabled to implement a three-dimensional face reconstruction method.

The processor of the three-dimensional face reconstruction device is used to provide computing and control capabilities to support the operation of the entire three-dimensional face reconstruction device. Computer-readable instructions may be stored in the memory of the three-dimensional face reconstruction device. When executed by the processor, the computer-readable instructions may cause the processor to execute the three-dimensional face reconstruction method of the present application. The network interface of the three-dimensional face reconstruction device is used to connect and communicate with the terminal.

Those skilled in the art can understand that the structure shown in Figure 11 is only a block diagram of part of the structure related to the solution of the present application. A specific three-dimensional face reconstruction device may include more or fewer components than shown in the figure. Or combine certain parts, or have different parts arrangements.

In this embodiment, the processor is used to execute the specific functions of each module in Figure 10, and the memory stores program codes and various types of data required to execute the above modules or sub-modules. The network interface is used to realize data transmission between user terminals or servers. The non-volatile readable storage medium in this embodiment stores the program codes and data required to execute all modules in the three-dimensional face reconstruction device of the present application. The server can call the server's program codes and data to execute the functions of all modules. .

This application also provides a non-volatile readable storage medium storing computer-readable instructions. When the computer-readable instructions are executed by one or more processors, they cause one or more processors to execute any embodiment of the application. The steps of the 3D face reconstruction method.

The present application also provides a computer program product, which includes a computer program/instruction that implements the steps of the method described in any embodiment of the present application when executed by one or more processors.

Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above embodiments of the present application can be completed by instructing relevant hardware through a computer program. The computer program can be stored in a non-volatile readable storage medium. , when the program is executed, it may include the processes of the above-mentioned method embodiments. Among them, the aforementioned storage media can be computer-readable storage media such as magnetic disks, optical disks, read-only memory (Read-Only Memory, ROM), or random access memory (Random Access Memory, RAM), etc.

To sum up, this application can achieve three-dimensional face reconstruction. The three-dimensional face reconstruction method uses a bilinear modeling layer to decouple the identity information and expression information in the face, thereby effectively separating the expression parameters. , realizing expression migration can greatly promote the application and development of related live broadcast, film and television and other industries; secondly, the training method of the method is based on weakly supervised learning of a single image, which can greatly reduce the acquisition cost and labeling cost of training data. , conducive to large-scale application.

Claims

A three-dimensional face reconstruction method, including:

Obtain facial image data and extract facial images;

Perform key point detection on the face image to obtain a face area image of the area where the face key points are located;

Use the bilinear modeling layer of the three-dimensional face reconstruction network pre-trained to the convergence state to perform bilinear modeling of facial identity and facial expression on the face area image to obtain a parameterized three-dimensional face model;

The parameter mapping layer of the three-dimensional face reconstruction network is used to map the face area image to the corresponding parameter coefficients in the parameterized three-dimensional face model, and the parameter coefficients include the parameters corresponding to the face identity. The identity coefficient and the expression coefficient corresponding to the facial expression.
The three-dimensional face reconstruction method according to claim 1, after using the parameter mapping layer of the three-dimensional face reconstruction network to map the face area image to the corresponding parameter coefficients in the parameterized three-dimensional face model , the method includes:

Obtain target parameter coefficients required to constitute the parameterized three-dimensional face model, wherein the target parameter coefficients include a pre-specified identity coefficient and a pre-specified expression coefficient;

Migrate the target parameter coefficients to the three-dimensional face model of the corresponding digital person to obtain the three-dimensional face model of the digital person;

The three-dimensional face model of the digital human is rendered and projected into a two-dimensional image space to obtain a digital human image.
The three-dimensional face reconstruction method according to claim 1, wherein performing key point detection on the face image to obtain the face area image of the area where the face key points are located includes:

Perform face key point detection on the face image to obtain the face area image and face key point information;

Align the face key points with standard face key points to obtain standard alignment parameters. The standard face key points are corresponding face key points obtained by two-dimensional projection of a standard three-dimensional face model;

The face region images are aligned according to the standard alignment parameters.
The three-dimensional face reconstruction method according to claim 1, wherein the parameter mapping layer of the three-dimensional face reconstruction network is used to map the face region image to the corresponding parameters in the parameterized three-dimensional face model. coefficients, including:

Using the encoder in the three-dimensional face reconstruction network to perform feature extraction on the face area image to obtain a face feature map;

Perform spatial mapping on the facial feature map to obtain parameter coefficients in the bilinear modeling layer.
The three-dimensional face reconstruction method according to claim 4, wherein the face feature map is Line space mapping is used to obtain parameter coefficients in the bilinear modeling layer, including:

Perform semantic space mapping on the facial feature map to obtain a facial feature vector;

Perform parameter space mapping on the facial feature vector to obtain parameter coefficients in the bilinear modeling layer.
The three-dimensional face reconstruction method according to any one of claims 1 to 5, wherein the training process of the three-dimensional face reconstruction network includes:

Obtain a single sample of preprocessed face image data;

Obtain the face area image, face key points and three-dimensional face model posture coefficients in the single sample;

Use the three-dimensional face reconstruction network to reconstruct and obtain a three-dimensional face model of the face area image, and then render and project it into two dimensions to obtain the face reconstruction image;

Calculate a reconstruction loss value based on the face area image and the face reconstruction image, and update the parameters of the three-dimensional face reconstruction network based on the reconstruction loss value;

Repeat the above operations until the preset termination condition is triggered and the training ends, and the three-dimensional face reconstruction network is obtained.
The three-dimensional face reconstruction method according to claim 6, wherein calculating the reconstruction loss value based on the aligned face area image and the reconstructed face image includes:

Calculate a first loss value, the first loss value being used to minimize the error between the face region image and the face reconstruction image;

Calculate a second loss value, the second loss value being used to enhance the shape and pixel-level alignment between the face region image and the face reconstruction image;

Calculate a third loss value, the third loss value is used to ensure that the posture has higher accuracy;

Calculating a fourth loss value, the fourth loss value being used to optimize the accuracy of the projected points in the two-dimensional projection;

Calculate a reconstruction loss value, which is a weighted fusion of the first loss value, the second loss value, the third loss value, and the fourth loss value.
A three-dimensional face reconstruction device, including:

The image acquisition module is configured to acquire face image data and extract the face image therein;

A face detection module configured to perform key point detection on the face image and obtain a face region image of the area where the face key points are located;

Face modeling module, set up to use a dual-line 3D face reconstruction network pre-trained to a converged state The sexual modeling layer performs bilinear modeling of facial identity and facial expression on the face region image to obtain a parameterized three-dimensional face model;

A parameter mapping module configured to use the parameter mapping layer of the three-dimensional face reconstruction network to map the face area image into corresponding parameter coefficients in the parameterized three-dimensional face model, where the parameter coefficients include the The identity coefficient corresponding to the facial identity and the expression coefficient corresponding to the facial expression.
A three-dimensional face reconstruction device, including a central processor and a memory. The central processor is configured to call and run a computer program stored in the memory to perform the method according to any one of claims 1 to 7. step.
A non-volatile readable storage medium, the non-volatile readable storage medium stores a computer program implemented according to the method according to any one of claims 1 to 7 in the form of computer readable instructions, When the computer program is called and run by the computer, the steps included in the corresponding method are executed.
A computer program product, comprising a computer program/instructions which, when executed by a processor, implements the steps of the method of any one of claims 1 to 7.