CN116563457A

CN116563457A - Three-dimensional face reconstruction method based on CLIP model

Info

Publication number: CN116563457A
Application number: CN202310376661.3A
Authority: CN
Inventors: 包永堂; 周鹏飞; 肖欣菲
Original assignee: Shandong University of Science and Technology
Current assignee: Shandong University of Science and Technology
Priority date: 2023-04-11
Filing date: 2023-04-11
Publication date: 2023-08-08

Abstract

The invention provides a three-dimensional face reconstruction method based on a CLIP model, which comprises the following steps: s1, acquiring rough characteristic representation by adopting a mask pre-training mode; s2, adopting a parameter refinement module to learn fine granularity characteristic representation from the coarse characteristic; s3, acquiring 3DMM parameters by adopting a feature classifier; s4, adopting a BFM model to fit 3DMM parameters to generate a three-dimensional face model; s5, rendering the 3D face model into a 2D image by adopting a micro-renderer; s6, adopting a loss function optimization model. The technical scheme of the invention solves the problems of lower face reconstruction precision and fewer face geometric details in the prior art.

Description

Three-dimensional face reconstruction method based on CLIP model

Technical Field

The invention relates to the technical field of computer vision and computer graphics, in particular to a three-dimensional face reconstruction method based on a CLIP model.

Background

In recent years, 3D face reconstruction based on a single image is increasingly receiving attention from researchers. Vetter et al (Volker Blanz, thomas Vetter, "A morphable model for the synthesis of D faces," in processes of the ACM SIGGRAPH Annual Conference,1999, pp.187-194.) originally developed a 3D deformation model (3D morphable model,3DMM) algorithm. Since over twenty years, the 3DMM method has been rapidly developed and is most widely used. With the advent of deep learning technology, some supervised 3D face reconstruction methods utilize deep convolutional networks to predict 3DMM parameters to replace traditional optimization methods to obtain better reconstruction results. However, face data with 3D ground truth is not readily available. Some unsupervised or weakly supervised learning methods have been extensively studied and have achieved acceptable results. Tewari et al (A.Tewari, M.Zollofer, H.Kim, P.Garrido, F.Bernard, P.Perez, and T.Christian. MoFA: model-based Deep Convolutional Face Autoencoder for Unsupervised Monocular Reconstruction. InICCV, 2017.3) propose to use loss of gloss as a supervisory signal during training to restore facial texture. Genova et al (K.Genova, F.Cole, A.Maschinot.Unsupervised tracking for 3D morphable model regression.In CVPR,2018,pp.8377-8386) utilize face recognition networks to improve the fidelity of face reconstruction. Deng et al (Y.Deng, J.Yang, S.Xu.Accurate 3D face reconstruction with weakly-supervised learning: from single image to image set.InCVPRW, 2019, pp.285-295) employ landmark loss to improve accuracy of facial reconstruction. Shang et al (J.Shang, T.Shen, S.Li, L.Zhou, M.Zhen, T.Fang, and L.Quan. Self-supervised monocular D face Recons-truction by occlusion-aware multiview geometry constancy. In ECCV,2020, pp. 53-70) propose a depth penalty to improve face alignment accuracy. These methods continue to explore the effects of different losses on 3D face reconstruction, but they ignore concerns about face geometry details. In summary, these methods can only reconstruct rough geometry and non-fidelity textures, and cannot recover geometry details.

Some methods in the prior art can be used to restore detailed face shapes. Feng et al (Y.Feng, H.Feng, M.J.Black, and t.bolkart.learning an animatable detailed 3D face model from in-the-wild images.in TOG,40 (4): 88:1-88:13,2021.2,3,4,5,6,7, 8) propose capturing detailed expressions and animations (detailed expression capture and animat-ion, DECA) methods to learn common geometric details using multi-view face images, thereby generating geometric displacement maps rich in geometric details. However, the geometric displacement diagram learned by the method is inaccurate, and the generated geometric details are not true. Danecek et al (R.Danecek, M.J.Black, T.Bolkart.EMOCA: emotion Driven Monocular Face Capture and animation. InCVPR, 2022.) propose an emotion capture and animation (emotion capture and animation, EMOCA) method that utilizes depth-aware expression consistency loss to learn geometric details under facial expressions. The method can effectively restore the geometric details of the facial expression. But this method cannot generate a 3D face shape with a sense of realism. Therefore, these methods for restoring the geometric details of the face using the displacement map have difficulty in learning accurate geometric details and lack geometric realism. The existing work can not effectively capture geometric details and semantic attributes, so that the generated 3D face has few geometric details and rough textures. Furthermore, we found that the EMOCA method uses expression networks to obtain more geometric details of facial expressions. Therefore, we consider that a powerful semantic representation network can learn geometric details and semantic attributes to guide a coarse 3D face model to recover more geometric details and facial expressions with realism. To this end, we exploit the powerful representation capabilities of the CLIP (contrast-Language-Image-pretaining) model to learn geometric details and semantic features. The CLIP model is trained on 400 ten thousand text-image pairs, which can effectively acquire fine-grained semantic features. StyleClIP (O.Patashnik, Z.Wu, E.Shechtman, D.Cohen-Or, and D.Lischinski. Styleclip: text-driven manipulation of sty-egan image. InCVPR, pp.2085-2094,2021.) shows that the CLIP model can capture geometric and semantic attributes of a human face.

Therefore, there is a need for a three-dimensional face reconstruction method based on CLIP model with higher face reconstruction accuracy and more significant face geometry.

Disclosure of Invention

The invention mainly aims to provide a three-dimensional face reconstruction method based on a CLIP model, which aims to solve the problems of lower face reconstruction precision and fewer face geometric details in the prior art.

In order to achieve the above purpose, the present invention provides a three-dimensional face reconstruction method based on a CLIP model, which specifically includes the following steps:

s1, acquiring rough characteristic representation by adopting a mask pre-training mode.

S2, learning fine granularity characteristic representation from the coarse characteristics by adopting a parameter refinement module, wherein the parameter refinement module comprises: the depth-separable residual block is used for learning the detail features of the local face, and the transform encoder is used for learning global semantic features from the rough feature representation, and the feature fusion module is used for fusing the local detail features and the global semantic features learned by the parameter refinement module.

S3, acquiring 3DMM parameters by adopting a feature classifier, wherein the feature representation F of fine granularity _c The 3DMM parameter codes with low dimension are obtained through the feature classifier, and the parameter codes consist of a shape code alpha, an expression code beta, a texture code t, a pose code rho and an illumination code l to form 257-dimensional parameter codes.

S4, adopting a BFM model to fit 3DMM parameters to generate a three-dimensional face model, wherein the fitting parameters at the stage are divided into the face model and a camera model.

S5, rendering the 3D face model into the 2D image by adopting a micro-renderer to generate a rendered imageI _r ＝R(S _3d ) Wherein R (·) represents a micro-renderer, S _3d Is the vertex of the 3D face model.

S6, adopting a loss function optimization model, wherein the loss function comprises a rough loss function and a feature consistency loss function, the rough loss function comprises a luminosity loss function, an identity loss function, a landmark loss function and a regularization loss function, and the feature consistency loss function comprises a geometric feature consistency loss function and a semantic feature consistency loss function.

Further, in step S1, an input image is givenWherein h and w represent the height and width of the face image, and a mask mode is adopted in the VGGFace2 data set to pretrain the residual network to extract rough characteristic representationF ₀ ＝H _b (I _s ) Wherein H is _b (. Cndot.) represents the pre-training residual network and c represents the number of channels.

Further, in step S2, the parameter refinement stage is performed from the coarse feature F ₀ Medium learning fine granularity feature representationF _c ＝H _PRM (F ₀ ) Wherein H is _PRM (. Cndot.) represents a parameter refinement module.

Further, the step S2 specifically includes the following steps:

s2.1, given the roughness characteristics F0.

S2.2, using a 1 x 1 convolution layer to reduce the feature dimension results in a 256-dimensional feature vector, this process is defined as: c=c (F ₀ )。

S2.3, learning local high frequency features and global semantic features respectively by using parallel depth-separable residual blocks and a transducer encoder, wherein the process is defined as F _c =cat (T (c) +dw (c)), where T (-) represents the transducer encoder and DW (-) represents the depth separable residual block.

Further, the face model of step S4 is expressed as:

wherein, the liquid crystal display device comprises a liquid crystal display device,representing an average shape of the 3D face; a is that _id ，A _exp And A _tex The main component basis of the shape, the expression and the texture of the face are respectively represented, and alpha, beta and t respectively represent the shape, the expression and the texture parameters of the face and are used for fitting and generating the 3D face.

The camera model of step S4 projects the 3D face model into the 2D image using a perspective camera, the perspective camera projection process can be expressed as:

v＝f×R×S _3d +T (2)

wherein R is a rotation matrix, T is a translation vector, S _3d Is the vertex of the 3D face model and f is the camera focal length.

Further, in step S6, a luminosity loss function is used to approximate the generated texture skin-tone to the texture skin-tone of the input graph, the luminosity loss function being defined as:

L _photo ＝||M _I ⊙(I _s -I _r )|| ₁ (3)

wherein M is _I Is the mask region of the skin tone of the human face, as indicated by the Hadamard product, I _s 、I _r Respectively an input image and a rendered image, I.I ₁ Is L ₁ A paradigm.

The landmark loss function is used for performing weak supervision learning, and measuring the distances between 68 key points of the 3D face projected to the input image and the input image. The landmark loss function is defined as:

wherein k is _i Is the i-th key point of the input image, k' _i Is the ith key point of the reconstructed 3D face model after projection; w (w) _i Is the ith switchThe weights of the key points are only 20, and the weights of other key points are all 1.

The identity loss function is used for generating a face geometric image, training an ArcFace network on a VGGFace2 data set, then using the trained network for extracting depth features of 512-dimensional faces of an input image and a rendered image, and finally calculating depth feature cosine similarity, wherein the identity loss function is defined as:

L _id ＝1-<F(I _s ),F(I _r )> (5)

wherein F (·) is a pre-trained ArcFace network, I _s ，I _r Respectively an input image and a rendered image,<·,·>is the vector inner product.

The regularization loss function is used for preventing the 3D face shape from degrading, and the regularization loss is defined as follows:

L _reg ＝||α|| ² +||β|| ² +||δ|| ² (9)

wherein, alpha, beta, delta represent shape parameter, expression parameter and texture parameter respectively.

Further, in step S6, a geometric feature consistency loss function is used to restore geometric details of the face, where the geometric feature consistency loss function is defined as:

wherein, CLIP _l Layers 2 and 3, w, of the RN50X4CLIP model _l Is CLIP _l Weights, w _l ＝{1,1/2}，I _s 、I _r Respectively an input image and a rendered image, I.I ₂ Is L ₂ A paradigm.

A semantic feature consistency loss function for approximating texture skin color to an input image and for solving eye closure problems for a 3D face model, which can be defined as:

L _semantic ＝1-cos(CLIP(I _s ),CLIP(I _r )) (7)

wherein CLIP is the FC layer of the ViT-B/32CLIP model, I _s 、I _r The input image and the rendered image, respectively, cos (·) is the cosine distance.

The feature consistency loss function is defined as:

L _cl ＝L _geometric +L _semantic (8)。

further, the optimization of all losses of the objective function is defined as:

L _all ＝minλ _photo L _photo +λ _id L _id +λ _lm L _lm +λ _cl L _cl +λ _reg L _reg (10)

wherein lambda is _photo ＝1,λ _id ＝2,λ _lm ＝1.7×10 ^-3 ,λ _reg ＝1×10 ^-4 ,λ _cl =2 is the weight of the corresponding penalty.

The invention has the following beneficial effects:

1. the parameter refinement module is used for learning rich feature representations so as to accelerate the convergence speed of the model and estimate accurate face model parameters, and adopts a parallel transducer encoder and a depth separable residual block to learn global semantic features and local geometric features.

2. The feature fusion module provided by the invention fuses the global semantic features and the local geometric features learned by the parameter refinement module into fine-grained feature representations, and then the feature classifier linearly classifies the fine-grained features into different 3DMM parameters.

3. The feature consistency loss function of the invention captures geometric details by using a powerful-representation CLIP model, thereby recovering the texture skin color and local geometric details of the face.

4. The method provided by the invention comprises a parameter refinement module, a feature fusion module and a feature consistency loss function, so that the method has higher face reconstruction precision and more obvious face geometric details compared with the existing single image three-dimensional face reconstruction algorithm.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are needed in the description of the embodiments or the prior art will be briefly described, and it is obvious that the drawings in the description below are some embodiments of the present invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art. In the drawings:

fig. 1 shows a flowchart of a three-dimensional face reconstruction method based on a CLIP model according to the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made apparent and fully in view of the accompanying drawings, in which some, but not all embodiments of the invention are shown. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The three-dimensional face reconstruction method based on the CLIP model shown in fig. 1 specifically comprises the following steps:

Specifically, in step S1, an input image is givenWherein h and w represent the height and width of the face image, and a mask mode is adopted in the VGGFace2 data set to pretrain the residual network to extract rough characteristic representationF ₀ ＝H _b (I _s ) Wherein H is _b (. Cndot.) represents the pre-training residual network and c represents the number of channels.

S2, learning fine granularity characteristic representation from the coarse characteristics by adopting a parameter refinement module, wherein the parameter refinement module comprises: the depth-separable residual block is used for learning the detail features of the local face, and the transform encoder is used for learning global semantic features from the rough feature representation, and the feature fusion module is used for fusing the local detail features and the global semantic features learned by the parameter refinement module. The parameter refinement module aims to learn rich feature representations from a single image to accelerate the convergence speed of the model and estimate accurate face model parameters.

Specifically, in step S2, the parameter refinement stage is performed from the coarse feature F ₀ Medium learning fine granularity feature representationF _c ＝H _PRM (F ₀ ) Wherein H is _PRM (. Cndot.) represents a parameter refinement module. Specifically, the step S2 specifically includes the following steps: s2.1, giving a roughness feature F0; s2.2, using a 1 x 1 convolution layer to reduce the feature dimension results in a 256-dimensional feature vector, this process is defined as: c=c (F ₀ ) The method comprises the steps of carrying out a first treatment on the surface of the S2.3, learning local high frequency features and global semantic features respectively by using parallel depth-separable residual blocks and a transducer encoder, wherein the process is defined as F _c =cat (T (c) +dw (c)), where T (-) represents the transducer encoder and DW (-) represents the depth separable residual block.

The depth separable residual block adopts convolution operation of residual connection to learn detail characteristics of the local face. It first performs a batch normalization operation on the extracted coarse features to speed up convergence of the network and mitigate overfitting. Linear reasoning is then performed using 2 3 x 3 convolutional layers SConv, where the first convolutional layer employs packet convolution to reduce the number of parameters. Each convolution operation is followed by a ReLU activation function. The depth separable residual block captures as many local detail features as possible using as few parameters as possible. The transducer encoder consists of three Transformer block, learning global semantic features from the coarse feature representation. The extracted coarse features are input into a transducer encoder, transformer block uses a multi-head attention network to enhance global interactions of coarse features, learning rich semantic features. And the feature fusion module fuses the local detail features and the global semantic features learned by the parameter refinement module.

S3, adopting a feature classifierTo obtain 3DMM parameters, fine-grained feature representation F _c The 3DMM parameter code with low dimension is obtained through the feature classifier, the parameter code is composed of a shape code alpha, an expression code beta, a texture code t, a pose code rho and an illumination code l to form a 257-dimensional parameter code, wherein 80-dimensional shape parameters are obtained64-dimensional expression parameter->80-dimensional texture parameters6-dimensional pose parameter->And a 27-dimensional illumination parameter l e SO (2).

Specifically, a BFM data set is used as priori knowledge of a face to generate a three-dimensional face model. The BFM dataset is facial data of 200 persons scanned and three-dimensional data representing the shape and texture of the faces is generated by the PCA model. The 3D faces generated by the BFM data set have the same topological structure, and the semantic information of each triangular patch is fixed. To recover facial expressions, we use the faceWareHouse dataset to generate 3D facial expressions. The face model of step S4 is expressed as:

wherein, the liquid crystal display device comprises a liquid crystal display device,representing an average shape of the 3D face; a is that _id ，A _exp And A _tex Principal component basis respectively representing shape, expression and texture of face, and alpha, beta, t respectively represent shape of faceAnd the expression and texture parameters are used for fitting and generating the 3D face.

The camera model of step S4 projects the 3D face model into the 2D image using a perspective camera, which is the most commonly used projection method for rendering a scene, and the perspective camera projection process can be expressed as:

v＝f×R×S _3d +T (2)

Specifically, in step S6, the luminosity loss function is used to reduce the difference in pixels between the rendered image and the input image, and the luminosity loss is used to approximate the generated texture skin color to that of the input image. The photometric loss function is defined as:

L _photo ＝||M _I ⊙(I _s -I _r )|| ₁ (3)

In order to achieve alignment of 2D faces and 3D faces, weak supervised learning is performed using landmark loss. The distances of the 3D face projected to 68 key points between the input image and the input image were measured. The landmark loss function is defined as:

wherein k is _i Is the i-th key point of the input image, k' _i Is the ith key point of the reconstructed 3D face model after projection; w (w) _i The weight of the ith key point is 20, and the weights of the other key points are 1.

The identity loss function is used for generating more realistic face geometric images, and deep features of the face images are extracted from a high-dimensional angle by using a face recognition network. Because the ArcFace network can obtain higher face recognition accuracy on the face data set, the ArcFace network is used as a face depth feature extractor. Training an ArcFace network on the VGGFace2 data set, using the trained network to extract depth features of 512-dimensional faces of an input image and a rendered image, and finally calculating the cosine similarity of the depth features. The identity loss function is defined as:

L _id ＝1-<F(I _s ),F(I _r )> (5)

wherein F (·) is a pre-trained ArcFace network, I _s ，I _r Respectively an input image and a rendered image,<·,·>is the vector inner product;

to prevent 3D face shape degradation, the present invention will be directed to 3DMM parameter constrained regularization. Experimental results show that regularization can effectively prevent the degradation of the facial shape parameters, the expression parameters and the texture parameters. Defining regularization loss as:

L _reg ＝||α|| ² +||β|| ² +||δ|| ² (9)

Inspired by CLIPasso, a feature consistency loss function is proposed to recover the geometric details of the face. The CLIP model can effectively capture geometric detail features and advanced semantic features, is trained on 200 ten thousand image text data sets, and has strong attribute representation capability. And we have found that feature consistency loss functions based on CLIP models capture geometric details of faces more easily than VGG-based perceptual loss functions. Generating an antagonism network can restore the texture details of the face, but the method is not trained stably. The feature consistency loss functions include geometric feature consistency loss functions and semantic feature consistency loss functions. We use a geometric feature consistency loss function to measure the geometric similarity of the input image and the rendered image. First we extract the layer 2 and 3 feature maps using a CLIP image encoder and then calculate the mean square error of the input image and the rendered image layer 2 and 3 feature maps.

Specifically, in step S6, the geometric feature consistency loss function is used for recovering the geometric details of the face, where the geometric feature consistency loss function is defined as:

The semantic feature consistency loss function can achieve consistency of high-dimensional semantic attributes, is used for enabling texture skin colors to approach to an input image and is used for solving the eye closure problem of a 3D face model, firstly, a CLIP image encoder is used for extracting high-dimensional 512-dimensional feature vectors of a rendered image and the input image, and then cosine similarity distances of the feature vectors of the input image and the rendered image 512 are calculated. It can be defined as:

L _semantic ＝1-cos(CLIP(I _s ),CLIP(I _r )) (7)

wherein CLIP is the FC layer of the ViT-B/32CLIP model, I _s 、I _r Respectively an input image and a rendered image, cos (·) being a cosine distance;

the feature consistency loss function is defined as:

L _cl ＝L _geometric +L _semantic (8)。

specifically, the optimization of all losses of the objective function is defined as:

Thereby reconstructing the three-dimensional face model.

The invention provides a three-dimensional face reconstruction method based on a CLIP model, which firstly provides a feature consistency loss to learn more geometric details, specifically utilizes the CLIP model as supervision information and encourages the similarity of the geometric details between an input image and a rendered image during training. The CLIP model can also capture the condition of the eye without any eye closure loss to solve the problem of eye closure inaccuracy. Secondly, in order to accelerate the convergence speed of the model and estimate accurate face model parameters, a parameter refinement module is provided, and a parallel transducer encoder and a depth separable residual block are utilized to respectively learn global semantics and local detail features. The feature fusion module fuses the global semantic and local detail features into rich features so as to accelerate the convergence speed of the model in the training process. Finally, in order to drive the 3D facial model, the invention guides the 3D facial expression migration by using the expression text for the first time. Unlike the DECA which utilizes the reference image to carry out expression migration, the invention adopts text to guide and reconstruct the 3D face with specific expression and keeps the identity consistency of the face. Finally, the functions of higher face reconstruction precision and more remarkable face geometric details are realized.

It should be understood that the above description is not intended to limit the invention to the particular embodiments disclosed, but to limit the invention to the particular embodiments disclosed, and that the invention is not limited to the particular embodiments disclosed, but is intended to cover modifications, adaptations, additions and alternatives falling within the spirit and scope of the invention.

Claims

1. The three-dimensional face reconstruction method based on the CLIP model is characterized by comprising the following steps of:

s1, acquiring rough characteristic representation by adopting a mask pre-training mode;

s2, learning fine granularity characteristic representation from the coarse characteristics by adopting a parameter refinement module, wherein the parameter refinement module comprises: the depth-separable residual block is used for learning the detail features of the local face, and the transform encoder is used for learning global semantic features from the rough feature representation, and a feature fusion module is used for fusing the local detail features and the global semantic features learned by the parameter refinement module;

s3, acquiring 3DMM parameters by adopting a feature classifier, wherein the feature representation F of fine granularity _c Obtaining a low-dimensional 3DMM parameter code through a feature classifier, wherein the parameter code consists of a shape code alpha, an expression code beta, a texture code t, a pose code rho and an illumination code l to form a 257-dimensional parameter code;

s4, adopting a BFM model to fit 3DMM parameters to generate a three-dimensional face model, wherein the fitting parameters at the stage are divided into a face model and a camera model;

s5, rendering the 3D face model into the 2D image by adopting a micro-renderer to generate a rendered imageI _r ＝R(S _3d ) Wherein R (·) represents a micro-renderer, S _3d Is the vertex of the 3D face model;

2. A according to claim 1A three-dimensional face reconstruction method based on a CLIP model is characterized in that in step S1, an input image is givenWherein h and w represent the height and width of the face image, and a mask mode is adopted in the VGGFace2 data set to pretrain a residual network to extract rough characteristic representation +.>F ₀ ＝H _b (I _s ) Wherein H is _b (. Cndot.) represents the pre-training residual network and c represents the number of channels.

3. The three-dimensional face reconstruction method based on the CLIP model according to claim 2, wherein in step S2, the parameter refinement stage is performed from the rough feature F ₀ Medium learning fine granularity feature representationF _c ＝H _PRM (F ₀ ) Wherein H is _PRM (. Cndot.) represents a parameter refinement module.

4. The three-dimensional face reconstruction method based on the CLIP model according to claim 3, wherein the step S2 specifically comprises the following steps:

s2.1 given roughness characteristics F ₀ ；

S2.2, using a 1 x 1 convolution layer to reduce the feature dimension results in a 256-dimensional feature vector, this process is defined as: c=c (F ₀ )；

5. The three-dimensional face reconstruction method according to claim 1, wherein the face model in step S4 is represented as:

wherein s represents the average shape of the 3D face; a is that _id ，A _exp And A _tex The method comprises the steps of respectively representing the main component basis of the shape, the expression and the texture of the human face, and respectively representing the shape, the expression and the texture parameters of the human face by alpha, beta and t for fitting to generate a 3D human face;

v＝f×R×S _3d +T (2)

6. The three-dimensional face reconstruction method according to claim 1, wherein in step S6, a luminosity loss function is used to approximate the generated texture skin color to the texture skin color of the input graph, and the luminosity loss function is defined as:

L _photo ＝||M _I ⊙(I _s -I _r )|| ₁ (3)

wherein M is _I Is the mask region of the skin tone of the human face, as indicated by the Hadamard product, I _s 、I _r Respectively an input image and a rendered image, I.I ₁ Is L ₁ A paradigm;

the landmark loss function is used for weak supervision learning, measures the distances between 68 key points of the projection of the 3D face to the input image and the input image, and is defined as:

wherein k is _i Is the i-th key point of the input image, k' _i Is the ith key point of the reconstructed 3D face model after projection; w (w) _i The weight of the ith key point is 20, and the weights of the other key points are 1;

L _id ＝1-<F(I _s ),F(I _r )> (5)

L _reg ＝||α|| ² +||β|| ² +||δ|| ² (9)

7. The three-dimensional face reconstruction method according to claim 6, wherein in step S6, a geometric feature consistency loss function is defined as:

wherein, CLIP _l Layers 2 and 3, w, of the RN50X4CLIP model _l Is CLIP _l Weights, w _l ＝{1,1/2}，I _s 、I _r Respectively an input image and a rendered image, I.I ₂ Is L ₂ A paradigm;

L _semantic ＝1-cos(CLIP(I _s ),CLIP(I _r )) (7)

the feature consistency loss function is defined as:

L _cl ＝L _geometric +L _semantic (8)。

8. the three-dimensional face reconstruction method based on the CLIP model according to claim 7, wherein the optimization of all the losses of the objective function is defined as: