CN116563457A - Three-dimensional face reconstruction method based on CLIP model - Google Patents

Three-dimensional face reconstruction method based on CLIP model Download PDF

Info

Publication number
CN116563457A
CN116563457A CN202310376661.3A CN202310376661A CN116563457A CN 116563457 A CN116563457 A CN 116563457A CN 202310376661 A CN202310376661 A CN 202310376661A CN 116563457 A CN116563457 A CN 116563457A
Authority
CN
China
Prior art keywords
face
model
loss function
feature
clip
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310376661.3A
Other languages
Chinese (zh)
Inventor
包永堂
周鹏飞
肖欣菲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong University of Science and Technology
Original Assignee
Shandong University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong University of Science and Technology filed Critical Shandong University of Science and Technology
Priority to CN202310376661.3A priority Critical patent/CN116563457A/en
Publication of CN116563457A publication Critical patent/CN116563457A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T17/00Three dimensional [3D] modelling, e.g. data description of 3D objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/0895Weakly supervised learning, e.g. semi-supervised or self-supervised learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T15/003D [Three Dimensional] image rendering
    • G06T15/005General purpose rendering architectures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/42Global feature extraction by analysis of the whole pattern, e.g. using frequency domain transformations or autocorrelation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • G06V40/171Local features and components; Facial parts ; Occluding parts, e.g. glasses; Geometrical relationships
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Abstract

The invention provides a three-dimensional face reconstruction method based on a CLIP model, which comprises the following steps: s1, acquiring rough characteristic representation by adopting a mask pre-training mode; s2, adopting a parameter refinement module to learn fine granularity characteristic representation from the coarse characteristic; s3, acquiring 3DMM parameters by adopting a feature classifier; s4, adopting a BFM model to fit 3DMM parameters to generate a three-dimensional face model; s5, rendering the 3D face model into a 2D image by adopting a micro-renderer; s6, adopting a loss function optimization model. The technical scheme of the invention solves the problems of lower face reconstruction precision and fewer face geometric details in the prior art.

Description

Three-dimensional face reconstruction method based on CLIP model
Technical Field
The invention relates to the technical field of computer vision and computer graphics, in particular to a three-dimensional face reconstruction method based on a CLIP model.
Background
In recent years, 3D face reconstruction based on a single image is increasingly receiving attention from researchers. Vetter et al (Volker Blanz, thomas Vetter, "A morphable model for the synthesis of D faces," in processes of the ACM SIGGRAPH Annual Conference,1999, pp.187-194.) originally developed a 3D deformation model (3D morphable model,3DMM) algorithm. Since over twenty years, the 3DMM method has been rapidly developed and is most widely used. With the advent of deep learning technology, some supervised 3D face reconstruction methods utilize deep convolutional networks to predict 3DMM parameters to replace traditional optimization methods to obtain better reconstruction results. However, face data with 3D ground truth is not readily available. Some unsupervised or weakly supervised learning methods have been extensively studied and have achieved acceptable results. Tewari et al (A.Tewari, M.Zollofer, H.Kim, P.Garrido, F.Bernard, P.Perez, and T.Christian. MoFA: model-based Deep Convolutional Face Autoencoder for Unsupervised Monocular Reconstruction. InICCV, 2017.3) propose to use loss of gloss as a supervisory signal during training to restore facial texture. Genova et al (K.Genova, F.Cole, A.Maschinot.Unsupervised tracking for 3D morphable model regression.In CVPR,2018,pp.8377-8386) utilize face recognition networks to improve the fidelity of face reconstruction. Deng et al (Y.Deng, J.Yang, S.Xu.Accurate 3D face reconstruction with weakly-supervised learning: from single image to image set.InCVPRW, 2019, pp.285-295) employ landmark loss to improve accuracy of facial reconstruction. Shang et al (J.Shang, T.Shen, S.Li, L.Zhou, M.Zhen, T.Fang, and L.Quan. Self-supervised monocular D face Recons-truction by occlusion-aware multiview geometry constancy. In ECCV,2020, pp. 53-70) propose a depth penalty to improve face alignment accuracy. These methods continue to explore the effects of different losses on 3D face reconstruction, but they ignore concerns about face geometry details. In summary, these methods can only reconstruct rough geometry and non-fidelity textures, and cannot recover geometry details.
Some methods in the prior art can be used to restore detailed face shapes. Feng et al (Y.Feng, H.Feng, M.J.Black, and t.bolkart.learning an animatable detailed 3D face model from in-the-wild images.in TOG,40 (4): 88:1-88:13,2021.2,3,4,5,6,7, 8) propose capturing detailed expressions and animations (detailed expression capture and animat-ion, DECA) methods to learn common geometric details using multi-view face images, thereby generating geometric displacement maps rich in geometric details. However, the geometric displacement diagram learned by the method is inaccurate, and the generated geometric details are not true. Danecek et al (R.Danecek, M.J.Black, T.Bolkart.EMOCA: emotion Driven Monocular Face Capture and animation. InCVPR, 2022.) propose an emotion capture and animation (emotion capture and animation, EMOCA) method that utilizes depth-aware expression consistency loss to learn geometric details under facial expressions. The method can effectively restore the geometric details of the facial expression. But this method cannot generate a 3D face shape with a sense of realism. Therefore, these methods for restoring the geometric details of the face using the displacement map have difficulty in learning accurate geometric details and lack geometric realism. The existing work can not effectively capture geometric details and semantic attributes, so that the generated 3D face has few geometric details and rough textures. Furthermore, we found that the EMOCA method uses expression networks to obtain more geometric details of facial expressions. Therefore, we consider that a powerful semantic representation network can learn geometric details and semantic attributes to guide a coarse 3D face model to recover more geometric details and facial expressions with realism. To this end, we exploit the powerful representation capabilities of the CLIP (contrast-Language-Image-pretaining) model to learn geometric details and semantic features. The CLIP model is trained on 400 ten thousand text-image pairs, which can effectively acquire fine-grained semantic features. StyleClIP (O.Patashnik, Z.Wu, E.Shechtman, D.Cohen-Or, and D.Lischinski. Styleclip: text-driven manipulation of sty-egan image. InCVPR, pp.2085-2094,2021.) shows that the CLIP model can capture geometric and semantic attributes of a human face.
Therefore, there is a need for a three-dimensional face reconstruction method based on CLIP model with higher face reconstruction accuracy and more significant face geometry.
Disclosure of Invention
The invention mainly aims to provide a three-dimensional face reconstruction method based on a CLIP model, which aims to solve the problems of lower face reconstruction precision and fewer face geometric details in the prior art.
In order to achieve the above purpose, the present invention provides a three-dimensional face reconstruction method based on a CLIP model, which specifically includes the following steps:
s1, acquiring rough characteristic representation by adopting a mask pre-training mode.
S2, learning fine granularity characteristic representation from the coarse characteristics by adopting a parameter refinement module, wherein the parameter refinement module comprises: the depth-separable residual block is used for learning the detail features of the local face, and the transform encoder is used for learning global semantic features from the rough feature representation, and the feature fusion module is used for fusing the local detail features and the global semantic features learned by the parameter refinement module.
S3, acquiring 3DMM parameters by adopting a feature classifier, wherein the feature representation F of fine granularity c The 3DMM parameter codes with low dimension are obtained through the feature classifier, and the parameter codes consist of a shape code alpha, an expression code beta, a texture code t, a pose code rho and an illumination code l to form 257-dimensional parameter codes.
S4, adopting a BFM model to fit 3DMM parameters to generate a three-dimensional face model, wherein the fitting parameters at the stage are divided into the face model and a camera model.
S5, rendering the 3D face model into the 2D image by adopting a micro-renderer to generate a rendered imageI r =R(S 3d ) Wherein R (·) represents a micro-renderer, S 3d Is the vertex of the 3D face model.
S6, adopting a loss function optimization model, wherein the loss function comprises a rough loss function and a feature consistency loss function, the rough loss function comprises a luminosity loss function, an identity loss function, a landmark loss function and a regularization loss function, and the feature consistency loss function comprises a geometric feature consistency loss function and a semantic feature consistency loss function.
Further, in step S1, an input image is givenWherein h and w represent the height and width of the face image, and a mask mode is adopted in the VGGFace2 data set to pretrain the residual network to extract rough characteristic representationF 0 =H b (I s ) Wherein H is b (. Cndot.) represents the pre-training residual network and c represents the number of channels.
Further, in step S2, the parameter refinement stage is performed from the coarse feature F 0 Medium learning fine granularity feature representationF c =H PRM (F 0 ) Wherein H is PRM (. Cndot.) represents a parameter refinement module.
Further, the step S2 specifically includes the following steps:
s2.1, given the roughness characteristics F0.
S2.2, using a 1 x 1 convolution layer to reduce the feature dimension results in a 256-dimensional feature vector, this process is defined as: c=c (F 0 )。
S2.3, learning local high frequency features and global semantic features respectively by using parallel depth-separable residual blocks and a transducer encoder, wherein the process is defined as F c =cat (T (c) +dw (c)), where T (-) represents the transducer encoder and DW (-) represents the depth separable residual block.
Further, the face model of step S4 is expressed as:
wherein, the liquid crystal display device comprises a liquid crystal display device,representing an average shape of the 3D face; a is that id ,A exp And A tex The main component basis of the shape, the expression and the texture of the face are respectively represented, and alpha, beta and t respectively represent the shape, the expression and the texture parameters of the face and are used for fitting and generating the 3D face.
The camera model of step S4 projects the 3D face model into the 2D image using a perspective camera, the perspective camera projection process can be expressed as:
v=f×R×S 3d +T (2)
wherein R is a rotation matrix, T is a translation vector, S 3d Is the vertex of the 3D face model and f is the camera focal length.
Further, in step S6, a luminosity loss function is used to approximate the generated texture skin-tone to the texture skin-tone of the input graph, the luminosity loss function being defined as:
L photo =||M I ⊙(I s -I r )|| 1 (3)
wherein M is I Is the mask region of the skin tone of the human face, as indicated by the Hadamard product, I s 、I r Respectively an input image and a rendered image, I.I 1 Is L 1 A paradigm.
The landmark loss function is used for performing weak supervision learning, and measuring the distances between 68 key points of the 3D face projected to the input image and the input image. The landmark loss function is defined as:
wherein k is i Is the i-th key point of the input image, k' i Is the ith key point of the reconstructed 3D face model after projection; w (w) i Is the ith switchThe weights of the key points are only 20, and the weights of other key points are all 1.
The identity loss function is used for generating a face geometric image, training an ArcFace network on a VGGFace2 data set, then using the trained network for extracting depth features of 512-dimensional faces of an input image and a rendered image, and finally calculating depth feature cosine similarity, wherein the identity loss function is defined as:
L id =1-<F(I s ),F(I r )> (5)
wherein F (·) is a pre-trained ArcFace network, I s ,I r Respectively an input image and a rendered image,<·,·>is the vector inner product.
The regularization loss function is used for preventing the 3D face shape from degrading, and the regularization loss is defined as follows:
L reg =||α|| 2 +||β|| 2 +||δ|| 2 (9)
wherein, alpha, beta, delta represent shape parameter, expression parameter and texture parameter respectively.
Further, in step S6, a geometric feature consistency loss function is used to restore geometric details of the face, where the geometric feature consistency loss function is defined as:
wherein, CLIP l Layers 2 and 3, w, of the RN50X4CLIP model l Is CLIP l Weights, w l ={1,1/2},I s 、I r Respectively an input image and a rendered image, I.I 2 Is L 2 A paradigm.
A semantic feature consistency loss function for approximating texture skin color to an input image and for solving eye closure problems for a 3D face model, which can be defined as:
L semantic =1-cos(CLIP(I s ),CLIP(I r )) (7)
wherein CLIP is the FC layer of the ViT-B/32CLIP model, I s 、I r The input image and the rendered image, respectively, cos (·) is the cosine distance.
The feature consistency loss function is defined as:
L cl =L geometric +L semantic (8)。
further, the optimization of all losses of the objective function is defined as:
L all =minλ photo L photoid L idlm L lmcl L clreg L reg (10)
wherein lambda is photo =1,λ id =2,λ lm =1.7×10 -3reg =1×10 -4cl =2 is the weight of the corresponding penalty.
The invention has the following beneficial effects:
1. the parameter refinement module is used for learning rich feature representations so as to accelerate the convergence speed of the model and estimate accurate face model parameters, and adopts a parallel transducer encoder and a depth separable residual block to learn global semantic features and local geometric features.
2. The feature fusion module provided by the invention fuses the global semantic features and the local geometric features learned by the parameter refinement module into fine-grained feature representations, and then the feature classifier linearly classifies the fine-grained features into different 3DMM parameters.
3. The feature consistency loss function of the invention captures geometric details by using a powerful-representation CLIP model, thereby recovering the texture skin color and local geometric details of the face.
4. The method provided by the invention comprises a parameter refinement module, a feature fusion module and a feature consistency loss function, so that the method has higher face reconstruction precision and more obvious face geometric details compared with the existing single image three-dimensional face reconstruction algorithm.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are needed in the description of the embodiments or the prior art will be briefly described, and it is obvious that the drawings in the description below are some embodiments of the present invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art. In the drawings:
fig. 1 shows a flowchart of a three-dimensional face reconstruction method based on a CLIP model according to the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made apparent and fully in view of the accompanying drawings, in which some, but not all embodiments of the invention are shown. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The three-dimensional face reconstruction method based on the CLIP model shown in fig. 1 specifically comprises the following steps:
s1, acquiring rough characteristic representation by adopting a mask pre-training mode.
Specifically, in step S1, an input image is givenWherein h and w represent the height and width of the face image, and a mask mode is adopted in the VGGFace2 data set to pretrain the residual network to extract rough characteristic representationF 0 =H b (I s ) Wherein H is b (. Cndot.) represents the pre-training residual network and c represents the number of channels.
S2, learning fine granularity characteristic representation from the coarse characteristics by adopting a parameter refinement module, wherein the parameter refinement module comprises: the depth-separable residual block is used for learning the detail features of the local face, and the transform encoder is used for learning global semantic features from the rough feature representation, and the feature fusion module is used for fusing the local detail features and the global semantic features learned by the parameter refinement module. The parameter refinement module aims to learn rich feature representations from a single image to accelerate the convergence speed of the model and estimate accurate face model parameters.
Specifically, in step S2, the parameter refinement stage is performed from the coarse feature F 0 Medium learning fine granularity feature representationF c =H PRM (F 0 ) Wherein H is PRM (. Cndot.) represents a parameter refinement module. Specifically, the step S2 specifically includes the following steps: s2.1, giving a roughness feature F0; s2.2, using a 1 x 1 convolution layer to reduce the feature dimension results in a 256-dimensional feature vector, this process is defined as: c=c (F 0 ) The method comprises the steps of carrying out a first treatment on the surface of the S2.3, learning local high frequency features and global semantic features respectively by using parallel depth-separable residual blocks and a transducer encoder, wherein the process is defined as F c =cat (T (c) +dw (c)), where T (-) represents the transducer encoder and DW (-) represents the depth separable residual block.
The depth separable residual block adopts convolution operation of residual connection to learn detail characteristics of the local face. It first performs a batch normalization operation on the extracted coarse features to speed up convergence of the network and mitigate overfitting. Linear reasoning is then performed using 2 3 x 3 convolutional layers SConv, where the first convolutional layer employs packet convolution to reduce the number of parameters. Each convolution operation is followed by a ReLU activation function. The depth separable residual block captures as many local detail features as possible using as few parameters as possible. The transducer encoder consists of three Transformer block, learning global semantic features from the coarse feature representation. The extracted coarse features are input into a transducer encoder, transformer block uses a multi-head attention network to enhance global interactions of coarse features, learning rich semantic features. And the feature fusion module fuses the local detail features and the global semantic features learned by the parameter refinement module.
S3, adopting a feature classifierTo obtain 3DMM parameters, fine-grained feature representation F c The 3DMM parameter code with low dimension is obtained through the feature classifier, the parameter code is composed of a shape code alpha, an expression code beta, a texture code t, a pose code rho and an illumination code l to form a 257-dimensional parameter code, wherein 80-dimensional shape parameters are obtained64-dimensional expression parameter->80-dimensional texture parameters6-dimensional pose parameter->And a 27-dimensional illumination parameter l e SO (2).
S4, adopting a BFM model to fit 3DMM parameters to generate a three-dimensional face model, wherein the fitting parameters at the stage are divided into the face model and a camera model.
Specifically, a BFM data set is used as priori knowledge of a face to generate a three-dimensional face model. The BFM dataset is facial data of 200 persons scanned and three-dimensional data representing the shape and texture of the faces is generated by the PCA model. The 3D faces generated by the BFM data set have the same topological structure, and the semantic information of each triangular patch is fixed. To recover facial expressions, we use the faceWareHouse dataset to generate 3D facial expressions. The face model of step S4 is expressed as:
wherein, the liquid crystal display device comprises a liquid crystal display device,representing an average shape of the 3D face; a is that id ,A exp And A tex Principal component basis respectively representing shape, expression and texture of face, and alpha, beta, t respectively represent shape of faceAnd the expression and texture parameters are used for fitting and generating the 3D face.
The camera model of step S4 projects the 3D face model into the 2D image using a perspective camera, which is the most commonly used projection method for rendering a scene, and the perspective camera projection process can be expressed as:
v=f×R×S 3d +T (2)
wherein R is a rotation matrix, T is a translation vector, S 3d Is the vertex of the 3D face model and f is the camera focal length.
S5, rendering the 3D face model into the 2D image by adopting a micro-renderer to generate a rendered imageI r =R(S 3d ) Wherein R (·) represents a micro-renderer, S 3d Is the vertex of the 3D face model.
S6, adopting a loss function optimization model, wherein the loss function comprises a rough loss function and a feature consistency loss function, the rough loss function comprises a luminosity loss function, an identity loss function, a landmark loss function and a regularization loss function, and the feature consistency loss function comprises a geometric feature consistency loss function and a semantic feature consistency loss function.
Specifically, in step S6, the luminosity loss function is used to reduce the difference in pixels between the rendered image and the input image, and the luminosity loss is used to approximate the generated texture skin color to that of the input image. The photometric loss function is defined as:
L photo =||M I ⊙(I s -I r )|| 1 (3)
wherein M is I Is the mask region of the skin tone of the human face, as indicated by the Hadamard product, I s 、I r Respectively an input image and a rendered image, I.I 1 Is L 1 A paradigm.
In order to achieve alignment of 2D faces and 3D faces, weak supervised learning is performed using landmark loss. The distances of the 3D face projected to 68 key points between the input image and the input image were measured. The landmark loss function is defined as:
wherein k is i Is the i-th key point of the input image, k' i Is the ith key point of the reconstructed 3D face model after projection; w (w) i The weight of the ith key point is 20, and the weights of the other key points are 1.
The identity loss function is used for generating more realistic face geometric images, and deep features of the face images are extracted from a high-dimensional angle by using a face recognition network. Because the ArcFace network can obtain higher face recognition accuracy on the face data set, the ArcFace network is used as a face depth feature extractor. Training an ArcFace network on the VGGFace2 data set, using the trained network to extract depth features of 512-dimensional faces of an input image and a rendered image, and finally calculating the cosine similarity of the depth features. The identity loss function is defined as:
L id =1-<F(I s ),F(I r )> (5)
wherein F (·) is a pre-trained ArcFace network, I s ,I r Respectively an input image and a rendered image,<·,·>is the vector inner product;
to prevent 3D face shape degradation, the present invention will be directed to 3DMM parameter constrained regularization. Experimental results show that regularization can effectively prevent the degradation of the facial shape parameters, the expression parameters and the texture parameters. Defining regularization loss as:
L reg =||α|| 2 +||β|| 2 +||δ|| 2 (9)
wherein, alpha, beta, delta represent shape parameter, expression parameter and texture parameter respectively.
Inspired by CLIPasso, a feature consistency loss function is proposed to recover the geometric details of the face. The CLIP model can effectively capture geometric detail features and advanced semantic features, is trained on 200 ten thousand image text data sets, and has strong attribute representation capability. And we have found that feature consistency loss functions based on CLIP models capture geometric details of faces more easily than VGG-based perceptual loss functions. Generating an antagonism network can restore the texture details of the face, but the method is not trained stably. The feature consistency loss functions include geometric feature consistency loss functions and semantic feature consistency loss functions. We use a geometric feature consistency loss function to measure the geometric similarity of the input image and the rendered image. First we extract the layer 2 and 3 feature maps using a CLIP image encoder and then calculate the mean square error of the input image and the rendered image layer 2 and 3 feature maps.
Specifically, in step S6, the geometric feature consistency loss function is used for recovering the geometric details of the face, where the geometric feature consistency loss function is defined as:
wherein, CLIP l Layers 2 and 3, w, of the RN50X4CLIP model l Is CLIP l Weights, w l ={1,1/2},I s 、I r Respectively an input image and a rendered image, I.I 2 Is L 2 A paradigm.
The semantic feature consistency loss function can achieve consistency of high-dimensional semantic attributes, is used for enabling texture skin colors to approach to an input image and is used for solving the eye closure problem of a 3D face model, firstly, a CLIP image encoder is used for extracting high-dimensional 512-dimensional feature vectors of a rendered image and the input image, and then cosine similarity distances of the feature vectors of the input image and the rendered image 512 are calculated. It can be defined as:
L semantic =1-cos(CLIP(I s ),CLIP(I r )) (7)
wherein CLIP is the FC layer of the ViT-B/32CLIP model, I s 、I r Respectively an input image and a rendered image, cos (·) being a cosine distance;
the feature consistency loss function is defined as:
L cl =L geometric +L semantic (8)。
specifically, the optimization of all losses of the objective function is defined as:
L all =minλ photo L photoid L idlm L lmcl L clreg L reg (10)
wherein lambda is photo =1,λ id =2,λ lm =1.7×10 -3reg =1×10 -4cl =2 is the weight of the corresponding penalty.
Thereby reconstructing the three-dimensional face model.
The invention provides a three-dimensional face reconstruction method based on a CLIP model, which firstly provides a feature consistency loss to learn more geometric details, specifically utilizes the CLIP model as supervision information and encourages the similarity of the geometric details between an input image and a rendered image during training. The CLIP model can also capture the condition of the eye without any eye closure loss to solve the problem of eye closure inaccuracy. Secondly, in order to accelerate the convergence speed of the model and estimate accurate face model parameters, a parameter refinement module is provided, and a parallel transducer encoder and a depth separable residual block are utilized to respectively learn global semantics and local detail features. The feature fusion module fuses the global semantic and local detail features into rich features so as to accelerate the convergence speed of the model in the training process. Finally, in order to drive the 3D facial model, the invention guides the 3D facial expression migration by using the expression text for the first time. Unlike the DECA which utilizes the reference image to carry out expression migration, the invention adopts text to guide and reconstruct the 3D face with specific expression and keeps the identity consistency of the face. Finally, the functions of higher face reconstruction precision and more remarkable face geometric details are realized.
It should be understood that the above description is not intended to limit the invention to the particular embodiments disclosed, but to limit the invention to the particular embodiments disclosed, and that the invention is not limited to the particular embodiments disclosed, but is intended to cover modifications, adaptations, additions and alternatives falling within the spirit and scope of the invention.

Claims (8)

1. The three-dimensional face reconstruction method based on the CLIP model is characterized by comprising the following steps of:
s1, acquiring rough characteristic representation by adopting a mask pre-training mode;
s2, learning fine granularity characteristic representation from the coarse characteristics by adopting a parameter refinement module, wherein the parameter refinement module comprises: the depth-separable residual block is used for learning the detail features of the local face, and the transform encoder is used for learning global semantic features from the rough feature representation, and a feature fusion module is used for fusing the local detail features and the global semantic features learned by the parameter refinement module;
s3, acquiring 3DMM parameters by adopting a feature classifier, wherein the feature representation F of fine granularity c Obtaining a low-dimensional 3DMM parameter code through a feature classifier, wherein the parameter code consists of a shape code alpha, an expression code beta, a texture code t, a pose code rho and an illumination code l to form a 257-dimensional parameter code;
s4, adopting a BFM model to fit 3DMM parameters to generate a three-dimensional face model, wherein the fitting parameters at the stage are divided into a face model and a camera model;
s5, rendering the 3D face model into the 2D image by adopting a micro-renderer to generate a rendered imageI r =R(S 3d ) Wherein R (·) represents a micro-renderer, S 3d Is the vertex of the 3D face model;
s6, adopting a loss function optimization model, wherein the loss function comprises a rough loss function and a feature consistency loss function, the rough loss function comprises a luminosity loss function, an identity loss function, a landmark loss function and a regularization loss function, and the feature consistency loss function comprises a geometric feature consistency loss function and a semantic feature consistency loss function.
2. A according to claim 1A three-dimensional face reconstruction method based on a CLIP model is characterized in that in step S1, an input image is givenWherein h and w represent the height and width of the face image, and a mask mode is adopted in the VGGFace2 data set to pretrain a residual network to extract rough characteristic representation +.>F 0 =H b (I s ) Wherein H is b (. Cndot.) represents the pre-training residual network and c represents the number of channels.
3. The three-dimensional face reconstruction method based on the CLIP model according to claim 2, wherein in step S2, the parameter refinement stage is performed from the rough feature F 0 Medium learning fine granularity feature representationF c =H PRM (F 0 ) Wherein H is PRM (. Cndot.) represents a parameter refinement module.
4. The three-dimensional face reconstruction method based on the CLIP model according to claim 3, wherein the step S2 specifically comprises the following steps:
s2.1 given roughness characteristics F 0
S2.2, using a 1 x 1 convolution layer to reduce the feature dimension results in a 256-dimensional feature vector, this process is defined as: c=c (F 0 );
S2.3, learning local high frequency features and global semantic features respectively by using parallel depth-separable residual blocks and a transducer encoder, wherein the process is defined as F c =cat (T (c) +dw (c)), where T (-) represents the transducer encoder and DW (-) represents the depth separable residual block.
5. The three-dimensional face reconstruction method according to claim 1, wherein the face model in step S4 is represented as:
wherein s represents the average shape of the 3D face; a is that id ,A exp And A tex The method comprises the steps of respectively representing the main component basis of the shape, the expression and the texture of the human face, and respectively representing the shape, the expression and the texture parameters of the human face by alpha, beta and t for fitting to generate a 3D human face;
the camera model of step S4 projects the 3D face model into the 2D image using a perspective camera, the perspective camera projection process can be expressed as:
v=f×R×S 3d +T (2)
wherein R is a rotation matrix, T is a translation vector, S 3d Is the vertex of the 3D face model and f is the camera focal length.
6. The three-dimensional face reconstruction method according to claim 1, wherein in step S6, a luminosity loss function is used to approximate the generated texture skin color to the texture skin color of the input graph, and the luminosity loss function is defined as:
L photo =||M I ⊙(I s -I r )|| 1 (3)
wherein M is I Is the mask region of the skin tone of the human face, as indicated by the Hadamard product, I s 、I r Respectively an input image and a rendered image, I.I 1 Is L 1 A paradigm;
the landmark loss function is used for weak supervision learning, measures the distances between 68 key points of the projection of the 3D face to the input image and the input image, and is defined as:
wherein k is i Is the i-th key point of the input image, k' i Is the ith key point of the reconstructed 3D face model after projection; w (w) i The weight of the ith key point is 20, and the weights of the other key points are 1;
the identity loss function is used for generating a face geometric image, training an ArcFace network on a VGGFace2 data set, then using the trained network for extracting depth features of 512-dimensional faces of an input image and a rendered image, and finally calculating depth feature cosine similarity, wherein the identity loss function is defined as:
L id =1-<F(I s ),F(I r )> (5)
wherein F (·) is a pre-trained ArcFace network, I s ,I r Respectively an input image and a rendered image,<·,·>is the vector inner product;
the regularization loss function is used for preventing the 3D face shape from degrading, and the regularization loss is defined as follows:
L reg =||α|| 2 +||β|| 2 +||δ|| 2 (9)
wherein, alpha, beta, delta represent shape parameter, expression parameter and texture parameter respectively.
7. The three-dimensional face reconstruction method according to claim 6, wherein in step S6, a geometric feature consistency loss function is defined as:
wherein, CLIP l Layers 2 and 3, w, of the RN50X4CLIP model l Is CLIP l Weights, w l ={1,1/2},I s 、I r Respectively an input image and a rendered image, I.I 2 Is L 2 A paradigm;
a semantic feature consistency loss function for approximating texture skin color to an input image and for solving eye closure problems for a 3D face model, which can be defined as:
L semantic =1-cos(CLIP(I s ),CLIP(I r )) (7)
wherein CLIP is the FC layer of the ViT-B/32CLIP model, I s 、I r Respectively an input image and a rendered image, cos (·) being a cosine distance;
the feature consistency loss function is defined as:
L cl =L geometric +L semantic (8)。
8. the three-dimensional face reconstruction method based on the CLIP model according to claim 7, wherein the optimization of all the losses of the objective function is defined as:
L all =minλ photo L photoid L idlm L lmcl L clreg L reg (10)
wherein lambda is photo =1,λ id =2,λ lm =1.7×10 -3reg =1×10 -4cl =2 is the weight of the corresponding penalty.
CN202310376661.3A 2023-04-11 2023-04-11 Three-dimensional face reconstruction method based on CLIP model Pending CN116563457A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310376661.3A CN116563457A (en) 2023-04-11 2023-04-11 Three-dimensional face reconstruction method based on CLIP model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310376661.3A CN116563457A (en) 2023-04-11 2023-04-11 Three-dimensional face reconstruction method based on CLIP model

Publications (1)

Publication Number Publication Date
CN116563457A true CN116563457A (en) 2023-08-08

Family

ID=87485263

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310376661.3A Pending CN116563457A (en) 2023-04-11 2023-04-11 Three-dimensional face reconstruction method based on CLIP model

Country Status (1)

Country Link
CN (1) CN116563457A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116993929A (en) * 2023-09-27 2023-11-03 北京大学深圳研究生院 Three-dimensional face reconstruction method and device based on human eye dynamic change and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116993929A (en) * 2023-09-27 2023-11-03 北京大学深圳研究生院 Three-dimensional face reconstruction method and device based on human eye dynamic change and storage medium
CN116993929B (en) * 2023-09-27 2024-01-16 北京大学深圳研究生院 Three-dimensional face reconstruction method and device based on human eye dynamic change and storage medium

Similar Documents

Publication Publication Date Title
Feng et al. Learning an animatable detailed 3D face model from in-the-wild images
Daněček et al. Emoca: Emotion driven monocular face capture and animation
Wang et al. Detecting photoshopped faces by scripting photoshop
Kundu et al. 3d-rcnn: Instance-level 3d object reconstruction via render-and-compare
Fu et al. Deep ordinal regression network for monocular depth estimation
Zhou et al. Fully convolutional mesh autoencoder using efficient spatially varying kernels
Huynh et al. Mesoscopic facial geometry inference using deep neural networks
CN110991281B (en) Dynamic face recognition method
CN110728219B (en) 3D face generation method based on multi-column multi-scale graph convolution neural network
Chen et al. I2uv-handnet: Image-to-uv prediction network for accurate and high-fidelity 3d hand mesh modeling
CN109448083A (en) A method of human face animation is generated from single image
Liu et al. Normalized face image generation with perceptron generative adversarial networks
Gao et al. Semi-supervised 3D face representation learning from unconstrained photo collections
Ji et al. SurfaceNet+: An end-to-end 3D neural network for very sparse multi-view stereopsis
CN116563457A (en) Three-dimensional face reconstruction method based on CLIP model
Zhang et al. Weakly-supervised multi-face 3d reconstruction
Li et al. Multi-attribute regression network for face reconstruction
Basak et al. 3D face-model reconstruction from a single image: A feature aggregation approach using hierarchical transformer with weak supervision
Chen et al. Transformer-based 3d face reconstruction with end-to-end shape-preserved domain transfer
Zheng et al. GCM-Net: Towards effective global context modeling for image inpainting
Ren et al. Facial geometric detail recovery via implicit representation
Lin et al. Single-shot implicit morphable faces with consistent texture parameterization
Yin et al. Segmentation-reconstruction-guided facial image de-occlusion
Zhao et al. Generative landmarks guided eyeglasses removal 3D face reconstruction
Maxim et al. A survey on the current state of the art on deep learning 3D reconstruction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination