CN116563457A - Three-dimensional face reconstruction method based on CLIP model - Google Patents
Three-dimensional face reconstruction method based on CLIP model Download PDFInfo
- Publication number
- CN116563457A CN116563457A CN202310376661.3A CN202310376661A CN116563457A CN 116563457 A CN116563457 A CN 116563457A CN 202310376661 A CN202310376661 A CN 202310376661A CN 116563457 A CN116563457 A CN 116563457A
- Authority
- CN
- China
- Prior art keywords
- face
- model
- loss function
- feature
- clip
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 49
- 238000012549 training Methods 0.000 claims abstract description 13
- 238000005457 optimization Methods 0.000 claims abstract description 8
- 238000009877 rendering Methods 0.000 claims abstract description 5
- 230000006870 function Effects 0.000 claims description 79
- 230000014509 gene expression Effects 0.000 claims description 21
- 208000037170 Delayed Emergence from Anesthesia Diseases 0.000 claims description 13
- 230000008569 process Effects 0.000 claims description 11
- 239000013598 vector Substances 0.000 claims description 11
- 230000004927 fusion Effects 0.000 claims description 7
- 230000004399 eye closure Effects 0.000 claims description 5
- 238000005286 illumination Methods 0.000 claims description 4
- 241000282326 Felis catus Species 0.000 claims description 3
- 239000011159 matrix material Substances 0.000 claims description 3
- 238000013519 translation Methods 0.000 claims description 3
- 230000000593 degrading effect Effects 0.000 claims description 2
- 230000008921 facial expression Effects 0.000 description 7
- 230000001815 facial effect Effects 0.000 description 5
- 239000004973 liquid crystal related substance Substances 0.000 description 4
- 238000006073 displacement reaction Methods 0.000 description 3
- 230000008451 emotion Effects 0.000 description 3
- 230000015556 catabolic process Effects 0.000 description 2
- WHHGLZMJPXIBIX-UHFFFAOYSA-N decabromodiphenyl ether Chemical compound BrC1=C(Br)C(Br)=C(Br)C(Br)=C1OC1=C(Br)C(Br)=C(Br)C(Br)=C1Br WHHGLZMJPXIBIX-UHFFFAOYSA-N 0.000 description 2
- 238000006731 degradation reaction Methods 0.000 description 2
- 230000005012 migration Effects 0.000 description 2
- 238000013508 migration Methods 0.000 description 2
- 230000004913 activation Effects 0.000 description 1
- 230000006978 adaptation Effects 0.000 description 1
- 238000007792 addition Methods 0.000 description 1
- 230000008485 antagonism Effects 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 239000003086 colorant Substances 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 238000003786 synthesis reaction Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T17/00—Three dimensional [3D] modelling, e.g. data description of 3D objects
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/0895—Weakly supervised learning, e.g. semi-supervised or self-supervised learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T15/00—3D [Three Dimensional] image rendering
- G06T15/005—General purpose rendering architectures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/42—Global feature extraction by analysis of the whole pattern, e.g. using frequency domain transformations or autocorrelation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/44—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/764—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
- G06V10/806—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/168—Feature extraction; Face representation
- G06V40/171—Local features and components; Facial parts ; Occluding parts, e.g. glasses; Geometrical relationships
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Abstract
The invention provides a three-dimensional face reconstruction method based on a CLIP model, which comprises the following steps: s1, acquiring rough characteristic representation by adopting a mask pre-training mode; s2, adopting a parameter refinement module to learn fine granularity characteristic representation from the coarse characteristic; s3, acquiring 3DMM parameters by adopting a feature classifier; s4, adopting a BFM model to fit 3DMM parameters to generate a three-dimensional face model; s5, rendering the 3D face model into a 2D image by adopting a micro-renderer; s6, adopting a loss function optimization model. The technical scheme of the invention solves the problems of lower face reconstruction precision and fewer face geometric details in the prior art.
Description
Technical Field
The invention relates to the technical field of computer vision and computer graphics, in particular to a three-dimensional face reconstruction method based on a CLIP model.
Background
In recent years, 3D face reconstruction based on a single image is increasingly receiving attention from researchers. Vetter et al (Volker Blanz, thomas Vetter, "A morphable model for the synthesis of D faces," in processes of the ACM SIGGRAPH Annual Conference,1999, pp.187-194.) originally developed a 3D deformation model (3D morphable model,3DMM) algorithm. Since over twenty years, the 3DMM method has been rapidly developed and is most widely used. With the advent of deep learning technology, some supervised 3D face reconstruction methods utilize deep convolutional networks to predict 3DMM parameters to replace traditional optimization methods to obtain better reconstruction results. However, face data with 3D ground truth is not readily available. Some unsupervised or weakly supervised learning methods have been extensively studied and have achieved acceptable results. Tewari et al (A.Tewari, M.Zollofer, H.Kim, P.Garrido, F.Bernard, P.Perez, and T.Christian. MoFA: model-based Deep Convolutional Face Autoencoder for Unsupervised Monocular Reconstruction. InICCV, 2017.3) propose to use loss of gloss as a supervisory signal during training to restore facial texture. Genova et al (K.Genova, F.Cole, A.Maschinot.Unsupervised tracking for 3D morphable model regression.In CVPR,2018,pp.8377-8386) utilize face recognition networks to improve the fidelity of face reconstruction. Deng et al (Y.Deng, J.Yang, S.Xu.Accurate 3D face reconstruction with weakly-supervised learning: from single image to image set.InCVPRW, 2019, pp.285-295) employ landmark loss to improve accuracy of facial reconstruction. Shang et al (J.Shang, T.Shen, S.Li, L.Zhou, M.Zhen, T.Fang, and L.Quan. Self-supervised monocular D face Recons-truction by occlusion-aware multiview geometry constancy. In ECCV,2020, pp. 53-70) propose a depth penalty to improve face alignment accuracy. These methods continue to explore the effects of different losses on 3D face reconstruction, but they ignore concerns about face geometry details. In summary, these methods can only reconstruct rough geometry and non-fidelity textures, and cannot recover geometry details.
Some methods in the prior art can be used to restore detailed face shapes. Feng et al (Y.Feng, H.Feng, M.J.Black, and t.bolkart.learning an animatable detailed 3D face model from in-the-wild images.in TOG,40 (4): 88:1-88:13,2021.2,3,4,5,6,7, 8) propose capturing detailed expressions and animations (detailed expression capture and animat-ion, DECA) methods to learn common geometric details using multi-view face images, thereby generating geometric displacement maps rich in geometric details. However, the geometric displacement diagram learned by the method is inaccurate, and the generated geometric details are not true. Danecek et al (R.Danecek, M.J.Black, T.Bolkart.EMOCA: emotion Driven Monocular Face Capture and animation. InCVPR, 2022.) propose an emotion capture and animation (emotion capture and animation, EMOCA) method that utilizes depth-aware expression consistency loss to learn geometric details under facial expressions. The method can effectively restore the geometric details of the facial expression. But this method cannot generate a 3D face shape with a sense of realism. Therefore, these methods for restoring the geometric details of the face using the displacement map have difficulty in learning accurate geometric details and lack geometric realism. The existing work can not effectively capture geometric details and semantic attributes, so that the generated 3D face has few geometric details and rough textures. Furthermore, we found that the EMOCA method uses expression networks to obtain more geometric details of facial expressions. Therefore, we consider that a powerful semantic representation network can learn geometric details and semantic attributes to guide a coarse 3D face model to recover more geometric details and facial expressions with realism. To this end, we exploit the powerful representation capabilities of the CLIP (contrast-Language-Image-pretaining) model to learn geometric details and semantic features. The CLIP model is trained on 400 ten thousand text-image pairs, which can effectively acquire fine-grained semantic features. StyleClIP (O.Patashnik, Z.Wu, E.Shechtman, D.Cohen-Or, and D.Lischinski. Styleclip: text-driven manipulation of sty-egan image. InCVPR, pp.2085-2094,2021.) shows that the CLIP model can capture geometric and semantic attributes of a human face.
Therefore, there is a need for a three-dimensional face reconstruction method based on CLIP model with higher face reconstruction accuracy and more significant face geometry.
Disclosure of Invention
The invention mainly aims to provide a three-dimensional face reconstruction method based on a CLIP model, which aims to solve the problems of lower face reconstruction precision and fewer face geometric details in the prior art.
In order to achieve the above purpose, the present invention provides a three-dimensional face reconstruction method based on a CLIP model, which specifically includes the following steps:
s1, acquiring rough characteristic representation by adopting a mask pre-training mode.
S2, learning fine granularity characteristic representation from the coarse characteristics by adopting a parameter refinement module, wherein the parameter refinement module comprises: the depth-separable residual block is used for learning the detail features of the local face, and the transform encoder is used for learning global semantic features from the rough feature representation, and the feature fusion module is used for fusing the local detail features and the global semantic features learned by the parameter refinement module.
S3, acquiring 3DMM parameters by adopting a feature classifier, wherein the feature representation F of fine granularity c The 3DMM parameter codes with low dimension are obtained through the feature classifier, and the parameter codes consist of a shape code alpha, an expression code beta, a texture code t, a pose code rho and an illumination code l to form 257-dimensional parameter codes.
S4, adopting a BFM model to fit 3DMM parameters to generate a three-dimensional face model, wherein the fitting parameters at the stage are divided into the face model and a camera model.
S5, rendering the 3D face model into the 2D image by adopting a micro-renderer to generate a rendered imageI r =R(S 3d ) Wherein R (·) represents a micro-renderer, S 3d Is the vertex of the 3D face model.
S6, adopting a loss function optimization model, wherein the loss function comprises a rough loss function and a feature consistency loss function, the rough loss function comprises a luminosity loss function, an identity loss function, a landmark loss function and a regularization loss function, and the feature consistency loss function comprises a geometric feature consistency loss function and a semantic feature consistency loss function.
Further, in step S1, an input image is givenWherein h and w represent the height and width of the face image, and a mask mode is adopted in the VGGFace2 data set to pretrain the residual network to extract rough characteristic representationF 0 =H b (I s ) Wherein H is b (. Cndot.) represents the pre-training residual network and c represents the number of channels.
Further, in step S2, the parameter refinement stage is performed from the coarse feature F 0 Medium learning fine granularity feature representationF c =H PRM (F 0 ) Wherein H is PRM (. Cndot.) represents a parameter refinement module.
Further, the step S2 specifically includes the following steps:
s2.1, given the roughness characteristics F0.
S2.2, using a 1 x 1 convolution layer to reduce the feature dimension results in a 256-dimensional feature vector, this process is defined as: c=c (F 0 )。
S2.3, learning local high frequency features and global semantic features respectively by using parallel depth-separable residual blocks and a transducer encoder, wherein the process is defined as F c =cat (T (c) +dw (c)), where T (-) represents the transducer encoder and DW (-) represents the depth separable residual block.
Further, the face model of step S4 is expressed as:
wherein, the liquid crystal display device comprises a liquid crystal display device,representing an average shape of the 3D face; a is that id ,A exp And A tex The main component basis of the shape, the expression and the texture of the face are respectively represented, and alpha, beta and t respectively represent the shape, the expression and the texture parameters of the face and are used for fitting and generating the 3D face.
The camera model of step S4 projects the 3D face model into the 2D image using a perspective camera, the perspective camera projection process can be expressed as:
v=f×R×S 3d +T (2)
wherein R is a rotation matrix, T is a translation vector, S 3d Is the vertex of the 3D face model and f is the camera focal length.
Further, in step S6, a luminosity loss function is used to approximate the generated texture skin-tone to the texture skin-tone of the input graph, the luminosity loss function being defined as:
L photo =||M I ⊙(I s -I r )|| 1 (3)
wherein M is I Is the mask region of the skin tone of the human face, as indicated by the Hadamard product, I s 、I r Respectively an input image and a rendered image, I.I 1 Is L 1 A paradigm.
The landmark loss function is used for performing weak supervision learning, and measuring the distances between 68 key points of the 3D face projected to the input image and the input image. The landmark loss function is defined as:
wherein k is i Is the i-th key point of the input image, k' i Is the ith key point of the reconstructed 3D face model after projection; w (w) i Is the ith switchThe weights of the key points are only 20, and the weights of other key points are all 1.
The identity loss function is used for generating a face geometric image, training an ArcFace network on a VGGFace2 data set, then using the trained network for extracting depth features of 512-dimensional faces of an input image and a rendered image, and finally calculating depth feature cosine similarity, wherein the identity loss function is defined as:
L id =1-<F(I s ),F(I r )> (5)
wherein F (·) is a pre-trained ArcFace network, I s ,I r Respectively an input image and a rendered image,<·,·>is the vector inner product.
The regularization loss function is used for preventing the 3D face shape from degrading, and the regularization loss is defined as follows:
L reg =||α|| 2 +||β|| 2 +||δ|| 2 (9)
wherein, alpha, beta, delta represent shape parameter, expression parameter and texture parameter respectively.
Further, in step S6, a geometric feature consistency loss function is used to restore geometric details of the face, where the geometric feature consistency loss function is defined as:
wherein, CLIP l Layers 2 and 3, w, of the RN50X4CLIP model l Is CLIP l Weights, w l ={1,1/2},I s 、I r Respectively an input image and a rendered image, I.I 2 Is L 2 A paradigm.
A semantic feature consistency loss function for approximating texture skin color to an input image and for solving eye closure problems for a 3D face model, which can be defined as:
L semantic =1-cos(CLIP(I s ),CLIP(I r )) (7)
wherein CLIP is the FC layer of the ViT-B/32CLIP model, I s 、I r The input image and the rendered image, respectively, cos (·) is the cosine distance.
The feature consistency loss function is defined as:
L cl =L geometric +L semantic (8)。
further, the optimization of all losses of the objective function is defined as:
L all =minλ photo L photo +λ id L id +λ lm L lm +λ cl L cl +λ reg L reg (10)
wherein lambda is photo =1,λ id =2,λ lm =1.7×10 -3 ,λ reg =1×10 -4 ,λ cl =2 is the weight of the corresponding penalty.
The invention has the following beneficial effects:
1. the parameter refinement module is used for learning rich feature representations so as to accelerate the convergence speed of the model and estimate accurate face model parameters, and adopts a parallel transducer encoder and a depth separable residual block to learn global semantic features and local geometric features.
2. The feature fusion module provided by the invention fuses the global semantic features and the local geometric features learned by the parameter refinement module into fine-grained feature representations, and then the feature classifier linearly classifies the fine-grained features into different 3DMM parameters.
3. The feature consistency loss function of the invention captures geometric details by using a powerful-representation CLIP model, thereby recovering the texture skin color and local geometric details of the face.
4. The method provided by the invention comprises a parameter refinement module, a feature fusion module and a feature consistency loss function, so that the method has higher face reconstruction precision and more obvious face geometric details compared with the existing single image three-dimensional face reconstruction algorithm.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are needed in the description of the embodiments or the prior art will be briefly described, and it is obvious that the drawings in the description below are some embodiments of the present invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art. In the drawings:
fig. 1 shows a flowchart of a three-dimensional face reconstruction method based on a CLIP model according to the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made apparent and fully in view of the accompanying drawings, in which some, but not all embodiments of the invention are shown. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The three-dimensional face reconstruction method based on the CLIP model shown in fig. 1 specifically comprises the following steps:
s1, acquiring rough characteristic representation by adopting a mask pre-training mode.
Specifically, in step S1, an input image is givenWherein h and w represent the height and width of the face image, and a mask mode is adopted in the VGGFace2 data set to pretrain the residual network to extract rough characteristic representationF 0 =H b (I s ) Wherein H is b (. Cndot.) represents the pre-training residual network and c represents the number of channels.
S2, learning fine granularity characteristic representation from the coarse characteristics by adopting a parameter refinement module, wherein the parameter refinement module comprises: the depth-separable residual block is used for learning the detail features of the local face, and the transform encoder is used for learning global semantic features from the rough feature representation, and the feature fusion module is used for fusing the local detail features and the global semantic features learned by the parameter refinement module. The parameter refinement module aims to learn rich feature representations from a single image to accelerate the convergence speed of the model and estimate accurate face model parameters.
Specifically, in step S2, the parameter refinement stage is performed from the coarse feature F 0 Medium learning fine granularity feature representationF c =H PRM (F 0 ) Wherein H is PRM (. Cndot.) represents a parameter refinement module. Specifically, the step S2 specifically includes the following steps: s2.1, giving a roughness feature F0; s2.2, using a 1 x 1 convolution layer to reduce the feature dimension results in a 256-dimensional feature vector, this process is defined as: c=c (F 0 ) The method comprises the steps of carrying out a first treatment on the surface of the S2.3, learning local high frequency features and global semantic features respectively by using parallel depth-separable residual blocks and a transducer encoder, wherein the process is defined as F c =cat (T (c) +dw (c)), where T (-) represents the transducer encoder and DW (-) represents the depth separable residual block.
The depth separable residual block adopts convolution operation of residual connection to learn detail characteristics of the local face. It first performs a batch normalization operation on the extracted coarse features to speed up convergence of the network and mitigate overfitting. Linear reasoning is then performed using 2 3 x 3 convolutional layers SConv, where the first convolutional layer employs packet convolution to reduce the number of parameters. Each convolution operation is followed by a ReLU activation function. The depth separable residual block captures as many local detail features as possible using as few parameters as possible. The transducer encoder consists of three Transformer block, learning global semantic features from the coarse feature representation. The extracted coarse features are input into a transducer encoder, transformer block uses a multi-head attention network to enhance global interactions of coarse features, learning rich semantic features. And the feature fusion module fuses the local detail features and the global semantic features learned by the parameter refinement module.
S3, adopting a feature classifierTo obtain 3DMM parameters, fine-grained feature representation F c The 3DMM parameter code with low dimension is obtained through the feature classifier, the parameter code is composed of a shape code alpha, an expression code beta, a texture code t, a pose code rho and an illumination code l to form a 257-dimensional parameter code, wherein 80-dimensional shape parameters are obtained64-dimensional expression parameter->80-dimensional texture parameters6-dimensional pose parameter->And a 27-dimensional illumination parameter l e SO (2).
S4, adopting a BFM model to fit 3DMM parameters to generate a three-dimensional face model, wherein the fitting parameters at the stage are divided into the face model and a camera model.
Specifically, a BFM data set is used as priori knowledge of a face to generate a three-dimensional face model. The BFM dataset is facial data of 200 persons scanned and three-dimensional data representing the shape and texture of the faces is generated by the PCA model. The 3D faces generated by the BFM data set have the same topological structure, and the semantic information of each triangular patch is fixed. To recover facial expressions, we use the faceWareHouse dataset to generate 3D facial expressions. The face model of step S4 is expressed as:
wherein, the liquid crystal display device comprises a liquid crystal display device,representing an average shape of the 3D face; a is that id ,A exp And A tex Principal component basis respectively representing shape, expression and texture of face, and alpha, beta, t respectively represent shape of faceAnd the expression and texture parameters are used for fitting and generating the 3D face.
The camera model of step S4 projects the 3D face model into the 2D image using a perspective camera, which is the most commonly used projection method for rendering a scene, and the perspective camera projection process can be expressed as:
v=f×R×S 3d +T (2)
wherein R is a rotation matrix, T is a translation vector, S 3d Is the vertex of the 3D face model and f is the camera focal length.
S5, rendering the 3D face model into the 2D image by adopting a micro-renderer to generate a rendered imageI r =R(S 3d ) Wherein R (·) represents a micro-renderer, S 3d Is the vertex of the 3D face model.
S6, adopting a loss function optimization model, wherein the loss function comprises a rough loss function and a feature consistency loss function, the rough loss function comprises a luminosity loss function, an identity loss function, a landmark loss function and a regularization loss function, and the feature consistency loss function comprises a geometric feature consistency loss function and a semantic feature consistency loss function.
Specifically, in step S6, the luminosity loss function is used to reduce the difference in pixels between the rendered image and the input image, and the luminosity loss is used to approximate the generated texture skin color to that of the input image. The photometric loss function is defined as:
L photo =||M I ⊙(I s -I r )|| 1 (3)
wherein M is I Is the mask region of the skin tone of the human face, as indicated by the Hadamard product, I s 、I r Respectively an input image and a rendered image, I.I 1 Is L 1 A paradigm.
In order to achieve alignment of 2D faces and 3D faces, weak supervised learning is performed using landmark loss. The distances of the 3D face projected to 68 key points between the input image and the input image were measured. The landmark loss function is defined as:
wherein k is i Is the i-th key point of the input image, k' i Is the ith key point of the reconstructed 3D face model after projection; w (w) i The weight of the ith key point is 20, and the weights of the other key points are 1.
The identity loss function is used for generating more realistic face geometric images, and deep features of the face images are extracted from a high-dimensional angle by using a face recognition network. Because the ArcFace network can obtain higher face recognition accuracy on the face data set, the ArcFace network is used as a face depth feature extractor. Training an ArcFace network on the VGGFace2 data set, using the trained network to extract depth features of 512-dimensional faces of an input image and a rendered image, and finally calculating the cosine similarity of the depth features. The identity loss function is defined as:
L id =1-<F(I s ),F(I r )> (5)
wherein F (·) is a pre-trained ArcFace network, I s ,I r Respectively an input image and a rendered image,<·,·>is the vector inner product;
to prevent 3D face shape degradation, the present invention will be directed to 3DMM parameter constrained regularization. Experimental results show that regularization can effectively prevent the degradation of the facial shape parameters, the expression parameters and the texture parameters. Defining regularization loss as:
L reg =||α|| 2 +||β|| 2 +||δ|| 2 (9)
wherein, alpha, beta, delta represent shape parameter, expression parameter and texture parameter respectively.
Inspired by CLIPasso, a feature consistency loss function is proposed to recover the geometric details of the face. The CLIP model can effectively capture geometric detail features and advanced semantic features, is trained on 200 ten thousand image text data sets, and has strong attribute representation capability. And we have found that feature consistency loss functions based on CLIP models capture geometric details of faces more easily than VGG-based perceptual loss functions. Generating an antagonism network can restore the texture details of the face, but the method is not trained stably. The feature consistency loss functions include geometric feature consistency loss functions and semantic feature consistency loss functions. We use a geometric feature consistency loss function to measure the geometric similarity of the input image and the rendered image. First we extract the layer 2 and 3 feature maps using a CLIP image encoder and then calculate the mean square error of the input image and the rendered image layer 2 and 3 feature maps.
Specifically, in step S6, the geometric feature consistency loss function is used for recovering the geometric details of the face, where the geometric feature consistency loss function is defined as:
wherein, CLIP l Layers 2 and 3, w, of the RN50X4CLIP model l Is CLIP l Weights, w l ={1,1/2},I s 、I r Respectively an input image and a rendered image, I.I 2 Is L 2 A paradigm.
The semantic feature consistency loss function can achieve consistency of high-dimensional semantic attributes, is used for enabling texture skin colors to approach to an input image and is used for solving the eye closure problem of a 3D face model, firstly, a CLIP image encoder is used for extracting high-dimensional 512-dimensional feature vectors of a rendered image and the input image, and then cosine similarity distances of the feature vectors of the input image and the rendered image 512 are calculated. It can be defined as:
L semantic =1-cos(CLIP(I s ),CLIP(I r )) (7)
wherein CLIP is the FC layer of the ViT-B/32CLIP model, I s 、I r Respectively an input image and a rendered image, cos (·) being a cosine distance;
the feature consistency loss function is defined as:
L cl =L geometric +L semantic (8)。
specifically, the optimization of all losses of the objective function is defined as:
L all =minλ photo L photo +λ id L id +λ lm L lm +λ cl L cl +λ reg L reg (10)
wherein lambda is photo =1,λ id =2,λ lm =1.7×10 -3 ,λ reg =1×10 -4 ,λ cl =2 is the weight of the corresponding penalty.
Thereby reconstructing the three-dimensional face model.
The invention provides a three-dimensional face reconstruction method based on a CLIP model, which firstly provides a feature consistency loss to learn more geometric details, specifically utilizes the CLIP model as supervision information and encourages the similarity of the geometric details between an input image and a rendered image during training. The CLIP model can also capture the condition of the eye without any eye closure loss to solve the problem of eye closure inaccuracy. Secondly, in order to accelerate the convergence speed of the model and estimate accurate face model parameters, a parameter refinement module is provided, and a parallel transducer encoder and a depth separable residual block are utilized to respectively learn global semantics and local detail features. The feature fusion module fuses the global semantic and local detail features into rich features so as to accelerate the convergence speed of the model in the training process. Finally, in order to drive the 3D facial model, the invention guides the 3D facial expression migration by using the expression text for the first time. Unlike the DECA which utilizes the reference image to carry out expression migration, the invention adopts text to guide and reconstruct the 3D face with specific expression and keeps the identity consistency of the face. Finally, the functions of higher face reconstruction precision and more remarkable face geometric details are realized.
It should be understood that the above description is not intended to limit the invention to the particular embodiments disclosed, but to limit the invention to the particular embodiments disclosed, and that the invention is not limited to the particular embodiments disclosed, but is intended to cover modifications, adaptations, additions and alternatives falling within the spirit and scope of the invention.
Claims (8)
1. The three-dimensional face reconstruction method based on the CLIP model is characterized by comprising the following steps of:
s1, acquiring rough characteristic representation by adopting a mask pre-training mode;
s2, learning fine granularity characteristic representation from the coarse characteristics by adopting a parameter refinement module, wherein the parameter refinement module comprises: the depth-separable residual block is used for learning the detail features of the local face, and the transform encoder is used for learning global semantic features from the rough feature representation, and a feature fusion module is used for fusing the local detail features and the global semantic features learned by the parameter refinement module;
s3, acquiring 3DMM parameters by adopting a feature classifier, wherein the feature representation F of fine granularity c Obtaining a low-dimensional 3DMM parameter code through a feature classifier, wherein the parameter code consists of a shape code alpha, an expression code beta, a texture code t, a pose code rho and an illumination code l to form a 257-dimensional parameter code;
s4, adopting a BFM model to fit 3DMM parameters to generate a three-dimensional face model, wherein the fitting parameters at the stage are divided into a face model and a camera model;
s5, rendering the 3D face model into the 2D image by adopting a micro-renderer to generate a rendered imageI r =R(S 3d ) Wherein R (·) represents a micro-renderer, S 3d Is the vertex of the 3D face model;
s6, adopting a loss function optimization model, wherein the loss function comprises a rough loss function and a feature consistency loss function, the rough loss function comprises a luminosity loss function, an identity loss function, a landmark loss function and a regularization loss function, and the feature consistency loss function comprises a geometric feature consistency loss function and a semantic feature consistency loss function.
2. A according to claim 1A three-dimensional face reconstruction method based on a CLIP model is characterized in that in step S1, an input image is givenWherein h and w represent the height and width of the face image, and a mask mode is adopted in the VGGFace2 data set to pretrain a residual network to extract rough characteristic representation +.>F 0 =H b (I s ) Wherein H is b (. Cndot.) represents the pre-training residual network and c represents the number of channels.
3. The three-dimensional face reconstruction method based on the CLIP model according to claim 2, wherein in step S2, the parameter refinement stage is performed from the rough feature F 0 Medium learning fine granularity feature representationF c =H PRM (F 0 ) Wherein H is PRM (. Cndot.) represents a parameter refinement module.
4. The three-dimensional face reconstruction method based on the CLIP model according to claim 3, wherein the step S2 specifically comprises the following steps:
s2.1 given roughness characteristics F 0 ;
S2.2, using a 1 x 1 convolution layer to reduce the feature dimension results in a 256-dimensional feature vector, this process is defined as: c=c (F 0 );
S2.3, learning local high frequency features and global semantic features respectively by using parallel depth-separable residual blocks and a transducer encoder, wherein the process is defined as F c =cat (T (c) +dw (c)), where T (-) represents the transducer encoder and DW (-) represents the depth separable residual block.
5. The three-dimensional face reconstruction method according to claim 1, wherein the face model in step S4 is represented as:
wherein s represents the average shape of the 3D face; a is that id ,A exp And A tex The method comprises the steps of respectively representing the main component basis of the shape, the expression and the texture of the human face, and respectively representing the shape, the expression and the texture parameters of the human face by alpha, beta and t for fitting to generate a 3D human face;
the camera model of step S4 projects the 3D face model into the 2D image using a perspective camera, the perspective camera projection process can be expressed as:
v=f×R×S 3d +T (2)
wherein R is a rotation matrix, T is a translation vector, S 3d Is the vertex of the 3D face model and f is the camera focal length.
6. The three-dimensional face reconstruction method according to claim 1, wherein in step S6, a luminosity loss function is used to approximate the generated texture skin color to the texture skin color of the input graph, and the luminosity loss function is defined as:
L photo =||M I ⊙(I s -I r )|| 1 (3)
wherein M is I Is the mask region of the skin tone of the human face, as indicated by the Hadamard product, I s 、I r Respectively an input image and a rendered image, I.I 1 Is L 1 A paradigm;
the landmark loss function is used for weak supervision learning, measures the distances between 68 key points of the projection of the 3D face to the input image and the input image, and is defined as:
wherein k is i Is the i-th key point of the input image, k' i Is the ith key point of the reconstructed 3D face model after projection; w (w) i The weight of the ith key point is 20, and the weights of the other key points are 1;
the identity loss function is used for generating a face geometric image, training an ArcFace network on a VGGFace2 data set, then using the trained network for extracting depth features of 512-dimensional faces of an input image and a rendered image, and finally calculating depth feature cosine similarity, wherein the identity loss function is defined as:
L id =1-<F(I s ),F(I r )> (5)
wherein F (·) is a pre-trained ArcFace network, I s ,I r Respectively an input image and a rendered image,<·,·>is the vector inner product;
the regularization loss function is used for preventing the 3D face shape from degrading, and the regularization loss is defined as follows:
L reg =||α|| 2 +||β|| 2 +||δ|| 2 (9)
wherein, alpha, beta, delta represent shape parameter, expression parameter and texture parameter respectively.
7. The three-dimensional face reconstruction method according to claim 6, wherein in step S6, a geometric feature consistency loss function is defined as:
wherein, CLIP l Layers 2 and 3, w, of the RN50X4CLIP model l Is CLIP l Weights, w l ={1,1/2},I s 、I r Respectively an input image and a rendered image, I.I 2 Is L 2 A paradigm;
a semantic feature consistency loss function for approximating texture skin color to an input image and for solving eye closure problems for a 3D face model, which can be defined as:
L semantic =1-cos(CLIP(I s ),CLIP(I r )) (7)
wherein CLIP is the FC layer of the ViT-B/32CLIP model, I s 、I r Respectively an input image and a rendered image, cos (·) being a cosine distance;
the feature consistency loss function is defined as:
L cl =L geometric +L semantic (8)。
8. the three-dimensional face reconstruction method based on the CLIP model according to claim 7, wherein the optimization of all the losses of the objective function is defined as:
L all =minλ photo L photo +λ id L id +λ lm L lm +λ cl L cl +λ reg L reg (10)
wherein lambda is photo =1,λ id =2,λ lm =1.7×10 -3 ,λ reg =1×10 -4 ,λ cl =2 is the weight of the corresponding penalty.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310376661.3A CN116563457A (en) | 2023-04-11 | 2023-04-11 | Three-dimensional face reconstruction method based on CLIP model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310376661.3A CN116563457A (en) | 2023-04-11 | 2023-04-11 | Three-dimensional face reconstruction method based on CLIP model |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116563457A true CN116563457A (en) | 2023-08-08 |
Family
ID=87485263
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310376661.3A Pending CN116563457A (en) | 2023-04-11 | 2023-04-11 | Three-dimensional face reconstruction method based on CLIP model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116563457A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116993929A (en) * | 2023-09-27 | 2023-11-03 | 北京大学深圳研究生院 | Three-dimensional face reconstruction method and device based on human eye dynamic change and storage medium |
-
2023
- 2023-04-11 CN CN202310376661.3A patent/CN116563457A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116993929A (en) * | 2023-09-27 | 2023-11-03 | 北京大学深圳研究生院 | Three-dimensional face reconstruction method and device based on human eye dynamic change and storage medium |
CN116993929B (en) * | 2023-09-27 | 2024-01-16 | 北京大学深圳研究生院 | Three-dimensional face reconstruction method and device based on human eye dynamic change and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Feng et al. | Learning an animatable detailed 3D face model from in-the-wild images | |
Daněček et al. | Emoca: Emotion driven monocular face capture and animation | |
Wang et al. | Detecting photoshopped faces by scripting photoshop | |
Kundu et al. | 3d-rcnn: Instance-level 3d object reconstruction via render-and-compare | |
Fu et al. | Deep ordinal regression network for monocular depth estimation | |
Zhou et al. | Fully convolutional mesh autoencoder using efficient spatially varying kernels | |
Huynh et al. | Mesoscopic facial geometry inference using deep neural networks | |
CN110991281B (en) | Dynamic face recognition method | |
CN110728219B (en) | 3D face generation method based on multi-column multi-scale graph convolution neural network | |
Chen et al. | I2uv-handnet: Image-to-uv prediction network for accurate and high-fidelity 3d hand mesh modeling | |
CN109448083A (en) | A method of human face animation is generated from single image | |
Liu et al. | Normalized face image generation with perceptron generative adversarial networks | |
Gao et al. | Semi-supervised 3D face representation learning from unconstrained photo collections | |
Ji et al. | SurfaceNet+: An end-to-end 3D neural network for very sparse multi-view stereopsis | |
CN116563457A (en) | Three-dimensional face reconstruction method based on CLIP model | |
Zhang et al. | Weakly-supervised multi-face 3d reconstruction | |
Li et al. | Multi-attribute regression network for face reconstruction | |
Basak et al. | 3D face-model reconstruction from a single image: A feature aggregation approach using hierarchical transformer with weak supervision | |
Chen et al. | Transformer-based 3d face reconstruction with end-to-end shape-preserved domain transfer | |
Zheng et al. | GCM-Net: Towards effective global context modeling for image inpainting | |
Ren et al. | Facial geometric detail recovery via implicit representation | |
Lin et al. | Single-shot implicit morphable faces with consistent texture parameterization | |
Yin et al. | Segmentation-reconstruction-guided facial image de-occlusion | |
Zhao et al. | Generative landmarks guided eyeglasses removal 3D face reconstruction | |
Maxim et al. | A survey on the current state of the art on deep learning 3D reconstruction |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |