CN114549387A - Face image highlight removal method based on pseudo label - Google Patents
Face image highlight removal method based on pseudo label Download PDFInfo
- Publication number
- CN114549387A CN114549387A CN202210208825.7A CN202210208825A CN114549387A CN 114549387 A CN114549387 A CN 114549387A CN 202210208825 A CN202210208825 A CN 202210208825A CN 114549387 A CN114549387 A CN 114549387A
- Authority
- CN
- China
- Prior art keywords
- convolution
- data set
- module
- highlight
- output
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 67
- 238000013527 convolutional neural network Methods 0.000 claims abstract description 33
- 238000013528 artificial neural network Methods 0.000 claims abstract description 26
- 238000012549 training Methods 0.000 claims abstract description 20
- 238000009877 rendering Methods 0.000 claims abstract description 17
- 239000013598 vector Substances 0.000 claims description 67
- 230000006870 function Effects 0.000 claims description 36
- 230000008569 process Effects 0.000 claims description 28
- 238000009826 distribution Methods 0.000 claims description 24
- 238000010606 normalization Methods 0.000 claims description 12
- 230000004913 activation Effects 0.000 claims description 9
- 238000005286 illumination Methods 0.000 claims description 7
- 239000011159 matrix material Substances 0.000 claims description 7
- 230000004927 fusion Effects 0.000 claims description 6
- 238000011176 pooling Methods 0.000 claims description 5
- 238000005070 sampling Methods 0.000 claims description 4
- OAICVXFJPJFONN-UHFFFAOYSA-N Phosphorus Chemical compound [P] OAICVXFJPJFONN-UHFFFAOYSA-N 0.000 claims description 3
- 238000005096 rolling process Methods 0.000 claims description 3
- 239000000126 substance Substances 0.000 claims description 3
- 238000013135 deep learning Methods 0.000 abstract description 7
- 238000012545 processing Methods 0.000 abstract description 3
- 230000000694 effects Effects 0.000 abstract description 2
- 239000002131 composite material Substances 0.000 description 3
- 238000000354 decomposition reaction Methods 0.000 description 3
- 230000001815 facial effect Effects 0.000 description 3
- 238000005457 optimization Methods 0.000 description 3
- 230000009471 action Effects 0.000 description 2
- 230000006399 behavior Effects 0.000 description 2
- 230000015572 biosynthetic process Effects 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000003062 neural network model Methods 0.000 description 2
- 238000003786 synthesis reaction Methods 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 1
- 235000004035 Cryptotaenia japonica Nutrition 0.000 description 1
- 244000146493 Cryptotaenia japonica Species 0.000 description 1
- 230000004075 alteration Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
- 238000005215 recombination Methods 0.000 description 1
- 230000006798 recombination Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000002194 synthesizing effect Effects 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T5/00—Image enhancement or restoration
- G06T5/50—Image enhancement or restoration using two or more images, e.g. averaging or subtraction
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20081—Training; Learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20084—Artificial neural networks [ANN]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/30—Subject of image; Context of image processing
- G06T2207/30196—Human being; Person
- G06T2207/30201—Face
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Biomedical Technology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Image Analysis (AREA)
- Image Processing (AREA)
Abstract
The invention relates to the field of image processing, computer vision and deep learning, in particular to a highlight removing method for a face image based on a pseudo label, which comprises the steps of obtaining a synthesized face data set through a rendering engine, and forming a labeled data set and a non-labeled data set with a real face image with highlight; training the convolutional neural network by using the labeled data set; the generalization capability of the neural network is improved by using a pseudo label method; inputting the face image with highlight into a convolutional neural network to obtain a highlight-removed picture; the face highlight-removed image obtained by the method accords with the face observation effect, and the face texture details are not damaged.
Description
Technical Field
The invention relates to the field of image processing, computer vision and deep learning, in particular to a face image highlight removal method based on a pseudo label.
Background
Specular reflection formed on a human face can generate facial highlight, and in most cases, the facial highlight can obviously influence image quality and reduce the aesthetic feeling of the human face. In the fields of movie and television production, virtual reality, human face relighting and the like, the highlight removing algorithm of the human face image, which is crucial to the highlight removing of the existing human face image, mainly comprises the following steps:
1) a numerical optimization method based on a priori assumptions. This type of algorithm relies on making assumptions about the physical characteristics of the highlight of the face, and manually designing prior assumptions and constraints based thereon. The aim of separating highlight components and non-highlight components of the human face is achieved by minimizing an optimization function.
2) An image processing method based on image chromaticity space. In the method, different pixels are divided into highlight pixels and non-highlight pixels in a chromaticity space by analyzing the chromaticity space of the image, and the highlight pixels are converted into the non-highlight pixels through image processing calculation.
3) The method is based on a deep learning class method driven by data. The algorithm takes a highlight image as a model input and takes a non-highlight image as a model output by means of a deep learning technology. And the model learns the capability of removing the highlight of the human face through large-scale data.
With the continuous development of deep learning technology, a deep learning method based on data driving becomes the mainstream direction for removing highlight from a human face. Because it is difficult to obtain a highlight dataset of a tagged face, it is often necessary to produce a composite dataset by a rendering engine. However, the synthesized face image still has a difference from the real face image, and the de-highlight neural network trained by the synthesized face acts on the de-highlight of the real face, which can cause problems of color distortion, smooth image transition and the like.
Disclosure of Invention
In order to overcome the defect of a deep learning face highlight removing algorithm, the invention provides a face image highlight removing method based on a pseudo label, which specifically comprises the following steps:
s1, acquiring a synthesized face data set through a rendering engine, and forming a tagged data set and a non-tagged data set with the real face image with highlight;
s2, training the convolutional neural network by using the labeled data set;
s3, acquiring a pseudo label of data in the unlabeled data set, and training the convolutional neural network by using the data set with the pseudo label;
s4, inputting the face image with highlight into the convolutional neural network which completes training to obtain a highlight-removed picture;
when the convolutional neural network is trained, the sum of the loss between the data in the data set with the labels and the labels thereof and the loss between the data in the data set without the labels and the pseudo labels thereof is used as a loss function of the convolutional neural network, and the convolutional neural network is subjected to back propagation to train the convolutional neural network.
Further, the method for acquiring the tagged data set and the untagged data set comprises the following steps:
selecting a plurality of face 3D model images, loading the face images into a rendering relation based on physics, selecting a plurality of HDR environment lights, and rendering the face models; respectively rendering a diffuse reflection part D and a specular reflection part S of a human face according to a Phong illumination model, expressing a rendered human face image with highlight as I ═ D + S, forming a synthetic human face image pair with highlight/highlight removal consisting of the obtained human face image with highlight and an original image thereof, and forming a labeled data set by the synthetic human face image pair with highlight/highlight removal;
and selecting a plurality of highlight face pictures from the face data set as a label-free data set.
Further, the convolutional neural network comprises an encoder and a decoder, the image input into the convolutional neural network is used as the input of the encoder, the encoder comprises 5 cascaded convolutional modules, and the output of each convolutional module is input into the next convolutional module after being subjected to maximum pooling;
taking the output of the encoder as the input of a decoder, wherein the decoder comprises 4 attention modules, 4 convolution modules, 4 deconvolution modules and a convolution layer first deconvolution module, the output obtained by carrying out deconvolution operation by taking the input of the decoder as the input is d1, and the output obtained by fusing the output of the penultimate convolution module in the encoder and d1 as the input is recorded as x1 by the first attention module;
d1 and x1 are spliced together and input into a first convolution module for convolution, the output of the first convolution module is used as the input of a second convolution module for convolution operation to obtain an output result d2, and the output obtained by the second attention module through fusion is used as the input according to d2 and the output of a third last convolution module of an encoder and is recorded as x 2;
d2 and x2 are spliced together, input into a second convolution module for convolution, the output of the second convolution module is used as the input of a third deconvolution module for deconvolution operation to obtain an output result d3, and the output obtained by fusing the output of the third attention module according to d3 and the output of a fourth last convolution module of an encoder as the input is recorded as x 3;
d3 and x3 are spliced together, input into a third convolution module for convolution, the output of the third convolution module is used as the input of a fourth deconvolution module for deconvolution operation to obtain an output result d4, and the output obtained by fusing the output of the fourth attention module according to d4 and the output of a fifth last convolution module of an encoder as the input is recorded as x 4;
d4 and x4 are spliced together, input into a fourth convolution module for convolution, and the output of the third convolution module is used as the input of the convolution layer of the encoder for convolution to obtain the output result of the decoder.
Furthermore, the convolution module is composed of two cascaded convolution layers, each convolution layer sequentially performs convolution operation with the convolution kernel number being the number of image channels, the convolution window size being 3 × 3 and the step length being 1 × 1, normalization operation and activation by using a relu function;
the attention module takes the input from the decoder as a main part and the input from the encoder as a secondary part, the attention module fuses the main part and the secondary part, the main part and the secondary part respectively use the convolution operation with the convolution window size of 3X 3 and the step size of 1X 1 and the normalization operation in sequence to obtain a main part X and a secondary part S, the main part X and the secondary part S are added and then sequentially carry out the convolution operation with the convolution window size of 3X 3 and the step size of 1X 1, the normalization operation and the activation by using a sigmoid function to obtain a fusion result XS, and the fusion result XS is multiplied by the main part X to be used as the output of the attention module;
the deconvolution module comprises an up-sampling layer and a convolution layer, wherein the up-sampling layer up-samples an input image, and after the image on the image is sampled to be twice as large as the image input into the module, the convolution layer sequentially performs convolution operation and normalization operation with the number of convolution kernels being the number of output channels, the size of a convolution window being 3 x3 and the step length being 1 x1, and activation is performed by using a relu function.
Further, the size of a convolution kernel for performing maximum pooling in the encoder is 2, the step size is 2, and convolution kernels of five convolution modules of the encoder are 64,128,256,512 and 1024 in sequence; the convolution kernel number of the attention module in the decoder is the same as that of the deconvolution module, the convolution kernel numbers of the 4 deconvolution modules are 512,256,128 and 64 respectively, the convolution kernel number of the convolution layer in the decoder is 3, the convolution window size is 1 multiplied by 1, and the step size is 1 multiplied by 1.
Further, the method for using the pseudo label to improve the generalization capability of the convolutional neural network comprises the following steps:
s31, generating pseudo labels for the unlabeled data sets through a Gaussian process;
s32, calculating the error of the unlabeled data set according to the generated pseudo label;
s33, calculating a loss function of the convolutional neural network according to the error of the unlabeled data set and the error of the labeled data set, and training the convolutional neural network through back propagation, wherein the loss function of the convolutional neural network is expressed as:
Ltotal=Lsup+λunsupLunsup;
wherein L istotalIs the total loss of the convolutional neural network; l issupLoss of a portion of the tagged data set; l isunsupLoss of unmarked data set parts; lambda [ alpha ]unsupThe weight lost for the unmarked data set part.
Further, for the unlabeled dataset, the process of generating the pseudo label by the gaussian process comprises the following steps:
when the labeled data set is used for the first time, the obtained feature vector of the last convolution module in the encoder is usedMemory matrix
To pairSparse coding is carried out, and a dictionary F with a tagged data set feature vector is obtained through learning;
when the label-free data set is sent into the neural network, the corresponding characteristic vector is obtained by the last rolling blockAnd apply the feature vectorsProjecting to a feature vector space F obtained by learning;
in the case where the tagged data set and the tagged data set feature vector are known, the distribution of the unlabeled data set feature vector is equivalent to a gaussian distribution, and the mean in the equivalent gaussian distribution is taken as the pseudo-tag of the unlabeled data set feature vector.
Further, the distribution of the feature vectors of the unlabeled dataset is equivalent to a gaussian distribution, and the mean of the equivalent gaussian distribution is expressed as:
the variance of the equivalent gaussian distribution is expressed as:
wherein K (X, Y) is a kernel function expressed as<X,Y>Represents the inner product of vector X and vector Y, | X | represents the modular length of vector X;has a value of 1; and I is an identity matrix.
Further, the loss function of the unlabeled dataset is represented as:
wherein the content of the first and second substances,the feature vector output by the last convolution module of the encoder;is a pseudo label obtained by a Gaussian process, and the value is the mean value of Gaussian distribution|| ||2Is the norm of L2; lambda [ alpha ]1Is a sparse coefficient; alpha is a sparse vector.
Further, the loss L of the tagged dataset portionsupExpressed as:
Lsup=Lpixel+Lperception;
Lpixel=||ypred-y||1;
wherein, ypredIs the prediction result of the neural network; y is a real label, | ·| non-woven phosphor1Is L between images1A distance; lambda [ alpha ]2Is a weight; phiVGGRepresents a VGG16 network; phiVGG(ypred) Characteristic value, phi, generated by VGG16 for the predicted result of the neural networkVGG(y) feature values generated for the authentic tag by VGG 16;which is the square of the L2 distance between the images.
According to the face highlight removing method based on the pseudo label, the pseudo label of the face picture which is not marked is manufactured by using the Gaussian process mixed model, the generalization capability of a neural network on a real face image is enhanced through the pseudo label, and the effect of highlight removing on the real face is improved.
Drawings
FIG. 1 is a simplified flowchart of an example of a pseudo-label based facial image de-highlights algorithm of the present invention;
FIG. 2 is a schematic diagram of the structure of the light removal network of the present invention;
FIG. 3 is a schematic diagram of the present invention for training highlight removal networks;
fig. 4 shows the test results of the trained neural network on the CelebA data set (first line of action of the original image, second line of action of the first line of image corresponding to the highlight-removed result).
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The invention provides a face image highlight removal method based on a pseudo label, which specifically comprises the following steps:
s1, acquiring a synthesized face data set through a rendering engine, and forming a tagged data set and a non-tagged data set with the real face image with highlight;
s2, training the convolutional neural network by using the labeled data set;
s3, acquiring a pseudo label of data in the unlabeled data set, and training the convolutional neural network by using the data set with the pseudo label;
and S4, inputting the face image with highlight into the convolutional neural network which is trained, and obtaining the picture without highlight.
In this embodiment, a flow of a face image highlight removal method based on a pseudo label is as shown in fig. 1 to 4, and specifically includes the following steps:
s10, a face dataset (including both specular and diffuse non-specular faces) is rendered using a rendering engine with a highlight image in the real world to form a labeled dataset and an unlabeled dataset to train a neural network.
S101, selecting a plurality of face 3D models (obj files) and loading the face 3D models into a rendering engine based on physics, selecting a plurality of HDR (high-resolution digital hierarchy) environment lights and rendering the face models. Rendering according to a Phong illumination model to obtain a diffuse reflection part D and a specular reflection part S of the face respectively, and according to a formula:
I=D+S
an image I is obtained. And S is regarded as highlight of the face, I is a face image with highlight, and D is a face image without highlight. Repeating the above operations to generate a large number of highlight/highlight-removed synthetic face image pairs as a labeled data set.
S102, manually selecting a plurality of highlight face pictures from the real face data set, and taking the pictures as a non-label data set.
S20, the neural network is first convolved with the tagged data set. The method specifically comprises the following steps:
s201, the structure of the convolutional neural network.
The convolutional neural network can be specifically divided into an encoder (encoding) and a decoder (decoding), and specifically includes:
1) encoder the encoder comprises 5 successive convolution modules conv _ block. The first convolution module takes the image as input, the last four convolution modules take the output of the previous convolution layer as input, the maximum pooling needs to be performed after the first 4 convolution modules are finished, the convolution kernel size of the maximum pooling operation is 2, the step size is 2, and the number of convolution kernels of the five convolution modules is [64,128,256,512,1024 ].
2) The decoder comprises 4 attention modules, 4 convolution modules, 4 deconvolution modules and a convolution layer. The first deconvolution module of the decoder accepts as input the output of the last encoder of the encoder, resulting in output d1, the first attention module accepts as input d1 and the output e4 of the penultimate convolution module of the encoder, resulting in output x1, and d1 is connected to x1 and fed into the first convolution module. The output of the convolution module is fed to a second deconvolution module to obtain an output d 2. The second attention module accepts d2 and the output e3 of the third last convolution module of the encoder as inputs, resulting in an output x2, which connects d2 and d into the second convolution module. And so on, finally sending the output d4 of the last convolution module into the convolution layer. The number of convolution kernels for the 4 deconvolution modules is [512,256,128,64], respectively. The number of convolution kernels of the attention module is the same as that of the deconvolution module. The number of last convolutional kernels is 3, the convolutional window size is 1 x1, and the step size is 1 x 1. The overall structure of the convolutional neural network is shown in fig. 2. The convolution module structure in the decoder is consistent with that in the encoder, the number of convolution kernels of the encoder of the decoder is [512,256,128,64 and 3], and the convolution kernels are sequentially and uniformly closer to the input end and further forward according to the times from input to output; other convolution parameters are consistent with the encoder and are not described in detail here.
The convolution module, attention module, deconvolution module and convolution layer in the convolution neural network specifically comprise
1) Convolution module conv _ block: the convolution module is composed of two convolution layers. The two convolution layers are identical except for the inputs. The number of convolution layer convolution kernels is the number of output channels ch, the size of a convolution window is 3 x3, the step size is 1 x1, then normalization is carried out, and the activation function is a relu function. The convolution structures adopted in the encoder and the decoder are the same, and only the convolution kernels of the convolution layers are different in number, wherein the convolution kernels of 5 convolution modules in the encoder are 64,128,256,512 and 1024 in sequence, and the convolution kernels of 4 convolution modules in the decoder are 512,256,128 and 64 in sequence.
2) Attention module attu _ block: the attention module takes the input from the decoder as a main part and the input from the encoder as a secondary part as the input of the attention module, respectively performs convolution operation on the two parts, the convolution window size of the convolution operation is 3X 3, the step size is 1X 1, and then normalization is performed to obtain a main part X and a secondary distribution S; adding X and S to obtain XS, then carrying out convolution operation on the XS, wherein the size of a convolution window of the convolution operation is 3X 3, the step length is 1X 1, then carrying out normalization, and enabling an activation function to be sigmoid; multiplying X by XS yields the output of the attention module.
3) Deconvolution module up _ block: the deconvolution module first upsamples the image input to the module twice as much as the original image. Then, performing convolution by using a convolution layer, wherein the number of convolution kernels of the convolution layer is the number ch of output channels, the size of a convolution window is 3 x3, the step length is 1 x1, then performing normalization, and an activation function is a relu function; the numbers of convolution kernels of the 4 deconvolution modules are 512,256,128 and 64 respectively.
4) And (3) rolling layers: the output of the last convolution module of the decoder is input into a convolution layer for convolution, the number of convolution kernels of the convolution layer is the number of channels of an input image, the size of a convolution window is 3 multiplied by 3, and the step length is 1 multiplied by 1.
S202: for tagged datasets (composite images), the error between the prediction and the tag is calculated. The pixel error between the label and the prediction result is expressed as:
Lpixel=||ypred-y||1;
meanwhile, the perceptual error between the label and the prediction result is calculated and expressed as:
wherein, ypredIs the prediction result of the neural network; y is a real label, | ·| non-woven phosphor1Is L between images1A distance; lambda [ alpha ]2For weighting, the person skilled in the art can adjust the weights according to actual needs; phiVGGRepresenting a VGG16 network; phiVGG(ypred) Characteristic value, phi, generated by VGG16 for the predicted result of the neural networkVGG(y) passing of a genuine labelCharacteristic values generated by VGG 16;which is the square of the L2 distance between the images.
The two parts are supervised loss functions of the neural network:
Lsup=Lpixel+Lperception。
and S30, using a pseudo label method to improve the generalization ability of the neural network. The method specifically comprises the following steps:
s301, generating a pseudo label through a Gaussian process for a label-free data set (a real image), specifically:
when the labeled data set is used for the first time, the obtained feature vector of the last convolution module in the encoder is usedMemory matrixNamely thatWherein N islRepresenting the number of annotated images, and i represents the label of the image. Suppose thatIs a1 × M feature vector, thenIs an N x M matrix.
From Sparse representation (Sparse representation) theory, sample set X ═ X1,...,xnMay be given by a set of basis vectors phi ═ phi1,...,φiThe linearity of the equation is given by:
thus, ifBy sample set X ═ X1,...,xnLearning a set of overcomplete basis vectors D ═ phi }1,...,φiIt can be decomposed.
X=Dα;
X is a sample set, D is a basis vector group (dictionary), alpha is a coefficient vector, and the decomposition mode is sparse coding. The aim of sparse coding is to make the decomposition result as close as possible to the original sample set after recombination, and alpha is as sparse as possible, namely:
min|α|0 s.t.Dα=X
preferably, forSparse coding is carried out, and a dictionary F with a tagged data set feature vector can be obtained through learning:
wherein alpha islAnd the coefficient vector corresponding to the labeled data set.
Meanwhile, when the label-free data set is sent into the neural network, the last convolution block can also obtain the corresponding characteristic vectorThe method assumes feature vectors of a label-free datasetFeature vectors associated with tagged data setsBelong to the same vector space and, therefore,andone dictionary F can be shared. Thus, when using non-tag dataWhen the set is trained, the obtained feature vector of the last convolution module in the encoder can be usedAnd projecting the feature vector space F obtained by learning. Preferably, the annotated data and the unlabeled data can be jointly modeled using a Gaussian Process (GP).
The gaussian process is a function whose core is modeling the function using an infinite multidimensional variable gaussian distribution. A gaussian process may be determined by a mean function (mean function) and a covariance function (covariance function).
v and v' are random variables, f is a gaussian process, E is desired, m (v) is a mean function, and K is a kernel function (covariance). The gaussian process can be defined as:
for a set of random variables V, denoted as V1,v2,...,vn]In other words, the result of the gaussian process conforms to a multidimensional gaussian distribution, namely:
thus, the joint distribution of the feature vectors of tagged and untagged datasets can be modeled as a multivariate gaussian:
wherein z isLGaussian process of feature vectors for tagged data sets, zUGaussian process of feature vectors for unlabeled datasets, μLIs the mean, μ, of the feature vectors of the tagged datasetUMean of the feature vectors for the unlabeled dataset; with the gaussian process, the distribution of the unlabeled dataset feature vectors can be calculated given the labeled dataset and the labeled dataset feature vectors. The last feature vector space is modeled using a Gaussian Process (GP). Unlabeled sample feature vectorThe distribution of (d) is equivalent to a gaussian distribution:
in order to label the data as such,is a Gaussian distribution of multivariate variables, whereinMean of gaussian process:
wherein the content of the first and second substances,<X,Y>represents the inner product of vector X and vector Y, | X | represents the modular length of vector X;has a value of 1; i is an identity matrix
S302, calculating the error of the unlabeled data, specifically:
for unlabeled data, the formula is adopted:
and calculating the error between the space vector obtained by the encoder and the space vector of the pseudo label. WhereinThe feature vector is predicted for the last volume block,for a pseudo tag obtained by the gaussian process,gauss corresponding to the jth convolution layerThe variance of the equation.After the expression is subjected to sparse representation, the decomposition result is as close to the characteristic vector of the tagged data set as possible, and the sparse vector alpha is as sparse as possible, and lambda is1For sparse coefficients, the coefficients can be adjusted by the user.
S303, synthesizing the loss functions of the labeled data set and the unlabeled data set, and further training the network.
The loss function of the marked data and the unmarked data is synthesized, and the loss function provided by the invention is as follows:
Ltotal=Lsup+λunsupLunsup;
wherein λ isunsupThe parameter can be adjusted by the user for the weight lost by the unmarked data set part.
And S40, inputting the face image with highlight into the trained convolutional neural network to obtain the highlight-removing result.
In an example of the present invention, a flow chart for synthesis and training is shown in FIG. 1. The method comprises the steps that based on a human face OBJ model and HDR environment illumination, a synthetic human face data set is obtained by giving a specular reflection component, a diffuse reflection component and a combination of the specular reflection component and the diffuse reflection component and rendering by using a rendering engine; a synthesized face data set with given illumination parameters is used as a data set with a label, a real face with highlight is collected as a data set without the label, the two data sets are input into a neural network model, and a 'pseudo label' of the real face is extracted through the synthesized face data.
When the labeled data is trained, extracting the feature vector of the labeled image in the last layer, learning a dictionary for describing the feature vector of the labeled data by a sparse coding mode on the basis of the extracted feature vector, and coding the feature vector into a sparse coefficient by using the dictionary. When an unlabeled image is input, a Gaussian process is calculated by using the sparse representation of the unlabeled image and the feature vector of the unlabeled image in the last layer, and the mean value of the Gaussian process is used as a 'pseudo label' of the unlabeled image. The neural network model trains the net's ability to remove light by simultaneously reducing the distance between the synthetic image and the label, the real image and the "pseudo label". The neural network psi is obtained after the training is finished. And applying psi to the image of the face which is input randomly to obtain a result after highlight is removed.
1) Training data synthesis
Any 3D face model (which can be automatically generated by using a face three-dimensional modeling algorithm) is taken, and the model comprises different categories of gender, width, fat and thin and the like. The models are put into a physical-based renderer (such as mitsuba), and different HDR environment illuminations are selected to render the models. And setting parameters of an illumination model of the renderer to obtain a specular reflection component S and a diffuse reflection component D of the face, and adding the S and the D by using an image addition method to obtain a face image I with high light. The dataset is constructed using D and I, with D as the label for I. This process is repeated to obtain a sufficient set of tagged data.
2) Neural network training
Besides the labeled data set, a plurality of real human faces with high light are selected (manually selected) to serve as a non-labeled data set, and the images can be used as diffuse reflection images and mirror reflection images no matter labeled data or unlabeled data.
I=D+S;
With a neural network psi as shown in fig. 2, it is desirable to obtain a delustering result by outputting a band highlight image x
The network is trained according to the network training architecture shown in fig. 3. The network training is divided into a supervised part and an unsupervised part, wherein the supervised part comprises:
for tagged datasets (composite images), use is made of
Lpixel=||ypred-y||1;
Compute tags and prefixesPixel error between the measured values, ypredIs the prediction result of the neural network, y is a real label, | |1Is L between images1Distance.
Meanwhile, the following steps are adopted:
a perceptual error between the label and the prediction is calculated.
Supervised losses include:
Lsup=Lpixel+Lperception;
the unsupervised part comprises:
for unlabeled data, the following are adopted:
and calculating the error between the space vector obtained by the encoder and the space vector of the pseudo label.
The loss function of the neural network proposed by the present invention is:
Ltotal=Lsup+λunsupLunsup;
the optimization goal of the neural network is this loss function. The overall training process of the neural network is shown in fig. 3.
The partial training configuration is as follows: epoch is 1000, batch size is 4, and use optimizer is Adam. The method of the invention is tested as follows: the experimental platform is PC, the GPU is NVIDIA GeForce GTX 1080Ti, and the video memory is 12G. The software configuration comprises the following steps: ubuntu18.04 system, CUDA11.3, Python3.8.0, Pytrch framework. Through tests, the method provided by the invention can effectively remove the high light.
3) De-highlighting network applications
The face image with highlight is input into the trained neural network psi, and the face with highlight removed can be obtained, as shown in fig. 4, the first behavior in fig. 4 is the face image with highlight, and the second behavior is that the image after highlight is removed by the method, so that the method can effectively remove highlight and simultaneously keep most details of the image.
Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.
Claims (10)
1. A face image highlight removal method based on a pseudo label is characterized by specifically comprising the following steps:
s1, acquiring a synthesized face data set through a rendering engine, and forming a tagged data set and a non-tagged data set with the real face image with highlight;
s2, training the convolutional neural network by using the labeled data set;
s3, acquiring a pseudo label of data in the unlabeled data set, and training the convolutional neural network by using the data set with the pseudo label;
and S4, inputting the face image with highlight into the convolutional neural network which is trained, and obtaining the picture without highlight.
2. The highlight removal method for the face image based on the pseudo label is characterized in that the acquisition method for the labeled data set and the unlabeled data set comprises the following steps:
selecting a plurality of face 3D model images, loading the face images into a rendering relation based on physics, selecting a plurality of HDR environment lights, and rendering the face models; respectively rendering a diffuse reflection part D and a specular reflection part S of a human face according to a Phong illumination model, expressing a rendered human face image with highlight as I ═ D + S, forming a synthetic human face image pair with highlight/highlight removal consisting of the obtained human face image with highlight and an original image thereof, and forming a labeled data set by the synthetic human face image pair with highlight/highlight removal;
and selecting a plurality of highlight face pictures from the face data set as a label-free data set.
3. The method for removing highlight from human face image based on pseudo label according to claim 1, wherein the convolutional neural network comprises an encoder and a decoder, the image inputted into the convolutional neural network is used as the input of the encoder, the encoder comprises 5 cascaded convolutional modules, and the output of each convolutional module is inputted into the next convolutional module after being maximally pooled;
taking the output of the encoder as the input of a decoder, wherein the decoder comprises 4 attention modules, 4 convolution modules, 4 deconvolution modules and a convolution layer first deconvolution module, the output obtained by carrying out deconvolution operation by taking the input of the decoder as the input is d1, and the output obtained by fusing the output of the penultimate convolution module in the encoder and d1 as the input is recorded as x1 by the first attention module;
d1 and x1 are spliced together and input into a first convolution module for convolution, the output of the first convolution module is used as the input of a second convolution module for convolution operation to obtain an output result d2, and the output obtained by the second attention module through fusion is used as the input according to d2 and the output of a third last convolution module of an encoder and is recorded as x 2;
d2 and x2 are spliced together, input into a second convolution module for convolution, the output of the second convolution module is used as the input of a third deconvolution module for deconvolution operation to obtain an output result d3, and the output obtained by fusing the output of the third attention module according to d3 and the output of a fourth last convolution module of an encoder as the input is recorded as x 3;
d3 and x3 are spliced together, input into a third convolution module for convolution, the output of the third convolution module is used as the input of a fourth deconvolution module for deconvolution operation to obtain an output result d4, and the output obtained by fusing the output of the fourth attention module according to d4 and the output of a fifth last convolution module of an encoder as the input is recorded as x 4;
d4 and x4 are spliced together, input into a fourth convolution module for convolution, and the output of the fourth convolution module is used as the input of the convolution layer of the encoder for convolution, so that the output result of the decoder is obtained.
4. The method for removing highlight from face image based on pseudo label according to claim 3, characterized in that the convolution module is composed of two cascaded convolution layers, each convolution layer sequentially performs convolution operation with convolution kernel number of image channel number, convolution window size of 3 x3 and step size of 1 x1, normalization operation and activation by using relu function;
the attention module takes the input from the decoder as a main part and the input from the encoder as a secondary part, the attention module fuses the main part and the secondary part, the main part and the secondary part respectively use convolution operation with the convolution window size of 3X 3 and the step size of 1X 1 and normalization operation in sequence to obtain a main part X and a secondary part S, the main part X and the secondary part S are added, then the convolution operation with the convolution window size of 3X 3 and the step size of 1X 1, the normalization operation and the activation by using a sigmoid function are sequentially carried out to obtain a fusion result XS, and the fusion result XS and the main part X are multiplied to obtain the output of the attention module;
the deconvolution module comprises an up-sampling layer and a convolution layer, wherein the up-sampling layer up-samples an input image, and after the image on the image is sampled to be twice as large as the image input into the module, the convolution layer sequentially performs convolution operation and normalization operation with the number of convolution kernels being the number of output channels, the size of a convolution window being 3 x3 and the step length being 1 x1, and activation is performed by using a relu function.
5. The method for removing highlight from face image based on pseudo label according to claim 3, wherein the size of convolution kernel for maximum pooling in the encoder is 2, the step size is 2, and the convolution kernels of five convolution modules in the encoder are 64,128,256,512 and 1024 in sequence; the convolution kernel number of the attention module in the decoder is the same as that of the deconvolution module, the convolution kernel numbers of the 4 deconvolution modules are 512,256,128 and 64 respectively, the convolution kernel number of the convolution layer in the decoder is 3, the convolution window size is 1 multiplied by 1, and the step size is 1 multiplied by 1.
6. The method for removing highlight from human face image based on pseudo label according to claim 2, wherein the method of using pseudo label to improve the generalization ability of convolutional neural network comprises the following steps:
s31, generating pseudo labels for the unlabeled data sets through a Gaussian process;
s32, calculating the error of the unlabeled data set according to the generated pseudo label;
s33, calculating a loss function of the convolutional neural network according to the error of the unlabeled data set and the error of the labeled data set, and training the convolutional neural network through back propagation, wherein the loss function of the convolutional neural network is expressed as:
Ltotal=Lsup+λunsupLunsup;
wherein L istotalIs the total loss of the convolutional neural network; l issupLoss of a portion of the tagged data set; l isunsupLoss of unmarked data set parts; lambda [ alpha ]unsupThe weight lost for the unmarked data set part.
7. The method for removing highlights from face image based on false label as claimed in claim 6, wherein the process of generating false label through Gaussian process for non-label data set includes the following steps:
when the labeled data set is used for the first time, the obtained feature vector of the last convolution module in the encoder is usedMemory matrix
To pairSparse coding is carried out, and a dictionary F with a tagged data set feature vector is obtained through learning;
when the label-free data set is sent into the neural network, the corresponding characteristic vector is obtained by the last rolling blockAnd apply the feature vectorsProjecting to a feature vector space F obtained by learning;
in the case where the tagged data set and the tagged data set feature vector are known, the distribution of the unlabeled data set feature vector is equivalent to a gaussian distribution, and the mean in the equivalent gaussian distribution is taken as the pseudo-tag of the unlabeled data set feature vector.
8. The method for removing highlights from a face image based on a pseudo label according to claim 7, characterized in that the distribution of the feature vectors of the unmarked data set is equivalent to a gaussian distribution, and the mean of the equivalent gaussian distribution is expressed as:
the variance of the equivalent gaussian distribution is expressed as:
9. The method for removing highlights from a face image based on a pseudo label according to claim 8, wherein the loss function of the unlabeled data set is expressed as:
wherein the content of the first and second substances,the feature vector output by the last convolution module of the encoder;is a pseudo label obtained by a Gaussian process, and the value is the mean value of Gaussian distribution||||2Is the norm of L2; lambda [ alpha ]1Is a sparse coefficient; alpha is a sparse vector.
10. The highlight removal method for face images based on pseudo labels as claimed in claim 6, wherein the loss L of the labeled data set partsupExpressed as:
Lsup=Lpixel+Lperception;
Lpixel=||ypred-y||1;
wherein, ypredIs the prediction result of the neural network; y is a real label, | ·| non-woven phosphor1Is L between images1A distance; lambda [ alpha ]2Is a weight; phiVGGRepresents a VGG16 network; phiVGG(ypred) Characteristic value, phi, generated by VGG16 for the predicted result of the neural networkVGG(y) feature values generated for the authentic tag by VGG 16;is the square of the norm of L2.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210208825.7A CN114549387A (en) | 2022-03-03 | 2022-03-03 | Face image highlight removal method based on pseudo label |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210208825.7A CN114549387A (en) | 2022-03-03 | 2022-03-03 | Face image highlight removal method based on pseudo label |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114549387A true CN114549387A (en) | 2022-05-27 |
Family
ID=81661283
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210208825.7A Pending CN114549387A (en) | 2022-03-03 | 2022-03-03 | Face image highlight removal method based on pseudo label |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114549387A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113361548A (en) * | 2021-07-05 | 2021-09-07 | 北京理工导航控制科技股份有限公司 | Local feature description and matching method for highlight image |
CN115131252A (en) * | 2022-09-01 | 2022-09-30 | 杭州电子科技大学 | Metal object surface highlight removal method based on secondary coding and decoding structure |
-
2022
- 2022-03-03 CN CN202210208825.7A patent/CN114549387A/en active Pending
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113361548A (en) * | 2021-07-05 | 2021-09-07 | 北京理工导航控制科技股份有限公司 | Local feature description and matching method for highlight image |
CN113361548B (en) * | 2021-07-05 | 2023-11-14 | 北京理工导航控制科技股份有限公司 | Local feature description and matching method for highlight image |
CN115131252A (en) * | 2022-09-01 | 2022-09-30 | 杭州电子科技大学 | Metal object surface highlight removal method based on secondary coding and decoding structure |
CN115131252B (en) * | 2022-09-01 | 2022-11-29 | 杭州电子科技大学 | Metal object surface highlight removal method based on secondary coding and decoding structure |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Zheng et al. | Ultra-high-definition image dehazing via multi-guided bilateral learning | |
EP3678059B1 (en) | Image processing method, image processing apparatus, and a neural network training method | |
Ghiasi et al. | Exploring the structure of a real-time, arbitrary neural artistic stylization network | |
CN111784602B (en) | Method for generating countermeasure network for image restoration | |
CN111583135B (en) | Nuclear prediction neural network Monte Carlo rendering image denoising method | |
CN112465718B (en) | Two-stage image restoration method based on generation of countermeasure network | |
Messaoud et al. | Structural consistency and controllability for diverse colorization | |
CN114549387A (en) | Face image highlight removal method based on pseudo label | |
CN111832570A (en) | Image semantic segmentation model training method and system | |
CN111783658B (en) | Two-stage expression animation generation method based on dual-generation reactance network | |
CN109993820B (en) | Automatic animation video generation method and device | |
CN113205449A (en) | Expression migration model training method and device and expression migration method and device | |
CN114820341A (en) | Image blind denoising method and system based on enhanced transform | |
CN115393231B (en) | Defect image generation method and device, electronic equipment and storage medium | |
Wu et al. | FW-GAN: Underwater image enhancement using generative adversarial network with multi-scale fusion | |
Chen et al. | Domain adaptation for underwater image enhancement via content and style separation | |
CN115170915A (en) | Infrared and visible light image fusion method based on end-to-end attention network | |
Salmona et al. | Deoldify: A review and implementation of an automatic colorization method | |
CN112270692A (en) | Monocular video structure and motion prediction self-supervision method based on super-resolution | |
CN113096001A (en) | Image processing method, electronic device and readable storage medium | |
Kim et al. | A multi-purpose convolutional neural network for simultaneous super-resolution and high dynamic range image reconstruction | |
CN114694081A (en) | Video sample generation method based on multivariate attribute synthesis | |
Jiang et al. | Real noise image adjustment networks for saliency-aware stylistic color retouch | |
CN117499711A (en) | Training method, device, equipment and storage medium of video generation model | |
Liu et al. | Sketch to portrait generation with generative adversarial networks and edge constraint |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |