CN114549387A - Face image highlight removal method based on pseudo label - Google Patents

Face image highlight removal method based on pseudo label Download PDF

Info

Publication number
CN114549387A
CN114549387A CN202210208825.7A CN202210208825A CN114549387A CN 114549387 A CN114549387 A CN 114549387A CN 202210208825 A CN202210208825 A CN 202210208825A CN 114549387 A CN114549387 A CN 114549387A
Authority
CN
China
Prior art keywords
convolution
data set
module
highlight
output
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210208825.7A
Other languages
Chinese (zh)
Inventor
黄颖
王泽荃
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University of Post and Telecommunications
Original Assignee
Chongqing University of Post and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University of Post and Telecommunications filed Critical Chongqing University of Post and Telecommunications
Priority to CN202210208825.7A priority Critical patent/CN114549387A/en
Publication of CN114549387A publication Critical patent/CN114549387A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/50Image enhancement or restoration using two or more images, e.g. averaging or subtraction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30196Human being; Person
    • G06T2207/30201Face

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

The invention relates to the field of image processing, computer vision and deep learning, in particular to a highlight removing method for a face image based on a pseudo label, which comprises the steps of obtaining a synthesized face data set through a rendering engine, and forming a labeled data set and a non-labeled data set with a real face image with highlight; training the convolutional neural network by using the labeled data set; the generalization capability of the neural network is improved by using a pseudo label method; inputting the face image with highlight into a convolutional neural network to obtain a highlight-removed picture; the face highlight-removed image obtained by the method accords with the face observation effect, and the face texture details are not damaged.

Description

Face image highlight removal method based on pseudo label
Technical Field
The invention relates to the field of image processing, computer vision and deep learning, in particular to a face image highlight removal method based on a pseudo label.
Background
Specular reflection formed on a human face can generate facial highlight, and in most cases, the facial highlight can obviously influence image quality and reduce the aesthetic feeling of the human face. In the fields of movie and television production, virtual reality, human face relighting and the like, the highlight removing algorithm of the human face image, which is crucial to the highlight removing of the existing human face image, mainly comprises the following steps:
1) a numerical optimization method based on a priori assumptions. This type of algorithm relies on making assumptions about the physical characteristics of the highlight of the face, and manually designing prior assumptions and constraints based thereon. The aim of separating highlight components and non-highlight components of the human face is achieved by minimizing an optimization function.
2) An image processing method based on image chromaticity space. In the method, different pixels are divided into highlight pixels and non-highlight pixels in a chromaticity space by analyzing the chromaticity space of the image, and the highlight pixels are converted into the non-highlight pixels through image processing calculation.
3) The method is based on a deep learning class method driven by data. The algorithm takes a highlight image as a model input and takes a non-highlight image as a model output by means of a deep learning technology. And the model learns the capability of removing the highlight of the human face through large-scale data.
With the continuous development of deep learning technology, a deep learning method based on data driving becomes the mainstream direction for removing highlight from a human face. Because it is difficult to obtain a highlight dataset of a tagged face, it is often necessary to produce a composite dataset by a rendering engine. However, the synthesized face image still has a difference from the real face image, and the de-highlight neural network trained by the synthesized face acts on the de-highlight of the real face, which can cause problems of color distortion, smooth image transition and the like.
Disclosure of Invention
In order to overcome the defect of a deep learning face highlight removing algorithm, the invention provides a face image highlight removing method based on a pseudo label, which specifically comprises the following steps:
s1, acquiring a synthesized face data set through a rendering engine, and forming a tagged data set and a non-tagged data set with the real face image with highlight;
s2, training the convolutional neural network by using the labeled data set;
s3, acquiring a pseudo label of data in the unlabeled data set, and training the convolutional neural network by using the data set with the pseudo label;
s4, inputting the face image with highlight into the convolutional neural network which completes training to obtain a highlight-removed picture;
when the convolutional neural network is trained, the sum of the loss between the data in the data set with the labels and the labels thereof and the loss between the data in the data set without the labels and the pseudo labels thereof is used as a loss function of the convolutional neural network, and the convolutional neural network is subjected to back propagation to train the convolutional neural network.
Further, the method for acquiring the tagged data set and the untagged data set comprises the following steps:
selecting a plurality of face 3D model images, loading the face images into a rendering relation based on physics, selecting a plurality of HDR environment lights, and rendering the face models; respectively rendering a diffuse reflection part D and a specular reflection part S of a human face according to a Phong illumination model, expressing a rendered human face image with highlight as I ═ D + S, forming a synthetic human face image pair with highlight/highlight removal consisting of the obtained human face image with highlight and an original image thereof, and forming a labeled data set by the synthetic human face image pair with highlight/highlight removal;
and selecting a plurality of highlight face pictures from the face data set as a label-free data set.
Further, the convolutional neural network comprises an encoder and a decoder, the image input into the convolutional neural network is used as the input of the encoder, the encoder comprises 5 cascaded convolutional modules, and the output of each convolutional module is input into the next convolutional module after being subjected to maximum pooling;
taking the output of the encoder as the input of a decoder, wherein the decoder comprises 4 attention modules, 4 convolution modules, 4 deconvolution modules and a convolution layer first deconvolution module, the output obtained by carrying out deconvolution operation by taking the input of the decoder as the input is d1, and the output obtained by fusing the output of the penultimate convolution module in the encoder and d1 as the input is recorded as x1 by the first attention module;
d1 and x1 are spliced together and input into a first convolution module for convolution, the output of the first convolution module is used as the input of a second convolution module for convolution operation to obtain an output result d2, and the output obtained by the second attention module through fusion is used as the input according to d2 and the output of a third last convolution module of an encoder and is recorded as x 2;
d2 and x2 are spliced together, input into a second convolution module for convolution, the output of the second convolution module is used as the input of a third deconvolution module for deconvolution operation to obtain an output result d3, and the output obtained by fusing the output of the third attention module according to d3 and the output of a fourth last convolution module of an encoder as the input is recorded as x 3;
d3 and x3 are spliced together, input into a third convolution module for convolution, the output of the third convolution module is used as the input of a fourth deconvolution module for deconvolution operation to obtain an output result d4, and the output obtained by fusing the output of the fourth attention module according to d4 and the output of a fifth last convolution module of an encoder as the input is recorded as x 4;
d4 and x4 are spliced together, input into a fourth convolution module for convolution, and the output of the third convolution module is used as the input of the convolution layer of the encoder for convolution to obtain the output result of the decoder.
Furthermore, the convolution module is composed of two cascaded convolution layers, each convolution layer sequentially performs convolution operation with the convolution kernel number being the number of image channels, the convolution window size being 3 × 3 and the step length being 1 × 1, normalization operation and activation by using a relu function;
the attention module takes the input from the decoder as a main part and the input from the encoder as a secondary part, the attention module fuses the main part and the secondary part, the main part and the secondary part respectively use the convolution operation with the convolution window size of 3X 3 and the step size of 1X 1 and the normalization operation in sequence to obtain a main part X and a secondary part S, the main part X and the secondary part S are added and then sequentially carry out the convolution operation with the convolution window size of 3X 3 and the step size of 1X 1, the normalization operation and the activation by using a sigmoid function to obtain a fusion result XS, and the fusion result XS is multiplied by the main part X to be used as the output of the attention module;
the deconvolution module comprises an up-sampling layer and a convolution layer, wherein the up-sampling layer up-samples an input image, and after the image on the image is sampled to be twice as large as the image input into the module, the convolution layer sequentially performs convolution operation and normalization operation with the number of convolution kernels being the number of output channels, the size of a convolution window being 3 x3 and the step length being 1 x1, and activation is performed by using a relu function.
Further, the size of a convolution kernel for performing maximum pooling in the encoder is 2, the step size is 2, and convolution kernels of five convolution modules of the encoder are 64,128,256,512 and 1024 in sequence; the convolution kernel number of the attention module in the decoder is the same as that of the deconvolution module, the convolution kernel numbers of the 4 deconvolution modules are 512,256,128 and 64 respectively, the convolution kernel number of the convolution layer in the decoder is 3, the convolution window size is 1 multiplied by 1, and the step size is 1 multiplied by 1.
Further, the method for using the pseudo label to improve the generalization capability of the convolutional neural network comprises the following steps:
s31, generating pseudo labels for the unlabeled data sets through a Gaussian process;
s32, calculating the error of the unlabeled data set according to the generated pseudo label;
s33, calculating a loss function of the convolutional neural network according to the error of the unlabeled data set and the error of the labeled data set, and training the convolutional neural network through back propagation, wherein the loss function of the convolutional neural network is expressed as:
Ltotal=LsupunsupLunsup
wherein L istotalIs the total loss of the convolutional neural network; l issupLoss of a portion of the tagged data set; l isunsupLoss of unmarked data set parts; lambda [ alpha ]unsupThe weight lost for the unmarked data set part.
Further, for the unlabeled dataset, the process of generating the pseudo label by the gaussian process comprises the following steps:
when the labeled data set is used for the first time, the obtained feature vector of the last convolution module in the encoder is used
Figure BDA0003530198900000041
Memory matrix
Figure BDA0003530198900000042
To pair
Figure BDA0003530198900000043
Sparse coding is carried out, and a dictionary F with a tagged data set feature vector is obtained through learning;
when the label-free data set is sent into the neural network, the corresponding characteristic vector is obtained by the last rolling block
Figure BDA0003530198900000044
And apply the feature vectors
Figure BDA0003530198900000045
Projecting to a feature vector space F obtained by learning;
in the case where the tagged data set and the tagged data set feature vector are known, the distribution of the unlabeled data set feature vector is equivalent to a gaussian distribution, and the mean in the equivalent gaussian distribution is taken as the pseudo-tag of the unlabeled data set feature vector.
Further, the distribution of the feature vectors of the unlabeled dataset is equivalent to a gaussian distribution, and the mean of the equivalent gaussian distribution is expressed as:
Figure BDA0003530198900000051
the variance of the equivalent gaussian distribution is expressed as:
Figure BDA0003530198900000052
wherein K (X, Y) is a kernel function expressed as
Figure BDA0003530198900000053
<X,Y>Represents the inner product of vector X and vector Y, | X | represents the modular length of vector X;
Figure BDA0003530198900000054
has a value of 1; and I is an identity matrix.
Further, the loss function of the unlabeled dataset is represented as:
Figure BDA0003530198900000055
wherein the content of the first and second substances,
Figure BDA0003530198900000056
the feature vector output by the last convolution module of the encoder;
Figure BDA0003530198900000057
is a pseudo label obtained by a Gaussian process, and the value is the mean value of Gaussian distribution
Figure BDA0003530198900000058
|| ||2Is the norm of L2; lambda [ alpha ]1Is a sparse coefficient; alpha is a sparse vector.
Further, the loss L of the tagged dataset portionsupExpressed as:
Lsup=Lpixel+Lperception
Lpixel=||ypred-y||1
Figure BDA0003530198900000059
wherein, ypredIs the prediction result of the neural network; y is a real label, | ·| non-woven phosphor1Is L between images1A distance; lambda [ alpha ]2Is a weight; phiVGGRepresents a VGG16 network; phiVGG(ypred) Characteristic value, phi, generated by VGG16 for the predicted result of the neural networkVGG(y) feature values generated for the authentic tag by VGG 16;
Figure BDA00035301989000000510
which is the square of the L2 distance between the images.
According to the face highlight removing method based on the pseudo label, the pseudo label of the face picture which is not marked is manufactured by using the Gaussian process mixed model, the generalization capability of a neural network on a real face image is enhanced through the pseudo label, and the effect of highlight removing on the real face is improved.
Drawings
FIG. 1 is a simplified flowchart of an example of a pseudo-label based facial image de-highlights algorithm of the present invention;
FIG. 2 is a schematic diagram of the structure of the light removal network of the present invention;
FIG. 3 is a schematic diagram of the present invention for training highlight removal networks;
fig. 4 shows the test results of the trained neural network on the CelebA data set (first line of action of the original image, second line of action of the first line of image corresponding to the highlight-removed result).
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The invention provides a face image highlight removal method based on a pseudo label, which specifically comprises the following steps:
s1, acquiring a synthesized face data set through a rendering engine, and forming a tagged data set and a non-tagged data set with the real face image with highlight;
s2, training the convolutional neural network by using the labeled data set;
s3, acquiring a pseudo label of data in the unlabeled data set, and training the convolutional neural network by using the data set with the pseudo label;
and S4, inputting the face image with highlight into the convolutional neural network which is trained, and obtaining the picture without highlight.
In this embodiment, a flow of a face image highlight removal method based on a pseudo label is as shown in fig. 1 to 4, and specifically includes the following steps:
s10, a face dataset (including both specular and diffuse non-specular faces) is rendered using a rendering engine with a highlight image in the real world to form a labeled dataset and an unlabeled dataset to train a neural network.
S101, selecting a plurality of face 3D models (obj files) and loading the face 3D models into a rendering engine based on physics, selecting a plurality of HDR (high-resolution digital hierarchy) environment lights and rendering the face models. Rendering according to a Phong illumination model to obtain a diffuse reflection part D and a specular reflection part S of the face respectively, and according to a formula:
I=D+S
an image I is obtained. And S is regarded as highlight of the face, I is a face image with highlight, and D is a face image without highlight. Repeating the above operations to generate a large number of highlight/highlight-removed synthetic face image pairs as a labeled data set.
S102, manually selecting a plurality of highlight face pictures from the real face data set, and taking the pictures as a non-label data set.
S20, the neural network is first convolved with the tagged data set. The method specifically comprises the following steps:
s201, the structure of the convolutional neural network.
The convolutional neural network can be specifically divided into an encoder (encoding) and a decoder (decoding), and specifically includes:
1) encoder the encoder comprises 5 successive convolution modules conv _ block. The first convolution module takes the image as input, the last four convolution modules take the output of the previous convolution layer as input, the maximum pooling needs to be performed after the first 4 convolution modules are finished, the convolution kernel size of the maximum pooling operation is 2, the step size is 2, and the number of convolution kernels of the five convolution modules is [64,128,256,512,1024 ].
2) The decoder comprises 4 attention modules, 4 convolution modules, 4 deconvolution modules and a convolution layer. The first deconvolution module of the decoder accepts as input the output of the last encoder of the encoder, resulting in output d1, the first attention module accepts as input d1 and the output e4 of the penultimate convolution module of the encoder, resulting in output x1, and d1 is connected to x1 and fed into the first convolution module. The output of the convolution module is fed to a second deconvolution module to obtain an output d 2. The second attention module accepts d2 and the output e3 of the third last convolution module of the encoder as inputs, resulting in an output x2, which connects d2 and d into the second convolution module. And so on, finally sending the output d4 of the last convolution module into the convolution layer. The number of convolution kernels for the 4 deconvolution modules is [512,256,128,64], respectively. The number of convolution kernels of the attention module is the same as that of the deconvolution module. The number of last convolutional kernels is 3, the convolutional window size is 1 x1, and the step size is 1 x 1. The overall structure of the convolutional neural network is shown in fig. 2. The convolution module structure in the decoder is consistent with that in the encoder, the number of convolution kernels of the encoder of the decoder is [512,256,128,64 and 3], and the convolution kernels are sequentially and uniformly closer to the input end and further forward according to the times from input to output; other convolution parameters are consistent with the encoder and are not described in detail here.
The convolution module, attention module, deconvolution module and convolution layer in the convolution neural network specifically comprise
1) Convolution module conv _ block: the convolution module is composed of two convolution layers. The two convolution layers are identical except for the inputs. The number of convolution layer convolution kernels is the number of output channels ch, the size of a convolution window is 3 x3, the step size is 1 x1, then normalization is carried out, and the activation function is a relu function. The convolution structures adopted in the encoder and the decoder are the same, and only the convolution kernels of the convolution layers are different in number, wherein the convolution kernels of 5 convolution modules in the encoder are 64,128,256,512 and 1024 in sequence, and the convolution kernels of 4 convolution modules in the decoder are 512,256,128 and 64 in sequence.
2) Attention module attu _ block: the attention module takes the input from the decoder as a main part and the input from the encoder as a secondary part as the input of the attention module, respectively performs convolution operation on the two parts, the convolution window size of the convolution operation is 3X 3, the step size is 1X 1, and then normalization is performed to obtain a main part X and a secondary distribution S; adding X and S to obtain XS, then carrying out convolution operation on the XS, wherein the size of a convolution window of the convolution operation is 3X 3, the step length is 1X 1, then carrying out normalization, and enabling an activation function to be sigmoid; multiplying X by XS yields the output of the attention module.
3) Deconvolution module up _ block: the deconvolution module first upsamples the image input to the module twice as much as the original image. Then, performing convolution by using a convolution layer, wherein the number of convolution kernels of the convolution layer is the number ch of output channels, the size of a convolution window is 3 x3, the step length is 1 x1, then performing normalization, and an activation function is a relu function; the numbers of convolution kernels of the 4 deconvolution modules are 512,256,128 and 64 respectively.
4) And (3) rolling layers: the output of the last convolution module of the decoder is input into a convolution layer for convolution, the number of convolution kernels of the convolution layer is the number of channels of an input image, the size of a convolution window is 3 multiplied by 3, and the step length is 1 multiplied by 1.
S202: for tagged datasets (composite images), the error between the prediction and the tag is calculated. The pixel error between the label and the prediction result is expressed as:
Lpixel=||ypred-y||1
meanwhile, the perceptual error between the label and the prediction result is calculated and expressed as:
Figure BDA0003530198900000091
wherein, ypredIs the prediction result of the neural network; y is a real label, | ·| non-woven phosphor1Is L between images1A distance; lambda [ alpha ]2For weighting, the person skilled in the art can adjust the weights according to actual needs; phiVGGRepresenting a VGG16 network; phiVGG(ypred) Characteristic value, phi, generated by VGG16 for the predicted result of the neural networkVGG(y) passing of a genuine labelCharacteristic values generated by VGG 16;
Figure BDA0003530198900000098
which is the square of the L2 distance between the images.
The two parts are supervised loss functions of the neural network:
Lsup=Lpixel+Lperception
and S30, using a pseudo label method to improve the generalization ability of the neural network. The method specifically comprises the following steps:
s301, generating a pseudo label through a Gaussian process for a label-free data set (a real image), specifically:
when the labeled data set is used for the first time, the obtained feature vector of the last convolution module in the encoder is used
Figure BDA0003530198900000092
Memory matrix
Figure BDA0003530198900000093
Namely that
Figure BDA0003530198900000094
Wherein N islRepresenting the number of annotated images, and i represents the label of the image. Suppose that
Figure BDA0003530198900000095
Is a1 × M feature vector, then
Figure BDA0003530198900000096
Is an N x M matrix.
From Sparse representation (Sparse representation) theory, sample set X ═ X1,...,xnMay be given by a set of basis vectors phi ═ phi1,...,φiThe linearity of the equation is given by:
Figure BDA0003530198900000097
thus, ifBy sample set X ═ X1,...,xnLearning a set of overcomplete basis vectors D ═ phi }1,...,φiIt can be decomposed.
X=Dα;
X is a sample set, D is a basis vector group (dictionary), alpha is a coefficient vector, and the decomposition mode is sparse coding. The aim of sparse coding is to make the decomposition result as close as possible to the original sample set after recombination, and alpha is as sparse as possible, namely:
min|α|0 s.t.Dα=X
preferably, for
Figure BDA00035301989000001011
Sparse coding is carried out, and a dictionary F with a tagged data set feature vector can be obtained through learning:
Figure BDA00035301989000001012
wherein alpha islAnd the coefficient vector corresponding to the labeled data set.
Meanwhile, when the label-free data set is sent into the neural network, the last convolution block can also obtain the corresponding characteristic vector
Figure BDA0003530198900000101
The method assumes feature vectors of a label-free dataset
Figure BDA0003530198900000102
Feature vectors associated with tagged data sets
Figure BDA0003530198900000103
Belong to the same vector space and, therefore,
Figure BDA0003530198900000104
and
Figure BDA0003530198900000105
one dictionary F can be shared. Thus, when using non-tag dataWhen the set is trained, the obtained feature vector of the last convolution module in the encoder can be used
Figure BDA0003530198900000106
And projecting the feature vector space F obtained by learning. Preferably, the annotated data and the unlabeled data can be jointly modeled using a Gaussian Process (GP).
The gaussian process is a function whose core is modeling the function using an infinite multidimensional variable gaussian distribution. A gaussian process may be determined by a mean function (mean function) and a covariance function (covariance function).
Figure BDA0003530198900000107
Figure BDA0003530198900000108
v and v' are random variables, f is a gaussian process, E is desired, m (v) is a mean function, and K is a kernel function (covariance). The gaussian process can be defined as:
Figure BDA0003530198900000109
for a set of random variables V, denoted as V1,v2,...,vn]In other words, the result of the gaussian process conforms to a multidimensional gaussian distribution, namely:
Figure BDA00035301989000001010
thus, the joint distribution of the feature vectors of tagged and untagged datasets can be modeled as a multivariate gaussian:
Figure BDA0003530198900000111
wherein z isLGaussian process of feature vectors for tagged data sets, zUGaussian process of feature vectors for unlabeled datasets, μLIs the mean, μ, of the feature vectors of the tagged datasetUMean of the feature vectors for the unlabeled dataset; with the gaussian process, the distribution of the unlabeled dataset feature vectors can be calculated given the labeled dataset and the labeled dataset feature vectors. The last feature vector space is modeled using a Gaussian Process (GP). Unlabeled sample feature vector
Figure BDA0003530198900000112
The distribution of (d) is equivalent to a gaussian distribution:
Figure BDA0003530198900000113
Figure BDA0003530198900000114
in order to label the data as such,
Figure BDA0003530198900000115
is a Gaussian distribution of multivariate variables, wherein
Figure BDA0003530198900000116
Mean of gaussian process:
Figure BDA0003530198900000117
Figure BDA0003530198900000118
variance for gaussian process:
Figure BDA0003530198900000119
Figure BDA00035301989000001110
is set to 1. K (X, Y) is a kernel function defined as:
Figure BDA00035301989000001111
wherein the content of the first and second substances,<X,Y>represents the inner product of vector X and vector Y, | X | represents the modular length of vector X;
Figure BDA00035301989000001112
has a value of 1; i is an identity matrix
Preferably, the mean of the Gaussian process is used
Figure BDA00035301989000001113
As pseudo-tag feature vectors
Figure BDA00035301989000001114
S302, calculating the error of the unlabeled data, specifically:
for unlabeled data, the formula is adopted:
Figure BDA00035301989000001115
and calculating the error between the space vector obtained by the encoder and the space vector of the pseudo label. Wherein
Figure BDA0003530198900000121
The feature vector is predicted for the last volume block,
Figure BDA0003530198900000122
for a pseudo tag obtained by the gaussian process,
Figure BDA0003530198900000123
gauss corresponding to the jth convolution layerThe variance of the equation.
Figure BDA0003530198900000124
After the expression is subjected to sparse representation, the decomposition result is as close to the characteristic vector of the tagged data set as possible, and the sparse vector alpha is as sparse as possible, and lambda is1For sparse coefficients, the coefficients can be adjusted by the user.
S303, synthesizing the loss functions of the labeled data set and the unlabeled data set, and further training the network.
The loss function of the marked data and the unmarked data is synthesized, and the loss function provided by the invention is as follows:
Ltotal=LsupunsupLunsup
wherein λ isunsupThe parameter can be adjusted by the user for the weight lost by the unmarked data set part.
And S40, inputting the face image with highlight into the trained convolutional neural network to obtain the highlight-removing result.
In an example of the present invention, a flow chart for synthesis and training is shown in FIG. 1. The method comprises the steps that based on a human face OBJ model and HDR environment illumination, a synthetic human face data set is obtained by giving a specular reflection component, a diffuse reflection component and a combination of the specular reflection component and the diffuse reflection component and rendering by using a rendering engine; a synthesized face data set with given illumination parameters is used as a data set with a label, a real face with highlight is collected as a data set without the label, the two data sets are input into a neural network model, and a 'pseudo label' of the real face is extracted through the synthesized face data.
When the labeled data is trained, extracting the feature vector of the labeled image in the last layer, learning a dictionary for describing the feature vector of the labeled data by a sparse coding mode on the basis of the extracted feature vector, and coding the feature vector into a sparse coefficient by using the dictionary. When an unlabeled image is input, a Gaussian process is calculated by using the sparse representation of the unlabeled image and the feature vector of the unlabeled image in the last layer, and the mean value of the Gaussian process is used as a 'pseudo label' of the unlabeled image. The neural network model trains the net's ability to remove light by simultaneously reducing the distance between the synthetic image and the label, the real image and the "pseudo label". The neural network psi is obtained after the training is finished. And applying psi to the image of the face which is input randomly to obtain a result after highlight is removed.
1) Training data synthesis
Any 3D face model (which can be automatically generated by using a face three-dimensional modeling algorithm) is taken, and the model comprises different categories of gender, width, fat and thin and the like. The models are put into a physical-based renderer (such as mitsuba), and different HDR environment illuminations are selected to render the models. And setting parameters of an illumination model of the renderer to obtain a specular reflection component S and a diffuse reflection component D of the face, and adding the S and the D by using an image addition method to obtain a face image I with high light. The dataset is constructed using D and I, with D as the label for I. This process is repeated to obtain a sufficient set of tagged data.
2) Neural network training
Besides the labeled data set, a plurality of real human faces with high light are selected (manually selected) to serve as a non-labeled data set, and the images can be used as diffuse reflection images and mirror reflection images no matter labeled data or unlabeled data.
I=D+S;
With a neural network psi as shown in fig. 2, it is desirable to obtain a delustering result by outputting a band highlight image x
Figure BDA0003530198900000131
Figure BDA0003530198900000132
The network is trained according to the network training architecture shown in fig. 3. The network training is divided into a supervised part and an unsupervised part, wherein the supervised part comprises:
for tagged datasets (composite images), use is made of
Lpixel=||ypred-y||1
Compute tags and prefixesPixel error between the measured values, ypredIs the prediction result of the neural network, y is a real label, | |1Is L between images1Distance.
Meanwhile, the following steps are adopted:
Figure BDA0003530198900000133
a perceptual error between the label and the prediction is calculated.
Supervised losses include:
Lsup=Lpixel+Lperception
the unsupervised part comprises:
for unlabeled data, the following are adopted:
Figure BDA0003530198900000141
and calculating the error between the space vector obtained by the encoder and the space vector of the pseudo label.
The loss function of the neural network proposed by the present invention is:
Ltotal=LsupunsupLunsup
the optimization goal of the neural network is this loss function. The overall training process of the neural network is shown in fig. 3.
The partial training configuration is as follows: epoch is 1000, batch size is 4, and use optimizer is Adam. The method of the invention is tested as follows: the experimental platform is PC, the GPU is NVIDIA GeForce GTX 1080Ti, and the video memory is 12G. The software configuration comprises the following steps: ubuntu18.04 system, CUDA11.3, Python3.8.0, Pytrch framework. Through tests, the method provided by the invention can effectively remove the high light.
3) De-highlighting network applications
The face image with highlight is input into the trained neural network psi, and the face with highlight removed can be obtained, as shown in fig. 4, the first behavior in fig. 4 is the face image with highlight, and the second behavior is that the image after highlight is removed by the method, so that the method can effectively remove highlight and simultaneously keep most details of the image.
Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims (10)

1. A face image highlight removal method based on a pseudo label is characterized by specifically comprising the following steps:
s1, acquiring a synthesized face data set through a rendering engine, and forming a tagged data set and a non-tagged data set with the real face image with highlight;
s2, training the convolutional neural network by using the labeled data set;
s3, acquiring a pseudo label of data in the unlabeled data set, and training the convolutional neural network by using the data set with the pseudo label;
and S4, inputting the face image with highlight into the convolutional neural network which is trained, and obtaining the picture without highlight.
2. The highlight removal method for the face image based on the pseudo label is characterized in that the acquisition method for the labeled data set and the unlabeled data set comprises the following steps:
selecting a plurality of face 3D model images, loading the face images into a rendering relation based on physics, selecting a plurality of HDR environment lights, and rendering the face models; respectively rendering a diffuse reflection part D and a specular reflection part S of a human face according to a Phong illumination model, expressing a rendered human face image with highlight as I ═ D + S, forming a synthetic human face image pair with highlight/highlight removal consisting of the obtained human face image with highlight and an original image thereof, and forming a labeled data set by the synthetic human face image pair with highlight/highlight removal;
and selecting a plurality of highlight face pictures from the face data set as a label-free data set.
3. The method for removing highlight from human face image based on pseudo label according to claim 1, wherein the convolutional neural network comprises an encoder and a decoder, the image inputted into the convolutional neural network is used as the input of the encoder, the encoder comprises 5 cascaded convolutional modules, and the output of each convolutional module is inputted into the next convolutional module after being maximally pooled;
taking the output of the encoder as the input of a decoder, wherein the decoder comprises 4 attention modules, 4 convolution modules, 4 deconvolution modules and a convolution layer first deconvolution module, the output obtained by carrying out deconvolution operation by taking the input of the decoder as the input is d1, and the output obtained by fusing the output of the penultimate convolution module in the encoder and d1 as the input is recorded as x1 by the first attention module;
d1 and x1 are spliced together and input into a first convolution module for convolution, the output of the first convolution module is used as the input of a second convolution module for convolution operation to obtain an output result d2, and the output obtained by the second attention module through fusion is used as the input according to d2 and the output of a third last convolution module of an encoder and is recorded as x 2;
d2 and x2 are spliced together, input into a second convolution module for convolution, the output of the second convolution module is used as the input of a third deconvolution module for deconvolution operation to obtain an output result d3, and the output obtained by fusing the output of the third attention module according to d3 and the output of a fourth last convolution module of an encoder as the input is recorded as x 3;
d3 and x3 are spliced together, input into a third convolution module for convolution, the output of the third convolution module is used as the input of a fourth deconvolution module for deconvolution operation to obtain an output result d4, and the output obtained by fusing the output of the fourth attention module according to d4 and the output of a fifth last convolution module of an encoder as the input is recorded as x 4;
d4 and x4 are spliced together, input into a fourth convolution module for convolution, and the output of the fourth convolution module is used as the input of the convolution layer of the encoder for convolution, so that the output result of the decoder is obtained.
4. The method for removing highlight from face image based on pseudo label according to claim 3, characterized in that the convolution module is composed of two cascaded convolution layers, each convolution layer sequentially performs convolution operation with convolution kernel number of image channel number, convolution window size of 3 x3 and step size of 1 x1, normalization operation and activation by using relu function;
the attention module takes the input from the decoder as a main part and the input from the encoder as a secondary part, the attention module fuses the main part and the secondary part, the main part and the secondary part respectively use convolution operation with the convolution window size of 3X 3 and the step size of 1X 1 and normalization operation in sequence to obtain a main part X and a secondary part S, the main part X and the secondary part S are added, then the convolution operation with the convolution window size of 3X 3 and the step size of 1X 1, the normalization operation and the activation by using a sigmoid function are sequentially carried out to obtain a fusion result XS, and the fusion result XS and the main part X are multiplied to obtain the output of the attention module;
the deconvolution module comprises an up-sampling layer and a convolution layer, wherein the up-sampling layer up-samples an input image, and after the image on the image is sampled to be twice as large as the image input into the module, the convolution layer sequentially performs convolution operation and normalization operation with the number of convolution kernels being the number of output channels, the size of a convolution window being 3 x3 and the step length being 1 x1, and activation is performed by using a relu function.
5. The method for removing highlight from face image based on pseudo label according to claim 3, wherein the size of convolution kernel for maximum pooling in the encoder is 2, the step size is 2, and the convolution kernels of five convolution modules in the encoder are 64,128,256,512 and 1024 in sequence; the convolution kernel number of the attention module in the decoder is the same as that of the deconvolution module, the convolution kernel numbers of the 4 deconvolution modules are 512,256,128 and 64 respectively, the convolution kernel number of the convolution layer in the decoder is 3, the convolution window size is 1 multiplied by 1, and the step size is 1 multiplied by 1.
6. The method for removing highlight from human face image based on pseudo label according to claim 2, wherein the method of using pseudo label to improve the generalization ability of convolutional neural network comprises the following steps:
s31, generating pseudo labels for the unlabeled data sets through a Gaussian process;
s32, calculating the error of the unlabeled data set according to the generated pseudo label;
s33, calculating a loss function of the convolutional neural network according to the error of the unlabeled data set and the error of the labeled data set, and training the convolutional neural network through back propagation, wherein the loss function of the convolutional neural network is expressed as:
Ltotal=LsupunsupLunsup
wherein L istotalIs the total loss of the convolutional neural network; l issupLoss of a portion of the tagged data set; l isunsupLoss of unmarked data set parts; lambda [ alpha ]unsupThe weight lost for the unmarked data set part.
7. The method for removing highlights from face image based on false label as claimed in claim 6, wherein the process of generating false label through Gaussian process for non-label data set includes the following steps:
when the labeled data set is used for the first time, the obtained feature vector of the last convolution module in the encoder is used
Figure FDA0003530198890000031
Memory matrix
Figure FDA0003530198890000032
To pair
Figure FDA0003530198890000033
Sparse coding is carried out, and a dictionary F with a tagged data set feature vector is obtained through learning;
when the label-free data set is sent into the neural network, the corresponding characteristic vector is obtained by the last rolling block
Figure FDA0003530198890000034
And apply the feature vectors
Figure FDA0003530198890000041
Projecting to a feature vector space F obtained by learning;
in the case where the tagged data set and the tagged data set feature vector are known, the distribution of the unlabeled data set feature vector is equivalent to a gaussian distribution, and the mean in the equivalent gaussian distribution is taken as the pseudo-tag of the unlabeled data set feature vector.
8. The method for removing highlights from a face image based on a pseudo label according to claim 7, characterized in that the distribution of the feature vectors of the unmarked data set is equivalent to a gaussian distribution, and the mean of the equivalent gaussian distribution is expressed as:
Figure FDA0003530198890000042
the variance of the equivalent gaussian distribution is expressed as:
Figure FDA0003530198890000043
wherein K (X, Y) is a kernel function expressed as
Figure FDA0003530198890000044
X、
Figure FDA0003530198890000045
<X,Y>Represents the inner product of vector X and vector Y, | X | represents the modular length of vector X;
Figure FDA0003530198890000046
has a value of 1; and I is an identity matrix.
9. The method for removing highlights from a face image based on a pseudo label according to claim 8, wherein the loss function of the unlabeled data set is expressed as:
Figure FDA0003530198890000047
wherein the content of the first and second substances,
Figure FDA0003530198890000048
the feature vector output by the last convolution module of the encoder;
Figure FDA0003530198890000049
is a pseudo label obtained by a Gaussian process, and the value is the mean value of Gaussian distribution
Figure FDA00035301988900000410
||||2Is the norm of L2; lambda [ alpha ]1Is a sparse coefficient; alpha is a sparse vector.
10. The highlight removal method for face images based on pseudo labels as claimed in claim 6, wherein the loss L of the labeled data set partsupExpressed as:
Lsup=Lpixel+Lperception
Lpixel=||ypred-y||1
Figure FDA00035301988900000411
wherein, ypredIs the prediction result of the neural network; y is a real label, | ·| non-woven phosphor1Is L between images1A distance; lambda [ alpha ]2Is a weight; phiVGGRepresents a VGG16 network; phiVGG(ypred) Characteristic value, phi, generated by VGG16 for the predicted result of the neural networkVGG(y) feature values generated for the authentic tag by VGG 16;
Figure FDA0003530198890000051
is the square of the norm of L2.
CN202210208825.7A 2022-03-03 2022-03-03 Face image highlight removal method based on pseudo label Pending CN114549387A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210208825.7A CN114549387A (en) 2022-03-03 2022-03-03 Face image highlight removal method based on pseudo label

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210208825.7A CN114549387A (en) 2022-03-03 2022-03-03 Face image highlight removal method based on pseudo label

Publications (1)

Publication Number Publication Date
CN114549387A true CN114549387A (en) 2022-05-27

Family

ID=81661283

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210208825.7A Pending CN114549387A (en) 2022-03-03 2022-03-03 Face image highlight removal method based on pseudo label

Country Status (1)

Country Link
CN (1) CN114549387A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113361548A (en) * 2021-07-05 2021-09-07 北京理工导航控制科技股份有限公司 Local feature description and matching method for highlight image
CN115131252A (en) * 2022-09-01 2022-09-30 杭州电子科技大学 Metal object surface highlight removal method based on secondary coding and decoding structure

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113361548A (en) * 2021-07-05 2021-09-07 北京理工导航控制科技股份有限公司 Local feature description and matching method for highlight image
CN113361548B (en) * 2021-07-05 2023-11-14 北京理工导航控制科技股份有限公司 Local feature description and matching method for highlight image
CN115131252A (en) * 2022-09-01 2022-09-30 杭州电子科技大学 Metal object surface highlight removal method based on secondary coding and decoding structure
CN115131252B (en) * 2022-09-01 2022-11-29 杭州电子科技大学 Metal object surface highlight removal method based on secondary coding and decoding structure

Similar Documents

Publication Publication Date Title
Zheng et al. Ultra-high-definition image dehazing via multi-guided bilateral learning
EP3678059B1 (en) Image processing method, image processing apparatus, and a neural network training method
Ghiasi et al. Exploring the structure of a real-time, arbitrary neural artistic stylization network
CN111784602B (en) Method for generating countermeasure network for image restoration
CN111583135B (en) Nuclear prediction neural network Monte Carlo rendering image denoising method
CN112465718B (en) Two-stage image restoration method based on generation of countermeasure network
Messaoud et al. Structural consistency and controllability for diverse colorization
CN114549387A (en) Face image highlight removal method based on pseudo label
CN111832570A (en) Image semantic segmentation model training method and system
CN111783658B (en) Two-stage expression animation generation method based on dual-generation reactance network
CN109993820B (en) Automatic animation video generation method and device
CN113205449A (en) Expression migration model training method and device and expression migration method and device
CN114820341A (en) Image blind denoising method and system based on enhanced transform
CN115393231B (en) Defect image generation method and device, electronic equipment and storage medium
Wu et al. FW-GAN: Underwater image enhancement using generative adversarial network with multi-scale fusion
Chen et al. Domain adaptation for underwater image enhancement via content and style separation
CN115170915A (en) Infrared and visible light image fusion method based on end-to-end attention network
Salmona et al. Deoldify: A review and implementation of an automatic colorization method
CN112270692A (en) Monocular video structure and motion prediction self-supervision method based on super-resolution
CN113096001A (en) Image processing method, electronic device and readable storage medium
Kim et al. A multi-purpose convolutional neural network for simultaneous super-resolution and high dynamic range image reconstruction
CN114694081A (en) Video sample generation method based on multivariate attribute synthesis
Jiang et al. Real noise image adjustment networks for saliency-aware stylistic color retouch
CN117499711A (en) Training method, device, equipment and storage medium of video generation model
Liu et al. Sketch to portrait generation with generative adversarial networks and edge constraint

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination