CN114549387A

CN114549387A - Face image highlight removal method based on pseudo label

Info

Publication number: CN114549387A
Application number: CN202210208825.7A
Authority: CN
Inventors: 黄颖; 王泽荃
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2022-03-03
Filing date: 2022-03-03
Publication date: 2022-05-27

Abstract

The invention relates to the field of image processing, computer vision and deep learning, in particular to a highlight removing method for a face image based on a pseudo label, which comprises the steps of obtaining a synthesized face data set through a rendering engine, and forming a labeled data set and a non-labeled data set with a real face image with highlight; training the convolutional neural network by using the labeled data set; the generalization capability of the neural network is improved by using a pseudo label method; inputting the face image with highlight into a convolutional neural network to obtain a highlight-removed picture; the face highlight-removed image obtained by the method accords with the face observation effect, and the face texture details are not damaged.

Description

Face image highlight removal method based on pseudo label

Technical Field

The invention relates to the field of image processing, computer vision and deep learning, in particular to a face image highlight removal method based on a pseudo label.

Background

Specular reflection formed on a human face can generate facial highlight, and in most cases, the facial highlight can obviously influence image quality and reduce the aesthetic feeling of the human face. In the fields of movie and television production, virtual reality, human face relighting and the like, the highlight removing algorithm of the human face image, which is crucial to the highlight removing of the existing human face image, mainly comprises the following steps:

1) a numerical optimization method based on a priori assumptions. This type of algorithm relies on making assumptions about the physical characteristics of the highlight of the face, and manually designing prior assumptions and constraints based thereon. The aim of separating highlight components and non-highlight components of the human face is achieved by minimizing an optimization function.

2) An image processing method based on image chromaticity space. In the method, different pixels are divided into highlight pixels and non-highlight pixels in a chromaticity space by analyzing the chromaticity space of the image, and the highlight pixels are converted into the non-highlight pixels through image processing calculation.

3) The method is based on a deep learning class method driven by data. The algorithm takes a highlight image as a model input and takes a non-highlight image as a model output by means of a deep learning technology. And the model learns the capability of removing the highlight of the human face through large-scale data.

With the continuous development of deep learning technology, a deep learning method based on data driving becomes the mainstream direction for removing highlight from a human face. Because it is difficult to obtain a highlight dataset of a tagged face, it is often necessary to produce a composite dataset by a rendering engine. However, the synthesized face image still has a difference from the real face image, and the de-highlight neural network trained by the synthesized face acts on the de-highlight of the real face, which can cause problems of color distortion, smooth image transition and the like.

Disclosure of Invention

In order to overcome the defect of a deep learning face highlight removing algorithm, the invention provides a face image highlight removing method based on a pseudo label, which specifically comprises the following steps:

s1, acquiring a synthesized face data set through a rendering engine, and forming a tagged data set and a non-tagged data set with the real face image with highlight;

s2, training the convolutional neural network by using the labeled data set;

s3, acquiring a pseudo label of data in the unlabeled data set, and training the convolutional neural network by using the data set with the pseudo label;

s4, inputting the face image with highlight into the convolutional neural network which completes training to obtain a highlight-removed picture;

when the convolutional neural network is trained, the sum of the loss between the data in the data set with the labels and the labels thereof and the loss between the data in the data set without the labels and the pseudo labels thereof is used as a loss function of the convolutional neural network, and the convolutional neural network is subjected to back propagation to train the convolutional neural network.

Further, the method for acquiring the tagged data set and the untagged data set comprises the following steps:

selecting a plurality of face 3D model images, loading the face images into a rendering relation based on physics, selecting a plurality of HDR environment lights, and rendering the face models; respectively rendering a diffuse reflection part D and a specular reflection part S of a human face according to a Phong illumination model, expressing a rendered human face image with highlight as I ═ D + S, forming a synthetic human face image pair with highlight/highlight removal consisting of the obtained human face image with highlight and an original image thereof, and forming a labeled data set by the synthetic human face image pair with highlight/highlight removal;

and selecting a plurality of highlight face pictures from the face data set as a label-free data set.

Further, the convolutional neural network comprises an encoder and a decoder, the image input into the convolutional neural network is used as the input of the encoder, the encoder comprises 5 cascaded convolutional modules, and the output of each convolutional module is input into the next convolutional module after being subjected to maximum pooling;

taking the output of the encoder as the input of a decoder, wherein the decoder comprises 4 attention modules, 4 convolution modules, 4 deconvolution modules and a convolution layer first deconvolution module, the output obtained by carrying out deconvolution operation by taking the input of the decoder as the input is d1, and the output obtained by fusing the output of the penultimate convolution module in the encoder and d1 as the input is recorded as x1 by the first attention module;

d1 and x1 are spliced together and input into a first convolution module for convolution, the output of the first convolution module is used as the input of a second convolution module for convolution operation to obtain an output result d2, and the output obtained by the second attention module through fusion is used as the input according to d2 and the output of a third last convolution module of an encoder and is recorded as x 2;

d2 and x2 are spliced together, input into a second convolution module for convolution, the output of the second convolution module is used as the input of a third deconvolution module for deconvolution operation to obtain an output result d3, and the output obtained by fusing the output of the third attention module according to d3 and the output of a fourth last convolution module of an encoder as the input is recorded as x 3;

d3 and x3 are spliced together, input into a third convolution module for convolution, the output of the third convolution module is used as the input of a fourth deconvolution module for deconvolution operation to obtain an output result d4, and the output obtained by fusing the output of the fourth attention module according to d4 and the output of a fifth last convolution module of an encoder as the input is recorded as x 4;

d4 and x4 are spliced together, input into a fourth convolution module for convolution, and the output of the third convolution module is used as the input of the convolution layer of the encoder for convolution to obtain the output result of the decoder.

Furthermore, the convolution module is composed of two cascaded convolution layers, each convolution layer sequentially performs convolution operation with the convolution kernel number being the number of image channels, the convolution window size being 3 × 3 and the step length being 1 × 1, normalization operation and activation by using a relu function;

the attention module takes the input from the decoder as a main part and the input from the encoder as a secondary part, the attention module fuses the main part and the secondary part, the main part and the secondary part respectively use the convolution operation with the convolution window size of 3X 3 and the step size of 1X 1 and the normalization operation in sequence to obtain a main part X and a secondary part S, the main part X and the secondary part S are added and then sequentially carry out the convolution operation with the convolution window size of 3X 3 and the step size of 1X 1, the normalization operation and the activation by using a sigmoid function to obtain a fusion result XS, and the fusion result XS is multiplied by the main part X to be used as the output of the attention module;

the deconvolution module comprises an up-sampling layer and a convolution layer, wherein the up-sampling layer up-samples an input image, and after the image on the image is sampled to be twice as large as the image input into the module, the convolution layer sequentially performs convolution operation and normalization operation with the number of convolution kernels being the number of output channels, the size of a convolution window being 3 x3 and the step length being 1 x1, and activation is performed by using a relu function.

Further, the size of a convolution kernel for performing maximum pooling in the encoder is 2, the step size is 2, and convolution kernels of five convolution modules of the encoder are 64,128,256,512 and 1024 in sequence; the convolution kernel number of the attention module in the decoder is the same as that of the deconvolution module, the convolution kernel numbers of the 4 deconvolution modules are 512,256,128 and 64 respectively, the convolution kernel number of the convolution layer in the decoder is 3, the convolution window size is 1 multiplied by 1, and the step size is 1 multiplied by 1.

Further, the method for using the pseudo label to improve the generalization capability of the convolutional neural network comprises the following steps:

s31, generating pseudo labels for the unlabeled data sets through a Gaussian process;

s32, calculating the error of the unlabeled data set according to the generated pseudo label;

s33, calculating a loss function of the convolutional neural network according to the error of the unlabeled data set and the error of the labeled data set, and training the convolutional neural network through back propagation, wherein the loss function of the convolutional neural network is expressed as:

L_total＝L_sup+λ_unsupL_unsup；

wherein L is_totalIs the total loss of the convolutional neural network; l is_supLoss of a portion of the tagged data set; l is_unsupLoss of unmarked data set parts; lambda [ alpha ]_unsupThe weight lost for the unmarked data set part.

Further, for the unlabeled dataset, the process of generating the pseudo label by the gaussian process comprises the following steps:

when the labeled data set is used for the first time, the obtained feature vector of the last convolution module in the encoder is used

Memory matrix

To pair

Sparse coding is carried out, and a dictionary F with a tagged data set feature vector is obtained through learning;

when the label-free data set is sent into the neural network, the corresponding characteristic vector is obtained by the last rolling block

And apply the feature vectors

Projecting to a feature vector space F obtained by learning;

in the case where the tagged data set and the tagged data set feature vector are known, the distribution of the unlabeled data set feature vector is equivalent to a gaussian distribution, and the mean in the equivalent gaussian distribution is taken as the pseudo-tag of the unlabeled data set feature vector.

Further, the distribution of the feature vectors of the unlabeled dataset is equivalent to a gaussian distribution, and the mean of the equivalent gaussian distribution is expressed as:

the variance of the equivalent gaussian distribution is expressed as:

wherein K (X, Y) is a kernel function expressed as

<X,Y>Represents the inner product of vector X and vector Y, | X | represents the modular length of vector X;

has a value of 1; and I is an identity matrix.

Further, the loss function of the unlabeled dataset is represented as:

wherein the content of the first and second substances,

the feature vector output by the last convolution module of the encoder;

is a pseudo label obtained by a Gaussian process, and the value is the mean value of Gaussian distribution

|| ||₂Is the norm of L2; lambda [ alpha ]₁Is a sparse coefficient; alpha is a sparse vector.

Further, the loss L of the tagged dataset portion_supExpressed as:

L_sup＝L_pixel+L_perception；

L_pixel＝||y_pred-y||₁；

wherein, y_predIs the prediction result of the neural network; y is a real label, | ·| non-woven phosphor₁Is L between images₁A distance; lambda [ alpha ]₂Is a weight; phi_VGGRepresents a VGG16 network; phi_VGG(y_pred) Characteristic value, phi, generated by VGG16 for the predicted result of the neural network_VGG(y) feature values generated for the authentic tag by VGG 16;

which is the square of the L2 distance between the images.

According to the face highlight removing method based on the pseudo label, the pseudo label of the face picture which is not marked is manufactured by using the Gaussian process mixed model, the generalization capability of a neural network on a real face image is enhanced through the pseudo label, and the effect of highlight removing on the real face is improved.

Drawings

FIG. 1 is a simplified flowchart of an example of a pseudo-label based facial image de-highlights algorithm of the present invention;

FIG. 2 is a schematic diagram of the structure of the light removal network of the present invention;

FIG. 3 is a schematic diagram of the present invention for training highlight removal networks;

fig. 4 shows the test results of the trained neural network on the CelebA data set (first line of action of the original image, second line of action of the first line of image corresponding to the highlight-removed result).

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The invention provides a face image highlight removal method based on a pseudo label, which specifically comprises the following steps:

s2, training the convolutional neural network by using the labeled data set;

and S4, inputting the face image with highlight into the convolutional neural network which is trained, and obtaining the picture without highlight.

In this embodiment, a flow of a face image highlight removal method based on a pseudo label is as shown in fig. 1 to 4, and specifically includes the following steps:

s10, a face dataset (including both specular and diffuse non-specular faces) is rendered using a rendering engine with a highlight image in the real world to form a labeled dataset and an unlabeled dataset to train a neural network.

S101, selecting a plurality of face 3D models (obj files) and loading the face 3D models into a rendering engine based on physics, selecting a plurality of HDR (high-resolution digital hierarchy) environment lights and rendering the face models. Rendering according to a Phong illumination model to obtain a diffuse reflection part D and a specular reflection part S of the face respectively, and according to a formula:

I＝D+S

an image I is obtained. And S is regarded as highlight of the face, I is a face image with highlight, and D is a face image without highlight. Repeating the above operations to generate a large number of highlight/highlight-removed synthetic face image pairs as a labeled data set.

S102, manually selecting a plurality of highlight face pictures from the real face data set, and taking the pictures as a non-label data set.

S20, the neural network is first convolved with the tagged data set. The method specifically comprises the following steps:

s201, the structure of the convolutional neural network.

The convolutional neural network can be specifically divided into an encoder (encoding) and a decoder (decoding), and specifically includes:

1) encoder the encoder comprises 5 successive convolution modules conv _ block. The first convolution module takes the image as input, the last four convolution modules take the output of the previous convolution layer as input, the maximum pooling needs to be performed after the first 4 convolution modules are finished, the convolution kernel size of the maximum pooling operation is 2, the step size is 2, and the number of convolution kernels of the five convolution modules is [64,128,256,512,1024 ].

2) The decoder comprises 4 attention modules, 4 convolution modules, 4 deconvolution modules and a convolution layer. The first deconvolution module of the decoder accepts as input the output of the last encoder of the encoder, resulting in output d1, the first attention module accepts as input d1 and the output e4 of the penultimate convolution module of the encoder, resulting in output x1, and d1 is connected to x1 and fed into the first convolution module. The output of the convolution module is fed to a second deconvolution module to obtain an output d 2. The second attention module accepts d2 and the output e3 of the third last convolution module of the encoder as inputs, resulting in an output x2, which connects d2 and d into the second convolution module. And so on, finally sending the output d4 of the last convolution module into the convolution layer. The number of convolution kernels for the 4 deconvolution modules is [512,256,128,64], respectively. The number of convolution kernels of the attention module is the same as that of the deconvolution module. The number of last convolutional kernels is 3, the convolutional window size is 1 x1, and the step size is 1 x 1. The overall structure of the convolutional neural network is shown in fig. 2. The convolution module structure in the decoder is consistent with that in the encoder, the number of convolution kernels of the encoder of the decoder is [512,256,128,64 and 3], and the convolution kernels are sequentially and uniformly closer to the input end and further forward according to the times from input to output; other convolution parameters are consistent with the encoder and are not described in detail here.

The convolution module, attention module, deconvolution module and convolution layer in the convolution neural network specifically comprise

1) Convolution module conv _ block: the convolution module is composed of two convolution layers. The two convolution layers are identical except for the inputs. The number of convolution layer convolution kernels is the number of output channels ch, the size of a convolution window is 3 x3, the step size is 1 x1, then normalization is carried out, and the activation function is a relu function. The convolution structures adopted in the encoder and the decoder are the same, and only the convolution kernels of the convolution layers are different in number, wherein the convolution kernels of 5 convolution modules in the encoder are 64,128,256,512 and 1024 in sequence, and the convolution kernels of 4 convolution modules in the decoder are 512,256,128 and 64 in sequence.

2) Attention module attu _ block: the attention module takes the input from the decoder as a main part and the input from the encoder as a secondary part as the input of the attention module, respectively performs convolution operation on the two parts, the convolution window size of the convolution operation is 3X 3, the step size is 1X 1, and then normalization is performed to obtain a main part X and a secondary distribution S; adding X and S to obtain XS, then carrying out convolution operation on the XS, wherein the size of a convolution window of the convolution operation is 3X 3, the step length is 1X 1, then carrying out normalization, and enabling an activation function to be sigmoid; multiplying X by XS yields the output of the attention module.

3) Deconvolution module up _ block: the deconvolution module first upsamples the image input to the module twice as much as the original image. Then, performing convolution by using a convolution layer, wherein the number of convolution kernels of the convolution layer is the number ch of output channels, the size of a convolution window is 3 x3, the step length is 1 x1, then performing normalization, and an activation function is a relu function; the numbers of convolution kernels of the 4 deconvolution modules are 512,256,128 and 64 respectively.

4) And (3) rolling layers: the output of the last convolution module of the decoder is input into a convolution layer for convolution, the number of convolution kernels of the convolution layer is the number of channels of an input image, the size of a convolution window is 3 multiplied by 3, and the step length is 1 multiplied by 1.

S202: for tagged datasets (composite images), the error between the prediction and the tag is calculated. The pixel error between the label and the prediction result is expressed as:

L_pixel＝||y_pred-y||₁；

meanwhile, the perceptual error between the label and the prediction result is calculated and expressed as:

wherein, y_predIs the prediction result of the neural network; y is a real label, | ·| non-woven phosphor₁Is L between images₁A distance; lambda [ alpha ]₂For weighting, the person skilled in the art can adjust the weights according to actual needs; phi_VGGRepresenting a VGG16 network; phi_VGG(y_pred) Characteristic value, phi, generated by VGG16 for the predicted result of the neural network_VGG(y) passing of a genuine labelCharacteristic values generated by VGG 16;

which is the square of the L2 distance between the images.

The two parts are supervised loss functions of the neural network:

L_sup＝L_pixel+L_perception。

and S30, using a pseudo label method to improve the generalization ability of the neural network. The method specifically comprises the following steps:

s301, generating a pseudo label through a Gaussian process for a label-free data set (a real image), specifically:

Memory matrix

Namely that

Wherein N is_lRepresenting the number of annotated images, and i represents the label of the image. Suppose that

Is a1 × M feature vector, then

Is an N x M matrix.

From Sparse representation (Sparse representation) theory, sample set X ═ X₁，...，x_nMay be given by a set of basis vectors phi ═ phi₁，...，φ_iThe linearity of the equation is given by:

thus, ifBy sample set X ═ X₁，...，x_nLearning a set of overcomplete basis vectors D ═ phi }₁，...，φ_iIt can be decomposed.

X＝Dα；

X is a sample set, D is a basis vector group (dictionary), alpha is a coefficient vector, and the decomposition mode is sparse coding. The aim of sparse coding is to make the decomposition result as close as possible to the original sample set after recombination, and alpha is as sparse as possible, namely:

min|α_|0 s.t.Dα＝X

preferably, for

Sparse coding is carried out, and a dictionary F with a tagged data set feature vector can be obtained through learning:

wherein alpha is_lAnd the coefficient vector corresponding to the labeled data set.

Meanwhile, when the label-free data set is sent into the neural network, the last convolution block can also obtain the corresponding characteristic vector

The method assumes feature vectors of a label-free dataset

Feature vectors associated with tagged data sets

Belong to the same vector space and, therefore,

and

one dictionary F can be shared. Thus, when using non-tag dataWhen the set is trained, the obtained feature vector of the last convolution module in the encoder can be used

And projecting the feature vector space F obtained by learning. Preferably, the annotated data and the unlabeled data can be jointly modeled using a Gaussian Process (GP).

The gaussian process is a function whose core is modeling the function using an infinite multidimensional variable gaussian distribution. A gaussian process may be determined by a mean function (mean function) and a covariance function (covariance function).

v and v' are random variables, f is a gaussian process, E is desired, m (v) is a mean function, and K is a kernel function (covariance). The gaussian process can be defined as:

for a set of random variables V, denoted as V₁，v₂，...，v_n]In other words, the result of the gaussian process conforms to a multidimensional gaussian distribution, namely:

thus, the joint distribution of the feature vectors of tagged and untagged datasets can be modeled as a multivariate gaussian:

wherein z is_LGaussian process of feature vectors for tagged data sets, z_UGaussian process of feature vectors for unlabeled datasets, μ_LIs the mean, μ, of the feature vectors of the tagged dataset_UMean of the feature vectors for the unlabeled dataset; with the gaussian process, the distribution of the unlabeled dataset feature vectors can be calculated given the labeled dataset and the labeled dataset feature vectors. The last feature vector space is modeled using a Gaussian Process (GP). Unlabeled sample feature vector

The distribution of (d) is equivalent to a gaussian distribution:

in order to label the data as such,

is a Gaussian distribution of multivariate variables, wherein

Mean of gaussian process:

variance for gaussian process:

is set to 1. K (X, Y) is a kernel function defined as:

wherein the content of the first and second substances,<X,Y>represents the inner product of vector X and vector Y, | X | represents the modular length of vector X;

has a value of 1; i is an identity matrix

Preferably, the mean of the Gaussian process is used

As pseudo-tag feature vectors

S302, calculating the error of the unlabeled data, specifically:

for unlabeled data, the formula is adopted:

and calculating the error between the space vector obtained by the encoder and the space vector of the pseudo label. Wherein

The feature vector is predicted for the last volume block,

for a pseudo tag obtained by the gaussian process,

gauss corresponding to the jth convolution layerThe variance of the equation.

After the expression is subjected to sparse representation, the decomposition result is as close to the characteristic vector of the tagged data set as possible, and the sparse vector alpha is as sparse as possible, and lambda is₁For sparse coefficients, the coefficients can be adjusted by the user.

S303, synthesizing the loss functions of the labeled data set and the unlabeled data set, and further training the network.

The loss function of the marked data and the unmarked data is synthesized, and the loss function provided by the invention is as follows:

L_total＝L_sup+λ_unsupL_unsup；

wherein λ is_unsupThe parameter can be adjusted by the user for the weight lost by the unmarked data set part.

And S40, inputting the face image with highlight into the trained convolutional neural network to obtain the highlight-removing result.

In an example of the present invention, a flow chart for synthesis and training is shown in FIG. 1. The method comprises the steps that based on a human face OBJ model and HDR environment illumination, a synthetic human face data set is obtained by giving a specular reflection component, a diffuse reflection component and a combination of the specular reflection component and the diffuse reflection component and rendering by using a rendering engine; a synthesized face data set with given illumination parameters is used as a data set with a label, a real face with highlight is collected as a data set without the label, the two data sets are input into a neural network model, and a 'pseudo label' of the real face is extracted through the synthesized face data.

When the labeled data is trained, extracting the feature vector of the labeled image in the last layer, learning a dictionary for describing the feature vector of the labeled data by a sparse coding mode on the basis of the extracted feature vector, and coding the feature vector into a sparse coefficient by using the dictionary. When an unlabeled image is input, a Gaussian process is calculated by using the sparse representation of the unlabeled image and the feature vector of the unlabeled image in the last layer, and the mean value of the Gaussian process is used as a 'pseudo label' of the unlabeled image. The neural network model trains the net's ability to remove light by simultaneously reducing the distance between the synthetic image and the label, the real image and the "pseudo label". The neural network psi is obtained after the training is finished. And applying psi to the image of the face which is input randomly to obtain a result after highlight is removed.

1) Training data synthesis

Any 3D face model (which can be automatically generated by using a face three-dimensional modeling algorithm) is taken, and the model comprises different categories of gender, width, fat and thin and the like. The models are put into a physical-based renderer (such as mitsuba), and different HDR environment illuminations are selected to render the models. And setting parameters of an illumination model of the renderer to obtain a specular reflection component S and a diffuse reflection component D of the face, and adding the S and the D by using an image addition method to obtain a face image I with high light. The dataset is constructed using D and I, with D as the label for I. This process is repeated to obtain a sufficient set of tagged data.

2) Neural network training

Besides the labeled data set, a plurality of real human faces with high light are selected (manually selected) to serve as a non-labeled data set, and the images can be used as diffuse reflection images and mirror reflection images no matter labeled data or unlabeled data.

I＝D+S；

With a neural network psi as shown in fig. 2, it is desirable to obtain a delustering result by outputting a band highlight image x

The network is trained according to the network training architecture shown in fig. 3. The network training is divided into a supervised part and an unsupervised part, wherein the supervised part comprises:

for tagged datasets (composite images), use is made of

L_pixel＝||y_pred-y||₁；

Compute tags and prefixesPixel error between the measured values, y_predIs the prediction result of the neural network, y is a real label, | |₁Is L between images₁Distance.

Meanwhile, the following steps are adopted:

a perceptual error between the label and the prediction is calculated.

Supervised losses include:

L_sup＝L_pixel+L_perception；

the unsupervised part comprises:

for unlabeled data, the following are adopted:

and calculating the error between the space vector obtained by the encoder and the space vector of the pseudo label.

The loss function of the neural network proposed by the present invention is:

L_total＝L_sup+λ_unsupL_unsup；

the optimization goal of the neural network is this loss function. The overall training process of the neural network is shown in fig. 3.

The partial training configuration is as follows: epoch is 1000, batch size is 4, and use optimizer is Adam. The method of the invention is tested as follows: the experimental platform is PC, the GPU is NVIDIA GeForce GTX 1080Ti, and the video memory is 12G. The software configuration comprises the following steps: ubuntu18.04 system, CUDA11.3, Python3.8.0, Pytrch framework. Through tests, the method provided by the invention can effectively remove the high light.

3) De-highlighting network applications

The face image with highlight is input into the trained neural network psi, and the face with highlight removed can be obtained, as shown in fig. 4, the first behavior in fig. 4 is the face image with highlight, and the second behavior is that the image after highlight is removed by the method, so that the method can effectively remove highlight and simultaneously keep most details of the image.

Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. A face image highlight removal method based on a pseudo label is characterized by specifically comprising the following steps:

s2, training the convolutional neural network by using the labeled data set;

2. The highlight removal method for the face image based on the pseudo label is characterized in that the acquisition method for the labeled data set and the unlabeled data set comprises the following steps:

3. The method for removing highlight from human face image based on pseudo label according to claim 1, wherein the convolutional neural network comprises an encoder and a decoder, the image inputted into the convolutional neural network is used as the input of the encoder, the encoder comprises 5 cascaded convolutional modules, and the output of each convolutional module is inputted into the next convolutional module after being maximally pooled;

d4 and x4 are spliced together, input into a fourth convolution module for convolution, and the output of the fourth convolution module is used as the input of the convolution layer of the encoder for convolution, so that the output result of the decoder is obtained.

4. The method for removing highlight from face image based on pseudo label according to claim 3, characterized in that the convolution module is composed of two cascaded convolution layers, each convolution layer sequentially performs convolution operation with convolution kernel number of image channel number, convolution window size of 3 x3 and step size of 1 x1, normalization operation and activation by using relu function;

the attention module takes the input from the decoder as a main part and the input from the encoder as a secondary part, the attention module fuses the main part and the secondary part, the main part and the secondary part respectively use convolution operation with the convolution window size of 3X 3 and the step size of 1X 1 and normalization operation in sequence to obtain a main part X and a secondary part S, the main part X and the secondary part S are added, then the convolution operation with the convolution window size of 3X 3 and the step size of 1X 1, the normalization operation and the activation by using a sigmoid function are sequentially carried out to obtain a fusion result XS, and the fusion result XS and the main part X are multiplied to obtain the output of the attention module;

5. The method for removing highlight from face image based on pseudo label according to claim 3, wherein the size of convolution kernel for maximum pooling in the encoder is 2, the step size is 2, and the convolution kernels of five convolution modules in the encoder are 64,128,256,512 and 1024 in sequence; the convolution kernel number of the attention module in the decoder is the same as that of the deconvolution module, the convolution kernel numbers of the 4 deconvolution modules are 512,256,128 and 64 respectively, the convolution kernel number of the convolution layer in the decoder is 3, the convolution window size is 1 multiplied by 1, and the step size is 1 multiplied by 1.

6. The method for removing highlight from human face image based on pseudo label according to claim 2, wherein the method of using pseudo label to improve the generalization ability of convolutional neural network comprises the following steps:

L_total＝L_sup+λ_unsupL_unsup；

7. The method for removing highlights from face image based on false label as claimed in claim 6, wherein the process of generating false label through Gaussian process for non-label data set includes the following steps:

Memory matrix

To pair

And apply the feature vectors

Projecting to a feature vector space F obtained by learning;

8. The method for removing highlights from a face image based on a pseudo label according to claim 7, characterized in that the distribution of the feature vectors of the unmarked data set is equivalent to a gaussian distribution, and the mean of the equivalent gaussian distribution is expressed as:

the variance of the equivalent gaussian distribution is expressed as:

wherein K (X, Y) is a kernel function expressed as

X、

has a value of 1; and I is an identity matrix.

9. The method for removing highlights from a face image based on a pseudo label according to claim 8, wherein the loss function of the unlabeled data set is expressed as:

wherein the content of the first and second substances,

the feature vector output by the last convolution module of the encoder;

||||₂Is the norm of L2; lambda [ alpha ]₁Is a sparse coefficient; alpha is a sparse vector.

10. The highlight removal method for face images based on pseudo labels as claimed in claim 6, wherein the loss L of the labeled data set part_supExpressed as:

L_sup＝L_pixel+L_perception；

L_pixel＝||y_pred-y||₁；

is the square of the norm of L2.