CN113505120B

CN113505120B - Double-stage noise cleaning method for large-scale face data set

Info

Publication number: CN113505120B
Application number: CN202111061863.6A
Authority: CN
Inventors: 龚勋; 陈锐; 吴世杰
Original assignee: Southwest Jiaotong University
Current assignee: Southwest Jiaotong University
Priority date: 2021-09-10
Filing date: 2021-09-10
Publication date: 2021-12-21
Anticipated expiration: 2041-09-10
Also published as: CN113505120A

Abstract

The invention discloses a double-stage noise cleaning method for a large-scale face data set, which adopts a double-stage strategy to enable a network to spontaneously detect a closed set noise sample and an open set noise sample in a training process, and finally reuses the closed set noise sample and deletes the open set noise sample from a training set. After the method is adopted to clean the data set, compared with an unwashed data set, the effect of the trained model is greatly improved in lfw precision, more than 90% of noise can be correctly identified through statistics, and a very good noise identification effect is realized. The model training results on the cleaned data set are also greatly superior to the training results on the uncleaned data set, which is proved in common test sets such as LFW, Age-DB and cfp-fp.

Description

Double-stage noise cleaning method for large-scale face data set

Technical Field

The invention relates to the technical field of noise cleaning of face data sets, in particular to a two-stage noise cleaning method of a large-scale face data set.

Background

Noise in a human face refers to samples with false labels that are generated when the face data set is collected and produced. The noise in the face data set includes open set noise and closed set noise. Closed set noise is also called label flipping, e.g., samples that are originally of class a are labeled with class B; the open set noise samples, which do not belong to any of the classes in the training set themselves, are labeled to a class in the training set.

Noisy data is an inevitable phenomenon in large-scale face data sets. The size of face data sets that are popular today has reached the tens of millions. If all the face data is manually cleaned, it takes extremely time and money, and noise cannot be completely eliminated. The noise samples can seriously damage the performance of the trained model, and therefore, a cleaning method which can provide effective and high recognition rate aiming at large-scale human face data noise is urgently needed.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides a method for cleaning the large-scale face data set by using the two-stage noise, which adopts a two-stage strategy to enable a network to spontaneously detect closed set noise samples and open set noise samples in the training process, finally reuses the closed set noise samples and deletes the open set noise samples from the training set, thereby solving the problems in the background technology.

In order to achieve the purpose, the invention provides the following technical scheme: a method for cleaning double-stage noise of a large-scale face data set comprises the following steps:

s1, constructing an initial face data set D1;

s2, detecting the closed set noise samples and the corresponding real categories in the data set D1, and outputting a closed set noise list file containing closed set noise;

s3, completing the reuse of closed set noise to generate a data set D2;

s4, training is continued by taking the data set D2 as input, and an open set noise list file containing open set noise is output;

and S5, deleting the open set noise in the data set D2 according to the open set noise list file in the step S4, and finally generating a cleaned clean data set.

Preferably, the initial face data set D1 of step S1 is a labeled data set after face detection and alignment.

Preferably, the closed set noise samples and the corresponding true classes in the detection data set D1 in the step S2 are specifically: the data set D1 was input into a ResNet50 network and trained to detect closed-set noise using the BoundryF 1 Loss function.

Preferably, the formula of the boundaryF1 Loss function is as follows:

；

wherein the content of the first and second substances,

indicating the number of samples in a batch,

indicating the second within a batch

The number of the samples is one,

and

all represent a category of a sample label, and

and the first

The labels of the individual samples correspond to each other,

representing the total number of classes in the training set,

in order to be a scaling factor, the scaling factor,

represents all and

the samples are of a different class and,

in order to be a penalty term,

is shown as

The feature vector of each sample and

the included angle between class center feature vectors of the classes;

is shown as

The feature vector of each sample and

the included angle between class center feature vectors of the classes is represented as geodesic distance on the normalized hypersphere.

Preferably, the reuse of the closed set noise in step S3 is completed to generate the data set D2, specifically: according to the closed set noise list file output in the step S2, the closed set noise samples are moved to the corresponding category directory in the data set D1, the reuse of the closed set noise is completed, and the data set D2 is generated.

Preferably, the outputting of the open set noise list file containing the open set noise in step S4 is specifically: inputting the data set D2 into a ResNet50 network for continuous training, using a NoiseDropLoss loss function for training, saving the vector product of the feature layer and the weight center layer of each sample in the training process, and enabling the vector value corresponding to the class to which the sample belongs to be equal to

Using a fixed size queue to keep the updates

And discarding the last

As training progresses, the final cohort is saved, the set of data is fitted by a gaussian mixture model, and the children are matchedThe number of the models is set to be 3, three parameters of Gaussian distribution are obtained, the average value of the second Gaussian distribution is used as a judgment threshold, samples lower than the threshold are regarded as open set noise samples, and the open set noise samples are output as open set noise list files.

Preferably, the probability function of the gaussian mixture model is as follows:

；

wherein the content of the first and second substances,

represents the data of one observation, and the data of one observation,

all of the parameters of the representative model are,

represents the first

The sub-models are,

represents the number of the sub-models,

represents that the observed data belongs to

The probability of a sub-model,

represents the first

A gaussian distribution density function for the submodels.

The noisledroploss function is as follows:

wherein the content of the first and second substances,Lrepresents the loss function for a batch of samples,Tis a modulation function that makes decisions based on d, assigning different values to different samples.

The invention has the beneficial effects that: the invention adopts a two-stage strategy, so that the network spontaneously detects closed set noise samples and open set noise samples in the training process, finally reuses the closed set noise samples, deletes the open set noise samples from the training set, and can improve the recognition rate of noise. The model training results on the cleaned data set are also greatly superior to the training results on the uncleaned data set, which is proved in common test sets such as LFW, Age-DB and cfp-fp.

Drawings

FIG. 1 is a schematic diagram of the process steps of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, as shown in fig. 1, the method steps of the present invention are performed by a two-stage strategy, so that a network spontaneously detects a closed set noise sample and an open set noise sample during a training process, and finally reuses the closed set noise sample and deletes the open set noise sample from a training set.

This example performed experiments on a WebFace dataset that included 10572 categories, 490623 samples. The data set denoted as D1, which is a data set that has been labeled by human or machine, and that has been subjected to face detection and alignment, can be used directly to train a neural network model. Although labeling of sample labels is already performed, a human face data set is huge, and a large number of noise samples exist in the data set, so that the performance of a trained model is reduced, and therefore the data set needs to be cleaned of noise. In this embodiment, we replace part of samples with closed set noise, replace part of noise with open set noise, extract features using the ResNet50 model, and perform experiments.

In the first stage, closed set noise samples in the training set and their corresponding real classes are detected. Training is carried out by using ResNet50, BoundarryF 1 Loss detects the closed set noise, and finally a closed set noise list file containing the closed set noise is output. The formula for BoundaryF1 Loss is as follows:

；

wherein the content of the first and second substances,

indicating the number of samples in a batch,

indicating the second within a batch

The number of the samples is one,

and

all represent a category of a sample label, and

and the first

The labels of the individual samples correspond to each other,

representing the total number of classes in the training set,

in order to be a scaling factor, the scaling factor,

represents all and

the samples are of a different class and,

in order to be a penalty term,

is shown as

The feature vector of each sample and

the included angle between class center feature vectors of the classes;

is shown as

The feature vector of each sample and

According to the list generated in the previous training, the closed set noise samples are moved to the corresponding category list in the original training set (the sample of each category in the original data set is in a list corresponding to each category, therefore, if the closed set noise is in the list 1 and belongs to the 1 st category, but we detect that he should belong to the nth category, we move the closed set noise samples to the nth list (the absolute path where the monitored closed set noise samples are stored in the list file and the category corresponding to the monitored closed set noise samples)), so that the reuse of the closed set noise is completed, and the data set D2 is generated. The closed set noise is moved to a catalogue of a category corresponding to the closed set noise, and is not deleted, so that the closed set noise can be better utilized to train a model, and the method for processing the closed set noise is called the reuse of the closed set noise.

For the first stage, through our experiments, a noise recognition rate of 84% of the closed set noise can be achieved.

And in the second stage, inputting the data set D2 into a ResNet50 network for continuous training, using a NoiseDropLoss loss function for training, saving the vector product of the feature layer and the weight center layer of each sample in the training process, and enabling the vector value corresponding to the class to which the sample belongs to be equal to

. Using a fixed size queue to keep the updates

And discarding the last

. As training progresses, the queue q under hold will reflect the state of the training. If the queue is drawnIn a histogram, the histogram will exhibit two peaks. At this time, the data are fitted by a gaussian mixture model, and the number of submodels is set to 3, so that three parameters with gaussian distribution are obtained. The probability function of the gaussian mixture model is as follows:

wherein the content of the first and second substances,

represents the data of one observation, and the data of one observation,

all of the parameters of the representative model are,

represents the first

The sub-models are,

represents the number of the sub-models,

represents that the observed data belongs to

The probability of a sub-model,

represents the first

A gaussian distribution density function for the submodels.

And taking the mean value of the second Gaussian distribution as a judgment threshold, taking the sample lower than the threshold as a noise sample, and outputting the noise sample to the open set noise list to realize the cleaning of the noise sample.

After the threshold d is obtained by using the above-mentioned gaussian mixture model, a noisdroploss function is used, which is formed by combining the following two formulas:

wherein the content of the first and second substances,Lrepresents the loss function for a batch of samples,Tis a modulation function based ondAnd judging, and assigning different values to different samples. Statistics shows that the method can correctly identify more than 90% of open set noise and greatly improve the training effect of the data set.

And cleaning the noise sample according to list after training. The whole process can be carried out again, and whether the detected noise quantity and the shape of the histogram reach a cleaner level is judged to decide whether to stop. In the second stage, the noise identification rate above the open set 98 can be realized. After the data set is cleaned, the effect of the trained model is improved by 1-2 percentage points on lfw compared with the effect of the unwashed data set, and can be 2-8 percentage points on other data sets such as Age DB, CFP-FP, CALFW, CPLFW and SLLFW.

Although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that various changes in the embodiments and/or modifications of the invention can be made, and equivalents and modifications of some features of the invention can be made without departing from the spirit and scope of the invention.

Claims

1. A method for cleaning double-stage noise of a large-scale face data set is characterized by comprising the following steps:

s1, constructing an initial face data set D1;

s3, completing the reuse of closed set noise to generate a data set D2;

s5, deleting the open set noise in the data set D2 according to the open set noise list file in the step S4, and finally generating a cleaned clean data set;

the closed set noise samples and the corresponding true classes in the detection data set D1 in the step S2 are specifically: inputting a data set D1 into a ResNet50 network, training, and detecting closed set noise by using a BoundryF 1 Loss function;

the reuse of the closed set noise in step S3 is completed, and a data set D2 is generated, specifically: according to the closed set noise list file output in the step S2, moving the closed set noise samples to the corresponding category directories in the data set D1, completing the reuse of the closed set noise and generating a data set D2;

the step S4 of outputting the open set noise list file containing the open set noise specifically includes: inputting a data set D2 into a ResNet50 network for continuous training, using a NoiseDropLoss loss function for training, storing the vector product of a feature layer and a weight center layer of each sample in the training process, enabling the vector value corresponding to the class of the sample to be cos theta, continuously storing the latest cos theta and discarding the last cos theta by using a fixed-size queue, storing the final queue along with the training, fitting the group of data by a Gaussian mixture model, setting the number of submodels to be 3, obtaining three parameters of Gaussian distribution, taking the mean value of the second Gaussian distribution as a judgment threshold, regarding the samples lower than the threshold as open-set noise samples, and outputting the samples as open-set noise list files;

the BoundarryF 1 Loss function formula is as follows:

where if max{cos(θ_l+m)for all l≠y_i}-cos(θ_yi)>0∶y_i＝l；

where N represents the number of samples in a batch, i represents the ith sample in a batch, y_iAnd j each represent a category of a certain sample label, and y_iCorresponding to the label of the ith sample, n represents the total number of classes of the training set, s is a scaling factor, l represents all classes different from the ith sample, m is a penalty item, and theta_yiThe feature vector representing the ith sample and the y_iThe included angle between class center feature vectors of the classes; theta_jRepresenting an included angle between the characteristic vector of the ith sample and the class center characteristic vector of the jth class, and reflecting the included angle as a geodesic distance on the normalized hypersphere;

the noisledroploss function is as follows:

wherein L represents the loss function of a batch of samples, T is a modulation function, and the decision is made based on d, assigning different values to different samples.

2. The method of two-stage noise cleaning of large-scale face data sets of claim 1, wherein: the initial face data set D1 of the step S1 is a data set to which labels have been labeled, and face detection and alignment have been performed.

3. The method of two-stage noise cleaning of large-scale face data sets of claim 1, wherein: the probability function of the Gaussian mixture model is as follows:

wherein x represents one-time observation data, theta represents all parameters of the model, K represents the kth sub-model, K represents the number of the sub-models, and alpha_kRepresents the probability that the observed data belongs to the kth sub-model, phi (x | theta)_k) Representing the gaussian distribution density function of the kth sub-model.