CN113869463B

CN113869463B - Long tail noise learning method based on cross enhancement matching

Info

Publication number: CN113869463B
Application number: CN202111457536.2A
Authority: CN
Inventors: 程乐超; 茅一宁; 苏慧; 冯尊磊; 宋明黎
Original assignee: Zhejiang Lab
Current assignee: Zhejiang Lab
Priority date: 2021-12-02
Filing date: 2021-12-02
Publication date: 2022-04-15
Anticipated expiration: 2041-12-02
Also published as: CN113869463A

Abstract

The invention discloses a long-tail noise learning method based on cross enhancement matching, which is used for solving the problem of image classification with long-tail characteristics and noise labels at the same time. According to the data noise characteristics, the method screens noise samples by matching prediction results respectively obtained by weak enhancement data and strong enhancement data, and introduces a noise-removing regularization measure to eliminate the influence of the identified noise samples. For data long-tail features, the method implements a new online prior distribution-based prediction penalty to avoid bias on the head class. The method is simple and convenient to implement, flexible in means and superior in the aspect of obtaining the class fitting degree in real time, so that the method achieves remarkable classification effect improvement on long-tail data, noise data and training data with the characteristics of the long-tail data and the noise data.

Description

Long tail noise learning method based on cross enhancement matching

Technical Field

The invention relates to the field of image classification, in particular to a method for classifying images under the condition that a noise label and long tail distribution exist simultaneously.

Background

In recent years, Convolutional Neural Networks (CNNs) have been widely used in the field of computer vision. In the case of a fixed amount of training data, the overfitting phenomenon is increasingly prominent due to the increase of the number of parameters, and the requirement for accurately labeling data is also increased in order to improve the overall performance. However, obtaining a large number of accurately labeled samples is often quite expensive. In this regard, non-expert crowd-sourcing or systematic tagging is a practical solution, however this easily leads to mislabeling of tags. Many reference datasets, such as ImageNet, CIFAR-10/-100, MNIST, QuickDraw, etc., contain 3% to 10% noise label samples. Existing research on noisy labels has generally focused on splitting correctly labeled and incorrectly labeled samples, but neglecting the distribution of the data. In the real world, data often presents the characteristic of long tail distribution, several main categories in the data set occupy the dominance, and the data of other categories is insufficient in quantity. Therefore, it is very important to study how to train the model on the data set with both long tail distribution and label noise as shown in fig. 1 in practical application.

Noise label learning has received a lot of attention in recent years and also has achieved surprising effects. Because the convolutional neural network can learn a simple general mode of real data before fitting a noise sample in training, most of the existing methods adopt a cross entropy loss function to fit a model prediction result and a data label. However, in a data set with a long tail distribution, since the training data is dominated by the head class, cross entropy loss is difficult to distinguish between correct and incorrect samples of the tail class. For the long-tail image classification task, a series of data re-balancing (re-balancing) based strategies such as re-weighting and resampling balance the training data based on the number of samples of the class. However, in the presence of label noise, the number of samples for each class is unknown, and the number of samples does not reflect the degree of real-time fit of the class. Based on the above analysis, the existing deep neural network CNN still has no effective solution for the data set having both long tail features and tag noise.

Disclosure of Invention

In order to solve the defects of the prior art and avoid the problems that a co-training strategy (co-training) of noise data is difficult to distinguish correct and wrong samples of tail categories on long-tail data, and a data rebalancing strategy of long-tail classification has a poor classification effect on noisy data, the invention adopts the following technical scheme:

a long tail noise learning method based on cross enhancement matching comprises the following steps:

step S1: according to the data noise characteristics, respectively adopting a weak data enhancement strategy and a strong data enhancement strategy for each sample image, carrying out cross-enhancement matching (cross-enhancement matching) on the prediction results of the weak enhancement data and the strong enhancement data, and improving cross entropy loss into cross-enhancement matching loss for screening noise samples;

step S2: aiming at sample data difference caused by weak data enhancement and strong data enhancement strategies, a double-branch batch processing standardization method is adopted, and different parameters are respectively used for model training on feature graphs of the weak enhancement data and the strong enhancement data;

step S3: for the large loss sample screened by cross enhancement matching, the large loss sample is used as a noise sample with high confidence level, and a regularization measure for eliminating noise is used to eliminate the negative influence of the noise sample with high confidence level on model training;

step S4: for the head class classification advantages caused by data long-tail features, the classification prior probability is evaluated from online prediction, so that the class fitting degree is truly reflected, and the prediction penalty based on online prior distribution (online prior distribution) is used for smoothing the prediction result of the head class;

step S5: according to the data noise characteristics, a staged training strategy is used, and only the cross entropy loss and the online prior distribution loss of weak enhancement data are calculated in a preheating stage; in the formal training stage, the cross enhancement matching loss and the online prior distribution loss of the weak enhancement data and the strong enhancement data are calculated, and a regularization measure for eliminating noise is added.

Further, the method can be used for preparing a novel materialIn step S1, a

A sample image and

training data set for individual image classes

，

In order to be the image of the sample,

for noisy labels (i.e. for

Not necessarily correct),

representing the prediction results of the classification model, wherein

As a result of the network parameters,

in order to be a function of the mapping,

the dimension of expression isKIs determined according to the following cross-enhancement matching loss function

Whether it is a correctly labeled sample:

wherein

、

Respectively weak enhancement data and strong enhancement data,

for sample images

The prediction of the class result with the highest confidence level,

，

is shown asiAn image of a sample

Is marked with a labelkThe degree of confidence of (a) is,

the weight parameter is represented by a weight value,

is composed of

T is a transposed symbol.

Further, in step S1, the cross enhancement matching loss is smaller than the OTSU (otosu, an algorithm for determining the image binarization segmentation threshold value) threshold value

The data of (2) is recognized as correct data, and the correct data set

In the training phase, only the sets are used

The data in (1) calculates the cross-enhancement matching loss, i.e. the total loss is expressed as:

。

further, in the step S2, in order to avoid a negative effect of the sample data difference caused by the weak data enhancement and strong data enhancement strategies on the feature extraction, a dual-branch batch processing standardization method is adopted, specifically, a weak enhancement data difference is subjected to a batch processing standardization process

And strong enhancement data

Calculating different mean and variance according to exponential moving average accumulation

：

Wherein

Is a constant number of times, and is,

the number of sample images in a batch (batch) is shown, and the normalized output of the batch is

、

Wherein

、

Are all characteristic graphs of the middle layer of the neural network,

the neural network batches the layer inputs representing weak enhancement inputs,

which represents a weakly enhanced mean value, is,

the variance of the weak enhancement is indicated,

representing a strong enhancement input a neural network batches the layer inputs,

which represents the average of the strong enhancement,

it is meant that the variance is strongly enhanced,

、

are all learnable radiation parameters.

Further, in the step S2, in the training phase, a set of batch processing parameters is trained on the weak enhancement data and the strong enhancement data respectively; in the testing stage, only the weak data enhancement strategy and the batch processing parameters of the weak enhancement data are used, and the batch processing parameters of the strong data enhancement strategy are discarded.

Further, in step S3, according to the loss of the cross-enhancing matching, the filtered noise sample set is used as a high confidence error data set:

wherein the set

Is restricted to

，

Is a constant;

taking the screened large loss sample as a noise sample with high confidence, regularizing a network model through a noise-removing regularization measure, and aligning a set

Each sample image in (1)

Which belong to a specific class

Using the following regularization term constraint set

Preventing the prediction result from fitting a wrong noise label:

wherein the content of the first and second substances,

is shown asjAn image of a sample

Is marked with a labely _jThe confidence of (c).

Further, in step S4, the classification prior probability is evaluated from the online prediction

The prior probability dynamic evaluation of each category is:

wherein

Is a constant number of times, and is,

initialisation to the ratio of the number of samples of a particular class to the total number of samples, i.e.

，

Represents that the number of samples is

In the training data of (2), each class is

Number of training samples of

。

Further, in the step S4, the penalty of prediction based on the online prior distribution

And is used for smoothing the labels according to prior distribution, so that the labels with higher prior probability are more strongly smoothed, thereby enhancing the optimization of the tail category, and the specific formula is as follows:

wherein the content of the first and second substances,

a representation of the image of the sample is shown,

the probability of a priori is represented and,

representing the prediction results of the classification model, wherein

As a result of the network parameters,

is a mapping function.

Further, in the step S4, the prediction penalty based on the online prior distribution is determined

Adding a cross entropy loss function

Then obtaining:

wherein

Is a constant weighting coefficient; loss function

Is converted into the following form:

wherein the content of the first and second substances,

in the form of a noisy tag or label,

is shown asiAn image of a sample

Is marked with a labelkThe degree of confidence of (a) is,

is shown as

Prior probability of individual classes.

Further, in the step S5, the training is divided into a preheating stage and a formal stage, and the two stages respectively calculate the loss and update the parameters, including the following steps:

step S5.1, in the preheating stage, cross entropy loss and online prior distribution are calculated by using weak enhancement data:

wherein the content of the first and second substances,

represents a cross-entropy loss function of the entropy of the sample,

a representation of the image of the sample is shown,

indicating that the data is weakly enhanced data,

a tag that represents the weakly enhanced data,

represents the predictive penalty of an online prior distribution computed over weak enhancement data,

in order to be a constant weighting factor,

is a training data set;

step S5.2, in the formal training stage, the cross enhancement matching loss is combined

Noise rejection regularization loss

And an online prior distribution prediction penalty term for weak enhancement data and strong enhancement data

Screening out correct data set

And high confidence error data set

The total loss is functionalized as:

and updates the network parameters using a random gradient descent method (SGD).

The invention has the advantages and beneficial effects that:

according to the method, according to data noise characteristics, noise samples are screened by matching prediction results respectively obtained by weak enhancement data and strong enhancement data, a noise-rejection regularization measure (leave-noise-out regularization) is introduced to eliminate the influence of the identified noise samples, and a new prediction penalty based on online prior distribution (online prior distribution) is implemented by the method aiming at data long tail characteristics to avoid bias of head types.

Drawings

FIG. 1 is a schematic of a data set with both long tail distribution and tag noise.

Fig. 2 is a flow chart of the method of the present invention.

FIG. 3a is a graph of the variation of the test accuracy of other methods of the present invention (

，

）。

FIG. 3b is a graph of the variation of the test accuracy of other methods of the present invention (

，

）。

FIG. 3c is a graph of the variation of the test accuracy of other methods of the present invention: (

，

）。

FIG. 3d is a graph of the variation of the test accuracy of other methods of the present invention: (

，

）。

FIG. 3e is a graph of the variation of the test accuracy of other methods of the present invention: (

，

）。

FIG. 4a is a graph of the variation of the test accuracy of other methods of the present invention (

，

）。

FIG. 4b is a graph of the variation of the test accuracy of other methods of the present invention (

，

）。

FIG. 4c is a graph of the change in test accuracy of other methods of the present invention: (

，

）。

FIG. 4d is a graph of the change in test accuracy of other methods of the present invention (

，

）。

FIG. 4e is a graph of the change in test accuracy of other methods of the present invention: (

，

）。

Detailed Description

The following detailed description of embodiments of the invention refers to the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating the present invention, are given by way of illustration and explanation only, not limitation.

As shown in fig. 1 and 2, a long tail noise learning method based on cross enhanced matching includes the following steps:

the method comprises the following steps: according to the data noise characteristics, a weak data enhancement strategy and a strong data enhancement strategy are respectively adopted for each sample, cross-enhancement matching (cross-entropy matching) is carried out on the prediction results of weak enhancement data and strong enhancement, and cross entropy loss is improved into cross-enhancement matching loss for screening the noise samples.

The invention relates to two data enhancement strategies, namely weak data enhancement and strong data enhancement. The implementation of weak data enhancement (weak augmentation) is simple random flip (flip) and crop (crop), while strong data enhancement (strong augmentation) uses the implementation of AutoAutoAutoAutoaugmentation and adopts a data enhancement strategy automatically selected by a search algorithm on ImageNet.

Is given with

A sample image and

training data set for individual image class labels

，

In order to be the image of the sample,

for noisy labels (i.e. for

Not necessarily correct), the prediction result of the classification model is defined as

Wherein

As a result of the network parameters,

in order to be a function of the mapping,

the dimension of expression isKIs determined according to the following cross-correlation matching loss function

Whether it is a correctly labeled sample:

wherein

、

Respectively a weakly enhanced and a strongly enhanced sample,

is composed of

The prediction of the class result with the highest confidence level,

，

is shown asiLabeling of individual sample imageskThe degree of confidence of (a) is,

the weight parameter is represented by a weight value,

is composed of

T is a transposed symbol.

The cross enhancement matching loss is smaller than OTSU (Otsu method), an algorithm for determining image binary segmentation threshold value

The data of (2) is recognized as correct data, and the correct data set

In the training phase, only the sets are used

The data in (1) calculates the cross-enhancement matching loss, i.e., the total loss can be expressed as:

。

step two: aiming at sample data difference caused by weak data enhancement and strong data enhancement strategies, a dual-branch batch normalization method is adopted, and different parameters are respectively used for model training of feature maps of weak enhancement and strong enhancement data.

In order to avoid the negative influence of sample data difference on feature extraction caused by weak data enhancement and strong data enhancement strategies, a double-branch batch processing standardization method is adopted, and specifically, the weak enhancement data is subjected to batch processing standardization in the process of batch processing standardization

And strong enhancement data

Different mean and variance are cumulatively calculated based on exponential moving averages (EMA in FIG. 2), respectively

：

Wherein

Is a constant number of times, and is,

representing the number of samples in a batch (batch), the normalized output of the batch being

、

Wherein

、

Are all characteristic graphs of the middle layer of the neural network,

which represents a weakly enhanced mean value, is,

the variance of the weak enhancement is indicated,

which represents the average of the strong enhancement,

it is meant that the variance is strongly enhanced,

、

are all learnable radiation parameters.

In the training stage, respectively training a group of batch processing parameters for the weak enhancement data and the strong enhancement data; in the testing stage, only the weak enhancement strategy and the batch processing parameters of the weak enhancement data are used, and the batch processing parameters of the strong enhancement strategy are discarded.

Step three: regarding the large loss samples screened by the cross-enhancement matching as noise samples with high confidence, a regularization measure (LNOR in fig. 2) for eliminating noise is used to eliminate the negative influence of the samples on the model training.

For large loss samples screened by cross enhancement matching, defining a high-confidence error data set:

wherein the set

Is restricted to

，

Is a constant.

The selected high loss samples are considered as high confidence noise samples for regularization by a noise rejectionMeasure regularization network model optimization, specifically, on-set

Each sample in (1)

Assume it belongs to a particular sample class

Using the following regularization term constraint set

Preventing the prediction result from fitting a wrong noise label:

wherein the content of the first and second substances,

is shown asjAn image of a sample

Is marked with a labely _jThe confidence of (c).

Step four: aiming at the head class classification advantages caused by the long tail features of the data, a new online prior distribution (online prior distribution) -based prediction penalty is implemented to smooth the prediction result of the head class.

Aiming at the head class classification advantages caused by the data long-tail characteristics, the classification prior probability is evaluated from online prediction to truly reflect class fitting degree

The prior probability dynamic evaluation of each category is:

wherein

Is a constant number of times, and is,

，

Represents that the number of samples is

In the training data of (2), each class is

Number of training samples of

。

Predictive punishment based on online prior distribution

The method is used for smoothing the label according to prior distribution, so that the label with higher prior probability obtains stronger smoothing, thereby enhancing the optimization of the tail category, and the specific definition is as follows:

wherein the content of the first and second substances,

and (4) showing.

Adding the prediction punishment based on the online prior distribution into a cross entropy loss function to obtain the result

Wherein

For a constant weighting factor, the above-mentioned loss function can be converted into the following form:

wherein the content of the first and second substances,

is shown as

Prior probability of individual classes.

Step five: a phased training strategy is used based on the data noise signature. In the preheating stage, only the cross entropy loss and the online prior distribution loss of the weak enhancement data are calculated; in the formal training stage, the cross enhancement matching loss and the online prior distribution loss of the weak enhancement data and the strong enhancement data are calculated, and a regularization measure for eliminating noise is added.

The training is divided into a preheating stage and a formal stage, and the loss and the parameters are calculated and updated in the following modes respectively:

step 5.1, in the preheating stage, only weak enhancement data is used for calculating cross entropy loss and online prior distribution, namely:

step 5.2, combining cross enhancement matching loss in the formal training stage

Noise rejection regularization loss

Screening out correct data set

And high confidence error data set

The total loss function is defined as:

And in the prediction stage, the sample image to be predicted is input into a model trained by adopting a long tail noise learning method based on cross enhancement matching, and a classification result is output.

The invention defines the sample number ratio between the most sample number category and the least sample number category as an unbalance factor (imbalance factor)

I.e. by

. The type of long tail data distribution used in the experiments of the present invention is exponential decay distribution.

For setting of noise data, there are two cases, i.e., class-independent noise (class-independent noise) and classClass-dependent noise (noise). The category-independent noise assumes that the mislabeled samples are randomly and uniformly distributed, and the category-dependent noise focuses on the phenomenon of artificial labeling error caused by visual similarity. The invention defines the error probability of the sample label as noise rate (noise rate)

. For class independent noise, each class has a label

The probability of (d) is labeled as other arbitrary categories by random errors; for class-dependent noise, there are labels for every two classes

The probability of (d) is labeled as the opposite party category.

The method uses a Pythrch frame to carry out experiments, uses a CIFAR-10 data set, uses ResNet-32 as a network model, and uses an SGD optimizer with an initial learning rate of 0.05 and a cosine annealing scheduler. 100 iterations of training are set in both training phases, batch size 128, parameters

All experiments of the invention are trained from zero.

As shown in FIGS. 3a-e, by CE, Coteaching +, JoCoR, Comatring and the method of the present invention, on CIFAR-10 data set, using ResNet-32 network to perform the test accuracy change of long tail distributed noise sample learning, it can be seen that the accuracy of the method of the present invention is superior to that of other methods, wherein the symmetric noise rate

Imbalance factor

。

As shown in fig. 4a-e, by CE, LDAM, Mixup, MisLAS and the method of the invention, the ResNet-32 network is used on the CIFAR-10 data set to carry out the test accuracy change of the long tail distributed noise sample learning, and the accuracy of the method of the invention is superior to that of other methods, wherein the imbalance factor

Asymmetric noise ratio

。

The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. A long tail noise learning method based on cross enhancement matching is characterized by comprising the following steps:

step S1: according to the data noise characteristics, a weak data enhancement strategy and a strong data enhancement strategy are respectively adopted for each sample image;

step S3: carrying out cross enhancement matching on the prediction results of the weak enhancement data and the strong enhancement data to screen a noise sample, and eliminating the negative influence of the noise sample on model training by using a noise elimination regularization measure on the noise sample screened by the cross enhancement matching;

step S4: for the head class classification advantages caused by the long tail characteristics of the data, estimating the classification prior probability from online prediction, and based on the prediction penalty of online prior distribution, smoothing the prediction result of the head class;

2. The method for learning long tail noise based on cross-correlation matching as claimed in claim 1, wherein in step S3, given is

A sample image and

training data set for individual image classes

，

In order to be the image of the sample,

in the form of a noisy tag or label,

representing the prediction results of the classification model, wherein

As a result of the network parameters,

in order to be a function of the mapping,

Whether it is a correctly labeled sample:

wherein

、

Respectively weak enhancement data and strong enhancement data,

for sample images

The prediction of the class result with the highest confidence level,

，

is shown asiAn image of a sample

To (1) akThe confidence level of the seed class or classes,

the weight parameter is represented by a weight value,

is composed of

T is a transposed symbol.

3. The method according to claim 2, wherein in step S1, the cross-enhancement matching loss is smaller than the OTSU threshold

The data of (2) is recognized as correct data, and the correct data set

In the training phase, only the sets are used

。

4. the method for learning long tail noise based on cross-correlation matching as claimed in claim 1, wherein in step S2, a two-branch batch normalization method is used to normalize the weak correlation data

And strong enhancement data

：

Wherein

Is a constant number of times, and is,

representing the number of sample images in a batch, the normalized output of the batch being

、

Wherein

、

Are all characteristic graphs of the middle layer of the neural network,

which represents a weakly enhanced mean value, is,

the variance of the weak enhancement is indicated,

which represents the average of the strong enhancement,

it is meant that the variance is strongly enhanced,

、

are all learnable radiation parameters.

5. The method for learning long tail noise based on cross-correlation matching as claimed in claim 1, wherein in the step S2, in the training phase, a set of batch processing parameters is trained on the weak correlation data and the strong correlation data respectively; in the testing stage, only the weak data enhancement strategy and the batch processing parameters of the weak enhancement data are used, and the batch processing parameters of the strong data enhancement strategy are discarded.

6. The method according to claim 3, wherein in step S3, the filtered noise sample set is used as the high confidence error data set according to the loss of the cross-matching enhancement:

wherein the set

Is restricted to

，

Is a constant;

Each sample image in (1)

Which belong to a specific class

Using the following regularization term constraint set

Network prediction result of (1):

wherein the content of the first and second substances,

is shown asjAn image of a sample

Is marked with a labely _jThe confidence of (c).

7. The method of claim 3, wherein in step S4, the class prior probability is evaluated from online prediction

The prior probability dynamic evaluation of each category is:

wherein

Is a constant number of times, and is,

，

Represents that the number of samples is

In the training data of (2), each class is the first

Number of training samples of species, satisfy

。

8. The method for learning long tail noise based on cross-correlation matching as claimed in claim 1, wherein in step S4, the penalty of prediction based on online prior distribution is penalized

The method is used for smoothing the label according to prior distribution, so that the label with higher prior probability obtains stronger smoothing, and the specific formula is as follows:

wherein the content of the first and second substances,

a representation of the image of the sample is shown,

the probability of a priori is represented and,

representing the prediction results of the classification model, wherein

As a result of the network parameters,

is a mapping function.

9. The method for learning long tail noise based on cross-correlation matching as claimed in claim 8, wherein in step S4, the prediction penalty based on online prior distribution is determined

Adding a cross entropy loss function

Then obtaining:

wherein

Is a constant weighting coefficient; loss function

Is converted into the following form:

wherein the content of the first and second substances,

in the form of a noisy tag or label,

is shown asiAn image of a sample

To (1) akThe confidence level of the seed class or classes,

is shown as

Prior probability of individual classes.

10. The method for learning long tail noise based on cross-correlation matching according to claim 1, wherein the training in step S5 is divided into a preheating stage and a formal stage, and the two stages respectively calculate the loss and update the parameters, including the following steps: