CN113516207B

CN113516207B - Long-tail distribution image classification method with noise label

Info

Publication number: CN113516207B
Application number: CN202111059448.7A
Authority: CN
Inventors: 程乐超; 茅一宁; 冯尊磊; 宋明黎
Original assignee: Zhejiang Lab
Current assignee: Zhejiang Lab
Priority date: 2021-09-10
Filing date: 2021-09-10
Publication date: 2022-01-25
Anticipated expiration: 2041-09-10
Also published as: CN113516207A

Abstract

The invention discloses a long-tail distribution image classification method with a noise label, which is used for learning through relaxation interval loss depending on a sample and assisting with an anti-noise data enhancement strategy and solving the problem of image classification with a long-tail characteristic and a noise label at the same time. According to the data noise characteristics, introducing a relaxation variable dependent on the sample when calculating the sample function interval to relax the interval constraint, and calculating the smooth relaxation loss dependent on the sample according to the sample interval classification; according to the data long tail characteristics, a data enhancement strategy adjusted in stages is implemented, samples are enhanced strongly and weakly respectively, and a sample screening mechanism based on relaxation loss is provided in the formal training stage for screening out noise data. The method is simple and convenient to implement, flexible in means, and capable of remarkably improving the classification effect on long-tail data, noise data and training data with the characteristics of the long-tail data and the noise data.

Description

Long-tail distribution image classification method with noise label

Technical Field

The invention relates to the field of image classification, in particular to a method for classifying images under noise labels and long tail distribution data.

Background

In recent years, Convolutional Neural Networks (CNNs) have been widely used in the field of computer vision. In the case of a fixed amount of training data, the overfitting phenomenon is increasingly prominent due to the increase of the number of parameters, and the requirement for accurately labeling data is also increased in order to improve the overall performance. However, obtaining a large number of accurately labeled samples is often quite expensive. In this regard, non-expert crowd-sourcing or systematic tagging is a practical solution, however this easily leads to mislabeling of tags. Many reference datasets, such as ImageNet, CIFAR-10/-100, MNIST, QuickDraw, etc., contain 3% to 10% noise label samples. Existing research on noisy labels has generally focused on splitting correctly labeled and incorrectly labeled samples, but neglecting the distribution of the data. In the real world, data often presents the characteristic of long tail distribution, several main categories in the data set occupy the dominance, and the data of other categories is insufficient in quantity. Therefore, in the current image classification task based on the deep neural network, how to classify data with long tail features and noise labels simultaneously to reduce the influence of the noise labels under long tail distribution is very important in practical application.

Disclosure of Invention

In order to solve the defects of the prior art and achieve the purpose of reducing the influence of noise labels under long tail distribution, the invention adopts the following technical scheme:

a method for classifying long-tail distribution images with noise labels comprises the following steps:

s1, according to the data noise characteristics, each sample image and the noise label thereof

At sample intervals

On the basis of (2), introducing a relaxation variable

Forming sample relaxation intervals of noisy samples

；

The sample interval is

Class interval of

Wherein

Is shown as

A sample

Is marked with a label

Is a category

I.e. specimen

Belong to the category

And the process of, accordingly,

indicates all belong to the category

A set of sequence numbers of samples of (1);

the sample relaxation intervals were:

wherein,

indicating the sample image and its correct label,

representing a prediction function for predicting to which class a sample image belongs,

in order to be a sample space, the sample space,Nis the total number of samples and is,

is composed of

A set of tags for each of the categories,

the representation of the real number field is performed,

is shown and

different noise labels

And x corresponding thereto, the largest value among the values obtained by the prediction function,

，

representing an optimal interval; the traditional DNN classification network usually follows a linear conversion layer after a feature extractor, but the strategy is easy to generate the situation that the classifier falls into linear inseparability when fitting data with noise, so the invention proposes a relaxation variable

Introducing slack variables with relaxed spacing constraints

Sample relaxation interval of

The tolerance of classification prediction results is increased;

according to the sample interval, the smooth relaxation Loss (Slack Loss) depending on the sample is calculated in a segmented mode

；

S2, according to the data long tail characteristics, implementing the divisionData Augmentation strategy for stage adjustment for noisy Data sets

Each group of samples in (1)

For sample image

Respectively carrying out weak data enhancement and strong data enhancement to obtain corresponding weak enhancement data and strong enhancement data, dividing training into a preheating stage and a formal stage, considering the negative influence of a strong data enhancement method on a high noise rate data set, respectively calculating and adding relaxation loss in the training stage by using the weak enhancement data and the strong enhancement data, and calculating and adding the relaxation loss in the noise rate

And

as the weight, in the preheating stage, directly calculating the relaxation loss of the weak enhancement data and the strong enhancement data; in the formal training stage, a group of sample images are screened and relaxed to be used as pure data according to the relaxation loss in the preheating stage, residual noise data are screened out, and the relaxation loss is calculated. The method of injecting strong data enhancement during the warm-up phase of training may improve performance for training of low noise data sets, but is counterproductive as the noise of the data set increases. Conversely, the weak data enhancement during the warm-up phase can greatly improve the performance of the high noise data training. Based on this summary, the present invention divides model training into two phases, adjusting the enhancement strategy at different phases.

Further, the relaxation loss in S1 is:

further onThe warm-up stage in S2, using weak enhancement data directly

And strong enhancement data

Calculating the relaxation loss as the noise rate

And

as weights, the overall loss is calculated:

wherein,

。

further, the formal training phase in S2 includes the following steps:

s21, screening out the slack loss according to the slack loss in the preheating stage

、

As weak enhancement data

And strong enhancement data

Front of minimum medium slack loss

A partial sample image;

s22, according toWeak enhanced data after screening

From strong enhancement data

Obtained by intermediate sampling

According to the screened strong enhancement data

From weak enhancement of data

Obtained by intermediate sampling

Screening out the remaining noise data;

s23, obtaining

、

As correct sample image, at the noise rate

And

as weight, calculating the overall loss, returning the loss, updating network parameters:

wherein,

。

further, in the S21, the

、

The following screens were used:

。

further, in the above S1, an optimum interval is set

、

For training data points

Sample interval

Greater than the optimum interval

Therefore, it needs to be pushed to the class boundary to make the data boundary more gradual; for the sample interval at

Data points within the interval

，

In the opposite direction, so that the data point has a certain probability of turning into the other side of the class boundary;

、

indicating for a category

And

is not an exact formula but is stated to be inversely proportional to the number of samples corresponding to the class in view of the relationship between the two classes

And

is/are as follows

To the power. Thereby setting the sample-dependent tolerance range.

Further, the relaxation variable in S1

Will be uniformly distributed

Multiplication by

From which the slack variable is extracted

I.e. by

，

Representing the noise rate, i.e., the probability of sample label error.

Further, for the setting of the long tail distribution data, the total number of samples is

In each category of training data

The number of training samples is

Satisfy the following requirements

The ratio of the most sample number class to the least sample number class is used as an imbalance factor (imbalance factor)

I.e. by

。

Further, the sample image and its noise label in S1

By transition matrix (transition matrix)

Representation represents a noise label:

wherein,

representing a sample image

The corresponding category of the content file,

is shown asnThe number of images of the sample is determined,

representing categories

Is classified into categoriesjThe probability of (a) of (b) being,

. For the setting of the noise data, there are 2 cases, i.e., class-independent noise (class-dependent noise) and class-dependent noise (class-dependent noise). The category-independent noise assumes that the mislabeled samples are randomly and uniformly distributed, and the category-dependent noise focuses on the phenomenon of artificial labeling error caused by visual similarity. Both types of noise distributions may use a transition matrix

And (4) showing.

Further, the sample image and its noise label in S1

Sampled in noisy data sets

Corresponding to the sample image and its correct label

Sampled in a clean data set

Wherein

Is shown asnThe number of images of the sample is determined,

representing a sample image

The corresponding category of the content file,

as to the number of samples,

average sampling from potential distribution of data

。

The invention has the advantages and beneficial effects that:

starting from the category correlation interval, the method introduces the relaxation variable depending on the sample, relaxes the interval constraint, and increases the tolerance of the classification prediction result, thereby bearing the risk of wrong classification caused by noise or unbalanced distribution; considering the negative influence of a strong data enhancement method on a high-noise-rate data set, the method calculates the relaxation loss in the training stage by using weak enhancement data and strong enhancement data respectively; the method for injecting strong data enhancement in the preheating stage of training can improve the performance of the training of low-noise data sets, but can be counterproductive when the noise of the data sets is increased, and on the contrary, the weak data enhancement in the preheating stage can greatly improve the performance of the training of high-noise data. Finally, the effect of noise signatures under long tail distributions is reduced.

Drawings

FIG. 1a is a graph of accuracy versus loss variation for noise sample learning on a CIFAR-10 data set.

FIG. 1b is a graph of accuracy versus loss variation for long tail distribution learning on a CIFAR-10 dataset.

FIG. 2a is a graph of the distribution of class independent noise (asymmetric noise ratio)

）。

FIG. 2b shows the distribution of class-dependent noise (asymmetric noise ratio)

）。

Fig. 2c is a distribution diagram of class independent noise under a long tail distribution.

Fig. 2d is a distribution diagram of class-correlated noise under a long-tailed distribution.

FIG. 3 is a graph of sample dependence tolerance in the present invention.

Detailed Description

The following detailed description of embodiments of the invention refers to the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating the present invention, are given by way of illustration and explanation only, not limitation.

Noise label learning has received a lot of attention in recent years and also has achieved surprising effects. However, existing deep neural networks DNN still have drawbacks in addressing noise labeling and long tail learning. As shown in fig. 1a, in which the noise ratio is symmetrical

When DNN is used to fit the noise label, the fluctuation in validation accuracy accounts for the noise capacity of the model. As shown in fig. 1b, in which the imbalance factor

The application of DNN in long tail distribution learning also embodies similar characteristics, i.e. DNN fits the main class first, then gradually the tail class. From the above analysis, it can be found that the contradiction between the few-sample class learning and the noise fitting confounds the prediction of the neural network, giving the noise label learning band under the long tail distributionNew challenges arise.

Deep Neural Networks (DNNs), during training with noisy data, tend to remember common patterns first, and then fit noise samples step by step. A similar process occurs in class imbalance learning, where the network tends to fit the main class first, then over-fit the tail class step by step. In this regard, the present invention starts with class-aware intervals (class-dependent margin), and introduces a sample-dependent relaxation Variable (Slack Variable) for taking up the risk of misclassification due to noise or unbalanced distribution. In addition, the invention also provides an anti-noise data enhancement strategy.

1. Experimental setup and preparation

The invention mainly solves the problem of classifying data with long tail characteristics and noise labels simultaneously in an image classification task. Defining an input space as a set of input spaces

，

Is as followsnAn input image is input to the image processing device,Nis the total number of samples and is,

the label sets of individual categories are

Potential distribution of data

. Training data pairs

Sample at

，

Is the total number of samples and is,

representing an input image

A corresponding category. The image classification task designed in the present invention therefore needs to be aimed at by constraining

Deriving a prediction function

Indicating to input

To proceed with

Calculating and outputting classification results

So that

The number of (a) is the largest, in short,

predicting each input image

To which category the prediction result belongs, and outputting the prediction result

；

The optimization goal is to maximize the correct number of prediction results,

representing a real number domain.

For the setting of the long tail distribution data, the total number of samples is set as

In each category of training data

Is defined as the number of training samples

Satisfy the following requirements

. The invention defines the sample number ratio between the most sample number category and the least sample number category as an unbalance factor (imbalance factor)

I.e. by

. As shown in fig. 2a, 2b, the distribution of the long tail data generally follows an exponential decay.

For the setting of the noise data, there are 2 cases, i.e., class-independent noise (class-dependent noise) and class-dependent noise (class-dependent noise). The category-independent noise assumes that the mislabeled samples are randomly and uniformly distributed, and the category-dependent noise focuses on the phenomenon of artificial labeling error caused by visual similarity. Both types of noise distribution may use a transition matrix (transition matrix)

Is shown, each element therein

Representative categories

Is classified into categories

The probability of (c). Corresponding correct sample and label thereof

Sampling in clean data sets

Samples representing data and noise labels therefor

Sampling in noisy data sets

Wherein

As to the number of samples,

can be defined as:

as shown in FIGS. 2a and 2b, the class-independent noise, or symmetric noise, is assumed to be of a certain class

Are evenly distributed over the other classes, i.e.

，

Representing the noise rate, i.e., the probability of sample label error; while class-dependent noise, or asymmetric noise, assumes a certain oneCategories

Are all error labeled as another specific category

I.e. by

. As shown in fig. 2c and 2d, the setting of noise data under a long tail distribution.

2. Sample dependent tolerance range setting

Conventional DNN classification networks typically follow the feature extractor with a linear conversion layer, however this strategy tends to produce cases where the classifier falls into linearity inseparability when fitting to noisy data. Therefore, the present invention proposes a relaxation variable

To relax the interval constraint and increase the tolerance of classification. Relaxation variables

At corresponding optimum intervals empirically

To be restricted, i.e.

(ii) a For data with noisy samples, there is a noise due to each sample

The probability of (noise rate) being a wrong sample, so here we will distribute uniformly

Multiplication by noise rate

From which the slack variable is extracted

I.e. by

。

At sample intervals

The present invention defines the relaxation interval as:

the relaxation interval increases the tolerance of the classification prediction results. As shown in FIG. 3, the optimum interval is set

、

For training data points

Interval of function

Greater than the optimum interval

Therefore, it needs to be pushed to the class boundary to make the data boundary more gradual; for the function interval at

Data points within the interval

，

、

relative to that in the above theoretical calculation

In the present embodiment, the description is given for the category

And

is not an exact formula but specifies that they are inversely proportional to the number of samples of the class in view of the relationship between the two classes

And

is/are as follows

To the power.

3. Loss of slack space

Relaxed interval for a particular noise sample

The sample dependent relaxation loss is defined as:

4. weak data enhancement strategy and strong data enhancement strategy

The present invention relates to 2 data enhancement strategies, namely weak data enhancement and strong data enhancement. The implementation of weak data enhancement (weak augmentation) is simple random flip (flip) and crop (crop), while strong data enhancement (strong augmentation) uses the implementation of AutoAutoAutoAutoaugmentation and adopts a data enhancement strategy automatically selected by a search algorithm on ImageNet.

5. Anti-noise data enhancement strategy implemented in stages

The method of injecting strong data enhancement during the warm-up phase of training may improve performance for training of low noise data sets, but is counterproductive as the noise of the data set increases. Conversely, the weak data enhancement during the warm-up phase can greatly improve the performance of the high noise data training. Based on this summary, the present invention divides model training into two phases, adjusting the enhancement strategy at different phases. In the warm-up phase, weak enhancement data is directly used

And strong enhancement data

Calculating the loss, namely:

in the formal training stage, the proportion of the screening quantity to the total quantity of the samples is that

Of (2) a sample

The remaining noise data is filtered out as "correct samples" for calculating the loss and update parameters, defining the loss as:

wherein

。

6. Training phase data screening strategy

The invention screens the data in the formal training phase and has the noise rate of

The training data of (1) is selected as the proportion of the total number of samples

Effective sample of

As "correct samples" for calculating the loss and update parameters, the remaining noise data is filtered out. The screening process firstly separately

Is defined as:

namely, it is

Respectively representing weakly enhanced data

And strong enhancement data

Front of minimum medium slack loss

A portion of the sample. Then, according to the screened weak enhancement data

From strong enhancement data

Obtained by intermediate sampling

(ii) a Similarly, according to the screened strong enhancement data

From weak enhancement of data

Obtained by intermediate sampling

The remaining noise data is filtered out.

Specifically, the sample-dependent relaxation interval loss learning method comprises the following steps:

step 1: according to the data noise characteristics, each sample and the noise label thereof

After calculating the sample function interval (functional margin)

When a sample dependent relaxation Variable (Slack Variable) is introduced

And calculating the sample-dependent smooth relaxation Loss (Slack Loss) according to the sample interval by sections by relaxing the interval constraint

。

Data samples and noise signatures thereof

Sampling in noisy data sets

Corresponding to correct sample and label thereof

Sampling in clean data sets

Wherein

As to the number of samples,

a noise rate representing the probability of sample label error,

average sampling from potential distribution of data

，

In order to input the space, the input device is provided with a display,

is composed of

A set of labels for each category.

For the prediction function

Spacing of samples

Is defined as:

at the same time, classify

Is defined as

Wherein

Is shown as

Sample (serial number

Sample (1)

Is marked with a label

Is a category

I.e. specimen

Belong to the category

And the process of, accordingly,

indicates all belong to the category

A set of sequence numbers of samples of (1).

Relaxed interval for a particular noise sample

At sample intervals

On the basis of (2), introducing a relaxation variable

The relaxation interval is defined as:

relaxation variance of sample

At corresponding optimum intervals empirically

To be restricted, i.e.

(ii) a For noise rate of

Will be evenly distributed

Multiplication by noise rate

From which the slack variable is extracted

I.e. by

。

Relaxing intervals for noise samples

The sample dependent relaxation loss is defined as:

step 2: according to the Data long tail characteristic, a Data Augmentation strategy (Data Augmentation) adjusted in stages is implemented. And respectively carrying out strong data enhancement and weak data enhancement on the sample. In the preheating stage, directly calculating relaxation loss; in the formal training phase, a mechanism is provided to screen small loss samples as clean data, to screen out noisy data, and to calculate slack loss.

For noisy data sets

Each group of samples in (1)

To input of

Respectively carrying out weak data enhancement and strong data enhancement to obtain corresponding weak enhancement data

And strong enhancement data

。

Considering the negative effect of strong data enhancement method on high noise rate data set, the invention uses the relaxation loss of training stage with weak enhancement data respectively

And strong enhancement data

Are calculated and added to obtain the noise ratio

And

as weights, the penalty is defined as:

wherein

。

The training is divided into a preheating stage and a formal stage, and the loss and the parameters are calculated and updated in the following modes respectively:

2.1: in the warm-up phase, weak enhancement data is directly used

And strong enhancement data

Calculating the loss, namely:

2.2: in the formal training stage, according to the relaxation loss of the samples, the proportion of the screening quantity to the total amount of the samples is

Of (2) a sample

。

in the formal training stage, the sample is screened

Is first defined separately

Enhancing data for weaknesses

And strong enhancement data

Front of minimum medium slack loss

A portion of the sample; then, according to the screened weak enhancement data

From strong enhancement data

Obtained by intermediate sampling

(ii) a Similarly, according to the screened strong enhancement data

From weak enhancement of data

Obtained by intermediate sampling

The remaining noise data is filtered out. Using the obtained

、

And calculating the loss by using the formula, returning the loss and updating the network parameters.

As shown in Table 1, on the CIFAR-10 and CIFAR-100 data sets with noise, using ResNet34 as a general network framework, noise ratios were set to the class-independent noise and the class-dependent noise, respectively

And

compared with Bootstrap, Forward, GCE, SCE and other methods. For class independent noise, the relaxation penalty proposed herein outweighs all other approaches. For class-dependent noise, the relaxation penalty is in

Is slightly better than other methods, but in

The accuracy is not high. To this end, we give a reasonable explanation: the relaxation variables introduced by the invention add random disturbance to the sample label distribution, and the random disturbance may have negative effects because the category-related noise is accurate in the non-corresponding categories. And when the noise rate is small (

) Regularization adjustment of relaxation loss may balance this negative effect.

As shown in Table 2, on CIFAR-10 and CIFAR-100 datasets with long tail distributions, using ResNet34 as a general network framework, an imbalance factor is set

And Focal localMixup, CE-DRW, CE-DRS, LDAM-DRW, BBN, etc. It can be seen that when the data is extremely unbalanced, the accuracy of the classification result of the invention is much higher than that of other methods. When in use

The performance of relaxation loss is somewhat inadequate for reasons similar to the interpretation of noise learning in the previous section.

TABLE 1

Different methods classify the results in accuracy (%) on the noisy data set CIFAR-10/100, with the highest accuracy being marked in bold and the second accuracy being marked in oblique bold. Wherein the Slack Loss method uses the Slack Loss (Slack Loss) in the present invention as a Loss function, but does not use a data enhancement strategy; the Slack Loss + method uses the relaxation Loss and data enhancement strategy of the present invention, i.e., the complete method given by the present invention.

TABLE 2

The classification result accuracy (%) of the different methods on the long tail data set CIFAR-10/100, the highest accuracy being marked in bold, the second accuracy being marked in oblique bold. Wherein the Slack Loss method uses the Slack Loss (Slack Loss) in the present invention as a Loss function.

The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. A long-tail distribution image classification method with a noise label is characterized by comprising the following steps:

s1, according to the data noise characteristics, the sample image and the noise label thereof

At sample intervals

On the basis of (2), introducing a relaxation variable

Forming sample relaxation intervals of noisy samples

；

The sample interval is

Class interval of

Wherein

Is shown as

A sample

Is marked with a label

Is a category

，

Indicates all belong to the category

A set of sequence numbers of samples of (1);

the sample relaxation intervals were:

wherein,

indicating the sample image and its correct label,

is composed of

A set of tags for each of the categories,

the representation of the real number field is performed,

is shown and

different noise labels

，

representing an optimal interval;

piecewise computing sample-dependent relaxation loss from sample interval

；

S2, according to the data long tail characteristic, the data enhancement strategy adjusted by stages is used for the sample image

Respectively performing weak data enhancement and strong data enhancement to obtain corresponding weak enhancement data and strong enhancement data, dividing training into a preheating stage and a formal stage, and directly calculating relaxation losses of the weak enhancement data and the strong enhancement data in the preheating stage; in the formal training stage, a group of sample images are screened and relaxed to be used as pure data according to the relaxation loss in the preheating stage, residual noise data are screened out, and the relaxation loss is calculated.

2. The method for classifying long-tail distribution images with noise labels as claimed in claim 1, wherein the relaxation loss in S1 is:

。

3. the method for classifying long tail distribution images with noise labels as claimed in claim 1, wherein the pre-heating stage in S2 directly uses weak enhancement data

And strong enhancement data

Calculating the relaxation loss as the noise rate

And

as weights, the overall loss is calculated:

wherein,

。

4. the method for classifying long-tail distribution images with noise labels as claimed in claim 1, wherein the formal training phase in S2 includes the following steps:

、

As weak enhancement data

And strong enhancement data

Front of minimum medium slack loss

A partial sample image;

s22, according to the screened weak enhancement data

From strong enhancement data

Obtained by intermediate sampling

According to the screened strong enhancement data

From weak enhancement of data

Obtained by intermediate sampling

Screening out the remaining noise data;

s23, obtaining

、

As correct sample image, at the noise rate

And

as weights, the overall loss is calculated:

wherein,

。

5. the method for classifying long tail distribution images with noise labels as claimed in claim 4, wherein in S21, the method comprises

、

The following screens were used:

。

6. the method for classifying long tail distribution images with noise labels as claimed in claim 1, wherein in S1, an optimal interval is set

、

For training data points

Sample interval

Greater than the optimum interval

Pushing it to the class boundary, making the data boundary more gradual; for the sample interval at

Data points within the interval

，

、

indicating for a category

And

is inversely proportional to the number of samples corresponding to the class

And

is/are as follows

To the power.

7. The method for classifying long tail distribution images with noise labels as claimed in claim 1, wherein the relaxation variables in S1

Will be uniformly distributed

Multiplication by

From which the slack variable is extracted

I.e. by

，

Representing the noise rate, i.e., the probability of sample label error.

8. The method of claim 1, wherein the total number of samples is

In each category of training data

The number of training samples is

Satisfy the following requirements

The ratio of the most sample number class to the least sample number class is used as the imbalance factor

I.e. by

。

9. The method for classifying long tail distribution images with noise labels as claimed in claim 1, wherein the sample image and its noise label in S1

By means of a transfer matrix

Represents the noise label:

wherein,

representing a sample image

The corresponding category of the content file,

is shown asnThe number of images of the sample is determined,

representing categories

Is classified into categoriesjThe probability of (a) of (b) being,

。

10. the method for classifying long tail distribution images with noise labels as claimed in claim 1, wherein the sample image in S1And its noise label