CN110363228A

CN110363228A - Noise label correcting method

Info

Publication number: CN110363228A
Application number: CN201910562002.2A
Authority: CN
Inventors: 徐建; 余孟池; 张静
Original assignee: Nanjing Tech University
Current assignee: Nanjing Tech University
Priority date: 2019-06-26
Filing date: 2019-06-26
Publication date: 2019-10-22
Anticipated expiration: 2039-06-26
Also published as: CN110363228B

Abstract

The present invention provides a noise label re-labeling method, comprising the following steps: Step 1, use a base classifier to classify the observed samples and estimate the noise rate, and identify noise label data; Step 2, use the base classifier to classify the noise label samples Re-labeling is performed to obtain a clean sample dataset with the noisy label samples corrected.

Description

Noise Label Correction Method

技术领域technical field

本发明涉及一种数据挖掘技术，特别是一种噪声标签纠正方法。The invention relates to a data mining technology, in particular to a noise label correction method.

背景技术Background technique

传统的监督学习分类问题通常假设数据集的标签是完整的，即每个数据集样本都存在无噪声的正确标签。然而在现实世界中，由于标签标注过程的随机性，样本标签很容易被噪声污染导致样本标签的不准确。噪声数据的产生通常和数据集的获取途径有关。例如，在对原始数据标注过程中，提供给标注人员的样本数据信息量不够导致标注人员将样本错误分类，又或者由于分类过程本身就是一个主观过程或是标注人员专业知识不足以保证分类的正确性。目前流行的各种数据标注平台也是噪声数据的来源之一，这些标注平台利用广大注册用户实现众包式的数据标注工作。例如Amazon的Amazon Mechanical Turk、数据堂、京东微工等数据服务平台。而这种途径得到的数据集由于标注者的专业性限制或个人差异导致得到的数据标签并不是完全符合真实情况，而且不同标注者对同一样本的看法可能不同从而导致同种样本有不同标签结果。数据集中的噪声可以根据噪声产生的位置分为特征噪声和标签噪声，一般标签中的噪声要比特征中的噪声对模型性能的影响更大(Mirylenka K,Giannakopoulos G,Do L M,et al.On classifier behavior in thepresence of mislabeling noise[J].Data Mining and Knowledge Discovery,2017)。在二元分类中，根据正例数据集和负例数据集中噪声分布的特征提出了PU(Positive-unlabeled)学习问题(KhetanA,Lipton Z C,Anandkumar A.Learning From NoisySingly-labeled Data[J].2017)。PU学习表示数据集中只有一部分正例训练样本有标签而其它样本都不带标签的一种二元分类任务。针对PU学习问题可以将所有未标注样本当作负例样本。这样PU学习问题就转化为带噪声的二元分类问题。噪声标签数据的存在不仅会对分类器模型的分类准确性产生严重的负面影响，同时也会增加分类器的复杂度。因此设计适应噪声标签数据的分类学习算法具有重要的研究意义和应用价值。Traditional supervised learning classification problems usually assume that the labels of the dataset are complete, that is, noise-free correct labels exist for each dataset sample. However, in the real world, due to the randomness of the label labeling process, the sample labels are easily polluted by noise, resulting in inaccurate sample labels. The generation of noisy data is usually related to the way in which the dataset is obtained. For example, in the process of labeling the original data, the amount of sample data provided to the labeler is not enough to cause the labeler to misclassify the sample, or because the classification process itself is a subjective process or the labeler's expertise is not enough to ensure the correct classification sex. Various popular data annotation platforms are also one of the sources of noisy data. These annotation platforms utilize a large number of registered users to realize crowdsourced data annotation work. For example, Amazon's Amazon Mechanical Turk, Datatang, JD.com and other data service platforms. However, the data labels obtained in this way are not completely in line with the real situation due to the professional limitations or personal differences of the annotators, and different annotators may have different views on the same sample, resulting in different labeling results for the same sample. . The noise in the dataset can be divided into feature noise and label noise according to the location of the noise. Generally, the noise in the label has a greater impact on the model performance than the noise in the feature (Mirylenka K, Giannakopoulos G, Do L M, et al. On classifier behavior in the presence of mislabeling noise [J]. Data Mining and Knowledge Discovery, 2017). In binary classification, a PU (Positive-unlabeled) learning problem is proposed based on the characteristics of noise distribution in positive and negative datasets (Khetan A, Lipton Z C, Anandkumar A. Learning From Noisy Singly-labeled Data [J]. 2017 ). PU learning refers to a binary classification task in which only a portion of the positive training samples in the dataset are labeled and other samples are unlabeled. For the PU learning problem, all unlabeled samples can be regarded as negative samples. In this way, the PU learning problem is transformed into a noisy binary classification problem. The existence of noisy label data will not only have a serious negative impact on the classification accuracy of the classifier model, but also increase the complexity of the classifier. Therefore, it is of great research significance and application value to design a classification learning algorithm that adapts to noisy label data.

对于含有噪声标签的分类问题，Frénay,B归纳总结出了多种解决策略，包括噪声清理算法，噪声标签鲁棒方法和噪声标签模型化方法(Frenay B,VerleysenM.Classification in the Presence ofLabel Noise:A Survey[J].IEEE Transactionson Neural Networks and Learning Systems,2014)。噪声标签鲁棒方法使用模型自身对噪声的适应能力，不同模型对标签噪声的敏感度不同。需要选择对标签噪声不敏感的分类器进行学习。例如在二元分类的经验风险最小化问题中，使用损失函数衡量错误分类的损失，通过最小化样本的最小损失学习分类器。常见的损失有0-1损失。对于均匀标签噪声，0-1损失和最小平方损失是抗噪声标签的。而对于其他的损失函数即使在均匀噪声分布情况下也不是抗噪声标签的，如1)指数损失2)对数损失3)hinge损失。机器学习中的大多数学习算法都不完全是抗噪声标签的，并且只在训练数据被少量标签噪声干扰时很有效。随着深度学习的发展，在图像分类问题中常使用神经网络解决噪声标签图像问题，例如Mnih提出将噪声模型并入神经网络，但其仅考虑二元分类，并且假定噪声属于对称标签噪声(MnihV,Hinton G.Learning to Label Aerial Images from Noisy Data[C]//InternationalConference on Machine Learning.2013)。For classification problems with noisy labels, Frénay, B summarized a variety of solutions, including noise cleaning algorithms, noise label robust methods and noise label modeling methods (Frenay B, Verleysen M. Classification in the Presence of Label Noise: A Survey[J].IEEE Transactionson Neural Networks and Learning Systems, 2014). The noise label robust method uses the model's own ability to adapt to noise, and different models have different sensitivity to label noise. A classifier that is insensitive to label noise needs to be selected for learning. For example, in the empirical risk minimization problem of binary classification, a loss function is used to measure the loss of misclassification, and a classifier is learned by minimizing the minimum loss of the samples. Common losses are 0-1 losses. For uniform label noise, 0-1 loss and least squares loss are robust against noisy labels. And for other loss functions that are not anti-noise labels even in the case of uniform noise distribution, such as 1) exponential loss 2) logarithmic loss 3) hinge loss. Most learning algorithms in machine learning are not completely immune to noisy labels and only work well when the training data is disturbed by a small amount of label noise. With the development of deep learning, neural networks are often used in image classification problems to solve the problem of noisy label images. For example, Mnih proposed to incorporate the noise model into the neural network, but it only considers binary classification, and assumes that the noise belongs to symmetric label noise (MnihV, Hinton G. Learning to Label Aerial Images from Noisy Data [C]//International Conference on Machine Learning. 2013).

使用噪声清理策略解决噪声标签学习问题通常需要两步：(1)估计噪声率和(2)使用噪声率和预测。为估计噪声率，Scott等通过建立一个下界方法用于估计反转噪声率和(Blanchard G,Flaska M,Handy G,et al.Classification with Asymmetric LabelNoise:Consistency and Maximal Denoising[J].Journal of Machine LearningResearch,2013)。然而该方法得到的无界函数可能无法收敛。在添加额外假设后，Scott(2015)提出一种时间效率高的噪声率估计方法，但估计性能表现较差(Scott C.A RateofConvergence for Mixture Proportion Estimation,with,Application to Learningfrom Noisy Labels[J].2015)。Liu Tao通过重要性权值重写修改损失函数，但重写的权值来源于预测概率，因此可能会对不准确的估计比较敏感(Liu T,Tao D.Classificationwith Noisy Labels by Importance Reweighting[J].IEEE Transactions on PatternAnalysis&Machine Intelligence,2014)。Natarajan(2013)没有提出估计噪声的方法而是将噪声率视为交叉验证过程中优化的参数(Natarajan N,Dhillon I S,Ravikumar P K,etal.Learning with Noisy Labels[C]//International Conference on NeuralInformation Processing Systems.CurranAssociates Inc.2013)。Natarajan提出两种方法修改损失函数，第一种方法从噪声分布中构建正确分布的无偏估计器，但该估计器即使在原有损失函数是凸函数的情况下仍有可能会是非凸函数。第二种方法建立标签依赖的损失函数，以对于0-1损失，Nat13的最小风险和正确分布的风险相等。Northcutt提出从信任的样本中学习(Learning with confident examples)的概念，按照基分类器对噪声数据的分类概率计算出等变量值，并按基分类器对每个样本的预测结果大小删除部分被鉴定为噪声标签数据的样本，该过程称为按秩剪枝(Northcutt C G,Wu T,Chuang I L.Learningwith Confident Examples:Rank Pruning for Robust Classification withNoisyLabels[J].2017)。Solving noisy label learning problems using a noise cleaning strategy typically requires two steps: (1) estimating the noise rate and (2) using the noise rate and prediction. In order to estimate the noise rate, Scott et al. established a lower bound method for estimating the inversion noise rate and (Blanchard G, Flaska M, Handy G, et al. Classification with Asymmetric LabelNoise: Consistency and Maximal Denoising [J]. Journal of Machine Learning Research , 2013). However, the unbounded function obtained by this method may not converge. After adding additional assumptions, Scott (2015) proposed a time-efficient noise rate estimation method, but the estimation performance was poor (Scott C.A RateofConvergence for Mixture Proportion Estimation,with,Application to LearningfromNoisy Labels[J].2015) . Liu Tao rewrites and modifies the loss function through the importance weights, but the rewritten weights are derived from the predicted probability, so they may be sensitive to inaccurate estimates (Liu T, Tao D. Classification with Noisy Labels by Importance Reweighting [J] .IEEE Transactions on Pattern Analysis & Machine Intelligence, 2014). Natarajan (2013) did not propose a method for estimating noise but regarded the noise rate as a parameter optimized in the cross-validation process (Natarajan N, Dhillon I S, Ravikumar P K, et al. Learning with Noisy Labels [C]//International Conference on NeuralInformation Processing Systems. Curran Associates Inc. 2013). Natarajan proposes two methods to modify the loss function. The first method constructs an unbiased estimator of the correct distribution from the noise distribution, but the estimator may still be non-convex even when the original loss function is convex. The second approach builds a label-dependent loss function such that for a 0-1 loss, the minimum risk of Nat13 is equal to the risk of the correct distribution. Northcutt proposes the concept of learning with confident examples, calculates the equivalent variable value according to the classification probability of the noise data by the base classifier, and deletes the part to be identified according to the size of the prediction result of the base classifier for each sample is a sample of noisy label data, the process is called pruning by rank (Northcutt C G, Wu T, Chuang I L. Learning with Confident Examples: Rank Pruning for Robust Classification with NoisyLabels [J]. 2017).

发明内容SUMMARY OF THE INVENTION

本发明的目的在于提供一种噪声标签纠正方法。The purpose of the present invention is to provide a noise label correction method.

实现本发明目的的技术方案为：一种噪声标签纠正方法，包括以下步骤：The technical scheme for realizing the object of the present invention is: a noise label correction method, comprising the following steps:

步骤1，使用基分类器对样本进行预测得到样本预测概率，分别取正例集合和负例集合所有样本的预测概率期望值作为下界阈值和上界阈值，使用下界阈值和上界阈值判断观测样本真实标签，识别出噪声标签数据；Step 1, use the base classifier to predict the sample to obtain the sample prediction probability, take the expected value of the predicted probability of all samples in the positive example set and the negative example set respectively as the lower bound threshold and the upper bound threshold, and use the lower bound threshold and the upper bound threshold to judge the true value of the observed sample. label, identify noisy label data;

步骤2，利用基分类器对噪声标签样本进行重新标注，得到噪声标签样本被修正后的干净样本数据集；其中Step 2, use the base classifier to re-label the noise label samples to obtain a clean sample data set after the noise label samples have been corrected; wherein

步骤2中对于二元分类结果，识别出噪声标签样本后，根据每个样本在基分类器的预测概率值，将样本升序排序，在观测正例样本集中，将前面a个样本的标签重标注为0；在观测负例样本集中，将后个样本标签重标注为1；In step 2, for the binary classification result, after identifying the noise label samples, sort the samples in ascending order according to the predicted probability value of each sample in the base classifier, and relabel the labels of the first a samples in the observed positive sample set. is 0; in the observed negative sample set, the sample labels are relabeled to 1;

步骤2中对于多类分类结果，根据基分类器对所有样本数据预测得到的分类结果矩阵，利用该概率矩阵将样本的标签重标注为除当前标签外预测概率最大时所属的标签。In step 2, for the multi-class classification result, according to the classification result matrix predicted by the base classifier for all sample data, the probability matrix is used to relabel the label of the sample as the label to which the predicted probability is the largest except the current label.

进一步地，步骤1具体步骤包括：Further, the specific steps of step 1 include:

步骤1.1，基分类器对样本预测得到样本预测概率g(x)＝P(s＝1|x)；设Step 1.1, the base classifier predicts the sample to obtain the sample prediction probability g(x)=P(s=1|x); set

噪声率ρ₁＝P(s＝0|y＝1)表示真实标签为1的样本误标记为0的概率，The noise rate ρ ₁ =P(s=0|y=1) represents the probability that a sample whose true label is 1 is mislabeled as 0,

表示观测标签为1且真实标签为1的样本的数量， represents the number of samples with observed label 1 and true label 1,

表示观测标签为0且真实标签为1的样本的数量， represents the number of samples with observed label 0 and true label 1,

表示观测标签为1且真实标签为0的样本的数量， represents the number of samples with observed label 1 and true label 0,

表示观测标签为0且真实标签为0的样本的数量； Indicates the number of samples whose observed label is 0 and the true label is 0;

步骤1.2，使用基分类器的分类结果判断样本的真实标签：使用下界阈值LB_y＝1判断样本真实标签是否为1，当观测样本在基分类器g(x)上的预测结果大于该下界阈值时，设该观测样本的真实标签为1；当观测样本在基分类器上的预测结果小于上界阈值UB_y＝0时，设该观测样本的真实标签为0。Step 1.2, use the classification result of the base classifier Judging the true label of the sample: Use the lower threshold LB _y=1 to determine whether the true label of the sample is 1. When the prediction result of the observed sample on the base classifier g(x) is greater than the lower threshold, let the true label of the observed sample be 1; When the prediction result of the observed sample on the base classifier is less than the upper bound threshold UB _y=0 , set the true label of the observed sample to be 0.

步骤1.3，计算 Step 1.3, Calculate

其中，为观测正例样本集，为观测负例样本集，上届、下届阈值分别设定为正负例样本在基分类器上分类概率g(x)的期望值：in, is the observation positive sample set, In order to observe the negative sample set, the thresholds of the previous session and the next session are respectively set as the expected value of the classification probability g(x) of the positive and negative samples on the base classifier:

步骤1.4，计算噪声率的估计值和 Step 1.4, Calculate an estimate of the noise rate and

步骤1.5，由贝叶斯定理，根据噪声率的估计值推导出反转噪声率的值 Step 1.5, by Bayes' theorem, derive the value of the inverted noise rate from the estimated value of the noise rate

步骤1.6，设表示观测正例样本集中真实标签为0的样本数，表示观测负例样本集中真实标签为1的样本数，根据每个样本基分类器g(x)的预测值，将样本升序排序；在观测正例样本集中，前个样本视为正例样本集中的噪声标签样本；在观测负例样本集中，排在后个样本视为负例样本集中的噪声标签样本。Step 1.6, set represents the number of samples whose true label is 0 in the observed positive sample set, Indicates the number of samples whose true label is 1 in the observed negative sample set, and sorts the samples in ascending order according to the predicted value of the base classifier g(x) for each sample; in the observed positive sample set middle, front The samples are regarded as noise label samples in the positive sample set; in the observation negative sample set middle, after The samples are regarded as noise label samples in the negative sample set.

进一步地，步骤2中对于二元分类情况得到噪声标签样本被修正后的干净样本数据集的具体过程为：Further, in step 2, the specific process of obtaining a clean sample data set after the noise label sample is corrected for the binary classification situation is as follows:

识别出噪声标签样本后，根据每个样本在基分类器g(x)＝P(s＝1|x)的预测概率值，将样本升序排序。在观测正例样本集中，将前面个样本的标签重标注为0；在观测负例样本集中，将后个样本标签重标注为1；After identifying the noise label samples, the samples are sorted in ascending order according to the predicted probability value of each sample in the base classifier g(x)=P(s=1|x). In the observed positive sample set in the front The labels of the samples are relabeled as 0; in the observation negative sample set in, after sample labels are relabeled to 1;

重新标注后的正例样本集和负例样本集分别表示为：Positive sample set after relabeling and negative sample set They are respectively expressed as:

其中，表示观测正例样本集中g(x)值第小的g(x)值，表示观测负例样本集g(x)值第大的g(x)值。in, Indicates the first g(x) value in the observed positive sample set small g(x) values, Indicates the first g(x) value of the observed negative sample set Large g(x) values.

进一步地，步骤2中对于多类分类情况，采用对噪声样本的标签重标记得到噪声标签样本被修正后的干净样本数据集，具体过程为：Further, in the case of multi-class classification in step 2, the label re-labeling of the noise samples is used to obtain a clean sample data set after the noise label samples have been corrected. The specific process is as follows:

基分类器对所有样本数据预测时需要记录样本属于每个类别的概率，得到分类结果矩阵psx＝{p_ij|i∈N,j∈K}，psx是一个N×K的概率矩阵，其中N为样本数，K为标签种类数，其中，概率值表示基分类对所有样本数的分类结果矩阵，矩阵第i行p_i＝(p_i1,p_i2,,,p_ik)表示样本x_i在基分类器f(x)下属于各类标签的概率，值p_ij表示样本x_i属于k_j类的概率；When the base classifier predicts all sample data, it needs to record the probability that the sample belongs to each category, and obtain the classification result matrix psx={p _ij |i∈N,j∈K}, psx is an N×K probability matrix, where N is the number of samples, K is the number of label types, among them, the probability value represents the classification result matrix of the base classification for all samples, and the i-th row of the matrix p _i =(p _i1 ,p _i2 ,,,p _ik ) indicates that the sample x _i is in The probability that the base classifier f(x) belongs to various labels, and the value p _ij represents the probability that the sample x _i belongs to the class k _j ;

当样本x被判定为噪声标签后，利用该概率矩阵psx将x的标签重标注为除当前标签外预测概率最大时所属的标签：When the sample x is determined to be a noise label, the probability matrix psx is used to relabel the label of x as the label to which the predicted probability is the largest except the current label:

y_i ^relabel＝k_max(k_max＝arg max psx_i)y _i ^relabel =k _max (k _max =arg max psx _i )

其中，k_max为样本x_i在基分类器分类概率中除该样本原有噪声标签s_i外概率最大值所属的标签类别。Among them, k _max is the label category to which the maximum probability of the sample _xi in the classification probability of the base classifier, except for the original noise label _si of the sample, belongs.

本发明与现有技术相比，具有以下优点：(1)为噪声标签学习提出通用解决方案，适用于任何形式的分类器；(2)噪声样本识别率高，充分利用所有样本信息，提高分类器在噪声环境下的鲁棒性；(3)算法适用于二元分类和多类分类问题。Compared with the prior art, the present invention has the following advantages: (1) A general solution is proposed for noise label learning, which is suitable for any form of classifier; (2) The noise sample recognition rate is high, and all sample information is fully utilized to improve classification (3) The algorithm is suitable for binary classification and multi-class classification problems.

下面结合说明书附图对本发明作进一步描述。The present invention will be further described below with reference to the accompanying drawings.

附图说明Description of drawings

图1为本发明的方法流程示意图。FIG. 1 is a schematic flow chart of the method of the present invention.

图2为基于基分类器识别噪声标签数据过程示意图。Figure 2 is a schematic diagram of the process of identifying noisy label data based on a base classifier.

图3为噪声标签样本重标注过程示意图。Figure 3 is a schematic diagram of the re-labeling process of noise label samples.

具体实施方式Detailed ways

结合图1，一种利用基分类器对观测样本进行分类并估计噪声率的方法识别出噪声标签数据，过程如下：Combined with Figure 1, a method of classifying the observed samples and estimating the noise rate using the base classifier identifies the noise label data. The process is as follows:

步骤1，结合图2利用基分类器对观测样本进行分类并估计噪声率，识别出噪声标签数据，过程如下：Step 1, combined with Figure 2, use the base classifier to classify the observed samples and estimate the noise rate, and identify the noise label data. The process is as follows:

步骤1.1，基分类器clf对样本预测clf.fit(X,s)，得到样本预测概率g(x)＝P(s＝1|x)。基分类器可以选择现有的任何分类算法，只要能得到样本的预测概率即可。Step 1.1, the base classifier clf predicts the sample clf.fit(X,s), and obtains the sample prediction probability g(x)=P(s=1|x). The base classifier can choose any existing classification algorithm, as long as the predicted probability of the sample can be obtained.

对于噪声率ρ₁＝P(s＝0|y＝1)，其表示真实标签为1的样本误标记为0的概率，即正确标签为1的样本集中其观测标签为0的样本数量比例。用以下变量表示各种情况下样本的数量：表示观测标签为1，真实标签为1的样本；表示观测标签为0，真实标签为1的样本；表示观测标签为1，真实标签为0的样本；表示观测标签为0，真实标签为0的样本。For the noise rate ρ ₁ =P(s=0|y=1), it represents the probability that the sample with the true label of 1 is incorrectly labeled as 0, that is, the proportion of the number of samples with the observed label of 0 in the sample set with the correct label of 1. The number of samples in each case is represented by the following variables: Indicates that the observed label is 1 and the real label is 1; Indicates that the observed label is 0 and the real label is 1; Indicates that the observed label is 1 and the real label is 0; Indicates that the observed label is 0 and the true label is 0.

步骤1.2，因为样本的真实分布未知，所以使用基分类器的分类结果判断样本的真实标签。使用下界阈值LB_y＝1判断样本真实标签是否为1，当观测样本在基分类器g(x)上的预测结果大于该下界阈值时，可以假设该观测样本的真实标签为1。同样使用上界UB_y＝0判断观测样本真实标签是否为0。Step 1.2, because the true distribution of the sample is unknown, the classification result of the base classifier is used Determine the true label of the sample. Use the lower bound threshold LB _y=1 to determine whether the true label of the sample is 1. When the prediction result of the observed sample on the base classifier g(x) is greater than the lower bound threshold, it can be assumed that the true label of the observed sample is 1. Also use the upper bound UB _y=0 to determine whether the true label of the observed sample is 0.

步骤1.3，计算 Step 1.3, Calculate

其中，为观测正例样本集，为观测负例样本集，其中的阈值设定为正负例样本在基分类器上分类概率g(x)＝P(s＝1|x)的期望值：in, is the observation positive sample set, For the observation of the negative sample set, the threshold is set as the expected value of the classification probability g(x)=P(s=1|x) of the positive and negative samples on the base classifier:

步骤1.4，计算噪声率的估计值和过程如下：Step 1.4, Calculate an estimate of the noise rate and The process is as follows:

步骤1.5，由贝叶斯定理，根据噪声率的估计值推导出反转噪声率的值：In step 1.5, the value of the inverted noise rate is derived from the estimated value of the noise rate by Bayes' theorem:

其中p_s1＝P(s＝1)表示观测样本集中正例样本的个数。由于反转噪声率表示观测正负例样本中真实标签为0或1的概率，因此表示观测正例样本集中真实标签为0的样本数，即观测正例样本集中的噪声样本数。同理，表示观测负例样本集中真实标签为1的样本数，即观测负例样本集中的噪声样本数。最后，根据每个样本基分类器g(x)的预测值，将样本升序排序，在观测正例样本集中，前个样本视为正例样本集中的噪声标签样本，在观测负例样本集中，排在后个样本视为负例样本集中的噪声标签样本。where p _s1 =P(s=1) represents the number of positive samples in the observed sample set. Since the inversion noise rate represents the probability that the true label is 0 or 1 in the positive and negative samples of observations, so Indicates the number of samples whose true label is 0 in the observed positive sample set, that is, the number of noise samples in the observed positive sample set. Similarly, Indicates the number of samples whose true label is 1 in the observed negative sample set, that is, the number of noise samples in the observed negative sample set. Finally, according to the predicted value of each sample base classifier g(x), the samples are sorted in ascending order, and the observed positive sample set is middle, front The samples are regarded as the noise label samples in the positive sample set, and the negative samples are observed in the negative sample set. middle, after The samples are regarded as noise label samples in the negative sample set.

步骤2，结合图3利用基分类器分类结果对噪声标签样本进行重新标注，得到噪声标签样本被修正后的干净样本数据集，具体过程如下：Step 2, in conjunction with Figure 3, use the base classifier classification result to re-label the noise label samples, and obtain a clean sample data set after the noise label samples have been corrected. The specific process is as follows:

步骤2.1，对于二元分类情况。识别出噪声标签样本后，根据每个样本在基分类器g(x)＝P(s＝1|x)的预测概率值，将样本升序排序。在观测正例样本集中，将前面个样本的标签重标注为0；在观测负例样本集中，将后个样本标签重标注为1。重新标注后的正例样本集和负例样本集分别表示为：Step 2.1, for the binary classification case. After identifying the noise label samples, the samples are sorted in ascending order according to the predicted probability value of each sample in the base classifier g(x)=P(s=1|x). In the observed positive sample set in the front The labels of the samples are relabeled as 0; in the observation negative sample set in, after Each sample label is relabeled to 1. Positive sample set after relabeling and negative sample set They are respectively expressed as:

其中表示观测正例样本集中g(x)值第小的g(x)值，表示观测负例样本集g(x)值第大的g(x)值。in Indicates the first g(x) value in the observed positive sample set small g(x) values, Indicates the first g(x) value of the observed negative sample set Large g(x) values.

步骤2.2，对于多类分类情况。在多类分类情况下，样本标签总种类数不止两种，此时对噪声样本的标签重标记需要考虑到样本最可能属于哪类标签并分配该标签。噪声样本重标注的标签需要根据基分类器对所有样本的分类结果选择。因此在基分类器对所有样本数据预测时需要记录样本属于每个类别的概率，最终得到的是一个分类结果矩阵psx＝{p_ij|i∈N,j∈K}，psx是一个N×K的概率矩阵(其中N为样本数，K为标签种类数)，其中的概率值表示基分类对所有样本数的分类结果矩阵，矩阵第i行p_i＝(p_i1,p_i2,,,p_ik)表示样本x_i在基分类器f(x)下属于各类标签的概率，其中的值p_ij表示样本x_i属于k_j类的概率。当样本x被判定为噪声标签后，利用该概率矩阵psx将x的标签重标注为除当前标签外预测概率最大时所属的标签。即对于噪声标签样本x_i，其重标注的标签为：Step 2.2, for the multi-class classification case. In the case of multi-class classification, the total number of sample labels is more than two. At this time, the label re-labeling of noise samples needs to consider which label the sample most likely belongs to and assign the label. The labels for re-labeling noise samples need to be selected according to the classification results of all samples by the base classifier. Therefore, when the base classifier predicts all sample data, it is necessary to record the probability that the sample belongs to each category, and the final result is a classification result matrix psx={p _ij |i∈N,j∈K}, psx is an N×K The probability matrix of (where N is the number of samples and K is the number of label types), where the probability value represents the classification result matrix of the base classification for all samples, the i-th row of the matrix p _i = (p _i1 ,p _i2 ,,,p _ik ) represents the probability that the sample _xi belongs to various labels under the base classifier f(x), and the value p _ij represents the probability that the sample _xi belongs to the class k _j . After the sample x is determined to be a noise label, the probability matrix psx is used to relabel the label of x as the label that belongs to when the predicted probability is the largest except the current label. That is, for the noisy label sample _xi , its relabeled label is:

其中k_max为样本x_i在基分类器分类概率中除该样本原有噪声标签s_i外概率最大值所属的标签类别。最后重标注后得到的数据即为修正噪声标签后的正确数据集。where k _max is the label category to which the maximum probability of the sample _xi in the classification probability of the base classifier, except for the original noise label _si of the sample, belongs. The data obtained after the final relabeling is the correct data set after correcting the noise labels.

Claims

1. A noise tag correction method, comprising the steps of:

step 1, predicting a sample by using a base classifier to obtain a sample prediction probability, respectively taking prediction probability expectation values of all samples in a positive example set and a negative example set as a lower bound threshold and an upper bound threshold, judging a real label of an observed sample by using the lower bound threshold and the upper bound threshold, and identifying noise label data;

step 2, re-labeling the noise label sample by using a base classifier to obtain a clean sample data set after the noise label sample is corrected; wherein

After the binary classification result is identified as the noise label samples, sorting the samples in an ascending order according to the prediction probability value of each sample in the base classifier, and re-labeling the labels of the previous a samples as 0 in the observation of the positive sample set; after observing the negative sample setEach sample label is re-labeled as 1;

and 2, as for the multi-class classification result, according to a classification result matrix obtained by predicting all sample data by the base classifier, re-labeling the label of the sample as the label which belongs to the sample except the current label when the prediction probability is maximum by using the probability matrix.

2. The method according to claim 1, wherein the specific steps of step 1 include:

step 1.1, the base classifier predicts a sample to obtain a sample prediction probability g (x) P (s 1| x); is provided with

Noise rate ρ₁P (s-0 | y-1) represents the probability that a sample with a true label of 1 is incorrectly labeled as 0,

representing the number of samples for which the observation label is 1 and the true label is 1,

representing the number of samples with an observation tag of 0 and a true tag of 1,

representing the number of samples for which the observation label is 1 and the true label is 0,

represents the number of samples with an observation label of 0 and a true label of 0;

step 1.2, classification results using base classifiersJudging the real label of the sample: using lower bound thresholds LB_y＝1Judging whether the real label of the sample is 1, and setting the real label of the observation sample to be 1 when the prediction result of the observation sample on a base classifier g (x) is greater than the lower bound threshold value; when the prediction result of the observation sample on the base classifier is less than the upper bound threshold UB_y＝0When the actual label of the observation sample is set to 0.

Step 1.3, calculate

Wherein,in order to observe the set of positive examples samples,for observing the negative sample set, the upper and lower threshold values are respectively set as the expected values of the classification probability g (x) of the positive and negative samples on the base classifier:

step 1.4, calculating the estimated value of the noise rateAnd

step 1.5, deducing the value of the reversal noise rate according to the estimated value of the noise rate by Bayes' theorem

Step 1.6, settingThe number of samples with a true tag of 0 in the observation sample set,representing the number of samples with a real label of 1 in the observation negative sample set, and sorting the samples in an ascending order according to the predicted value of each sample base classifier g (x); sample set of observation right caseMiddle and frontTaking the sample as a noise label sample in the normal sample set; sample set of negative examples under observationMiddle and back rowIndividual samples are considered as noise labeled samples in the negative sample set.

3. The method according to claim 2, wherein the specific process of obtaining the clean sample data set after the noise label samples are modified for the binary classification case in step 2 is:

after the noise label samples are identified, the samples are sorted in an ascending order according to the predicted probability value of each sample in the base classifier g (x) P (s 1| x). Sample set of observation right caseIn front ofThe label of each sample is re-labeled as 0; sample set of negative examples under observationIn, afterEach sample label is re-labeled as 1;

re-labeled positive sample setAnd negative sample setRespectively expressed as:

wherein,representing the g (x) value in the sample set of the observation positive caseThe small value of g (x) is,sample set g (x) values representing negative observationsLarge g (x) values.

4. The method according to claim 2, wherein, for the multi-class classification condition in step 2, the label re-labeling of the noise sample is adopted to obtain a clean sample data set after the noise label sample is modified, and the specific process is as follows:

when predicting all sample data, the base classifier needs to record the probability that the sample belongs to each class, and a classification result matrix psx is obtained, wherein the probability is p_ijI belongs to N, j belongs to K, psx is a N multiplied by K probability matrix, where N is the number of samples and K is the number of label categories, where the probability value represents the classification result matrix of the base classification to all the samples, and the ith row p of the matrix_i＝(p_i1,p_i2,,,p_ik) Represents a sample x_iProbability, value p, of belonging to classes of labels under base classifier f (x)_ijRepresents a sample x_iBelong to k_jThe probability of a class;

when sample x is determined to be a noise label, label x is re-labeled with the probability matrix psx as the label to which the prediction probability is the highest except the current label:

y_i ^relabel＝k_max(k_max＝arg max psx_i)

wherein k is_maxIs a sample x_iRemoving original noise label s of the sample in classification probability of a base classifier_iThe label category to which the outer probability maximum belongs.