CN110363228A - Noise label correcting method - Google Patents
Noise label correcting method Download PDFInfo
- Publication number
- CN110363228A CN110363228A CN201910562002.2A CN201910562002A CN110363228A CN 110363228 A CN110363228 A CN 110363228A CN 201910562002 A CN201910562002 A CN 201910562002A CN 110363228 A CN110363228 A CN 110363228A
- Authority
- CN
- China
- Prior art keywords
- sample
- label
- samples
- noise
- probability
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims description 36
- 238000002372 labelling Methods 0.000 claims abstract description 13
- 239000011159 matrix material Substances 0.000 claims description 19
- 230000001174 ascending effect Effects 0.000 claims description 8
- 238000012937 correction Methods 0.000 claims description 4
- 239000005018 casein Substances 0.000 claims 1
- BECPQYXYKAMYBN-UHFFFAOYSA-N casein, tech. Chemical compound NCCCCC(C(O)=O)N=C(O)C(CC(O)=O)N=C(O)C(CCC(O)=N)N=C(O)C(CC(C)C)N=C(O)C(CCC(O)=O)N=C(O)C(CC(O)=O)N=C(O)C(CCC(O)=O)N=C(O)C(C(C)O)N=C(O)C(CCC(O)=N)N=C(O)C(CCC(O)=N)N=C(O)C(CCC(O)=N)N=C(O)C(CCC(O)=O)N=C(O)C(CCC(O)=O)N=C(O)C(COP(O)(O)=O)N=C(O)C(CCC(O)=N)N=C(O)C(N)CC1=CC=CC=C1 BECPQYXYKAMYBN-UHFFFAOYSA-N 0.000 claims 1
- 235000021240 caseins Nutrition 0.000 claims 1
- 230000006870 function Effects 0.000 description 7
- 238000004422 calculation algorithm Methods 0.000 description 4
- 238000013528 artificial neural network Methods 0.000 description 3
- 238000010801 machine learning Methods 0.000 description 3
- 238000004140 cleaning Methods 0.000 description 2
- 238000007418 data mining Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000013138 pruning Methods 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 238000012549 training Methods 0.000 description 2
- 238000013459 approach Methods 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 238000007635 classification algorithm Methods 0.000 description 1
- 238000002790 cross-validation Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 230000035945 sensitivity Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Probability & Statistics with Applications (AREA)
- Image Processing (AREA)
- Image Analysis (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
本发明提供了一种噪声标签重标注方法,包括以下步骤:步骤1,利用基分类器对观测样本进行分类并估计噪声率,识别出噪声标签数据;步骤2,利用基分类器对噪声标签样本进行重新标注,得到噪声标签样本被修正后的干净样本数据集。
The present invention provides a noise label re-labeling method, comprising the following steps: Step 1, use a base classifier to classify the observed samples and estimate the noise rate, and identify noise label data; Step 2, use the base classifier to classify the noise label samples Re-labeling is performed to obtain a clean sample dataset with the noisy label samples corrected.
Description
技术领域technical field
本发明涉及一种数据挖掘技术,特别是一种噪声标签纠正方法。The invention relates to a data mining technology, in particular to a noise label correction method.
背景技术Background technique
传统的监督学习分类问题通常假设数据集的标签是完整的,即每个数据集样本都存在无噪声的正确标签。然而在现实世界中,由于标签标注过程的随机性,样本标签很容易被噪声污染导致样本标签的不准确。噪声数据的产生通常和数据集的获取途径有关。例如,在对原始数据标注过程中,提供给标注人员的样本数据信息量不够导致标注人员将样本错误分类,又或者由于分类过程本身就是一个主观过程或是标注人员专业知识不足以保证分类的正确性。目前流行的各种数据标注平台也是噪声数据的来源之一,这些标注平台利用广大注册用户实现众包式的数据标注工作。例如Amazon的Amazon Mechanical Turk、数据堂、京东微工等数据服务平台。而这种途径得到的数据集由于标注者的专业性限制或个人差异导致得到的数据标签并不是完全符合真实情况,而且不同标注者对同一样本的看法可能不同从而导致同种样本有不同标签结果。数据集中的噪声可以根据噪声产生的位置分为特征噪声和标签噪声,一般标签中的噪声要比特征中的噪声对模型性能的影响更大(Mirylenka K,Giannakopoulos G,Do L M,et al.On classifier behavior in thepresence of mislabeling noise[J].Data Mining and Knowledge Discovery,2017)。在二元分类中,根据正例数据集和负例数据集中噪声分布的特征提出了PU(Positive-unlabeled)学习问题(KhetanA,Lipton Z C,Anandkumar A.Learning From NoisySingly-labeled Data[J].2017)。PU学习表示数据集中只有一部分正例训练样本有标签而其它样本都不带标签的一种二元分类任务。针对PU学习问题可以将所有未标注样本当作负例样本。这样PU学习问题就转化为带噪声的二元分类问题。噪声标签数据的存在不仅会对分类器模型的分类准确性产生严重的负面影响,同时也会增加分类器的复杂度。因此设计适应噪声标签数据的分类学习算法具有重要的研究意义和应用价值。Traditional supervised learning classification problems usually assume that the labels of the dataset are complete, that is, noise-free correct labels exist for each dataset sample. However, in the real world, due to the randomness of the label labeling process, the sample labels are easily polluted by noise, resulting in inaccurate sample labels. The generation of noisy data is usually related to the way in which the dataset is obtained. For example, in the process of labeling the original data, the amount of sample data provided to the labeler is not enough to cause the labeler to misclassify the sample, or because the classification process itself is a subjective process or the labeler's expertise is not enough to ensure the correct classification sex. Various popular data annotation platforms are also one of the sources of noisy data. These annotation platforms utilize a large number of registered users to realize crowdsourced data annotation work. For example, Amazon's Amazon Mechanical Turk, Datatang, JD.com and other data service platforms. However, the data labels obtained in this way are not completely in line with the real situation due to the professional limitations or personal differences of the annotators, and different annotators may have different views on the same sample, resulting in different labeling results for the same sample. . The noise in the dataset can be divided into feature noise and label noise according to the location of the noise. Generally, the noise in the label has a greater impact on the model performance than the noise in the feature (Mirylenka K, Giannakopoulos G, Do L M, et al. On classifier behavior in the presence of mislabeling noise [J]. Data Mining and Knowledge Discovery, 2017). In binary classification, a PU (Positive-unlabeled) learning problem is proposed based on the characteristics of noise distribution in positive and negative datasets (Khetan A, Lipton Z C, Anandkumar A. Learning From Noisy Singly-labeled Data [J]. 2017 ). PU learning refers to a binary classification task in which only a portion of the positive training samples in the dataset are labeled and other samples are unlabeled. For the PU learning problem, all unlabeled samples can be regarded as negative samples. In this way, the PU learning problem is transformed into a noisy binary classification problem. The existence of noisy label data will not only have a serious negative impact on the classification accuracy of the classifier model, but also increase the complexity of the classifier. Therefore, it is of great research significance and application value to design a classification learning algorithm that adapts to noisy label data.
对于含有噪声标签的分类问题,Frénay,B归纳总结出了多种解决策略,包括噪声清理算法,噪声标签鲁棒方法和噪声标签模型化方法(Frenay B,VerleysenM.Classification in the Presence ofLabel Noise:A Survey[J].IEEE Transactionson Neural Networks and Learning Systems,2014)。噪声标签鲁棒方法使用模型自身对噪声的适应能力,不同模型对标签噪声的敏感度不同。需要选择对标签噪声不敏感的分类器进行学习。例如在二元分类的经验风险最小化问题中,使用损失函数衡量错误分类的损失,通过最小化样本的最小损失学习分类器。常见的损失有0-1损失。对于均匀标签噪声,0-1损失和最小平方损失是抗噪声标签的。而对于其他的损失函数即使在均匀噪声分布情况下也不是抗噪声标签的,如1)指数损失2)对数损失3)hinge损失。机器学习中的大多数学习算法都不完全是抗噪声标签的,并且只在训练数据被少量标签噪声干扰时很有效。随着深度学习的发展,在图像分类问题中常使用神经网络解决噪声标签图像问题,例如Mnih提出将噪声模型并入神经网络,但其仅考虑二元分类,并且假定噪声属于对称标签噪声(MnihV,Hinton G.Learning to Label Aerial Images from Noisy Data[C]//InternationalConference on Machine Learning.2013)。For classification problems with noisy labels, Frénay, B summarized a variety of solutions, including noise cleaning algorithms, noise label robust methods and noise label modeling methods (Frenay B, Verleysen M. Classification in the Presence of Label Noise: A Survey[J].IEEE Transactionson Neural Networks and Learning Systems, 2014). The noise label robust method uses the model's own ability to adapt to noise, and different models have different sensitivity to label noise. A classifier that is insensitive to label noise needs to be selected for learning. For example, in the empirical risk minimization problem of binary classification, a loss function is used to measure the loss of misclassification, and a classifier is learned by minimizing the minimum loss of the samples. Common losses are 0-1 losses. For uniform label noise, 0-1 loss and least squares loss are robust against noisy labels. And for other loss functions that are not anti-noise labels even in the case of uniform noise distribution, such as 1) exponential loss 2) logarithmic loss 3) hinge loss. Most learning algorithms in machine learning are not completely immune to noisy labels and only work well when the training data is disturbed by a small amount of label noise. With the development of deep learning, neural networks are often used in image classification problems to solve the problem of noisy label images. For example, Mnih proposed to incorporate the noise model into the neural network, but it only considers binary classification, and assumes that the noise belongs to symmetric label noise (MnihV, Hinton G. Learning to Label Aerial Images from Noisy Data [C]//International Conference on Machine Learning. 2013).
使用噪声清理策略解决噪声标签学习问题通常需要两步:(1)估计噪声率和(2)使用噪声率和预测。为估计噪声率,Scott等通过建立一个下界方法用于估计反转噪声率和(Blanchard G,Flaska M,Handy G,et al.Classification with Asymmetric LabelNoise:Consistency and Maximal Denoising[J].Journal of Machine LearningResearch,2013)。然而该方法得到的无界函数可能无法收敛。在添加额外假设后,Scott(2015)提出一种时间效率高的噪声率估计方法,但估计性能表现较差(Scott C.A RateofConvergence for Mixture Proportion Estimation,with,Application to Learningfrom Noisy Labels[J].2015)。Liu Tao通过重要性权值重写修改损失函数,但重写的权值来源于预测概率,因此可能会对不准确的估计比较敏感(Liu T,Tao D.Classificationwith Noisy Labels by Importance Reweighting[J].IEEE Transactions on PatternAnalysis&Machine Intelligence,2014)。Natarajan(2013)没有提出估计噪声的方法而是将噪声率视为交叉验证过程中优化的参数(Natarajan N,Dhillon I S,Ravikumar P K,etal.Learning with Noisy Labels[C]//International Conference on NeuralInformation Processing Systems.CurranAssociates Inc.2013)。Natarajan提出两种方法修改损失函数,第一种方法从噪声分布中构建正确分布的无偏估计器,但该估计器即使在原有损失函数是凸函数的情况下仍有可能会是非凸函数。第二种方法建立标签依赖的损失函数,以对于0-1损失,Nat13的最小风险和正确分布的风险相等。Northcutt提出从信任的样本中学习(Learning with confident examples)的概念,按照基分类器对噪声数据的分类概率计算出等变量值,并按基分类器对每个样本的预测结果大小删除部分被鉴定为噪声标签数据的样本,该过程称为按秩剪枝(Northcutt C G,Wu T,Chuang I L.Learningwith Confident Examples:Rank Pruning for Robust Classification withNoisyLabels[J].2017)。Solving noisy label learning problems using a noise cleaning strategy typically requires two steps: (1) estimating the noise rate and (2) using the noise rate and prediction. In order to estimate the noise rate, Scott et al. established a lower bound method for estimating the inversion noise rate and (Blanchard G, Flaska M, Handy G, et al. Classification with Asymmetric LabelNoise: Consistency and Maximal Denoising [J]. Journal of Machine Learning Research , 2013). However, the unbounded function obtained by this method may not converge. After adding additional assumptions, Scott (2015) proposed a time-efficient noise rate estimation method, but the estimation performance was poor (Scott C.A RateofConvergence for Mixture Proportion Estimation,with,Application to LearningfromNoisy Labels[J].2015) . Liu Tao rewrites and modifies the loss function through the importance weights, but the rewritten weights are derived from the predicted probability, so they may be sensitive to inaccurate estimates (Liu T, Tao D. Classification with Noisy Labels by Importance Reweighting [J] .IEEE Transactions on Pattern Analysis & Machine Intelligence, 2014). Natarajan (2013) did not propose a method for estimating noise but regarded the noise rate as a parameter optimized in the cross-validation process (Natarajan N, Dhillon I S, Ravikumar P K, et al. Learning with Noisy Labels [C]//International Conference on NeuralInformation Processing Systems. Curran Associates Inc. 2013). Natarajan proposes two methods to modify the loss function. The first method constructs an unbiased estimator of the correct distribution from the noise distribution, but the estimator may still be non-convex even when the original loss function is convex. The second approach builds a label-dependent loss function such that for a 0-1 loss, the minimum risk of Nat13 is equal to the risk of the correct distribution. Northcutt proposes the concept of learning with confident examples, calculates the equivalent variable value according to the classification probability of the noise data by the base classifier, and deletes the part to be identified according to the size of the prediction result of the base classifier for each sample is a sample of noisy label data, the process is called pruning by rank (Northcutt C G, Wu T, Chuang I L. Learning with Confident Examples: Rank Pruning for Robust Classification with NoisyLabels [J]. 2017).
发明内容SUMMARY OF THE INVENTION
本发明的目的在于提供一种噪声标签纠正方法。The purpose of the present invention is to provide a noise label correction method.
实现本发明目的的技术方案为:一种噪声标签纠正方法,包括以下步骤:The technical scheme for realizing the object of the present invention is: a noise label correction method, comprising the following steps:
步骤1,使用基分类器对样本进行预测得到样本预测概率,分别取正例集合和负例集合所有样本的预测概率期望值作为下界阈值和上界阈值,使用下界阈值和上界阈值判断观测样本真实标签,识别出噪声标签数据;Step 1, use the base classifier to predict the sample to obtain the sample prediction probability, take the expected value of the predicted probability of all samples in the positive example set and the negative example set respectively as the lower bound threshold and the upper bound threshold, and use the lower bound threshold and the upper bound threshold to judge the true value of the observed sample. label, identify noisy label data;
步骤2,利用基分类器对噪声标签样本进行重新标注,得到噪声标签样本被修正后的干净样本数据集;其中Step 2, use the base classifier to re-label the noise label samples to obtain a clean sample data set after the noise label samples have been corrected; wherein
步骤2中对于二元分类结果,识别出噪声标签样本后,根据每个样本在基分类器的预测概率值,将样本升序排序,在观测正例样本集中,将前面a个样本的标签重标注为0;在观测负例样本集中,将后个样本标签重标注为1;In step 2, for the binary classification result, after identifying the noise label samples, sort the samples in ascending order according to the predicted probability value of each sample in the base classifier, and relabel the labels of the first a samples in the observed positive sample set. is 0; in the observed negative sample set, the sample labels are relabeled to 1;
步骤2中对于多类分类结果,根据基分类器对所有样本数据预测得到的分类结果矩阵,利用该概率矩阵将样本的标签重标注为除当前标签外预测概率最大时所属的标签。In step 2, for the multi-class classification result, according to the classification result matrix predicted by the base classifier for all sample data, the probability matrix is used to relabel the label of the sample as the label to which the predicted probability is the largest except the current label.
进一步地,步骤1具体步骤包括:Further, the specific steps of step 1 include:
步骤1.1,基分类器对样本预测得到样本预测概率g(x)=P(s=1|x);设Step 1.1, the base classifier predicts the sample to obtain the sample prediction probability g(x)=P(s=1|x); set
噪声率ρ1=P(s=0|y=1)表示真实标签为1的样本误标记为0的概率,The noise rate ρ 1 =P(s=0|y=1) represents the probability that a sample whose true label is 1 is mislabeled as 0,
表示观测标签为1且真实标签为1的样本的数量, represents the number of samples with observed label 1 and true label 1,
表示观测标签为0且真实标签为1的样本的数量, represents the number of samples with observed label 0 and true label 1,
表示观测标签为1且真实标签为0的样本的数量, represents the number of samples with observed label 1 and true label 0,
表示观测标签为0且真实标签为0的样本的数量; Indicates the number of samples whose observed label is 0 and the true label is 0;
步骤1.2,使用基分类器的分类结果判断样本的真实标签:使用下界阈值LBy=1判断样本真实标签是否为1,当观测样本在基分类器g(x)上的预测结果大于该下界阈值时,设该观测样本的真实标签为1;当观测样本在基分类器上的预测结果小于上界阈值UBy=0时,设该观测样本的真实标签为0。Step 1.2, use the classification result of the base classifier Judging the true label of the sample: Use the lower threshold LB y=1 to determine whether the true label of the sample is 1. When the prediction result of the observed sample on the base classifier g(x) is greater than the lower threshold, let the true label of the observed sample be 1; When the prediction result of the observed sample on the base classifier is less than the upper bound threshold UB y=0 , set the true label of the observed sample to be 0.
步骤1.3,计算 Step 1.3, Calculate
其中,为观测正例样本集,为观测负例样本集,上届、下届阈值分别设定为正负例样本在基分类器上分类概率g(x)的期望值:in, is the observation positive sample set, In order to observe the negative sample set, the thresholds of the previous session and the next session are respectively set as the expected value of the classification probability g(x) of the positive and negative samples on the base classifier:
步骤1.4,计算噪声率的估计值和 Step 1.4, Calculate an estimate of the noise rate and
步骤1.5,由贝叶斯定理,根据噪声率的估计值推导出反转噪声率的值 Step 1.5, by Bayes' theorem, derive the value of the inverted noise rate from the estimated value of the noise rate
步骤1.6,设表示观测正例样本集中真实标签为0的样本数,表示观测负例样本集中真实标签为1的样本数,根据每个样本基分类器g(x)的预测值,将样本升序排序;在观测正例样本集中,前个样本视为正例样本集中的噪声标签样本;在观测负例样本集中,排在后个样本视为负例样本集中的噪声标签样本。Step 1.6, set represents the number of samples whose true label is 0 in the observed positive sample set, Indicates the number of samples whose true label is 1 in the observed negative sample set, and sorts the samples in ascending order according to the predicted value of the base classifier g(x) for each sample; in the observed positive sample set middle, front The samples are regarded as noise label samples in the positive sample set; in the observation negative sample set middle, after The samples are regarded as noise label samples in the negative sample set.
进一步地,步骤2中对于二元分类情况得到噪声标签样本被修正后的干净样本数据集的具体过程为:Further, in step 2, the specific process of obtaining a clean sample data set after the noise label sample is corrected for the binary classification situation is as follows:
识别出噪声标签样本后,根据每个样本在基分类器g(x)=P(s=1|x)的预测概率值,将样本升序排序。在观测正例样本集中,将前面个样本的标签重标注为0;在观测负例样本集中,将后个样本标签重标注为1;After identifying the noise label samples, the samples are sorted in ascending order according to the predicted probability value of each sample in the base classifier g(x)=P(s=1|x). In the observed positive sample set in the front The labels of the samples are relabeled as 0; in the observation negative sample set in, after sample labels are relabeled to 1;
重新标注后的正例样本集和负例样本集分别表示为:Positive sample set after relabeling and negative sample set They are respectively expressed as:
其中,表示观测正例样本集中g(x)值第小的g(x)值,表示观测负例样本集g(x)值第大的g(x)值。in, Indicates the first g(x) value in the observed positive sample set small g(x) values, Indicates the first g(x) value of the observed negative sample set Large g(x) values.
进一步地,步骤2中对于多类分类情况,采用对噪声样本的标签重标记得到噪声标签样本被修正后的干净样本数据集,具体过程为:Further, in the case of multi-class classification in step 2, the label re-labeling of the noise samples is used to obtain a clean sample data set after the noise label samples have been corrected. The specific process is as follows:
基分类器对所有样本数据预测时需要记录样本属于每个类别的概率,得到分类结果矩阵psx={pij|i∈N,j∈K},psx是一个N×K的概率矩阵,其中N为样本数,K为标签种类数,其中,概率值表示基分类对所有样本数的分类结果矩阵,矩阵第i行pi=(pi1,pi2,,,pik)表示样本xi在基分类器f(x)下属于各类标签的概率,值pij表示样本xi属于kj类的概率;When the base classifier predicts all sample data, it needs to record the probability that the sample belongs to each category, and obtain the classification result matrix psx={p ij |i∈N,j∈K}, psx is an N×K probability matrix, where N is the number of samples, K is the number of label types, among them, the probability value represents the classification result matrix of the base classification for all samples, and the i-th row of the matrix p i =(p i1 ,p i2 ,,,p ik ) indicates that the sample x i is in The probability that the base classifier f(x) belongs to various labels, and the value p ij represents the probability that the sample x i belongs to the class k j ;
当样本x被判定为噪声标签后,利用该概率矩阵psx将x的标签重标注为除当前标签外预测概率最大时所属的标签:When the sample x is determined to be a noise label, the probability matrix psx is used to relabel the label of x as the label to which the predicted probability is the largest except the current label:
yi relabel=kmax(kmax=arg max psxi)y i relabel =k max (k max =arg max psx i )
其中,kmax为样本xi在基分类器分类概率中除该样本原有噪声标签si外概率最大值所属的标签类别。Among them, k max is the label category to which the maximum probability of the sample xi in the classification probability of the base classifier, except for the original noise label si of the sample, belongs.
本发明与现有技术相比,具有以下优点:(1)为噪声标签学习提出通用解决方案,适用于任何形式的分类器;(2)噪声样本识别率高,充分利用所有样本信息,提高分类器在噪声环境下的鲁棒性;(3)算法适用于二元分类和多类分类问题。Compared with the prior art, the present invention has the following advantages: (1) A general solution is proposed for noise label learning, which is suitable for any form of classifier; (2) The noise sample recognition rate is high, and all sample information is fully utilized to improve classification (3) The algorithm is suitable for binary classification and multi-class classification problems.
下面结合说明书附图对本发明作进一步描述。The present invention will be further described below with reference to the accompanying drawings.
附图说明Description of drawings
图1为本发明的方法流程示意图。FIG. 1 is a schematic flow chart of the method of the present invention.
图2为基于基分类器识别噪声标签数据过程示意图。Figure 2 is a schematic diagram of the process of identifying noisy label data based on a base classifier.
图3为噪声标签样本重标注过程示意图。Figure 3 is a schematic diagram of the re-labeling process of noise label samples.
具体实施方式Detailed ways
结合图1,一种利用基分类器对观测样本进行分类并估计噪声率的方法识别出噪声标签数据,过程如下:Combined with Figure 1, a method of classifying the observed samples and estimating the noise rate using the base classifier identifies the noise label data. The process is as follows:
步骤1,结合图2利用基分类器对观测样本进行分类并估计噪声率,识别出噪声标签数据,过程如下:Step 1, combined with Figure 2, use the base classifier to classify the observed samples and estimate the noise rate, and identify the noise label data. The process is as follows:
步骤1.1,基分类器clf对样本预测clf.fit(X,s),得到样本预测概率g(x)=P(s=1|x)。基分类器可以选择现有的任何分类算法,只要能得到样本的预测概率即可。Step 1.1, the base classifier clf predicts the sample clf.fit(X,s), and obtains the sample prediction probability g(x)=P(s=1|x). The base classifier can choose any existing classification algorithm, as long as the predicted probability of the sample can be obtained.
对于噪声率ρ1=P(s=0|y=1),其表示真实标签为1的样本误标记为0的概率,即正确标签为1的样本集中其观测标签为0的样本数量比例。用以下变量表示各种情况下样本的数量:表示观测标签为1,真实标签为1的样本;表示观测标签为0,真实标签为1的样本;表示观测标签为1,真实标签为0的样本;表示观测标签为0,真实标签为0的样本。For the noise rate ρ 1 =P(s=0|y=1), it represents the probability that the sample with the true label of 1 is incorrectly labeled as 0, that is, the proportion of the number of samples with the observed label of 0 in the sample set with the correct label of 1. The number of samples in each case is represented by the following variables: Indicates that the observed label is 1 and the real label is 1; Indicates that the observed label is 0 and the real label is 1; Indicates that the observed label is 1 and the real label is 0; Indicates that the observed label is 0 and the true label is 0.
步骤1.2,因为样本的真实分布未知,所以使用基分类器的分类结果判断样本的真实标签。使用下界阈值LBy=1判断样本真实标签是否为1,当观测样本在基分类器g(x)上的预测结果大于该下界阈值时,可以假设该观测样本的真实标签为1。同样使用上界UBy=0判断观测样本真实标签是否为0。Step 1.2, because the true distribution of the sample is unknown, the classification result of the base classifier is used Determine the true label of the sample. Use the lower bound threshold LB y=1 to determine whether the true label of the sample is 1. When the prediction result of the observed sample on the base classifier g(x) is greater than the lower bound threshold, it can be assumed that the true label of the observed sample is 1. Also use the upper bound UB y=0 to determine whether the true label of the observed sample is 0.
步骤1.3,计算 Step 1.3, Calculate
其中,为观测正例样本集,为观测负例样本集,其中的阈值设定为正负例样本在基分类器上分类概率g(x)=P(s=1|x)的期望值:in, is the observation positive sample set, For the observation of the negative sample set, the threshold is set as the expected value of the classification probability g(x)=P(s=1|x) of the positive and negative samples on the base classifier:
步骤1.4,计算噪声率的估计值和过程如下:Step 1.4, Calculate an estimate of the noise rate and The process is as follows:
步骤1.5,由贝叶斯定理,根据噪声率的估计值推导出反转噪声率的值:In step 1.5, the value of the inverted noise rate is derived from the estimated value of the noise rate by Bayes' theorem:
其中ps1=P(s=1)表示观测样本集中正例样本的个数。由于反转噪声率表示观测正负例样本中真实标签为0或1的概率,因此表示观测正例样本集中真实标签为0的样本数,即观测正例样本集中的噪声样本数。同理,表示观测负例样本集中真实标签为1的样本数,即观测负例样本集中的噪声样本数。最后,根据每个样本基分类器g(x)的预测值,将样本升序排序,在观测正例样本集中,前个样本视为正例样本集中的噪声标签样本,在观测负例样本集中,排在后个样本视为负例样本集中的噪声标签样本。where p s1 =P(s=1) represents the number of positive samples in the observed sample set. Since the inversion noise rate represents the probability that the true label is 0 or 1 in the positive and negative samples of observations, so Indicates the number of samples whose true label is 0 in the observed positive sample set, that is, the number of noise samples in the observed positive sample set. Similarly, Indicates the number of samples whose true label is 1 in the observed negative sample set, that is, the number of noise samples in the observed negative sample set. Finally, according to the predicted value of each sample base classifier g(x), the samples are sorted in ascending order, and the observed positive sample set is middle, front The samples are regarded as the noise label samples in the positive sample set, and the negative samples are observed in the negative sample set. middle, after The samples are regarded as noise label samples in the negative sample set.
步骤2,结合图3利用基分类器分类结果对噪声标签样本进行重新标注,得到噪声标签样本被修正后的干净样本数据集,具体过程如下:Step 2, in conjunction with Figure 3, use the base classifier classification result to re-label the noise label samples, and obtain a clean sample data set after the noise label samples have been corrected. The specific process is as follows:
步骤2.1,对于二元分类情况。识别出噪声标签样本后,根据每个样本在基分类器g(x)=P(s=1|x)的预测概率值,将样本升序排序。在观测正例样本集中,将前面个样本的标签重标注为0;在观测负例样本集中,将后个样本标签重标注为1。重新标注后的正例样本集和负例样本集分别表示为:Step 2.1, for the binary classification case. After identifying the noise label samples, the samples are sorted in ascending order according to the predicted probability value of each sample in the base classifier g(x)=P(s=1|x). In the observed positive sample set in the front The labels of the samples are relabeled as 0; in the observation negative sample set in, after Each sample label is relabeled to 1. Positive sample set after relabeling and negative sample set They are respectively expressed as:
其中表示观测正例样本集中g(x)值第小的g(x)值,表示观测负例样本集g(x)值第大的g(x)值。in Indicates the first g(x) value in the observed positive sample set small g(x) values, Indicates the first g(x) value of the observed negative sample set Large g(x) values.
步骤2.2,对于多类分类情况。在多类分类情况下,样本标签总种类数不止两种,此时对噪声样本的标签重标记需要考虑到样本最可能属于哪类标签并分配该标签。噪声样本重标注的标签需要根据基分类器对所有样本的分类结果选择。因此在基分类器对所有样本数据预测时需要记录样本属于每个类别的概率,最终得到的是一个分类结果矩阵psx={pij|i∈N,j∈K},psx是一个N×K的概率矩阵(其中N为样本数,K为标签种类数),其中的概率值表示基分类对所有样本数的分类结果矩阵,矩阵第i行pi=(pi1,pi2,,,pik)表示样本xi在基分类器f(x)下属于各类标签的概率,其中的值pij表示样本xi属于kj类的概率。当样本x被判定为噪声标签后,利用该概率矩阵psx将x的标签重标注为除当前标签外预测概率最大时所属的标签。即对于噪声标签样本xi,其重标注的标签为:Step 2.2, for the multi-class classification case. In the case of multi-class classification, the total number of sample labels is more than two. At this time, the label re-labeling of noise samples needs to consider which label the sample most likely belongs to and assign the label. The labels for re-labeling noise samples need to be selected according to the classification results of all samples by the base classifier. Therefore, when the base classifier predicts all sample data, it is necessary to record the probability that the sample belongs to each category, and the final result is a classification result matrix psx={p ij |i∈N,j∈K}, psx is an N×K The probability matrix of (where N is the number of samples and K is the number of label types), where the probability value represents the classification result matrix of the base classification for all samples, the i-th row of the matrix p i = (p i1 ,p i2 ,,,p ik ) represents the probability that the sample xi belongs to various labels under the base classifier f(x), and the value p ij represents the probability that the sample xi belongs to the class k j . After the sample x is determined to be a noise label, the probability matrix psx is used to relabel the label of x as the label that belongs to when the predicted probability is the largest except the current label. That is, for the noisy label sample xi , its relabeled label is:
yi relabel=kmax(kmax=arg max psxi)y i relabel =k max (k max =arg max psx i )
其中kmax为样本xi在基分类器分类概率中除该样本原有噪声标签si外概率最大值所属的标签类别。最后重标注后得到的数据即为修正噪声标签后的正确数据集。where k max is the label category to which the maximum probability of the sample xi in the classification probability of the base classifier, except for the original noise label si of the sample, belongs. The data obtained after the final relabeling is the correct data set after correcting the noise labels.
Claims (4)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910562002.2A CN110363228B (en) | 2019-06-26 | 2019-06-26 | Noise label correction method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910562002.2A CN110363228B (en) | 2019-06-26 | 2019-06-26 | Noise label correction method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110363228A true CN110363228A (en) | 2019-10-22 |
CN110363228B CN110363228B (en) | 2022-09-06 |
Family
ID=68216503
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910562002.2A Active CN110363228B (en) | 2019-06-26 | 2019-06-26 | Noise label correction method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110363228B (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111814883A (en) * | 2020-07-10 | 2020-10-23 | 重庆大学 | A Label Noise Correction Method Based on Heterogeneous Ensemble |
CN112101328A (en) * | 2020-11-19 | 2020-12-18 | 四川新网银行股份有限公司 | Method for identifying and processing label noise in deep learning |
CN113139628A (en) * | 2021-06-22 | 2021-07-20 | 腾讯科技(深圳)有限公司 | Sample image identification method, device and equipment and readable storage medium |
WO2022032471A1 (en) * | 2020-08-11 | 2022-02-17 | 香港中文大学(深圳) | Method and apparatus for training neural network model, and storage medium and device |
WO2022194049A1 (en) * | 2021-03-15 | 2022-09-22 | 华为技术有限公司 | Object processing method and apparatus |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105426826A (en) * | 2015-11-09 | 2016-03-23 | 张静 | Tag noise correction based crowd-sourced tagging data quality improvement method |
CN107292330A (en) * | 2017-05-02 | 2017-10-24 | 南京航空航天大学 | A kind of iterative label Noise Identification algorithm based on supervised learning and semi-supervised learning double-point information |
-
2019
- 2019-06-26 CN CN201910562002.2A patent/CN110363228B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105426826A (en) * | 2015-11-09 | 2016-03-23 | 张静 | Tag noise correction based crowd-sourced tagging data quality improvement method |
CN107292330A (en) * | 2017-05-02 | 2017-10-24 | 南京航空航天大学 | A kind of iterative label Noise Identification algorithm based on supervised learning and semi-supervised learning double-point information |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111814883A (en) * | 2020-07-10 | 2020-10-23 | 重庆大学 | A Label Noise Correction Method Based on Heterogeneous Ensemble |
WO2022032471A1 (en) * | 2020-08-11 | 2022-02-17 | 香港中文大学(深圳) | Method and apparatus for training neural network model, and storage medium and device |
CN112101328A (en) * | 2020-11-19 | 2020-12-18 | 四川新网银行股份有限公司 | Method for identifying and processing label noise in deep learning |
WO2022194049A1 (en) * | 2021-03-15 | 2022-09-22 | 华为技术有限公司 | Object processing method and apparatus |
CN113139628A (en) * | 2021-06-22 | 2021-07-20 | 腾讯科技(深圳)有限公司 | Sample image identification method, device and equipment and readable storage medium |
CN113139628B (en) * | 2021-06-22 | 2021-09-17 | 腾讯科技(深圳)有限公司 | Sample image identification method, device and equipment and readable storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN110363228B (en) | 2022-09-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110363228B (en) | Noise label correction method | |
Xu et al. | Generating representative samples for few-shot classification | |
CN111814584B (en) | Vehicle re-identification method in multi-view environment based on multi-center metric loss | |
CN110837850B (en) | An Unsupervised Domain Adaptation Method Based on Adversarial Learning Loss Function | |
CN109948561B (en) | Method and system for unsupervised image and video pedestrian re-identification based on transfer network | |
CN109101938B (en) | Multi-label age estimation method based on convolutional neural network | |
CN103390279B (en) | Associating conspicuousness detects the target prospect learnt with discriminant and works in coordination with dividing method | |
JP2019521443A (en) | Cell annotation method and annotation system using adaptive additional learning | |
CN110334687A (en) | A Pedestrian Retrieval Enhancement Method Based on Pedestrian Detection, Attribute Learning and Pedestrian Recognition | |
CN110110792A (en) | A kind of multi-tag method for classifying data stream based on incremental learning | |
CN108596199A (en) | Unbalanced data classification method based on EasyEnsemble algorithms and SMOTE algorithms | |
CN108960073A (en) | Cross-module state image steganalysis method towards Biomedical literature | |
CN107368534B (en) | A method for predicting social network user attributes | |
CN108228684B (en) | Method and device for training clustering model, electronic equipment and computer storage medium | |
Luo et al. | Learning from the past: Continual meta-learning with Bayesian graph neural networks | |
CN111160959B (en) | User click conversion prediction method and device | |
Zhu et al. | Semi-supervised streaming learning with emerging new labels | |
CN112819065A (en) | Unsupervised pedestrian sample mining method and unsupervised pedestrian sample mining system based on multi-clustering information | |
CN107403188A (en) | A kind of quality evaluation method and device | |
CN111814713A (en) | An expression recognition method based on BN parameter transfer learning | |
CN107067022B (en) | Method, device and equipment for establishing image classification model | |
CN117061322A (en) | Internet of things flow pool management method and system | |
WO2020024444A1 (en) | Group performance grade recognition method and apparatus, and storage medium and computer device | |
CN113313179A (en) | Noise image classification method based on l2p norm robust least square method | |
KR20100116404A (en) | Method and apparatus of dividing separated cell and grouped cell from image |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |