CN107292330B - An Iterative Label Noise Recognition Algorithm Based on Dual Information of Supervised Learning and Semi-Supervised Learning - Google Patents

An Iterative Label Noise Recognition Algorithm Based on Dual Information of Supervised Learning and Semi-Supervised Learning Download PDF

Info

Publication number
CN107292330B
CN107292330B CN201710315861.2A CN201710315861A CN107292330B CN 107292330 B CN107292330 B CN 107292330B CN 201710315861 A CN201710315861 A CN 201710315861A CN 107292330 B CN107292330 B CN 107292330B
Authority
CN
China
Prior art keywords
noise
supervised learning
sample
value
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201710315861.2A
Other languages
Chinese (zh)
Other versions
CN107292330A (en
Inventor
关东海
魏红强
袁伟伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Aeronautics and Astronautics
Original Assignee
Nanjing University of Aeronautics and Astronautics
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Aeronautics and Astronautics filed Critical Nanjing University of Aeronautics and Astronautics
Priority to CN201710315861.2A priority Critical patent/CN107292330B/en
Publication of CN107292330A publication Critical patent/CN107292330A/en
Application granted granted Critical
Publication of CN107292330B publication Critical patent/CN107292330B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Complex Calculations (AREA)

Abstract

本发明公开的一种基于监督学习和半监督学习双重信息的迭代式标签噪声识别算法,属于机器学习和数据挖掘领域。本发明将监督和半监督学习结合起来,对于监督学习部分,通过软多重投票方式,产生监督学习噪声识别结果;对于半监督学习部分,基于监督学习产生的纯净数据训练的分类模型,对无标签数据集进行标记,标记后的无标签数据作为训练集,用加权KNN方法对标签数据集进行检测产生噪声识别结果;最终将噪声识别结果结合起来产生最终识别结果。本发明算法还采取迭代方式,每次迭代输入的待测样本为上次迭代时过滤掉噪声后的剩余样本。同传统噪声识别算法比,该发明结合了更多互补信息,同时辅以迭代方式,能更好促进噪声识别准确性。

Figure 201710315861

The invention discloses an iterative label noise identification algorithm based on the dual information of supervised learning and semi-supervised learning, belonging to the field of machine learning and data mining. The invention combines supervision and semi-supervised learning. For the supervised learning part, the noise recognition result of supervised learning is generated by soft multiple voting; for the semi-supervised learning part, the classification model based on pure data training generated by supervised learning is used for unlabeled learning. The data set is labeled, and the labeled unlabeled data is used as the training set. The weighted KNN method is used to detect the labeled data set to produce noise recognition results; finally, the noise recognition results are combined to produce the final recognition results. The algorithm of the present invention also adopts an iterative manner, and the samples to be tested input in each iteration are the remaining samples after filtering out the noise in the previous iteration. Compared with the traditional noise identification algorithm, the invention combines more complementary information, and at the same time, it is supplemented by an iterative method, which can better promote the accuracy of noise identification.

Figure 201710315861

Description

Iterative label noise identification algorithm based on double information of supervised learning and semi-supervised learning
Technical Field
The invention relates to the technical field of data mining and machine learning, in particular to an iterative label noise identification algorithm based on double information of supervised learning and semi-supervised learning.
Background
Many training data used in practical applications of machine learning are noisy, and the causes include human error, hardware device error, data collection process error, and the like. The traditional method is to perform data preprocessing work on source data manually before applying machine learning algorithms to obtain pure source data, however, the manual work is labor-consuming, tedious and time-consuming, and cannot ensure the complete correctness of the data, which causes non-negligible influence on subsequent algorithm application. Data noise generally includes two categories: attribute noise, which refers to sample attribute value inaccuracy, and category noise, which refers to sample label inaccuracy [1 ]. The influence of class noise is larger than attribute noise.
The processing method for the class noise comprises the following steps: robust algorithms [2, 3] and noise detection algorithms [4, 5, 6, 7] are designed. The robust algorithm is designed mainly by improving the existing algorithm, so that the existing algorithm is less influenced by the class noise. And noise detection algorithms detect and remove noise before using the data containing the noise. In contrast, the noise-like detection algorithm is more effective and versatile.
The existing noise-like detection algorithms mainly comprise two types: supervised learning based and semi-supervised learning based. Where supervised learning based representatives are ensemble learning based algorithms, representatives of this class of algorithms are most filtering and consistency filtering [7 ]. In these algorithms, the training data is first randomly divided into subsets, and then each subset is individually noise detected. The basic idea of detection is the voting of multiple classifiers obtained by taking the remaining subset as a training sample. This type of algorithm mainly comprises two steps: sample division and multi-classifier voting. Because the sample division and the multi-classifier voting are performed only once, the method belongs to a label noise detection method based on single voting. The existing label noise detection method based on single voting has two defects: the result of a single vote is more affected by the sample division and the likelihood of missing noise is greater. Although a new improved algorithm (noise-like detection method of multiple votes [8]) was later developed for these deficiencies, some of the noise was missed. An algorithm [6] based on semi-supervised learning, wherein the idea of the algorithm is to train a classification model through known labeled data, label the unlabeled data, and add the labeled data to the existing labeled data set to enlarge the training set, so that a better classification model can be trained from a larger training set to better detect the label noise.
For supervised learning, hidden information in label-free data is not utilized and explored, and the possibility of first-floor noise is high; for semi-supervised learning, the original labeled data set has noise, and for labeling of unlabeled data, the original labeled data set also has noise, and if the noise of the original labeled data set is larger than the noise of the original labeled data set, a very poor classification model is finally obtained.
Reference documents:
[1]Zhu,Xingquan,and Xindong Wu.″Class noise vs.attribute noise:A quantitative study.″Artificial Intelligence Review 22.3(2004):177-210.
[2]J.Bootkrajang,A.Kaban,Classification of mislabelled microarrays using robust sparse logistic regression,Bioinformatics 29(7)(2013)870-877.
[3]J.Saez,M.Galar,J.Luengo,F.Herrera,A first study on decomposition strategies with data with class noise using decision trees,in:Hybrid Artificial Intelligent Systems,Lecture Notes in Computer Science,vol.7209,2012,pp.25-35.
[4]D.L.Wilson,Asymptotic properties of nearest neighbor rules using edited data,IEEE Trans.Syst.Man Cybernet.2(3)(1992)431-433.
[5]J.Young,J.Ashburner,S.Ourselin,Wrapper methods to correct mislabeled training data,in:3rd International Workshop on Pattern Recognition in Neuroimaging,2013,pp.170-173.
[6]D.Guan,W.Yuan,et al.,Identifying mislabeled training data with the aid of unlabeled data,Appl.Intell.35(3)(2011)345-358.
[7]C.E.Brodley,M.A.Friedl,Identifying mislabeled training data,J.Artif.Intell.Res.11(1999)131-167.
[8]Guan D,Yuan W,Ma T,et al.Detecting potential labeling errors for bioinformatics by multiple voting[J].Knowledge-Based Systems,2014,66(9):28-35.
disclosure of Invention
The invention aims to solve the problem of providing an iterative label noise identification algorithm based on double information of supervised learning and semi-supervised learning, which adopts a mode of double information of supervised learning and semi-supervised learning, can set corresponding parameters and strategies according to actual conditions, avoids the problem of noise detection of the two single information of the supervised learning and the semi-supervised learning, can effectively improve the identification accuracy, and can more thoroughly discover noise data in an iterative mode.
The iterative label noise identification algorithm based on the double information of supervised learning and semi-supervised learning, disclosed by the invention, comprises the following steps of:
step 1) determining algorithm input variables, including a sample set L to be processed and a non-label sample set U, a maximum iteration number maxIter, a multi-time voting number numVote, a noise identification judgment voting trust numFinalConfodence, a random block number numColossity, a classifier number numClassifier, a noise identification voting trust numConfidence, a trust threshold ConfidenceThreshold for judging noise, an initialization multi-time voting number t equal to 1, an iteration number m equal to 1, and an initialization sample set E equal to L;
step 2) randomly dividing E into numacross subsets with consistent sizes
Figure BSA0000144272320000031
Initializing a parameter i to 1;
step 3) with
Figure BSA0000144272320000032
Samples in the set are used as training data, numClassifier different classification algorithms are selected, and numClassifier different classifiers H are trained1,H2,…,HnumClassifier
Step 4) with H1,H2,…,HnumClassifierFor sample set
Figure BSA0000144272320000033
Classifying the middle samples, respectively calculating the numConfidence of each sample, and storing the calculation result into a table;
step 5) iteratively executing steps 2) to 4), adding 1 to the value i after each iteration until the value i is equal to nummatch, stopping the iteration, calculating nummatch of all samples after the voting is finished, and storing the nummatch into a table;
step 6) iteratively executing steps 2) to 5), adding 1 to the t value after each iteration until t is num Vote, and generating num Vote tables;
and 7) comprehensively analyzing num Vote tables, counting the numConfidence of each sample to obtain the numFinalConfidence (e) of each sample e, and storing the numFinalConfidence (e) into one table. Initializing an En, and storing samples of numFinalConfigence (e) smaller than a predetermined Configence threshold as suspicious samples in the En;
step 8), taking E' ═ E-En as a training set, generating numclassifier classifiers based on numclassifier classification algorithms, and marking a label-free sample set U by using the classifiers to obtain a sample set;
step 9), taking the data set E as a test set and a training set, calculating the numFinalConfidence (E) 'of each sample by a weighted KNN algorithm, and storing the numFinalConfidence (E)' in a table named as numConfidence;
and step 10) adding the numerical values of the same samples in the table and the numcfondence table to obtain a final Confidence table, and regarding the samples with the numerical values smaller than a specified threshold Confidence threshold as noise. Let the detected noise be
Figure BSA0000144272320000041
Then
Figure BSA0000144272320000042
Step 11)
Figure BSA0000144272320000043
Steps 2) to 10) are executed in an iteration mode, and after each iteration, the value of m is added with 1 until the value is up to
Figure BSA0000144272320000044
Or m is maxIter;
and step 12) returning the value E, wherein E is the clean sample set after the noise is deleted, and the algorithm is ended.
Further, in the step 3), the numclassfier is selected as an odd number, and the selection of the odd number is favorable for realizing voting. The classification algorithm is one or more of k-neighborhood, decision tree, Bayes, neural network and support vector machine. And numclasifier selection is affected by the data set. For small sample datasets, a larger numClassifier value should be used to ensure differentiation between multiple classifiers. When the sample set label noise is high, a larger numclasifier value should also be assumed. The larger numclassfier can ensure high tag noise identification rate of each iteration, is beneficial to reducing iteration times and improving algorithm efficiency. On the other hand, a smaller numclasifier may be selected when the number of sample sets is larger and the sample tag to noise ratio is lower. For example, numclasifier may be set to 3.
In another improvement, in the step 7), the larger the threshold value configurethreshold is set, the larger the set En of suspected noise obtained by the supervised learning portion is, the cleaner the training data obtained by using E' ═ E-En to label the unlabeled data set U is, the higher the accuracy of the obtained label is, and the higher the accuracy of the noise data used to detect E as the training data is. However, the value of ConfidenceThreshold is not too large, which may make some of the labeled accurate data in E to be considered as noise data, so that the E' data set is small and the classification model cannot be well trained to label the unlabeled data set U.
In another improvement, the threshold value configurethreshold in step 7) may be selected from conventional values, such as configurethreshold of 0.1, 0.2, 0.3 or 0.4. The optimized threshold value configurethreshold value may also be calculated from separate calibration samples. The method comprises the following specific steps: a) estimating the noise ratio of the noise data to be processed according to the prior knowledge, b) adding random noise into the verification samples, c) traversing possible threshold Condensethreshold values and calculating the recognition accuracy of the algorithm on the noise in the verification samples under the values, d) selecting the threshold Condensethreshold with the highest recognition accuracy.
The invention has the beneficial effects that: the iterative label noise identification algorithm based on the double information of the supervised learning and the semi-supervised learning adopts a double information mode combining the supervised learning and the semi-supervised learning, the data is not detected by the single information any more, the supervised learning makes a judgment on the data, meanwhile, the semi-supervised learning also makes a judgment, and finally, the 2 judgment results are combined together to obtain a final classification result. For the supervised learning part, a multi-voting mode is adopted for noise identification, and the sample sequence is randomly disturbed before each voting, so that the difference of voting is ensured. After a suspicious noise set En obtained by the supervised learning part is filtered, a part of suspicious data is firstly filtered in an E '-E-En filtering mode, then E' is used as a training set of a non-label data set U, a training classification model is used for marking U, the marked data set is used as the training set, the data in E is tested by applying a classification algorithm with the weight KNN to obtain numFinaLConfidence (E)) of each data in E, and finally, the detected noise (representing the noise set detected by the mth iteration) is obtained by comparing the numFinaLConfidence (E)) with 2 classification results of numFinaLConfidence (E) and the pure data set E-are obtained. In addition, an iterative identification method is adopted in the identification algorithm, the samples to be detected input in each iteration are pure samples output for filtering noise in the last iteration, and all noise data can be identified more comprehensively and thoroughly. The identification algorithm solves the problem that the existing label noise identification algorithm is low in identification accuracy, and ensures high accuracy of noise identification.
Drawings
FIG. 1 is a flow chart of an iterative label noise identification algorithm based on double information of supervised learning and semi-supervised learning according to the present invention.
Detailed Description
The following describes an iterative label noise identification algorithm based on double information of supervised learning and semi-supervised learning in detail with reference to the accompanying drawings.
As shown in fig. 1, the iterative label noise identification algorithm based on dual information of supervised learning and semi-supervised learning of the present invention includes the following steps:
step 1) determining algorithm input variables, including a sample set L to be processed and a non-label sample set U, a maximum iteration number maxIter, a multi-time voting number num Vote, a noise identification judgment voting trust numFinalConfodence, a random block number numColossity, a classifier number numClassifier, a noise identification voting trust numConfidentiality, a confidence threshold Confidentisethreshold for judging noise, an initialization multi-time voting number t equal to 1, an iteration number m equal to 1, and an initialization sample set E equal to L;
step 2) randomly dividing E into numacross subsets with consistent sizes
Figure BSA0000144272320000061
Initializing a parameter i to 1;
step 3) with
Figure BSA0000144272320000062
Samples in the set are used as training data, numClassifier different classification algorithms are selected, and numClassifier different classifiers H are trained1,H2,…,HnumClassifier. numclasifier is chosen to be an odd number, such as 3, 5, 7, etc., although not limited to these odd numbers as enumerated; the classification algorithm is one or more of k-neighborhood, decision tree, Bayes, neural network and support vector machine;
step 4) with H1,H2,…,HnumClassifierFor sample set
Figure BSA0000144272320000063
Classifying the middle samples, respectively calculating the numConfidence of each sample, and storing the calculation result into a table;
step 5) iteratively executing steps 2) to 4), adding 1 to the value i after each iteration until the value i is equal to nummatch, stopping the iteration, calculating nummatch of all samples after the voting is finished, and storing the nummatch into a table;
step 6) iteratively executing steps 2) to 5), adding 1 to the t value after each iteration until t is numVote, and generating numVote tables;
and 7) comprehensively analyzing the numVote tables, counting the numConfidence of each sample to obtain the numFinalConfidence (e) of each sample e, and storing the numVinConfidence (e) into one table. En is initialized and samples of numFinalConfigence (e) less than a predetermined Configence threshold are stored as suspect samples in En. The threshold confidencthreshold is chosen to be larger or better, and thus the resulting En is larger, thereby making the following E 'more pure, but not too large, otherwise the set of E' is small and a good training model cannot be trained for labeling U. Therefore, the threshold value configurethreshold is preferably 0.4, which is a preferred example, and other suitable values may be selected;
step 8), taking E' ═ E-En as a training set, generating numclassifier classifiers based on numclassifier classification algorithms, and marking a label-free sample set U by using the classifiers to obtain a sample set;
and step 9) taking the data set E as a test set and a training set, calculating the numFinalConfidence (E) 'of each sample by a weighted KNN algorithm, and storing the numFinalConfidence (E)' in a table named as numConfidence. The value of K of weighted KNN may be 3, 5, 7, 9, etc., and K is 5, which is a preferred example, and any other suitable value may be selected as one of the values;
and step 10) adding the numerical values of the same samples in the table and the numcfondence table to obtain a final Confidence table, and regarding the samples with the numerical values smaller than a specified threshold Confidence threshold as noise. Let the detected noise be
Figure BSA0000144272320000071
Then
Figure BSA0000144272320000072
Step 11)
Figure BSA0000144272320000073
Steps 2) to 10) are executed in an iteration mode, and after each iteration, the value of m is added with 1 until the value is up to
Figure BSA0000144272320000074
Or m is maxIter;
and step 12) returning the value E, wherein E is the clean sample set after the noise is deleted, and the algorithm is ended.
The present invention is described in detail below with respect to the test results of 2 sets of data in the UCI database and the improvement in performance compared to the tag noise recognition algorithm. The recognition algorithm proposed herein is compared with the currently popular multi-vote recognition algorithms MFCF and CFMF and semi-supervised based CFAUD and MFAUD. (MFCF, CFMF see reference [8], CFAUD and MFAUD see reference [6]) because the data in the raw UCI database is free of label noise and label-free data, for a selected data set we consider to select a large portion of the removed labels as the label-free data set, and the remaining labeled data, artificially added noise, taking into account different noise ratios, including 10%, 20%, 30%, 40%. In this example, the performance of the tag noise detection algorithm is measured by the number of mis-tag errors. The error count includes two parts, one part being diagnostic good data of noise data errors, denoted by E1, and the other part being diagnostic good data errors, denoted by E2. The smaller the E1+ E2 value, the higher the algorithm accuracy.
TABLE 1 data set
Data set Number of samples Number of features
Breast 683 9
Credit-screening 653 14
The parameters are set as follows: numwish is 3, numlast is 3 (three classification algorithms include naive bayes, decision trees and nearest neighbors), maxIter is 100, numvolt is 5; configurethreshold is 0.4.
TABLE 2 Breast data set, results at 10% noise ratio
Figure BSA0000144272320000081
TABLE 3 Breast data set, results at 20% noise ratio
Figure BSA0000144272320000082
TABLE 4 Breast data set, results at 30% noise ratio
Figure BSA0000144272320000083
TABLE 5 Breast data set, results at 40% noise ratio
Figure BSA0000144272320000091
TABLE 6 CRedit data set, results at 10% noise ratio
Figure BSA0000144272320000092
TABLE 7-hierarchy data set, results at 20% noise ratio
Figure BSA0000144272320000093
TABLE 8-hierarchy data set, results at 30% noise ratio
Figure BSA0000144272320000094
TABLE 9-hierarchy data set, results at 40% noise ratio
Figure BSA0000144272320000101
As shown in tables 2-9 above, the proposed algorithm is superior to the conventional two algorithms in stability based on different noise ratios on the two data used in the experiment.
In summary, the above embodiments are only used for illustrating the technical solutions of the present invention, and are not used for limiting the protection scope of the present invention. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present invention shall be covered within the scope of the claims of the present invention.

Claims (4)

1.一种基于监督学习和半监督学习双重信息的迭代式标签噪声识别方法,其特征在于,包括以下步骤:1. an iterative label noise identification method based on supervised learning and semi-supervised learning dual information, is characterized in that, comprises the following steps: 步骤1)确定算法输入变量,包括待处理样本集L和无标签样本集U,最大迭代次数maxIter,多次投票次数numVote,噪声识别判定投票信任度numFinalConfodence,随机分块数numCross,分类器个数numClassifier,噪声识别投票信任度numConfidence,判别噪声的信任度阈值ConfidenceThreshold,初始化多次投票次数t=1,迭代次数m=1,初始化待处理样本集E=L;Step 1) Determine the input variables of the algorithm, including the sample set L to be processed and the unlabeled sample set U, the maximum number of iterations maxIter, the number of multiple votes numVote, the noise identification judgment vote confidence numFinalConfodence, the number of random blocks numCross, the number of classifiers numClassifier, noise identification voting confidence numConfidence, the confidence threshold ConfidenceThreshold to discriminate noise, initialize multiple voting times t=1, iteration times m=1, initialize the sample set to be processed E=L; 步骤2)将E随机分成numCross个大小一致的子集
Figure FSB0000192930860000011
Figure FSB0000192930860000012
其中i=1:n,初始化参数i=1;
Step 2) Randomly divide E into numCross subsets of the same size
Figure FSB0000192930860000011
Figure FSB0000192930860000012
Where i=1:n, initialization parameter i=1;
步骤3)用
Figure FSB0000192930860000013
集合中样本做训练数据,选择numClassifier个不同的分类算法,训练numClassifier个不同的分类器H1,H2,...,HnumClassifier
step 3) with
Figure FSB0000192930860000013
The samples in the set are used as training data, numClassifier different classification algorithms are selected, and numClassifier different classifiers H 1 , H 2 , ..., H numClassifier are trained ;
步骤4)用H1,H2,...,HnumClassifier对样本集
Figure FSB0000192930860000014
中样本分类,分别计算每个样本的numConfidence,并将计算结果存入一张表中;
Step 4) Use H 1 , H 2 , . . . , H numClassifier to pair the sample set
Figure FSB0000192930860000014
Medium sample classification, calculate the numConfidence of each sample separately, and store the calculation results in a table;
步骤5)迭代执行步骤2)至步骤4),每次迭代后i值加1,直到i值等于numCross,停止迭代,计算出这一次投票完成后所有的样本的numConfidence并存入一张表中;Step 5) Iteratively execute step 2) to step 4), add 1 to the i value after each iteration, until the i value is equal to numCross, stop the iteration, calculate the numConfidence of all samples after this voting is completed and store it in a table; 所述步骤4)和步骤5)中,所述的表中的每个元素,对应的是每个待处理样本集E中的每个样本以及他们被正确标记的可能性numConfidence;In the steps 4) and 5), each element in the table corresponds to each sample in each sample set E to be processed and the probability numConfidence that they are correctly marked; 步骤6)迭代执行步骤2)至步骤5),每次迭代后t值加1,直到t=numVote为止,生成numVote个表;Step 6) iteratively executes step 2) to step 5), after each iteration, the t value is added by 1, until t=numVote, and numVote tables are generated; 步骤7)综合分析numVote个表,对每个样本的numConfidence进行统计,获得每个样本e的numFinalConfidence(e),并存入一张表格中;初始化一个En,将numFinalConfidence(e)小于预定的ConfidenceThreshold的样本视为可疑样本存入En;Step 7) Comprehensively analyze numVote tables, count the numConfidence of each sample, obtain numFinalConfidence(e) of each sample e, and store it in a table; initialize an En, and set numFinalConfidence(e) to be less than the predetermined ConfidenceThreshold The samples are regarded as suspicious samples and deposited in En; 所述步骤7)中的ConfidenceThreshold值选定为0.1-0.4之间的一个数值;The ConfidenceThreshold value in described step 7) is selected as a numerical value between 0.1-0.4; 步骤8)将E’=E-En作为训练集,基于numclassifier个分类方法,生成numclassifier个分类器,用这些分类器对无标签样本集U进行标记,得到一个样本集;Step 8) using E'=E-En as a training set, based on numclassifier classification methods, generate numclassifier classifiers, and label the unlabeled sample set U with these classifiers to obtain a sample set; 步骤9)将数据集E作为测试集,标记后的数据集作为训练集,通过加权KNN算法,计算出每个样本的numFinalConfidence(e)’,并存入一张表格中;Step 9) using the data set E as the test set, the marked data set as the training set, through the weighted KNN algorithm, calculate the numFinalConfidence(e)' of each sample, and store in a table; 步骤10)将包含numFinalConfidence(e)的表格和包含numConfidence的表格中的相同样例的数值,进行相加求平均值,得到最终的Confdence表格,对于数值小于指定阈值ConfidenceThreshold的样本,被视为噪声;设检测到的噪声为
Figure FSB0000192930860000021
Figure FSB0000192930860000022
Step 10) Add the table containing numFinalConfidence(e) and the values of the same sample in the table containing numConfidence to obtain the final Confdence table. Samples whose values are less than the specified threshold ConfidenceThreshold are regarded as noise. ; Let the detected noise be
Figure FSB0000192930860000021
but
Figure FSB0000192930860000022
步骤11)
Figure FSB0000192930860000023
迭代执行步骤2)至步骤10),每次迭代后,m值加1,直到
Figure FSB0000192930860000024
或m=maxIter为止;
step 11)
Figure FSB0000192930860000023
Iteratively execute steps 2) to 10), after each iteration, add 1 to the m value until
Figure FSB0000192930860000024
or until m=maxIter;
步骤12)返回E值,E为删除噪声后的纯净样本集,方法结束。Step 12) Return the value of E, where E is the pure sample set after removing noise, and the method ends.
2.根据权利要求1所述的基于监督学习和半监督学习双重信息的迭代式标签噪声识别方法,其特征在于:所述步骤3)中,numClassifier选定为奇数。2 . The iterative label noise identification method based on dual information of supervised learning and semi-supervised learning according to claim 1 , wherein: in the step 3), numClassifier is selected as an odd number. 3 . 3.根据权利要求2所述的基于监督学习和半监督学习双重信息的迭代式标签噪声识别方法,其特征在于:设置所述的numClassifier=3。3 . The iterative label noise identification method based on dual information of supervised learning and semi-supervised learning according to claim 2 , wherein the numClassifier=3 is set. 4 . 4.根据权利要求1所述的基于监督学习和半监督学习双重信息的迭代式标签噪声识别方法,其特征在于:所述步骤7)中ConfidenceThreshold值通过独立的校验样本,计算优化;具体步骤包括:a)根据先验知识估计待处理噪声数据的噪声比,b)在校验样本中加入随机噪声,c)遍历可能的ConfidenceThreshold数值并计算该数值下识别方法对校验样本中噪声的识别准确度,d)选择具有更高识别准确度的ConfidenceThreshold。4. the iterative label noise identification method based on supervised learning and semi-supervised learning dual information according to claim 1, is characterized in that: in described step 7), ConfidenceThreshold value is calculated and optimized by independent verification sample; Concrete steps It includes: a) estimating the noise ratio of the noise data to be processed according to prior knowledge, b) adding random noise to the verification sample, c) traversing the possible ConfidenceThreshold value and calculating the identification method under this value to identify the noise in the verification sample Accuracy, d) Choose ConfidenceThreshold with higher recognition accuracy.
CN201710315861.2A 2017-05-02 2017-05-02 An Iterative Label Noise Recognition Algorithm Based on Dual Information of Supervised Learning and Semi-Supervised Learning Expired - Fee Related CN107292330B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710315861.2A CN107292330B (en) 2017-05-02 2017-05-02 An Iterative Label Noise Recognition Algorithm Based on Dual Information of Supervised Learning and Semi-Supervised Learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710315861.2A CN107292330B (en) 2017-05-02 2017-05-02 An Iterative Label Noise Recognition Algorithm Based on Dual Information of Supervised Learning and Semi-Supervised Learning

Publications (2)

Publication Number Publication Date
CN107292330A CN107292330A (en) 2017-10-24
CN107292330B true CN107292330B (en) 2021-08-06

Family

ID=60094401

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710315861.2A Expired - Fee Related CN107292330B (en) 2017-05-02 2017-05-02 An Iterative Label Noise Recognition Algorithm Based on Dual Information of Supervised Learning and Semi-Supervised Learning

Country Status (1)

Country Link
CN (1) CN107292330B (en)

Families Citing this family (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107862386A (en) * 2017-11-03 2018-03-30 郑州云海信息技术有限公司 A kind of method and device of data processing
CN108021931A (en) 2017-11-20 2018-05-11 阿里巴巴集团控股有限公司 A kind of data sample label processing method and device
CN108021940B (en) * 2017-11-30 2023-04-18 中国银联股份有限公司 Data classification method and system based on machine learning
US20190244138A1 (en) * 2018-02-08 2019-08-08 Apple Inc. Privatized machine learning using generative adversarial networks
CN110163376B (en) * 2018-06-04 2023-11-03 腾讯科技(深圳)有限公司 Sample detection method, media object identification method, device, terminal and medium
CN108985365B (en) * 2018-07-05 2021-10-01 重庆大学 Multi-source heterogeneous data fusion method based on deep subspace switching ensemble learning
CN109213656A (en) * 2018-07-23 2019-01-15 武汉智领云科技有限公司 A kind of interactive mode big data dysgnosis detection system and method
JP7299002B2 (en) * 2018-08-23 2023-06-27 ファナック株式会社 Discriminator and machine learning method
EP3807821A1 (en) 2018-09-28 2021-04-21 Apple Inc. Distributed labeling for supervised learning
CN109800785B (en) * 2018-12-12 2021-12-28 中国科学院信息工程研究所 Data classification method and device based on self-expression correlation
US11568324B2 (en) * 2018-12-20 2023-01-31 Samsung Display Co., Ltd. Adversarial training method for noisy labels
CN110189305B (en) * 2019-05-14 2023-09-22 上海大学 A multi-task tongue image automatic analysis method
CN110363228B (en) * 2019-06-26 2022-09-06 南京理工大学 Noise label correction method
CN110633758A (en) * 2019-09-20 2019-12-31 四川长虹电器股份有限公司 Method for detecting and locating cancer region aiming at small sample or sample unbalance
US11853908B2 (en) 2020-05-13 2023-12-26 International Business Machines Corporation Data-analysis-based, noisy labeled and unlabeled datapoint detection and rectification for machine-learning
CN111784595B (en) * 2020-06-10 2023-08-29 北京科技大学 Dynamic tag smooth weighting loss method and device based on historical record
CN113269258A (en) * 2021-05-27 2021-08-17 郑州大学 Semi-supervised learning label noise defense algorithm based on AdaBoost
CN113887742A (en) * 2021-10-26 2022-01-04 重庆邮电大学 A data classification method and system based on support vector machine
CN114065135A (en) * 2021-11-12 2022-02-18 西安热工研究院有限公司 A stochastic denoising statistical method and system based on cumulative semaphore
CN114218872B (en) * 2021-12-28 2023-03-24 浙江大学 Remaining service life prediction method based on DBN-LSTM semi-supervised joint model
CN114708470A (en) * 2022-04-15 2022-07-05 杭州网易智企科技有限公司 Identification methods, media and computing devices of illegal images
CN117421657B (en) * 2023-10-27 2024-11-01 江苏开放大学(江苏城市职业学院) Method and system for screening and learning picture samples with noise labels based on oversampling strategy

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105046236A (en) * 2015-08-11 2015-11-11 南京航空航天大学 Iterative tag noise recognition algorithm based on multiple voting

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9053391B2 (en) * 2011-04-12 2015-06-09 Sharp Laboratories Of America, Inc. Supervised and semi-supervised online boosting algorithm in machine learning framework
CN103886330B (en) * 2014-03-27 2017-03-01 西安电子科技大学 Sorting technique based on semi-supervised SVM integrated study
CN104318242A (en) * 2014-10-08 2015-01-28 中国人民解放军空军工程大学 High-efficiency SVM active half-supervision learning algorithm
CN104598813B (en) * 2014-12-09 2017-05-17 西安电子科技大学 Computer intrusion detection method based on integrated study and semi-supervised SVM
WO2016138041A2 (en) * 2015-02-23 2016-09-01 Cellanyx Diagnostics, Llc Cell imaging and analysis to differentiate clinically relevant sub-populations of cells
CN105930411A (en) * 2016-04-18 2016-09-07 苏州大学 Classifier training method, classifier and sentiment classification system
CN106096622B (en) * 2016-04-26 2019-11-08 北京航空航天大学 Semi-supervised hyperspectral remote sensing image classification and labeling method
CN106294593B (en) * 2016-07-28 2019-04-09 浙江大学 A Relation Extraction Method Combining Clause-Level Remote Supervision and Semi-Supervised Ensemble Learning
CN106294590B (en) * 2016-07-29 2019-05-31 重庆邮电大学 A kind of social networks junk user filter method based on semi-supervised learning

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105046236A (en) * 2015-08-11 2015-11-11 南京航空航天大学 Iterative tag noise recognition algorithm based on multiple voting

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
具有噪声过滤功能的协同训练半监督主动学习算法;詹永照 等;《模式识别与人工智能》;20091031;第22卷(第5期);摘要,第1-5节 *
基于集成半监督学习的标签噪声研究;金龙 等;《中国优秀硕士学位论文全文数据库 信息科技辑》;20131215;第2013年卷(第S2期);第I140-91页 *

Also Published As

Publication number Publication date
CN107292330A (en) 2017-10-24

Similar Documents

Publication Publication Date Title
CN107292330B (en) An Iterative Label Noise Recognition Algorithm Based on Dual Information of Supervised Learning and Semi-Supervised Learning
JP7643692B2 (en) AI Methods for Cleaning Data to Train Artificial Intelligence (AI) Models
CN110689081B (en) Weak supervision target classification and positioning method based on bifurcation learning
CN113095229B (en) Self-adaptive pedestrian re-identification system and method for unsupervised domain
CN105046236A (en) Iterative tag noise recognition algorithm based on multiple voting
Ndung'u Data preparation for machine learning modelling
CN114550831B (en) A gastric cancer proteomic classification framework identification method based on deep learning feature extraction
CN114757433A (en) A rapid identification method for the relative risk of antibiotic resistance in drinking water sources
Shoohi et al. DCGAN for Handling Imbalanced Malaria Dataset based on Over-Sampling Technique and using CNN.
CN107679550A (en) A kind of appraisal procedure of data set classification availability
CN107247873A (en) A kind of recognition methods of differential methylation site
CN113627522B (en) Image classification method, device, equipment and storage medium based on relational network
CN113674862A (en) A method for predicting the incidence of acute renal injury based on machine learning
CN105631465A (en) Density peak-based high-efficiency hierarchical clustering method
Nurmalasari et al. Classification for papaya fruit maturity level with convolutional neural network
CN109597944B (en) Single-classification microblog rumor detection model based on deep belief network
CN113033694B (en) Data cleaning method based on deep learning
CN105701501A (en) Trademark image identification method
CN116151107B (en) Method, system and electronic equipment for identifying ore potential of magma type nickel cobalt
CN108304546B (en) A Medical Image Retrieval Method Based on Content Similarity and Softmax Classifier
CN113392086B (en) Medical database construction method, device and equipment based on Internet of things
CN116630694A (en) A target classification method, system and electronic equipment for more marked images
CN118820813B (en) Product cluster analysis method based on deep learning model
Wang et al. Progressive Ensemble Learning for In-sample Imagery Data Cleaning
Mustafa Automated machine learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20210806