CN107292330B

CN107292330B - An Iterative Label Noise Recognition Algorithm Based on Dual Information of Supervised Learning and Semi-Supervised Learning

Info

Publication number: CN107292330B
Application number: CN201710315861.2A
Authority: CN
Inventors: 关东海; 魏红强; 袁伟伟
Original assignee: Nanjing University of Aeronautics and Astronautics
Current assignee: Nanjing University of Aeronautics and Astronautics
Priority date: 2017-05-02
Filing date: 2017-05-02
Publication date: 2021-08-06
Anticipated expiration: 2037-05-02
Also published as: CN107292330A

Abstract

The invention discloses an iterative label noise identification algorithm based on the dual information of supervised learning and semi-supervised learning, belonging to the field of machine learning and data mining. The invention combines supervision and semi-supervised learning. For the supervised learning part, the noise recognition result of supervised learning is generated by soft multiple voting; for the semi-supervised learning part, the classification model based on pure data training generated by supervised learning is used for unlabeled learning. The data set is labeled, and the labeled unlabeled data is used as the training set. The weighted KNN method is used to detect the labeled data set to produce noise recognition results; finally, the noise recognition results are combined to produce the final recognition results. The algorithm of the present invention also adopts an iterative manner, and the samples to be tested input in each iteration are the remaining samples after filtering out the noise in the previous iteration. Compared with the traditional noise identification algorithm, the invention combines more complementary information, and at the same time, it is supplemented by an iterative method, which can better promote the accuracy of noise identification.

Description

Iterative label noise identification algorithm based on double information of supervised learning and semi-supervised learning

Technical Field

The invention relates to the technical field of data mining and machine learning, in particular to an iterative label noise identification algorithm based on double information of supervised learning and semi-supervised learning.

Background

Many training data used in practical applications of machine learning are noisy, and the causes include human error, hardware device error, data collection process error, and the like. The traditional method is to perform data preprocessing work on source data manually before applying machine learning algorithms to obtain pure source data, however, the manual work is labor-consuming, tedious and time-consuming, and cannot ensure the complete correctness of the data, which causes non-negligible influence on subsequent algorithm application. Data noise generally includes two categories: attribute noise, which refers to sample attribute value inaccuracy, and category noise, which refers to sample label inaccuracy [1 ]. The influence of class noise is larger than attribute noise.

The processing method for the class noise comprises the following steps: robust algorithms [2, 3] and noise detection algorithms [4, 5, 6, 7] are designed. The robust algorithm is designed mainly by improving the existing algorithm, so that the existing algorithm is less influenced by the class noise. And noise detection algorithms detect and remove noise before using the data containing the noise. In contrast, the noise-like detection algorithm is more effective and versatile.

The existing noise-like detection algorithms mainly comprise two types: supervised learning based and semi-supervised learning based. Where supervised learning based representatives are ensemble learning based algorithms, representatives of this class of algorithms are most filtering and consistency filtering [7 ]. In these algorithms, the training data is first randomly divided into subsets, and then each subset is individually noise detected. The basic idea of detection is the voting of multiple classifiers obtained by taking the remaining subset as a training sample. This type of algorithm mainly comprises two steps: sample division and multi-classifier voting. Because the sample division and the multi-classifier voting are performed only once, the method belongs to a label noise detection method based on single voting. The existing label noise detection method based on single voting has two defects: the result of a single vote is more affected by the sample division and the likelihood of missing noise is greater. Although a new improved algorithm (noise-like detection method of multiple votes [8]) was later developed for these deficiencies, some of the noise was missed. An algorithm [6] based on semi-supervised learning, wherein the idea of the algorithm is to train a classification model through known labeled data, label the unlabeled data, and add the labeled data to the existing labeled data set to enlarge the training set, so that a better classification model can be trained from a larger training set to better detect the label noise.

For supervised learning, hidden information in label-free data is not utilized and explored, and the possibility of first-floor noise is high; for semi-supervised learning, the original labeled data set has noise, and for labeling of unlabeled data, the original labeled data set also has noise, and if the noise of the original labeled data set is larger than the noise of the original labeled data set, a very poor classification model is finally obtained.

Reference documents:

[1]Zhu，Xingquan，and Xindong Wu.″Class noise vs.attribute noise：A quantitative study.″Artificial Intelligence Review 22.3(2004)：177-210.

[2]J.Bootkrajang，A.Kaban，Classification of mislabelled microarrays using robust sparse logistic regression，Bioinformatics 29(7)(2013)870-877.

[3]J.Saez，M.Galar，J.Luengo，F.Herrera，A first study on decomposition strategies with data with class noise using decision trees，in：Hybrid Artificial Intelligent Systems，Lecture Notes in Computer Science，vol.7209，2012，pp.25-35.

[4]D.L.Wilson，Asymptotic properties of nearest neighbor rules using edited data，IEEE Trans.Syst.Man Cybernet.2(3)(1992)431-433.

[5]J.Young，J.Ashburner，S.Ourselin，Wrapper methods to correct mislabeled training data，in：3rd International Workshop on Pattern Recognition in Neuroimaging，2013，pp.170-173.

[6]D.Guan，W.Yuan，et al.，Identifying mislabeled training data with the aid of unlabeled data，Appl.Intell.35(3)(2011)345-358.

[7]C.E.Brodley，M.A.Friedl，Identifying mislabeled training data，J.Artif.Intell.Res.11(1999)131-167.

[8]Guan D，Yuan W，Ma T，et al.Detecting potential labeling errors for bioinformatics by multiple voting[J].Knowledge-Based Systems，2014，66(9)：28-35.

disclosure of Invention

The invention aims to solve the problem of providing an iterative label noise identification algorithm based on double information of supervised learning and semi-supervised learning, which adopts a mode of double information of supervised learning and semi-supervised learning, can set corresponding parameters and strategies according to actual conditions, avoids the problem of noise detection of the two single information of the supervised learning and the semi-supervised learning, can effectively improve the identification accuracy, and can more thoroughly discover noise data in an iterative mode.

The iterative label noise identification algorithm based on the double information of supervised learning and semi-supervised learning, disclosed by the invention, comprises the following steps of:

step 1) determining algorithm input variables, including a sample set L to be processed and a non-label sample set U, a maximum iteration number maxIter, a multi-time voting number numVote, a noise identification judgment voting trust numFinalConfodence, a random block number numColossity, a classifier number numClassifier, a noise identification voting trust numConfidence, a trust threshold ConfidenceThreshold for judging noise, an initialization multi-time voting number t equal to 1, an iteration number m equal to 1, and an initialization sample set E equal to L;

step 2) randomly dividing E into numacross subsets with consistent sizes

Initializing a parameter i to 1;

step 3) with

Samples in the set are used as training data, numClassifier different classification algorithms are selected, and numClassifier different classifiers H are trained₁，H₂，…，H_{numClassifier}；

Step 4) with H₁，H₂，…，H_{numClassifier}For sample set

Classifying the middle samples, respectively calculating the numConfidence of each sample, and storing the calculation result into a table;

step 5) iteratively executing steps 2) to 4), adding 1 to the value i after each iteration until the value i is equal to nummatch, stopping the iteration, calculating nummatch of all samples after the voting is finished, and storing the nummatch into a table;

step 6) iteratively executing steps 2) to 5), adding 1 to the t value after each iteration until t is num Vote, and generating num Vote tables;

and 7) comprehensively analyzing num Vote tables, counting the numConfidence of each sample to obtain the numFinalConfidence (e) of each sample e, and storing the numFinalConfidence (e) into one table. Initializing an En, and storing samples of numFinalConfigence (e) smaller than a predetermined Configence threshold as suspicious samples in the En;

step 8), taking E' ═ E-En as a training set, generating numclassifier classifiers based on numclassifier classification algorithms, and marking a label-free sample set U by using the classifiers to obtain a sample set;

step 9), taking the data set E as a test set and a training set, calculating the numFinalConfidence (E) 'of each sample by a weighted KNN algorithm, and storing the numFinalConfidence (E)' in a table named as numConfidence;

and step 10) adding the numerical values of the same samples in the table and the numcfondence table to obtain a final Confidence table, and regarding the samples with the numerical values smaller than a specified threshold Confidence threshold as noise. Let the detected noise be

Then

Step 11)

Steps 2) to 10) are executed in an iteration mode, and after each iteration, the value of m is added with 1 until the value is up to

Or m is maxIter;

and step 12) returning the value E, wherein E is the clean sample set after the noise is deleted, and the algorithm is ended.

Further, in the step 3), the numclassfier is selected as an odd number, and the selection of the odd number is favorable for realizing voting. The classification algorithm is one or more of k-neighborhood, decision tree, Bayes, neural network and support vector machine. And numclasifier selection is affected by the data set. For small sample datasets, a larger numClassifier value should be used to ensure differentiation between multiple classifiers. When the sample set label noise is high, a larger numclasifier value should also be assumed. The larger numclassfier can ensure high tag noise identification rate of each iteration, is beneficial to reducing iteration times and improving algorithm efficiency. On the other hand, a smaller numclasifier may be selected when the number of sample sets is larger and the sample tag to noise ratio is lower. For example, numclasifier may be set to 3.

In another improvement, in the step 7), the larger the threshold value configurethreshold is set, the larger the set En of suspected noise obtained by the supervised learning portion is, the cleaner the training data obtained by using E' ═ E-En to label the unlabeled data set U is, the higher the accuracy of the obtained label is, and the higher the accuracy of the noise data used to detect E as the training data is. However, the value of ConfidenceThreshold is not too large, which may make some of the labeled accurate data in E to be considered as noise data, so that the E' data set is small and the classification model cannot be well trained to label the unlabeled data set U.

In another improvement, the threshold value configurethreshold in step 7) may be selected from conventional values, such as configurethreshold of 0.1, 0.2, 0.3 or 0.4. The optimized threshold value configurethreshold value may also be calculated from separate calibration samples. The method comprises the following specific steps: a) estimating the noise ratio of the noise data to be processed according to the prior knowledge, b) adding random noise into the verification samples, c) traversing possible threshold Condensethreshold values and calculating the recognition accuracy of the algorithm on the noise in the verification samples under the values, d) selecting the threshold Condensethreshold with the highest recognition accuracy.

The invention has the beneficial effects that: the iterative label noise identification algorithm based on the double information of the supervised learning and the semi-supervised learning adopts a double information mode combining the supervised learning and the semi-supervised learning, the data is not detected by the single information any more, the supervised learning makes a judgment on the data, meanwhile, the semi-supervised learning also makes a judgment, and finally, the 2 judgment results are combined together to obtain a final classification result. For the supervised learning part, a multi-voting mode is adopted for noise identification, and the sample sequence is randomly disturbed before each voting, so that the difference of voting is ensured. After a suspicious noise set En obtained by the supervised learning part is filtered, a part of suspicious data is firstly filtered in an E '-E-En filtering mode, then E' is used as a training set of a non-label data set U, a training classification model is used for marking U, the marked data set is used as the training set, the data in E is tested by applying a classification algorithm with the weight KNN to obtain numFinaLConfidence (E)) of each data in E, and finally, the detected noise (representing the noise set detected by the mth iteration) is obtained by comparing the numFinaLConfidence (E)) with 2 classification results of numFinaLConfidence (E) and the pure data set E-are obtained. In addition, an iterative identification method is adopted in the identification algorithm, the samples to be detected input in each iteration are pure samples output for filtering noise in the last iteration, and all noise data can be identified more comprehensively and thoroughly. The identification algorithm solves the problem that the existing label noise identification algorithm is low in identification accuracy, and ensures high accuracy of noise identification.

Drawings

FIG. 1 is a flow chart of an iterative label noise identification algorithm based on double information of supervised learning and semi-supervised learning according to the present invention.

Detailed Description

The following describes an iterative label noise identification algorithm based on double information of supervised learning and semi-supervised learning in detail with reference to the accompanying drawings.

As shown in fig. 1, the iterative label noise identification algorithm based on dual information of supervised learning and semi-supervised learning of the present invention includes the following steps:

step 1) determining algorithm input variables, including a sample set L to be processed and a non-label sample set U, a maximum iteration number maxIter, a multi-time voting number num Vote, a noise identification judgment voting trust numFinalConfodence, a random block number numColossity, a classifier number numClassifier, a noise identification voting trust numConfidentiality, a confidence threshold Confidentisethreshold for judging noise, an initialization multi-time voting number t equal to 1, an iteration number m equal to 1, and an initialization sample set E equal to L;

step 2) randomly dividing E into numacross subsets with consistent sizes

Initializing a parameter i to 1;

step 3) with

Samples in the set are used as training data, numClassifier different classification algorithms are selected, and numClassifier different classifiers H are trained₁，H₂，…，H_{numClassifier}. numclasifier is chosen to be an odd number, such as 3, 5, 7, etc., although not limited to these odd numbers as enumerated; the classification algorithm is one or more of k-neighborhood, decision tree, Bayes, neural network and support vector machine;

step 4) with H₁，H₂，…，H_{numClassifier}For sample set

step 6) iteratively executing steps 2) to 5), adding 1 to the t value after each iteration until t is numVote, and generating numVote tables;

and 7) comprehensively analyzing the numVote tables, counting the numConfidence of each sample to obtain the numFinalConfidence (e) of each sample e, and storing the numVinConfidence (e) into one table. En is initialized and samples of numFinalConfigence (e) less than a predetermined Configence threshold are stored as suspect samples in En. The threshold confidencthreshold is chosen to be larger or better, and thus the resulting En is larger, thereby making the following E 'more pure, but not too large, otherwise the set of E' is small and a good training model cannot be trained for labeling U. Therefore, the threshold value configurethreshold is preferably 0.4, which is a preferred example, and other suitable values may be selected;

and step 9) taking the data set E as a test set and a training set, calculating the numFinalConfidence (E) 'of each sample by a weighted KNN algorithm, and storing the numFinalConfidence (E)' in a table named as numConfidence. The value of K of weighted KNN may be 3, 5, 7, 9, etc., and K is 5, which is a preferred example, and any other suitable value may be selected as one of the values;

Then

Step 11)

Or m is maxIter;

The present invention is described in detail below with respect to the test results of 2 sets of data in the UCI database and the improvement in performance compared to the tag noise recognition algorithm. The recognition algorithm proposed herein is compared with the currently popular multi-vote recognition algorithms MFCF and CFMF and semi-supervised based CFAUD and MFAUD. (MFCF, CFMF see reference [8], CFAUD and MFAUD see reference [6]) because the data in the raw UCI database is free of label noise and label-free data, for a selected data set we consider to select a large portion of the removed labels as the label-free data set, and the remaining labeled data, artificially added noise, taking into account different noise ratios, including 10%, 20%, 30%, 40%. In this example, the performance of the tag noise detection algorithm is measured by the number of mis-tag errors. The error count includes two parts, one part being diagnostic good data of noise data errors, denoted by E1, and the other part being diagnostic good data errors, denoted by E2. The smaller the E1+ E2 value, the higher the algorithm accuracy.

TABLE 1 data set

Data set	Number of samples	Number of features
			Breast	683	9
Credit-screening	653	14

The parameters are set as follows: numwish is 3, numlast is 3 (three classification algorithms include naive bayes, decision trees and nearest neighbors), maxIter is 100, numvolt is 5; configurethreshold is 0.4.

TABLE 2 Breast data set, results at 10% noise ratio

TABLE 3 Breast data set, results at 20% noise ratio

TABLE 4 Breast data set, results at 30% noise ratio

TABLE 5 Breast data set, results at 40% noise ratio

TABLE 6 CRedit data set, results at 10% noise ratio

TABLE 7-hierarchy data set, results at 20% noise ratio

TABLE 8-hierarchy data set, results at 30% noise ratio

TABLE 9-hierarchy data set, results at 40% noise ratio

As shown in tables 2-9 above, the proposed algorithm is superior to the conventional two algorithms in stability based on different noise ratios on the two data used in the experiment.

In summary, the above embodiments are only used for illustrating the technical solutions of the present invention, and are not used for limiting the protection scope of the present invention. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present invention shall be covered within the scope of the claims of the present invention.

Claims

1. an iterative label noise identification method based on supervised learning and semi-supervised learning dual information, is characterized in that, comprises the following steps:

Step 1) Determine the input variables of the algorithm, including the sample set L to be processed and the unlabeled sample set U, the maximum number of iterations maxIter, the number of multiple votes numVote, the noise identification judgment vote confidence numFinalConfodence, the number of random blocks numCross, the number of classifiers numClassifier, noise identification voting confidence numConfidence, the confidence threshold ConfidenceThreshold to discriminate noise, initialize multiple voting times t=1, iteration times m=1, initialize the sample set to be processed E=L;

Step 2) Randomly divide E into numCross subsets of the same size

Where i=1:n, initialization parameter i=1;

step 3) with

The samples in the set are used as training data, numClassifier different classification algorithms are selected, and numClassifier different classifiers H ₁ , H ₂ , ..., H _{numClassifier are trained} ;

Step 4) Use H ₁ , H ₂ , . . . , H _{numClassifier} to pair the sample set

Medium sample classification, calculate the numConfidence of each sample separately, and store the calculation results in a table;

Step 5) Iteratively execute step 2) to step 4), add 1 to the i value after each iteration, until the i value is equal to numCross, stop the iteration, calculate the numConfidence of all samples after this voting is completed and store it in a table;

In the steps 4) and 5), each element in the table corresponds to each sample in each sample set E to be processed and the probability numConfidence that they are correctly marked;

Step 6) iteratively executes step 2) to step 5), after each iteration, the t value is added by 1, until t=numVote, and numVote tables are generated;

Step 7) Comprehensively analyze numVote tables, count the numConfidence of each sample, obtain numFinalConfidence(e) of each sample e, and store it in a table; initialize an En, and set numFinalConfidence(e) to be less than the predetermined ConfidenceThreshold The samples are regarded as suspicious samples and deposited in En;

The ConfidenceThreshold value in described step 7) is selected as a numerical value between 0.1-0.4;

Step 8) using E'=E-En as a training set, based on numclassifier classification methods, generate numclassifier classifiers, and label the unlabeled sample set U with these classifiers to obtain a sample set;

Step 9) using the data set E as the test set, the marked data set as the training set, through the weighted KNN algorithm, calculate the numFinalConfidence(e)' of each sample, and store in a table;

Step 10) Add the table containing numFinalConfidence(e) and the values of the same sample in the table containing numConfidence to obtain the final Confdence table. Samples whose values are less than the specified threshold ConfidenceThreshold are regarded as noise. ; Let the detected noise be

but

step 11)

Iteratively execute steps 2) to 10), after each iteration, add 1 to the m value until

or until m=maxIter;

Step 12) Return the value of E, where E is the pure sample set after removing noise, and the method ends.

2 . The iterative label noise identification method based on dual information of supervised learning and semi-supervised learning according to claim 1 , wherein: in the step 3), numClassifier is selected as an odd number. 3 .

3 . The iterative label noise identification method based on dual information of supervised learning and semi-supervised learning according to claim 2 , wherein the numClassifier=3 is set. 4 .

4. the iterative label noise identification method based on supervised learning and semi-supervised learning dual information according to claim 1, is characterized in that: in described step 7), ConfidenceThreshold value is calculated and optimized by independent verification sample; Concrete steps It includes: a) estimating the noise ratio of the noise data to be processed according to prior knowledge, b) adding random noise to the verification sample, c) traversing the possible ConfidenceThreshold value and calculating the identification method under this value to identify the noise in the verification sample Accuracy, d) Choose ConfidenceThreshold with higher recognition accuracy.