Iterative label noise identification algorithm based on double information of supervised learning and semi-supervised learning
Technical Field
The invention relates to the technical field of data mining and machine learning, in particular to an iterative label noise identification algorithm based on double information of supervised learning and semi-supervised learning.
Background
Many training data used in practical applications of machine learning are noisy, and the causes include human error, hardware device error, data collection process error, and the like. The traditional method is to perform data preprocessing work on source data manually before applying machine learning algorithms to obtain pure source data, however, the manual work is labor-consuming, tedious and time-consuming, and cannot ensure the complete correctness of the data, which causes non-negligible influence on subsequent algorithm application. Data noise generally includes two categories: attribute noise, which refers to sample attribute value inaccuracy, and category noise, which refers to sample label inaccuracy [1 ]. The influence of class noise is larger than attribute noise.
The processing method for the class noise comprises the following steps: robust algorithms [2, 3] and noise detection algorithms [4, 5, 6, 7] are designed. The robust algorithm is designed mainly by improving the existing algorithm, so that the existing algorithm is less influenced by the class noise. And noise detection algorithms detect and remove noise before using the data containing the noise. In contrast, the noise-like detection algorithm is more effective and versatile.
The existing noise-like detection algorithms mainly comprise two types: supervised learning based and semi-supervised learning based. Where supervised learning based representatives are ensemble learning based algorithms, representatives of this class of algorithms are most filtering and consistency filtering [7 ]. In these algorithms, the training data is first randomly divided into subsets, and then each subset is individually noise detected. The basic idea of detection is the voting of multiple classifiers obtained by taking the remaining subset as a training sample. This type of algorithm mainly comprises two steps: sample division and multi-classifier voting. Because the sample division and the multi-classifier voting are performed only once, the method belongs to a label noise detection method based on single voting. The existing label noise detection method based on single voting has two defects: the result of a single vote is more affected by the sample division and the likelihood of missing noise is greater. Although a new improved algorithm (noise-like detection method of multiple votes [8]) was later developed for these deficiencies, some of the noise was missed. An algorithm [6] based on semi-supervised learning, wherein the idea of the algorithm is to train a classification model through known labeled data, label the unlabeled data, and add the labeled data to the existing labeled data set to enlarge the training set, so that a better classification model can be trained from a larger training set to better detect the label noise.
For supervised learning, hidden information in label-free data is not utilized and explored, and the possibility of first-floor noise is high; for semi-supervised learning, the original labeled data set has noise, and for labeling of unlabeled data, the original labeled data set also has noise, and if the noise of the original labeled data set is larger than the noise of the original labeled data set, a very poor classification model is finally obtained.
Reference documents:
[1]Zhu,Xingquan,and Xindong Wu.″Class noise vs.attribute noise:A quantitative study.″Artificial Intelligence Review 22.3(2004):177-210.
[2]J.Bootkrajang,A.Kaban,Classification of mislabelled microarrays using robust sparse logistic regression,Bioinformatics 29(7)(2013)870-877.
[3]J.Saez,M.Galar,J.Luengo,F.Herrera,A first study on decomposition strategies with data with class noise using decision trees,in:Hybrid Artificial Intelligent Systems,Lecture Notes in Computer Science,vol.7209,2012,pp.25-35.
[4]D.L.Wilson,Asymptotic properties of nearest neighbor rules using edited data,IEEE Trans.Syst.Man Cybernet.2(3)(1992)431-433.
[5]J.Young,J.Ashburner,S.Ourselin,Wrapper methods to correct mislabeled training data,in:3rd International Workshop on Pattern Recognition in Neuroimaging,2013,pp.170-173.
[6]D.Guan,W.Yuan,et al.,Identifying mislabeled training data with the aid of unlabeled data,Appl.Intell.35(3)(2011)345-358.
[7]C.E.Brodley,M.A.Friedl,Identifying mislabeled training data,J.Artif.Intell.Res.11(1999)131-167.
[8]Guan D,Yuan W,Ma T,et al.Detecting potential labeling errors for bioinformatics by multiple voting[J].Knowledge-Based Systems,2014,66(9):28-35.
disclosure of Invention
The invention aims to solve the problem of providing an iterative label noise identification algorithm based on double information of supervised learning and semi-supervised learning, which adopts a mode of double information of supervised learning and semi-supervised learning, can set corresponding parameters and strategies according to actual conditions, avoids the problem of noise detection of the two single information of the supervised learning and the semi-supervised learning, can effectively improve the identification accuracy, and can more thoroughly discover noise data in an iterative mode.
The iterative label noise identification algorithm based on the double information of supervised learning and semi-supervised learning, disclosed by the invention, comprises the following steps of:
step 1) determining algorithm input variables, including a sample set L to be processed and a non-label sample set U, a maximum iteration number maxIter, a multi-time voting number numVote, a noise identification judgment voting trust numFinalConfodence, a random block number numColossity, a classifier number numClassifier, a noise identification voting trust numConfidence, a trust threshold ConfidenceThreshold for judging noise, an initialization multi-time voting number t equal to 1, an iteration number m equal to 1, and an initialization sample set E equal to L;
step 2) randomly dividing E into numacross subsets with consistent sizes
Initializing a parameter i to 1;
step 3) with
Samples in the set are used as training data, numClassifier different classification algorithms are selected, and numClassifier different classifiers H are trained
1,H
2,…,H
numClassifier;
Step 4) with H
1,H
2,…,H
numClassifierFor sample set
Classifying the middle samples, respectively calculating the numConfidence of each sample, and storing the calculation result into a table;
step 5) iteratively executing steps 2) to 4), adding 1 to the value i after each iteration until the value i is equal to nummatch, stopping the iteration, calculating nummatch of all samples after the voting is finished, and storing the nummatch into a table;
step 6) iteratively executing steps 2) to 5), adding 1 to the t value after each iteration until t is num Vote, and generating num Vote tables;
and 7) comprehensively analyzing num Vote tables, counting the numConfidence of each sample to obtain the numFinalConfidence (e) of each sample e, and storing the numFinalConfidence (e) into one table. Initializing an En, and storing samples of numFinalConfigence (e) smaller than a predetermined Configence threshold as suspicious samples in the En;
step 8), taking E' ═ E-En as a training set, generating numclassifier classifiers based on numclassifier classification algorithms, and marking a label-free sample set U by using the classifiers to obtain a sample set;
step 9), taking the data set E as a test set and a training set, calculating the numFinalConfidence (E) 'of each sample by a weighted KNN algorithm, and storing the numFinalConfidence (E)' in a table named as numConfidence;
and step 10) adding the numerical values of the same samples in the table and the numcfondence table to obtain a final Confidence table, and regarding the samples with the numerical values smaller than a specified threshold Confidence threshold as noise. Let the detected noise be
Then
Step 11)
Steps 2) to 10) are executed in an iteration mode, and after each iteration, the value of m is added with 1 until the value is up to
Or m is maxIter;
and step 12) returning the value E, wherein E is the clean sample set after the noise is deleted, and the algorithm is ended.
Further, in the step 3), the numclassfier is selected as an odd number, and the selection of the odd number is favorable for realizing voting. The classification algorithm is one or more of k-neighborhood, decision tree, Bayes, neural network and support vector machine. And numclasifier selection is affected by the data set. For small sample datasets, a larger numClassifier value should be used to ensure differentiation between multiple classifiers. When the sample set label noise is high, a larger numclasifier value should also be assumed. The larger numclassfier can ensure high tag noise identification rate of each iteration, is beneficial to reducing iteration times and improving algorithm efficiency. On the other hand, a smaller numclasifier may be selected when the number of sample sets is larger and the sample tag to noise ratio is lower. For example, numclasifier may be set to 3.
In another improvement, in the step 7), the larger the threshold value configurethreshold is set, the larger the set En of suspected noise obtained by the supervised learning portion is, the cleaner the training data obtained by using E' ═ E-En to label the unlabeled data set U is, the higher the accuracy of the obtained label is, and the higher the accuracy of the noise data used to detect E as the training data is. However, the value of ConfidenceThreshold is not too large, which may make some of the labeled accurate data in E to be considered as noise data, so that the E' data set is small and the classification model cannot be well trained to label the unlabeled data set U.
In another improvement, the threshold value configurethreshold in step 7) may be selected from conventional values, such as configurethreshold of 0.1, 0.2, 0.3 or 0.4. The optimized threshold value configurethreshold value may also be calculated from separate calibration samples. The method comprises the following specific steps: a) estimating the noise ratio of the noise data to be processed according to the prior knowledge, b) adding random noise into the verification samples, c) traversing possible threshold Condensethreshold values and calculating the recognition accuracy of the algorithm on the noise in the verification samples under the values, d) selecting the threshold Condensethreshold with the highest recognition accuracy.
The invention has the beneficial effects that: the iterative label noise identification algorithm based on the double information of the supervised learning and the semi-supervised learning adopts a double information mode combining the supervised learning and the semi-supervised learning, the data is not detected by the single information any more, the supervised learning makes a judgment on the data, meanwhile, the semi-supervised learning also makes a judgment, and finally, the 2 judgment results are combined together to obtain a final classification result. For the supervised learning part, a multi-voting mode is adopted for noise identification, and the sample sequence is randomly disturbed before each voting, so that the difference of voting is ensured. After a suspicious noise set En obtained by the supervised learning part is filtered, a part of suspicious data is firstly filtered in an E '-E-En filtering mode, then E' is used as a training set of a non-label data set U, a training classification model is used for marking U, the marked data set is used as the training set, the data in E is tested by applying a classification algorithm with the weight KNN to obtain numFinaLConfidence (E)) of each data in E, and finally, the detected noise (representing the noise set detected by the mth iteration) is obtained by comparing the numFinaLConfidence (E)) with 2 classification results of numFinaLConfidence (E) and the pure data set E-are obtained. In addition, an iterative identification method is adopted in the identification algorithm, the samples to be detected input in each iteration are pure samples output for filtering noise in the last iteration, and all noise data can be identified more comprehensively and thoroughly. The identification algorithm solves the problem that the existing label noise identification algorithm is low in identification accuracy, and ensures high accuracy of noise identification.
Drawings
FIG. 1 is a flow chart of an iterative label noise identification algorithm based on double information of supervised learning and semi-supervised learning according to the present invention.
Detailed Description
The following describes an iterative label noise identification algorithm based on double information of supervised learning and semi-supervised learning in detail with reference to the accompanying drawings.
As shown in fig. 1, the iterative label noise identification algorithm based on dual information of supervised learning and semi-supervised learning of the present invention includes the following steps:
step 1) determining algorithm input variables, including a sample set L to be processed and a non-label sample set U, a maximum iteration number maxIter, a multi-time voting number num Vote, a noise identification judgment voting trust numFinalConfodence, a random block number numColossity, a classifier number numClassifier, a noise identification voting trust numConfidentiality, a confidence threshold Confidentisethreshold for judging noise, an initialization multi-time voting number t equal to 1, an iteration number m equal to 1, and an initialization sample set E equal to L;
step 2) randomly dividing E into numacross subsets with consistent sizes
Initializing a parameter i to 1;
step 3) with
Samples in the set are used as training data, numClassifier different classification algorithms are selected, and numClassifier different classifiers H are trained
1,H
2,…,H
numClassifier. numclasifier is chosen to be an odd number, such as 3, 5, 7, etc., although not limited to these odd numbers as enumerated; the classification algorithm is one or more of k-neighborhood, decision tree, Bayes, neural network and support vector machine;
step 4) with H
1,H
2,…,H
numClassifierFor sample set
Classifying the middle samples, respectively calculating the numConfidence of each sample, and storing the calculation result into a table;
step 5) iteratively executing steps 2) to 4), adding 1 to the value i after each iteration until the value i is equal to nummatch, stopping the iteration, calculating nummatch of all samples after the voting is finished, and storing the nummatch into a table;
step 6) iteratively executing steps 2) to 5), adding 1 to the t value after each iteration until t is numVote, and generating numVote tables;
and 7) comprehensively analyzing the numVote tables, counting the numConfidence of each sample to obtain the numFinalConfidence (e) of each sample e, and storing the numVinConfidence (e) into one table. En is initialized and samples of numFinalConfigence (e) less than a predetermined Configence threshold are stored as suspect samples in En. The threshold confidencthreshold is chosen to be larger or better, and thus the resulting En is larger, thereby making the following E 'more pure, but not too large, otherwise the set of E' is small and a good training model cannot be trained for labeling U. Therefore, the threshold value configurethreshold is preferably 0.4, which is a preferred example, and other suitable values may be selected;
step 8), taking E' ═ E-En as a training set, generating numclassifier classifiers based on numclassifier classification algorithms, and marking a label-free sample set U by using the classifiers to obtain a sample set;
and step 9) taking the data set E as a test set and a training set, calculating the numFinalConfidence (E) 'of each sample by a weighted KNN algorithm, and storing the numFinalConfidence (E)' in a table named as numConfidence. The value of K of weighted KNN may be 3, 5, 7, 9, etc., and K is 5, which is a preferred example, and any other suitable value may be selected as one of the values;
and step 10) adding the numerical values of the same samples in the table and the numcfondence table to obtain a final Confidence table, and regarding the samples with the numerical values smaller than a specified threshold Confidence threshold as noise. Let the detected noise be
Then
Step 11)
Steps 2) to 10) are executed in an iteration mode, and after each iteration, the value of m is added with 1 until the value is up to
Or m is maxIter;
and step 12) returning the value E, wherein E is the clean sample set after the noise is deleted, and the algorithm is ended.
The present invention is described in detail below with respect to the test results of 2 sets of data in the UCI database and the improvement in performance compared to the tag noise recognition algorithm. The recognition algorithm proposed herein is compared with the currently popular multi-vote recognition algorithms MFCF and CFMF and semi-supervised based CFAUD and MFAUD. (MFCF, CFMF see reference [8], CFAUD and MFAUD see reference [6]) because the data in the raw UCI database is free of label noise and label-free data, for a selected data set we consider to select a large portion of the removed labels as the label-free data set, and the remaining labeled data, artificially added noise, taking into account different noise ratios, including 10%, 20%, 30%, 40%. In this example, the performance of the tag noise detection algorithm is measured by the number of mis-tag errors. The error count includes two parts, one part being diagnostic good data of noise data errors, denoted by E1, and the other part being diagnostic good data errors, denoted by E2. The smaller the E1+ E2 value, the higher the algorithm accuracy.
TABLE 1 data set
Data set
|
Number of samples
|
Number of features
|
Breast
|
683
|
9
|
Credit-screening
|
653
|
14 |
The parameters are set as follows: numwish is 3, numlast is 3 (three classification algorithms include naive bayes, decision trees and nearest neighbors), maxIter is 100, numvolt is 5; configurethreshold is 0.4.
TABLE 2 Breast data set, results at 10% noise ratio
TABLE 3 Breast data set, results at 20% noise ratio
TABLE 4 Breast data set, results at 30% noise ratio
TABLE 5 Breast data set, results at 40% noise ratio
TABLE 6 CRedit data set, results at 10% noise ratio
TABLE 7-hierarchy data set, results at 20% noise ratio
TABLE 8-hierarchy data set, results at 30% noise ratio
TABLE 9-hierarchy data set, results at 40% noise ratio
As shown in tables 2-9 above, the proposed algorithm is superior to the conventional two algorithms in stability based on different noise ratios on the two data used in the experiment.
In summary, the above embodiments are only used for illustrating the technical solutions of the present invention, and are not used for limiting the protection scope of the present invention. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present invention shall be covered within the scope of the claims of the present invention.