CN107292330B - Iterative label noise identification algorithm based on double information of supervised learning and semi-supervised learning - Google Patents

Iterative label noise identification algorithm based on double information of supervised learning and semi-supervised learning Download PDF

Info

Publication number
CN107292330B
CN107292330B CN201710315861.2A CN201710315861A CN107292330B CN 107292330 B CN107292330 B CN 107292330B CN 201710315861 A CN201710315861 A CN 201710315861A CN 107292330 B CN107292330 B CN 107292330B
Authority
CN
China
Prior art keywords
noise
supervised learning
sample
value
label
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201710315861.2A
Other languages
Chinese (zh)
Other versions
CN107292330A (en
Inventor
关东海
魏红强
袁伟伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Aeronautics and Astronautics
Original Assignee
Nanjing University of Aeronautics and Astronautics
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Aeronautics and Astronautics filed Critical Nanjing University of Aeronautics and Astronautics
Priority to CN201710315861.2A priority Critical patent/CN107292330B/en
Publication of CN107292330A publication Critical patent/CN107292330A/en
Application granted granted Critical
Publication of CN107292330B publication Critical patent/CN107292330B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Complex Calculations (AREA)

Abstract

The invention discloses an iterative label noise identification algorithm based on double information of supervised learning and semi-supervised learning, and belongs to the field of machine learning and data mining. The invention combines supervision and semi-supervision learning, and generates a supervision learning noise identification result for a supervision learning part in a soft multiple voting mode; for a semi-supervised learning part, labeling a label-free data set based on a classification model trained by pure data generated by the supervised learning, taking the labeled label-free data as a training set, and detecting the label data set by using a weighted KNN method to generate a noise identification result; the noise recognition results are finally combined to generate a final recognition result. The algorithm also adopts an iteration mode, and the sample to be measured input in each iteration is the residual sample after the noise is filtered out in the last iteration. Compared with the traditional noise identification algorithm, the method combines more complementary information, and simultaneously assists in an iterative mode, so that the noise identification accuracy can be better promoted.

Description

Iterative label noise identification algorithm based on double information of supervised learning and semi-supervised learning
Technical Field
The invention relates to the technical field of data mining and machine learning, in particular to an iterative label noise identification algorithm based on double information of supervised learning and semi-supervised learning.
Background
Many training data used in practical applications of machine learning are noisy, and the causes include human error, hardware device error, data collection process error, and the like. The traditional method is to perform data preprocessing work on source data manually before applying machine learning algorithms to obtain pure source data, however, the manual work is labor-consuming, tedious and time-consuming, and cannot ensure the complete correctness of the data, which causes non-negligible influence on subsequent algorithm application. Data noise generally includes two categories: attribute noise, which refers to sample attribute value inaccuracy, and category noise, which refers to sample label inaccuracy [1 ]. The influence of class noise is larger than attribute noise.
The processing method for the class noise comprises the following steps: robust algorithms [2, 3] and noise detection algorithms [4, 5, 6, 7] are designed. The robust algorithm is designed mainly by improving the existing algorithm, so that the existing algorithm is less influenced by the class noise. And noise detection algorithms detect and remove noise before using the data containing the noise. In contrast, the noise-like detection algorithm is more effective and versatile.
The existing noise-like detection algorithms mainly comprise two types: supervised learning based and semi-supervised learning based. Where supervised learning based representatives are ensemble learning based algorithms, representatives of this class of algorithms are most filtering and consistency filtering [7 ]. In these algorithms, the training data is first randomly divided into subsets, and then each subset is individually noise detected. The basic idea of detection is the voting of multiple classifiers obtained by taking the remaining subset as a training sample. This type of algorithm mainly comprises two steps: sample division and multi-classifier voting. Because the sample division and the multi-classifier voting are performed only once, the method belongs to a label noise detection method based on single voting. The existing label noise detection method based on single voting has two defects: the result of a single vote is more affected by the sample division and the likelihood of missing noise is greater. Although a new improved algorithm (noise-like detection method of multiple votes [8]) was later developed for these deficiencies, some of the noise was missed. An algorithm [6] based on semi-supervised learning, wherein the idea of the algorithm is to train a classification model through known labeled data, label the unlabeled data, and add the labeled data to the existing labeled data set to enlarge the training set, so that a better classification model can be trained from a larger training set to better detect the label noise.
For supervised learning, hidden information in label-free data is not utilized and explored, and the possibility of first-floor noise is high; for semi-supervised learning, the original labeled data set has noise, and for labeling of unlabeled data, the original labeled data set also has noise, and if the noise of the original labeled data set is larger than the noise of the original labeled data set, a very poor classification model is finally obtained.
Reference documents:
[1]Zhu,Xingquan,and Xindong Wu.″Class noise vs.attribute noise:A quantitative study.″Artificial Intelligence Review 22.3(2004):177-210.
[2]J.Bootkrajang,A.Kaban,Classification of mislabelled microarrays using robust sparse logistic regression,Bioinformatics 29(7)(2013)870-877.
[3]J.Saez,M.Galar,J.Luengo,F.Herrera,A first study on decomposition strategies with data with class noise using decision trees,in:Hybrid Artificial Intelligent Systems,Lecture Notes in Computer Science,vol.7209,2012,pp.25-35.
[4]D.L.Wilson,Asymptotic properties of nearest neighbor rules using edited data,IEEE Trans.Syst.Man Cybernet.2(3)(1992)431-433.
[5]J.Young,J.Ashburner,S.Ourselin,Wrapper methods to correct mislabeled training data,in:3rd International Workshop on Pattern Recognition in Neuroimaging,2013,pp.170-173.
[6]D.Guan,W.Yuan,et al.,Identifying mislabeled training data with the aid of unlabeled data,Appl.Intell.35(3)(2011)345-358.
[7]C.E.Brodley,M.A.Friedl,Identifying mislabeled training data,J.Artif.Intell.Res.11(1999)131-167.
[8]Guan D,Yuan W,Ma T,et al.Detecting potential labeling errors for bioinformatics by multiple voting[J].Knowledge-Based Systems,2014,66(9):28-35.
disclosure of Invention
The invention aims to solve the problem of providing an iterative label noise identification algorithm based on double information of supervised learning and semi-supervised learning, which adopts a mode of double information of supervised learning and semi-supervised learning, can set corresponding parameters and strategies according to actual conditions, avoids the problem of noise detection of the two single information of the supervised learning and the semi-supervised learning, can effectively improve the identification accuracy, and can more thoroughly discover noise data in an iterative mode.
The iterative label noise identification algorithm based on the double information of supervised learning and semi-supervised learning, disclosed by the invention, comprises the following steps of:
step 1) determining algorithm input variables, including a sample set L to be processed and a non-label sample set U, a maximum iteration number maxIter, a multi-time voting number numVote, a noise identification judgment voting trust numFinalConfodence, a random block number numColossity, a classifier number numClassifier, a noise identification voting trust numConfidence, a trust threshold ConfidenceThreshold for judging noise, an initialization multi-time voting number t equal to 1, an iteration number m equal to 1, and an initialization sample set E equal to L;
step 2) randomly dividing E into numacross subsets with consistent sizes
Figure BSA0000144272320000031
Initializing a parameter i to 1;
step 3) with
Figure BSA0000144272320000032
Samples in the set are used as training data, numClassifier different classification algorithms are selected, and numClassifier different classifiers H are trained1,H2,…,HnumClassifier
Step 4) with H1,H2,…,HnumClassifierFor sample set
Figure BSA0000144272320000033
Classifying the middle samples, respectively calculating the numConfidence of each sample, and storing the calculation result into a table;
step 5) iteratively executing steps 2) to 4), adding 1 to the value i after each iteration until the value i is equal to nummatch, stopping the iteration, calculating nummatch of all samples after the voting is finished, and storing the nummatch into a table;
step 6) iteratively executing steps 2) to 5), adding 1 to the t value after each iteration until t is num Vote, and generating num Vote tables;
and 7) comprehensively analyzing num Vote tables, counting the numConfidence of each sample to obtain the numFinalConfidence (e) of each sample e, and storing the numFinalConfidence (e) into one table. Initializing an En, and storing samples of numFinalConfigence (e) smaller than a predetermined Configence threshold as suspicious samples in the En;
step 8), taking E' ═ E-En as a training set, generating numclassifier classifiers based on numclassifier classification algorithms, and marking a label-free sample set U by using the classifiers to obtain a sample set;
step 9), taking the data set E as a test set and a training set, calculating the numFinalConfidence (E) 'of each sample by a weighted KNN algorithm, and storing the numFinalConfidence (E)' in a table named as numConfidence;
and step 10) adding the numerical values of the same samples in the table and the numcfondence table to obtain a final Confidence table, and regarding the samples with the numerical values smaller than a specified threshold Confidence threshold as noise. Let the detected noise be
Figure BSA0000144272320000041
Then
Figure BSA0000144272320000042
Step 11)
Figure BSA0000144272320000043
Steps 2) to 10) are executed in an iteration mode, and after each iteration, the value of m is added with 1 until the value is up to
Figure BSA0000144272320000044
Or m is maxIter;
and step 12) returning the value E, wherein E is the clean sample set after the noise is deleted, and the algorithm is ended.
Further, in the step 3), the numclassfier is selected as an odd number, and the selection of the odd number is favorable for realizing voting. The classification algorithm is one or more of k-neighborhood, decision tree, Bayes, neural network and support vector machine. And numclasifier selection is affected by the data set. For small sample datasets, a larger numClassifier value should be used to ensure differentiation between multiple classifiers. When the sample set label noise is high, a larger numclasifier value should also be assumed. The larger numclassfier can ensure high tag noise identification rate of each iteration, is beneficial to reducing iteration times and improving algorithm efficiency. On the other hand, a smaller numclasifier may be selected when the number of sample sets is larger and the sample tag to noise ratio is lower. For example, numclasifier may be set to 3.
In another improvement, in the step 7), the larger the threshold value configurethreshold is set, the larger the set En of suspected noise obtained by the supervised learning portion is, the cleaner the training data obtained by using E' ═ E-En to label the unlabeled data set U is, the higher the accuracy of the obtained label is, and the higher the accuracy of the noise data used to detect E as the training data is. However, the value of ConfidenceThreshold is not too large, which may make some of the labeled accurate data in E to be considered as noise data, so that the E' data set is small and the classification model cannot be well trained to label the unlabeled data set U.
In another improvement, the threshold value configurethreshold in step 7) may be selected from conventional values, such as configurethreshold of 0.1, 0.2, 0.3 or 0.4. The optimized threshold value configurethreshold value may also be calculated from separate calibration samples. The method comprises the following specific steps: a) estimating the noise ratio of the noise data to be processed according to the prior knowledge, b) adding random noise into the verification samples, c) traversing possible threshold Condensethreshold values and calculating the recognition accuracy of the algorithm on the noise in the verification samples under the values, d) selecting the threshold Condensethreshold with the highest recognition accuracy.
The invention has the beneficial effects that: the iterative label noise identification algorithm based on the double information of the supervised learning and the semi-supervised learning adopts a double information mode combining the supervised learning and the semi-supervised learning, the data is not detected by the single information any more, the supervised learning makes a judgment on the data, meanwhile, the semi-supervised learning also makes a judgment, and finally, the 2 judgment results are combined together to obtain a final classification result. For the supervised learning part, a multi-voting mode is adopted for noise identification, and the sample sequence is randomly disturbed before each voting, so that the difference of voting is ensured. After a suspicious noise set En obtained by the supervised learning part is filtered, a part of suspicious data is firstly filtered in an E '-E-En filtering mode, then E' is used as a training set of a non-label data set U, a training classification model is used for marking U, the marked data set is used as the training set, the data in E is tested by applying a classification algorithm with the weight KNN to obtain numFinaLConfidence (E)) of each data in E, and finally, the detected noise (representing the noise set detected by the mth iteration) is obtained by comparing the numFinaLConfidence (E)) with 2 classification results of numFinaLConfidence (E) and the pure data set E-are obtained. In addition, an iterative identification method is adopted in the identification algorithm, the samples to be detected input in each iteration are pure samples output for filtering noise in the last iteration, and all noise data can be identified more comprehensively and thoroughly. The identification algorithm solves the problem that the existing label noise identification algorithm is low in identification accuracy, and ensures high accuracy of noise identification.
Drawings
FIG. 1 is a flow chart of an iterative label noise identification algorithm based on double information of supervised learning and semi-supervised learning according to the present invention.
Detailed Description
The following describes an iterative label noise identification algorithm based on double information of supervised learning and semi-supervised learning in detail with reference to the accompanying drawings.
As shown in fig. 1, the iterative label noise identification algorithm based on dual information of supervised learning and semi-supervised learning of the present invention includes the following steps:
step 1) determining algorithm input variables, including a sample set L to be processed and a non-label sample set U, a maximum iteration number maxIter, a multi-time voting number num Vote, a noise identification judgment voting trust numFinalConfodence, a random block number numColossity, a classifier number numClassifier, a noise identification voting trust numConfidentiality, a confidence threshold Confidentisethreshold for judging noise, an initialization multi-time voting number t equal to 1, an iteration number m equal to 1, and an initialization sample set E equal to L;
step 2) randomly dividing E into numacross subsets with consistent sizes
Figure BSA0000144272320000061
Initializing a parameter i to 1;
step 3) with
Figure BSA0000144272320000062
Samples in the set are used as training data, numClassifier different classification algorithms are selected, and numClassifier different classifiers H are trained1,H2,…,HnumClassifier. numclasifier is chosen to be an odd number, such as 3, 5, 7, etc., although not limited to these odd numbers as enumerated; the classification algorithm is one or more of k-neighborhood, decision tree, Bayes, neural network and support vector machine;
step 4) with H1,H2,…,HnumClassifierFor sample set
Figure BSA0000144272320000063
Classifying the middle samples, respectively calculating the numConfidence of each sample, and storing the calculation result into a table;
step 5) iteratively executing steps 2) to 4), adding 1 to the value i after each iteration until the value i is equal to nummatch, stopping the iteration, calculating nummatch of all samples after the voting is finished, and storing the nummatch into a table;
step 6) iteratively executing steps 2) to 5), adding 1 to the t value after each iteration until t is numVote, and generating numVote tables;
and 7) comprehensively analyzing the numVote tables, counting the numConfidence of each sample to obtain the numFinalConfidence (e) of each sample e, and storing the numVinConfidence (e) into one table. En is initialized and samples of numFinalConfigence (e) less than a predetermined Configence threshold are stored as suspect samples in En. The threshold confidencthreshold is chosen to be larger or better, and thus the resulting En is larger, thereby making the following E 'more pure, but not too large, otherwise the set of E' is small and a good training model cannot be trained for labeling U. Therefore, the threshold value configurethreshold is preferably 0.4, which is a preferred example, and other suitable values may be selected;
step 8), taking E' ═ E-En as a training set, generating numclassifier classifiers based on numclassifier classification algorithms, and marking a label-free sample set U by using the classifiers to obtain a sample set;
and step 9) taking the data set E as a test set and a training set, calculating the numFinalConfidence (E) 'of each sample by a weighted KNN algorithm, and storing the numFinalConfidence (E)' in a table named as numConfidence. The value of K of weighted KNN may be 3, 5, 7, 9, etc., and K is 5, which is a preferred example, and any other suitable value may be selected as one of the values;
and step 10) adding the numerical values of the same samples in the table and the numcfondence table to obtain a final Confidence table, and regarding the samples with the numerical values smaller than a specified threshold Confidence threshold as noise. Let the detected noise be
Figure BSA0000144272320000071
Then
Figure BSA0000144272320000072
Step 11)
Figure BSA0000144272320000073
Steps 2) to 10) are executed in an iteration mode, and after each iteration, the value of m is added with 1 until the value is up to
Figure BSA0000144272320000074
Or m is maxIter;
and step 12) returning the value E, wherein E is the clean sample set after the noise is deleted, and the algorithm is ended.
The present invention is described in detail below with respect to the test results of 2 sets of data in the UCI database and the improvement in performance compared to the tag noise recognition algorithm. The recognition algorithm proposed herein is compared with the currently popular multi-vote recognition algorithms MFCF and CFMF and semi-supervised based CFAUD and MFAUD. (MFCF, CFMF see reference [8], CFAUD and MFAUD see reference [6]) because the data in the raw UCI database is free of label noise and label-free data, for a selected data set we consider to select a large portion of the removed labels as the label-free data set, and the remaining labeled data, artificially added noise, taking into account different noise ratios, including 10%, 20%, 30%, 40%. In this example, the performance of the tag noise detection algorithm is measured by the number of mis-tag errors. The error count includes two parts, one part being diagnostic good data of noise data errors, denoted by E1, and the other part being diagnostic good data errors, denoted by E2. The smaller the E1+ E2 value, the higher the algorithm accuracy.
TABLE 1 data set
Data set Number of samples Number of features
Breast 683 9
Credit-screening 653 14
The parameters are set as follows: numwish is 3, numlast is 3 (three classification algorithms include naive bayes, decision trees and nearest neighbors), maxIter is 100, numvolt is 5; configurethreshold is 0.4.
TABLE 2 Breast data set, results at 10% noise ratio
Figure BSA0000144272320000081
TABLE 3 Breast data set, results at 20% noise ratio
Figure BSA0000144272320000082
TABLE 4 Breast data set, results at 30% noise ratio
Figure BSA0000144272320000083
TABLE 5 Breast data set, results at 40% noise ratio
Figure BSA0000144272320000091
TABLE 6 CRedit data set, results at 10% noise ratio
Figure BSA0000144272320000092
TABLE 7-hierarchy data set, results at 20% noise ratio
Figure BSA0000144272320000093
TABLE 8-hierarchy data set, results at 30% noise ratio
Figure BSA0000144272320000094
TABLE 9-hierarchy data set, results at 40% noise ratio
Figure BSA0000144272320000101
As shown in tables 2-9 above, the proposed algorithm is superior to the conventional two algorithms in stability based on different noise ratios on the two data used in the experiment.
In summary, the above embodiments are only used for illustrating the technical solutions of the present invention, and are not used for limiting the protection scope of the present invention. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present invention shall be covered within the scope of the claims of the present invention.

Claims (4)

1. An iterative label noise identification method based on double information of supervised learning and semi-supervised learning is characterized by comprising the following steps:
step 1) determining algorithm input variables, including a sample set L to be processed and a non-label sample set U, a maximum iteration number maxIter, a multi-time voting number numVote, a noise identification judgment voting trust numFinalConfodence, a random block number numColossity, a classifier number numClassifier, a noise identification voting trust numConfidence, a trust threshold ConfidenceThreshold for judging noise, an initialization multi-time voting number t equal to 1, an iteration number m equal to 1, and an initialization sample set E equal to L;
step 2) randomly dividing E into numacross subsets with consistent sizes
Figure FSB0000192930860000011
Figure FSB0000192930860000012
Wherein i is 1: n, the initialization parameter i is 1;
step 3) with
Figure FSB0000192930860000013
Samples in the set are used as training data, numClassifier different classification algorithms are selected, and numClassifier different classifiers H are trained1,H2,...,HnumClassifier
Step 4) with H1,H2,...,HnumClassifierFor sample set
Figure FSB0000192930860000014
Classifying the middle samples, respectively calculating the numConfidence of each sample, and storing the calculation result into a table;
step 5) iteratively executing the step 2) to the step 4), adding 1 to the value i after each iteration until the value i is equal to nummatch, stopping the iteration, calculating nummatch of all samples after the voting is finished, and storing the nummatch into a table;
in the step 4) and the step 5), each element in the table corresponds to each sample in each sample set E to be processed and the probability numconfigence that they are correctly labeled;
step 6) iteratively executing the step 2) to the step 5), adding 1 to a t value after each iteration until t is equal to numVote, and generating numVote tables;
step 7) comprehensively analyzing numVote tables, counting the numConfidence of each sample to obtain the numFinalConfidence (e) of each sample e, and storing the numVinConfidence (e) into one table; initializing an En, and storing samples of numFinalConfigence (e) smaller than a predetermined Configence threshold as suspicious samples in the En;
the ConfigeneThreshold value in step 7) is selected to be a value between 0.1 and 0.4;
step 8), taking E' ═ E-En as a training set, generating numclassifier classifiers based on numclassifier classification methods, and marking a label-free sample set U by using the classifiers to obtain a sample set;
step 9), taking the data set E as a test set and the marked data set as a training set, calculating the numFinalConfidence (E) 'of each sample by a weighted KNN algorithm, and storing the numFinalConfidence (E)' in a table;
step 10), adding the numerical values of the same samples in the table containing the numFinalConfidence (e) and the table containing the numConfidence to obtain a final Confdence table, and regarding the sample with the numerical value smaller than a specified threshold ConfideceThreshold as noise; let the detected noise be
Figure FSB0000192930860000021
Then
Figure FSB0000192930860000022
Step 11)
Figure FSB0000192930860000023
Step 2) to step 10) are executed in an iteration mode, and after each iteration, the value of m is added with 1 until the value is up to
Figure FSB0000192930860000024
Or m is maxIter;
and step 12) returning the value E, wherein E is the clean sample set after the noise is deleted, and the method is ended.
2. The iterative label noise identification method based on supervised learning and semi-supervised learning dual information as claimed in claim 1, wherein: in the step 3), numclasifier is selected as an odd number.
3. The iterative label noise identification method based on supervised learning and semi-supervised learning dual information as claimed in claim 2, wherein: the numclassfier is set to 3.
4. The iterative label noise identification method based on supervised learning and semi-supervised learning dual information as claimed in claim 1, wherein: calculating and optimizing the ConfigenceThreshold value in the step 7) through an independent check sample; the method comprises the following specific steps: a) estimating the noise ratio of the noise data to be processed according to the prior knowledge, b) adding random noise into the verification sample, c) traversing the possible Condensethreshold value and calculating the recognition accuracy of the recognition method on the noise in the verification sample under the value, and d) selecting Condensethreshold with higher recognition accuracy.
CN201710315861.2A 2017-05-02 2017-05-02 Iterative label noise identification algorithm based on double information of supervised learning and semi-supervised learning Expired - Fee Related CN107292330B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710315861.2A CN107292330B (en) 2017-05-02 2017-05-02 Iterative label noise identification algorithm based on double information of supervised learning and semi-supervised learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710315861.2A CN107292330B (en) 2017-05-02 2017-05-02 Iterative label noise identification algorithm based on double information of supervised learning and semi-supervised learning

Publications (2)

Publication Number Publication Date
CN107292330A CN107292330A (en) 2017-10-24
CN107292330B true CN107292330B (en) 2021-08-06

Family

ID=60094401

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710315861.2A Expired - Fee Related CN107292330B (en) 2017-05-02 2017-05-02 Iterative label noise identification algorithm based on double information of supervised learning and semi-supervised learning

Country Status (1)

Country Link
CN (1) CN107292330B (en)

Families Citing this family (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107862386A (en) * 2017-11-03 2018-03-30 郑州云海信息技术有限公司 A kind of method and device of data processing
CN108021931A (en) * 2017-11-20 2018-05-11 阿里巴巴集团控股有限公司 A kind of data sample label processing method and device
CN108021940B (en) * 2017-11-30 2023-04-18 中国银联股份有限公司 Data classification method and system based on machine learning
US20190244138A1 (en) * 2018-02-08 2019-08-08 Apple Inc. Privatized machine learning using generative adversarial networks
CN110163376B (en) * 2018-06-04 2023-11-03 腾讯科技(深圳)有限公司 Sample detection method, media object identification method, device, terminal and medium
CN108985365B (en) * 2018-07-05 2021-10-01 重庆大学 Multi-source heterogeneous data fusion method based on deep subspace switching ensemble learning
CN109213656A (en) * 2018-07-23 2019-01-15 武汉智领云科技有限公司 A kind of interactive mode big data dysgnosis detection system and method
JP7299002B2 (en) * 2018-08-23 2023-06-27 ファナック株式会社 Discriminator and machine learning method
EP3807821A1 (en) 2018-09-28 2021-04-21 Apple Inc. Distributed labeling for supervised learning
CN109800785B (en) * 2018-12-12 2021-12-28 中国科学院信息工程研究所 Data classification method and device based on self-expression correlation
CN110189305B (en) * 2019-05-14 2023-09-22 上海大学 Automatic analysis method for multitasking tongue picture
CN110363228B (en) * 2019-06-26 2022-09-06 南京理工大学 Noise label correction method
CN110633758A (en) * 2019-09-20 2019-12-31 四川长虹电器股份有限公司 Method for detecting and locating cancer region aiming at small sample or sample unbalance
US11853908B2 (en) 2020-05-13 2023-12-26 International Business Machines Corporation Data-analysis-based, noisy labeled and unlabeled datapoint detection and rectification for machine-learning
CN111784595B (en) * 2020-06-10 2023-08-29 北京科技大学 Dynamic tag smooth weighting loss method and device based on historical record
CN113269258A (en) * 2021-05-27 2021-08-17 郑州大学 Semi-supervised learning label noise defense algorithm based on AdaBoost
CN113887742A (en) * 2021-10-26 2022-01-04 重庆邮电大学 Data classification method and system based on support vector machine
CN114218872B (en) * 2021-12-28 2023-03-24 浙江大学 DBN-LSTM semi-supervised joint model-based residual service life prediction method
CN117421657A (en) * 2023-10-27 2024-01-19 江苏开放大学(江苏城市职业学院) Sampling and learning method and system for noisy labels based on oversampling strategy

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105046236A (en) * 2015-08-11 2015-11-11 南京航空航天大学 Iterative tag noise recognition algorithm based on multiple voting

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9053391B2 (en) * 2011-04-12 2015-06-09 Sharp Laboratories Of America, Inc. Supervised and semi-supervised online boosting algorithm in machine learning framework
CN103886330B (en) * 2014-03-27 2017-03-01 西安电子科技大学 Sorting technique based on semi-supervised SVM integrated study
CN104318242A (en) * 2014-10-08 2015-01-28 中国人民解放军空军工程大学 High-efficiency SVM active half-supervision learning algorithm
CN104598813B (en) * 2014-12-09 2017-05-17 西安电子科技大学 Computer intrusion detection method based on integrated study and semi-supervised SVM
CA2977262A1 (en) * 2015-02-23 2016-09-01 Cellanyx Diagnostics, Llc Cell imaging and analysis to differentiate clinically relevant sub-populations of cells
CN105930411A (en) * 2016-04-18 2016-09-07 苏州大学 Classifier training method, classifier and sentiment classification system
CN106096622B (en) * 2016-04-26 2019-11-08 北京航空航天大学 Semi-supervised Classification of hyperspectral remote sensing image mask method
CN106294593B (en) * 2016-07-28 2019-04-09 浙江大学 In conjunction with the Relation extraction method of subordinate clause grade remote supervisory and semi-supervised integrated study
CN106294590B (en) * 2016-07-29 2019-05-31 重庆邮电大学 A kind of social networks junk user filter method based on semi-supervised learning

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105046236A (en) * 2015-08-11 2015-11-11 南京航空航天大学 Iterative tag noise recognition algorithm based on multiple voting

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
具有噪声过滤功能的协同训练半监督主动学习算法;詹永照 等;《模式识别与人工智能》;20091031;第22卷(第5期);摘要,第1-5节 *
基于集成半监督学习的标签噪声研究;金龙 等;《中国优秀硕士学位论文全文数据库 信息科技辑》;20131215;第2013年卷(第S2期);第I140-91页 *

Also Published As

Publication number Publication date
CN107292330A (en) 2017-10-24

Similar Documents

Publication Publication Date Title
CN107292330B (en) Iterative label noise identification algorithm based on double information of supervised learning and semi-supervised learning
WO2017084408A1 (en) Method and system for checking cargo
CN112756759B (en) Spot welding robot workstation fault judgment method
CN111009321A (en) Application method of machine learning classification model in juvenile autism auxiliary diagnosis
CN111105041B (en) Machine learning method and device for intelligent data collision
CN114281809B (en) Multi-source heterogeneous data cleaning method and device
CN113095229B (en) Self-adaptive pedestrian re-identification system and method for unsupervised domain
Wen et al. Comparision of four machine learning techniques for the prediction of prostate cancer survivability
Shoohi et al. DCGAN for Handling Imbalanced Malaria Dataset based on Over-Sampling Technique and using CNN.
Jin et al. Confusion Graph: Detecting Confusion Communities in Large Scale Image Classification.
CN113516638A (en) Neural network internal feature importance visualization analysis and feature migration method
CN113674862A (en) Acute renal function injury onset prediction method based on machine learning
CN114817856B (en) Beam-pumping unit fault diagnosis method based on structural information retention domain adaptation network
CN116741393A (en) Medical record-based thyroid disease dataset classification model construction method, classification device and computer-readable medium
CN114757433A (en) Method for quickly identifying relative risk of drinking water source antibiotic resistance
CN108154189A (en) Grey relational cluster method based on LDTW distances
CN117034110A (en) Stem cell exosome detection method based on deep learning
CN110502669A (en) The unsupervised chart dendrography learning method of lightweight and device based on the side N DFS subgraph
CN113392086B (en) Medical database construction method, device and equipment based on Internet of things
Nurmalasari et al. Classification for Papaya Fruit Maturity Level with Convolutional Neural Network
CN115345248A (en) Deep learning-oriented data depolarization method and device
Prajapati et al. Handling Missing Values: Application to University Data Set
CN109308936B (en) Grain crop production area identification method, grain crop production area identification device and terminal identification equipment
Ndung'u Data Preparation for Machine Learning Modelling
CN104463205B (en) Data classification method based on chaos depth wavelet network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20210806

CF01 Termination of patent right due to non-payment of annual fee