CN111353553A - Method and device for cleaning error labeling data, computer equipment and storage medium - Google Patents
Method and device for cleaning error labeling data, computer equipment and storage medium Download PDFInfo
- Publication number
- CN111353553A CN111353553A CN202010182397.6A CN202010182397A CN111353553A CN 111353553 A CN111353553 A CN 111353553A CN 202010182397 A CN202010182397 A CN 202010182397A CN 111353553 A CN111353553 A CN 111353553A
- Authority
- CN
- China
- Prior art keywords
- sample
- samples
- loglos
- classification model
- error
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Bioinformatics & Cheminformatics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Probability & Statistics with Applications (AREA)
- Other Investigation Or Analysis Of Materials By Electrical Means (AREA)
Abstract
The invention discloses a method and a device for cleaning error labeling data, computer equipment and a storage medium, wherein the method comprises the following steps: training a deep learning classification model, wherein a loss function of the classification model is loglos _ fl; adopting the trained classification model to reason each sample to obtain the prediction probability p of each sampleijI ═ 1,2, …, N; j ═ 1,2, …, C; calculating a loss function loglos _ fl of each sample to obtain the loglos _ fl of each sample when N is 1; sorting the loglos _ fl of each sample from large to small; and taking a plurality of samples which are ranked at the top, and marking as error samples. The invention greatly improves the accuracy rate of finding the wrong sample and improves the inspection efficiency.
Description
Technical Field
The present invention relates to the field of data processing, and in particular, to a method and an apparatus for cleaning error marked data, a computer device, and a storage medium.
Background
At present, a plurality of artificial intelligence/deep learning models are supervised models established on the basis of accurate marking data; the acquisition of accurately labeled data requires a great deal of cost, and many labeled data inevitably have some wrongly labeled data, which greatly affects the trained model.
In the prior art, the inspection of the accuracy of the labeled data is mostly based on manual cross inspection, and the mode has low efficiency and the accuracy needs to be improved.
Disclosure of Invention
The invention aims to provide a method, a device, computer equipment and a storage medium for cleaning error marked data, and aims to solve the problems that in the prior art, a data verification mode is low in efficiency and accuracy needs to be improved.
The embodiment of the invention provides a method for cleaning error marking data, which comprises the following steps:
training a deep learning classification model, wherein a loss function of the classification model is loglos _ fl;
adopting the trained classification model to reason each sample to obtain the prediction probability p of each sampleij,i=1,2,…,N;j=1,2,…,C;
Calculating a loss function loglos _ fl of each sample to obtain the loglos _ fl of each sample when N is 1;
wherein N is the number of samples, C is the number of categories, yijEqual to 1 indicates that the ith sample has a category of j, yijEqual to 0 indicates that the category of the ith sample is not j, and gamma is a hyper parameter;
sorting the loglos _ fl of each sample from large to small;
and taking a plurality of samples which are ranked at the top, and marking as error samples.
Preferably, γ is greater than 1.
Preferably, γ is greater than 1 and less than 2.
Preferably, γ is 1.5.
Preferably, the taking of a plurality of samples ranked at the top and marking as the error sample includes:
samples with loglos _ fl greater than the threshold are taken and labeled as erroneous samples.
Preferably, after the samples with the top rank are taken and marked as the error samples, the method further includes:
and re-labeling the error samples selected by the classification model.
Preferably, the classification model is a neural network multi-classification model.
The embodiment of the present invention further provides a device for cleaning error marked data, wherein the device comprises:
the model training unit is used for training a deep learning classification model, and the loss function of the classification model is loglos _ fl;
a sample reasoning unit for reasoning each sample by using the trained classification model to obtain the prediction probability p of each sampleij,i=1,2,…,N;j=1,2,…,C;
The loss function calculating unit is used for calculating the loss function loglos _ fl of each sample to obtain the loglos _ fl of each sample when N is 1;
wherein N is the number of samples, C is the number of categories, yijEqual to 1 indicates that the ith sample has a category of j, yijEqual to 0 indicates that the category of the ith sample is not j, and gamma is a hyper parameter;
the sorting unit is used for sorting the logoss _ fl of each sample from large to small;
and the marking unit is used for taking out a plurality of samples sequenced in the front and marking the samples as error samples.
The embodiment of the present invention further provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, the method for cleaning error marking data as described above is implemented.
An embodiment of the present invention further provides a computer-readable storage medium, where the computer-readable storage medium stores thereon a computer program, and when the computer program is executed by a processor, the computer program implements the method for cleaning the error marking data as described above.
The embodiment of the invention provides a method and a device for cleaning error labeling data, computer equipment and a storage medium, wherein the method comprises the following steps: training a deep learning classification model, wherein a loss function of the classification model is loglos _ fl; adopting the trained classification model to reason each sample to obtain the prediction probability p of each sampleijI ═ 1,2, …, N; j ═ 1,2, …, C; calculating a loss function loglos _ fl of each sample to obtain the loglos _ fl of each sample when N is 1;wherein N is the number of samples, C is the number of categories, yijEqual to 1 indicates that the ith sample has a category of j, yijEqual to 0 indicates that the category of the ith sample is not j, and gamma is a hyper parameter; sorting the loglos _ fl of each sample from large to small; and taking a plurality of samples which are ranked at the top, and marking as error samples. The embodiment of the invention greatly improves the accuracy rate of finding the wrong sample and improves the inspection efficiency.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a flowchart illustrating a method for cleaning error marked data according to an embodiment of the present invention;
FIG. 2 is a schematic block diagram of an apparatus for cleaning error labeling data according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the specification of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
It should be further understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.
Referring to fig. 1, fig. 1 is a schematic flow chart illustrating a method for cleaning error marked data according to an embodiment of the present invention, the method including steps S101 to S105:
s101, training a deep learning classification model, wherein a loss function of the classification model is loglos _ fl;
s102, reasoning each sample by adopting the trained classification model to obtain the prediction probability p of each sampleij,i=1,2,…,N;j=1,2,…,C;
S103, calculating a loss function loglos _ fl of each sample to obtain the loglos _ fl of each sample when N is equal to 1;
wherein N is the number of samples, C is the number of categories, yijEqual to 1 indicates that the ith sample has a category of j, yijEqual to 0 indicates that the category of the ith sample is not j, and gamma is a hyper parameter;
s104, sequencing the loglos _ fl of each sample from large to small;
and S105, taking a plurality of samples which are ranked at the front, and marking as error samples.
The embodiment of the invention is an improvement on deep learning and identification of wrong samples, and by utilizing a loss function, the loss of the wrong labeled samples (wrong samples) is greatly exposed during reasoning, so that the effect of identifying the wrong labeled samples is achieved, the correct labeled samples (correct samples) are more concerned during model training, and the influence of the wrong labeled samples on the model is reduced.
In step S101, a deep learning classification model is first trained; the classification model is a neural network multi-classification model, and the loss function of the classification model is loglos _ fl.
In step S102, each sample is inferred by using the trained classification model to obtain the prediction probability p of each sampleij,i=1,2,…,N;j=1,2,…,C。
Wherein, in the embodiment of the invention, the probability p given to the simple sample-modelijProbability p given to large, difficult sample-modelijIs small.
In step S103, calculating a loss function loglos _ fl of each sample (regarding one sample as one set), and obtaining the loglos _ fl of each sample when N is 1;
wherein gamma is a hyper-ginseng, and theoretically, values of (1, + ∞) are all effective. In one embodiment, γ is greater than 1. Preferably, γ is greater than 1 and less than 2. Preferably, γ is 1.5.
In the prior art, when training a classification model, cross entropy (logarithmic loss) is generally used as the loss of the classification model, and the logarithmic loss is defined as follows:
where N is the sample, C is the number of classes, yij1 denotes the i-th sample as j, otherwise yijIs 0, pijThe probability that the ith sample is in category j is predicted for the model. logloss is the model loss, and is used to measure the difference between the predicted probability distribution and the true probability distribution, with smaller values being better.
From the above, it can be seen that, in the training, the weights of the error sample and the correct sample are the same for a single sample. If the parameters of the neural network are sufficient, the wrong sample can be easily learned, and the wrong sample can be difficult to distinguish during testing. Therefore, the embodiment of the present invention improves the loss function to obtain loglos _ fl.
Probability p given by loglos _ fl for simple samples, i.e. modelsijThe larger the calculation is, the larger the loglos _ fl is finally calculated; and given probability p to difficult sample-modelijThe smaller the final calculated loglos _ fl. Therefore, the model will tend to be more robust to reduce the loss of simple samples. For difficult samples (mostly mislabeled samples), the model will be ignored to some extent. Therefore, the classification model obtained by final learning has higher output probability to the correctly labeled sample. While the output probability for the incorrectly labeled sample is lower. Therefore, after model reasoning, relatively large samples, mostly mislabeled samples, are lost. Thus, most mislabeled samples can be screened out by the loss function.
In the embodiment of the invention, the amount of the wrongly labeled samples is a small part in the total samples, and the model per se can learn the wrongly labeled samples less, so that the output pijMay be small by itself; in addition, a certain contradiction exists between the incorrectly labeled sample and the correctly labeled sample, the model is uncertain, and the output p isijWill be relatively small; while training, loglos _ fl ignores p more each timeijComparing small samples, making errorsSamples are becoming increasingly difficult to obtain.
For example: for sample i, the corresponding class is j, pijThe cases of large (0.9) and small (0.6) loglos and loglos _ f1 are as follows (γ ═ 1.5)
pij | logloss | logloss_f1 |
0.9 | 0.1054 | 0.0900 |
0.6 | 0.5108 | 0.2374 |
From the above, in pijAt larger times, the degree of lowering of loglos to loglos _ f1 is less than pijThe smaller the reduction in logloss to logloss _ f1, the smaller the use of logloss _ f1, the less negligible p isijLarger cases, and p will be largely ignoredijThe smaller the case.
In step S104, the loglos _ fl of each sample is sorted from large to small;
in one embodiment, the step S105 includes:
samples with loglos _ fl greater than the threshold are taken and labeled as erroneous samples.
In this embodiment, a plurality of samples in the top ranking are taken out, which is the samples considered by the model to have a high probability of being marked with errors.
In an embodiment, after the step S105, the method further includes:
and re-labeling the error samples selected by the classification model.
The embodiment of the invention identifies and discovers the wrongly marked sample by applying the loss function, thereby greatly improving the recall rate and the accuracy rate of the discovered wrongly marked sample and greatly reducing the manual inspection.
As shown in fig. 2, an embodiment of the invention further provides an apparatus 200 for cleaning error marked data, which includes:
a model training unit 201, configured to train a deep-learning classification model, where a loss function of the classification model is loglos _ fl;
a sample reasoning unit 202, configured to reason for each sample by using the trained classification model to obtain a predicted probability p of each sampleij,i=1,2,…,N;j=1,2,…,C;
A loss function calculating unit 203, configured to calculate a loss function loglos _ fl of each sample, and obtain the loglos _ fl of each sample when N is 1;
wherein N is the number of samples, C is the number of categories, yijEqual to 1 indicates that the ith sample has a category of j, yijEqual to 0 indicates that the category of the ith sample is not j, and gamma is a hyper parameter;
a sorting unit 204, configured to sort the logoss _ fl of each sample from large to small;
and the marking unit 205 is used for taking out a plurality of samples which are sequenced at the front and marking the samples as error samples.
Since the embodiments of the apparatus portion and the method portion correspond to each other, please refer to the description of the embodiments of the method portion for the embodiments of the apparatus portion, which is not repeated here.
The present invention also provides a computer-readable storage medium having a computer program stored thereon, which, when executed, can implement the method provided by the above-described embodiments. The storage medium may include: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
The invention also provides a computer device, which may include a memory and a processor, wherein the memory stores a computer program, and the processor may implement the method provided by the above embodiment when calling the computer program in the memory. Of course, the computer device may also include various network interfaces, power supplies, and the like.
The embodiments are described in a progressive manner in the specification, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description. It should be noted that, for those skilled in the art, it is possible to make various improvements and modifications to the present invention without departing from the principle of the present invention, and those improvements and modifications also fall within the scope of the claims of the present invention.
It is further noted that, in the present specification, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
Claims (10)
1. A method for scrubbing mislabeled data, comprising:
training a deep learning classification model, wherein a loss function of the classification model is loglos _ fl;
adopting the trained classification model to reason each sample to obtain the prediction probability p of each sampleij,i=1,2,…,N;j=1,2,…,C;
Calculating a loss function loglos _ fl of each sample to obtain the loglos _ fl of each sample when N is 1;
wherein N is the number of samples, C is the number of categories, yijEqual to 1 indicates that the ith sample has a category of j, yijEqual to 0 indicates that the category of the ith sample is not j, and gamma is a hyper parameter;
sorting the loglos _ fl of each sample from large to small;
and taking a plurality of samples which are ranked at the top, and marking as error samples.
2. The method of scrubbing mislabeled data as recited in claim 1, wherein γ is greater than 1.
3. The method of scrubbing false mark data as claimed in claim 2, wherein γ is greater than 1 and less than 2.
4. The method of scrubbing false mark data as set forth in claim 3, wherein γ is 1.5.
5. The method for cleaning error marked data according to claim 1, wherein the taking of a plurality of samples which are ranked at the top and marked as error samples comprises:
samples with loglos _ fl greater than the threshold are taken and labeled as erroneous samples.
6. The method for cleaning error marked data according to claim 1, wherein after the samples are taken and marked as error samples, the method further comprises:
and re-labeling the error samples selected by the classification model.
7. The method of cleansing mis-tagged data as recited in claim 1, wherein said classification model is a neural network multi-classification model.
8. An apparatus for scrubbing mislabeled data, comprising:
the model training unit is used for training a deep learning classification model, and the loss function of the classification model is loglos _ fl;
a sample reasoning unit for reasoning each sample by using the trained classification model to obtain the prediction probability p of each sampleij,i=1,2,…,N;j=1,2,…,C;
The loss function calculating unit is used for calculating the loss function loglos _ fl of each sample to obtain the loglos _ fl of each sample when N is 1;
wherein N is the number of samples, C is the number of categories, yijEqual to 1 indicates that the ith sample has a category of j, yijEqual to 0 indicates that the category of the ith sample is not j, and gamma is a hyper parameter;
the sorting unit is used for sorting the logoss _ fl of each sample from large to small;
and the marking unit is used for taking out a plurality of samples sequenced in the front and marking the samples as error samples.
9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method of scrubbing error annotation data according to any one of claims 1 to 7 when executing the computer program.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method of scrubbing false mark data according to any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010182397.6A CN111353553A (en) | 2020-03-16 | 2020-03-16 | Method and device for cleaning error labeling data, computer equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010182397.6A CN111353553A (en) | 2020-03-16 | 2020-03-16 | Method and device for cleaning error labeling data, computer equipment and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111353553A true CN111353553A (en) | 2020-06-30 |
Family
ID=71197621
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010182397.6A Pending CN111353553A (en) | 2020-03-16 | 2020-03-16 | Method and device for cleaning error labeling data, computer equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111353553A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116204769A (en) * | 2023-03-06 | 2023-06-02 | 深圳市乐易网络股份有限公司 | Data cleaning method, system and storage medium based on data classification and identification |
-
2020
- 2020-03-16 CN CN202010182397.6A patent/CN111353553A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116204769A (en) * | 2023-03-06 | 2023-06-02 | 深圳市乐易网络股份有限公司 | Data cleaning method, system and storage medium based on data classification and identification |
CN116204769B (en) * | 2023-03-06 | 2023-12-05 | 深圳市乐易网络股份有限公司 | Data cleaning method, system and storage medium based on data classification and identification |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106951925B (en) | Data processing method, device, server and system | |
CN110704732B (en) | Cognitive diagnosis based time-sequence problem recommendation method and device | |
CN110928764B (en) | Automated evaluation method for crowdsourcing test report of mobile application and computer storage medium | |
JP6973625B2 (en) | Learning equipment, learning methods and learning programs | |
CN110889463A (en) | Sample labeling method and device, server and machine-readable storage medium | |
CN110852755A (en) | User identity identification method and device for transaction scene | |
CN111506598B (en) | Fault discrimination method, system and device based on small sample self-learning fault migration | |
Rady et al. | Time series forecasting using tree based methods | |
CN108470194B (en) | Feature screening method and device | |
CN111611486A (en) | Deep learning sample labeling method based on online education big data | |
US11132790B2 (en) | Wafer map identification method and computer-readable recording medium | |
CN113988044B (en) | Method for judging error question reason type | |
CN113762401A (en) | Self-adaptive classification task threshold adjusting method, device, equipment and storage medium | |
CN111353553A (en) | Method and device for cleaning error labeling data, computer equipment and storage medium | |
JP5684084B2 (en) | Misclassification detection apparatus, method, and program | |
CN112115996B (en) | Image data processing method, device, equipment and storage medium | |
CN112015861A (en) | Intelligent test paper algorithm based on user historical behavior analysis | |
CN110334080B (en) | Knowledge base construction method for realizing autonomous learning | |
CN111414930B (en) | Deep learning model training method and device, electronic equipment and storage medium | |
Alasalmi et al. | Classification uncertainty of multiple imputed data | |
US20210397960A1 (en) | Reliability evaluation device and reliability evaluation method | |
CN115660060A (en) | Model training method, detection method, device, equipment and storage medium | |
KR102303111B1 (en) | Training Data Quality Assessment Technique for Machine Learning-based Software | |
CN108959594B (en) | Capacity level evaluation method and device based on time-varying weighting | |
CN114743048A (en) | Method and device for detecting abnormal straw picture |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |