CN111353553A - Method and device for cleaning error labeling data, computer equipment and storage medium - Google Patents

Method and device for cleaning error labeling data, computer equipment and storage medium Download PDF

Info

Publication number
CN111353553A
CN111353553A CN202010182397.6A CN202010182397A CN111353553A CN 111353553 A CN111353553 A CN 111353553A CN 202010182397 A CN202010182397 A CN 202010182397A CN 111353553 A CN111353553 A CN 111353553A
Authority
CN
China
Prior art keywords
sample
samples
loglos
classification model
error
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010182397.6A
Other languages
Chinese (zh)
Inventor
黄鸿康
涂天牧
刘新宇
赵寒枫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Xinlian Credit Reporting Co ltd
Original Assignee
Shenzhen Xinlian Credit Reporting Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Xinlian Credit Reporting Co ltd filed Critical Shenzhen Xinlian Credit Reporting Co ltd
Priority to CN202010182397.6A priority Critical patent/CN111353553A/en
Publication of CN111353553A publication Critical patent/CN111353553A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Other Investigation Or Analysis Of Materials By Electrical Means (AREA)

Abstract

The invention discloses a method and a device for cleaning error labeling data, computer equipment and a storage medium, wherein the method comprises the following steps: training a deep learning classification model, wherein a loss function of the classification model is loglos _ fl; adopting the trained classification model to reason each sample to obtain the prediction probability p of each sampleijI ═ 1,2, …, N; j ═ 1,2, …, C; calculating a loss function loglos _ fl of each sample to obtain the loglos _ fl of each sample when N is 1; sorting the loglos _ fl of each sample from large to small; and taking a plurality of samples which are ranked at the top, and marking as error samples. The invention greatly improves the accuracy rate of finding the wrong sample and improves the inspection efficiency.

Description

Method and device for cleaning error labeling data, computer equipment and storage medium
Technical Field
The present invention relates to the field of data processing, and in particular, to a method and an apparatus for cleaning error marked data, a computer device, and a storage medium.
Background
At present, a plurality of artificial intelligence/deep learning models are supervised models established on the basis of accurate marking data; the acquisition of accurately labeled data requires a great deal of cost, and many labeled data inevitably have some wrongly labeled data, which greatly affects the trained model.
In the prior art, the inspection of the accuracy of the labeled data is mostly based on manual cross inspection, and the mode has low efficiency and the accuracy needs to be improved.
Disclosure of Invention
The invention aims to provide a method, a device, computer equipment and a storage medium for cleaning error marked data, and aims to solve the problems that in the prior art, a data verification mode is low in efficiency and accuracy needs to be improved.
The embodiment of the invention provides a method for cleaning error marking data, which comprises the following steps:
training a deep learning classification model, wherein a loss function of the classification model is loglos _ fl;
adopting the trained classification model to reason each sample to obtain the prediction probability p of each sampleij,i=1,2,…,N;j=1,2,…,C;
Calculating a loss function loglos _ fl of each sample to obtain the loglos _ fl of each sample when N is 1;
Figure BDA0002413022710000011
wherein N is the number of samples, C is the number of categories, yijEqual to 1 indicates that the ith sample has a category of j, yijEqual to 0 indicates that the category of the ith sample is not j, and gamma is a hyper parameter;
sorting the loglos _ fl of each sample from large to small;
and taking a plurality of samples which are ranked at the top, and marking as error samples.
Preferably, γ is greater than 1.
Preferably, γ is greater than 1 and less than 2.
Preferably, γ is 1.5.
Preferably, the taking of a plurality of samples ranked at the top and marking as the error sample includes:
samples with loglos _ fl greater than the threshold are taken and labeled as erroneous samples.
Preferably, after the samples with the top rank are taken and marked as the error samples, the method further includes:
and re-labeling the error samples selected by the classification model.
Preferably, the classification model is a neural network multi-classification model.
The embodiment of the present invention further provides a device for cleaning error marked data, wherein the device comprises:
the model training unit is used for training a deep learning classification model, and the loss function of the classification model is loglos _ fl;
a sample reasoning unit for reasoning each sample by using the trained classification model to obtain the prediction probability p of each sampleij,i=1,2,…,N;j=1,2,…,C;
The loss function calculating unit is used for calculating the loss function loglos _ fl of each sample to obtain the loglos _ fl of each sample when N is 1;
Figure BDA0002413022710000021
wherein N is the number of samples, C is the number of categories, yijEqual to 1 indicates that the ith sample has a category of j, yijEqual to 0 indicates that the category of the ith sample is not j, and gamma is a hyper parameter;
the sorting unit is used for sorting the logoss _ fl of each sample from large to small;
and the marking unit is used for taking out a plurality of samples sequenced in the front and marking the samples as error samples.
The embodiment of the present invention further provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, the method for cleaning error marking data as described above is implemented.
An embodiment of the present invention further provides a computer-readable storage medium, where the computer-readable storage medium stores thereon a computer program, and when the computer program is executed by a processor, the computer program implements the method for cleaning the error marking data as described above.
The embodiment of the invention provides a method and a device for cleaning error labeling data, computer equipment and a storage medium, wherein the method comprises the following steps: training a deep learning classification model, wherein a loss function of the classification model is loglos _ fl; adopting the trained classification model to reason each sample to obtain the prediction probability p of each sampleijI ═ 1,2, …, N; j ═ 1,2, …, C; calculating a loss function loglos _ fl of each sample to obtain the loglos _ fl of each sample when N is 1;
Figure BDA0002413022710000022
wherein N is the number of samples, C is the number of categories, yijEqual to 1 indicates that the ith sample has a category of j, yijEqual to 0 indicates that the category of the ith sample is not j, and gamma is a hyper parameter; sorting the loglos _ fl of each sample from large to small; and taking a plurality of samples which are ranked at the top, and marking as error samples. The embodiment of the invention greatly improves the accuracy rate of finding the wrong sample and improves the inspection efficiency.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a flowchart illustrating a method for cleaning error marked data according to an embodiment of the present invention;
FIG. 2 is a schematic block diagram of an apparatus for cleaning error labeling data according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the specification of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
It should be further understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.
Referring to fig. 1, fig. 1 is a schematic flow chart illustrating a method for cleaning error marked data according to an embodiment of the present invention, the method including steps S101 to S105:
s101, training a deep learning classification model, wherein a loss function of the classification model is loglos _ fl;
s102, reasoning each sample by adopting the trained classification model to obtain the prediction probability p of each sampleij,i=1,2,…,N;j=1,2,…,C;
S103, calculating a loss function loglos _ fl of each sample to obtain the loglos _ fl of each sample when N is equal to 1;
Figure BDA0002413022710000041
wherein N is the number of samples, C is the number of categories, yijEqual to 1 indicates that the ith sample has a category of j, yijEqual to 0 indicates that the category of the ith sample is not j, and gamma is a hyper parameter;
s104, sequencing the loglos _ fl of each sample from large to small;
and S105, taking a plurality of samples which are ranked at the front, and marking as error samples.
The embodiment of the invention is an improvement on deep learning and identification of wrong samples, and by utilizing a loss function, the loss of the wrong labeled samples (wrong samples) is greatly exposed during reasoning, so that the effect of identifying the wrong labeled samples is achieved, the correct labeled samples (correct samples) are more concerned during model training, and the influence of the wrong labeled samples on the model is reduced.
In step S101, a deep learning classification model is first trained; the classification model is a neural network multi-classification model, and the loss function of the classification model is loglos _ fl.
In step S102, each sample is inferred by using the trained classification model to obtain the prediction probability p of each sampleij,i=1,2,…,N;j=1,2,…,C。
Wherein, in the embodiment of the invention, the probability p given to the simple sample-modelijProbability p given to large, difficult sample-modelijIs small.
In step S103, calculating a loss function loglos _ fl of each sample (regarding one sample as one set), and obtaining the loglos _ fl of each sample when N is 1;
Figure BDA0002413022710000042
wherein gamma is a hyper-ginseng, and theoretically, values of (1, + ∞) are all effective. In one embodiment, γ is greater than 1. Preferably, γ is greater than 1 and less than 2. Preferably, γ is 1.5.
In the prior art, when training a classification model, cross entropy (logarithmic loss) is generally used as the loss of the classification model, and the logarithmic loss is defined as follows:
Figure BDA0002413022710000043
where N is the sample, C is the number of classes, yij1 denotes the i-th sample as j, otherwise yijIs 0, pijThe probability that the ith sample is in category j is predicted for the model. logloss is the model loss, and is used to measure the difference between the predicted probability distribution and the true probability distribution, with smaller values being better.
From the above, it can be seen that, in the training, the weights of the error sample and the correct sample are the same for a single sample. If the parameters of the neural network are sufficient, the wrong sample can be easily learned, and the wrong sample can be difficult to distinguish during testing. Therefore, the embodiment of the present invention improves the loss function to obtain loglos _ fl.
Probability p given by loglos _ fl for simple samples, i.e. modelsijThe larger the calculation is, the larger the loglos _ fl is finally calculated; and given probability p to difficult sample-modelijThe smaller the final calculated loglos _ fl. Therefore, the model will tend to be more robust to reduce the loss of simple samples. For difficult samples (mostly mislabeled samples), the model will be ignored to some extent. Therefore, the classification model obtained by final learning has higher output probability to the correctly labeled sample. While the output probability for the incorrectly labeled sample is lower. Therefore, after model reasoning, relatively large samples, mostly mislabeled samples, are lost. Thus, most mislabeled samples can be screened out by the loss function.
In the embodiment of the invention, the amount of the wrongly labeled samples is a small part in the total samples, and the model per se can learn the wrongly labeled samples less, so that the output pijMay be small by itself; in addition, a certain contradiction exists between the incorrectly labeled sample and the correctly labeled sample, the model is uncertain, and the output p isijWill be relatively small; while training, loglos _ fl ignores p more each timeijComparing small samples, making errorsSamples are becoming increasingly difficult to obtain.
For example: for sample i, the corresponding class is j, pijThe cases of large (0.9) and small (0.6) loglos and loglos _ f1 are as follows (γ ═ 1.5)
pij logloss logloss_f1
0.9 0.1054 0.0900
0.6 0.5108 0.2374
From the above, in pijAt larger times, the degree of lowering of loglos to loglos _ f1 is less than pijThe smaller the reduction in logloss to logloss _ f1, the smaller the use of logloss _ f1, the less negligible p isijLarger cases, and p will be largely ignoredijThe smaller the case.
In step S104, the loglos _ fl of each sample is sorted from large to small;
in one embodiment, the step S105 includes:
samples with loglos _ fl greater than the threshold are taken and labeled as erroneous samples.
In this embodiment, a plurality of samples in the top ranking are taken out, which is the samples considered by the model to have a high probability of being marked with errors.
In an embodiment, after the step S105, the method further includes:
and re-labeling the error samples selected by the classification model.
The embodiment of the invention identifies and discovers the wrongly marked sample by applying the loss function, thereby greatly improving the recall rate and the accuracy rate of the discovered wrongly marked sample and greatly reducing the manual inspection.
As shown in fig. 2, an embodiment of the invention further provides an apparatus 200 for cleaning error marked data, which includes:
a model training unit 201, configured to train a deep-learning classification model, where a loss function of the classification model is loglos _ fl;
a sample reasoning unit 202, configured to reason for each sample by using the trained classification model to obtain a predicted probability p of each sampleij,i=1,2,…,N;j=1,2,…,C;
A loss function calculating unit 203, configured to calculate a loss function loglos _ fl of each sample, and obtain the loglos _ fl of each sample when N is 1;
Figure BDA0002413022710000061
wherein N is the number of samples, C is the number of categories, yijEqual to 1 indicates that the ith sample has a category of j, yijEqual to 0 indicates that the category of the ith sample is not j, and gamma is a hyper parameter;
a sorting unit 204, configured to sort the logoss _ fl of each sample from large to small;
and the marking unit 205 is used for taking out a plurality of samples which are sequenced at the front and marking the samples as error samples.
Since the embodiments of the apparatus portion and the method portion correspond to each other, please refer to the description of the embodiments of the method portion for the embodiments of the apparatus portion, which is not repeated here.
The present invention also provides a computer-readable storage medium having a computer program stored thereon, which, when executed, can implement the method provided by the above-described embodiments. The storage medium may include: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
The invention also provides a computer device, which may include a memory and a processor, wherein the memory stores a computer program, and the processor may implement the method provided by the above embodiment when calling the computer program in the memory. Of course, the computer device may also include various network interfaces, power supplies, and the like.
The embodiments are described in a progressive manner in the specification, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description. It should be noted that, for those skilled in the art, it is possible to make various improvements and modifications to the present invention without departing from the principle of the present invention, and those improvements and modifications also fall within the scope of the claims of the present invention.
It is further noted that, in the present specification, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

Claims (10)

1. A method for scrubbing mislabeled data, comprising:
training a deep learning classification model, wherein a loss function of the classification model is loglos _ fl;
adopting the trained classification model to reason each sample to obtain the prediction probability p of each sampleij,i=1,2,…,N;j=1,2,…,C;
Calculating a loss function loglos _ fl of each sample to obtain the loglos _ fl of each sample when N is 1;
Figure FDA0002413022700000011
wherein N is the number of samples, C is the number of categories, yijEqual to 1 indicates that the ith sample has a category of j, yijEqual to 0 indicates that the category of the ith sample is not j, and gamma is a hyper parameter;
sorting the loglos _ fl of each sample from large to small;
and taking a plurality of samples which are ranked at the top, and marking as error samples.
2. The method of scrubbing mislabeled data as recited in claim 1, wherein γ is greater than 1.
3. The method of scrubbing false mark data as claimed in claim 2, wherein γ is greater than 1 and less than 2.
4. The method of scrubbing false mark data as set forth in claim 3, wherein γ is 1.5.
5. The method for cleaning error marked data according to claim 1, wherein the taking of a plurality of samples which are ranked at the top and marked as error samples comprises:
samples with loglos _ fl greater than the threshold are taken and labeled as erroneous samples.
6. The method for cleaning error marked data according to claim 1, wherein after the samples are taken and marked as error samples, the method further comprises:
and re-labeling the error samples selected by the classification model.
7. The method of cleansing mis-tagged data as recited in claim 1, wherein said classification model is a neural network multi-classification model.
8. An apparatus for scrubbing mislabeled data, comprising:
the model training unit is used for training a deep learning classification model, and the loss function of the classification model is loglos _ fl;
a sample reasoning unit for reasoning each sample by using the trained classification model to obtain the prediction probability p of each sampleij,i=1,2,…,N;j=1,2,…,C;
The loss function calculating unit is used for calculating the loss function loglos _ fl of each sample to obtain the loglos _ fl of each sample when N is 1;
Figure FDA0002413022700000021
wherein N is the number of samples, C is the number of categories, yijEqual to 1 indicates that the ith sample has a category of j, yijEqual to 0 indicates that the category of the ith sample is not j, and gamma is a hyper parameter;
the sorting unit is used for sorting the logoss _ fl of each sample from large to small;
and the marking unit is used for taking out a plurality of samples sequenced in the front and marking the samples as error samples.
9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method of scrubbing error annotation data according to any one of claims 1 to 7 when executing the computer program.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method of scrubbing false mark data according to any one of claims 1 to 7.
CN202010182397.6A 2020-03-16 2020-03-16 Method and device for cleaning error labeling data, computer equipment and storage medium Pending CN111353553A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010182397.6A CN111353553A (en) 2020-03-16 2020-03-16 Method and device for cleaning error labeling data, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010182397.6A CN111353553A (en) 2020-03-16 2020-03-16 Method and device for cleaning error labeling data, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN111353553A true CN111353553A (en) 2020-06-30

Family

ID=71197621

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010182397.6A Pending CN111353553A (en) 2020-03-16 2020-03-16 Method and device for cleaning error labeling data, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN111353553A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116204769A (en) * 2023-03-06 2023-06-02 深圳市乐易网络股份有限公司 Data cleaning method, system and storage medium based on data classification and identification

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116204769A (en) * 2023-03-06 2023-06-02 深圳市乐易网络股份有限公司 Data cleaning method, system and storage medium based on data classification and identification
CN116204769B (en) * 2023-03-06 2023-12-05 深圳市乐易网络股份有限公司 Data cleaning method, system and storage medium based on data classification and identification

Similar Documents

Publication Publication Date Title
CN106951925B (en) Data processing method, device, server and system
CN110704732B (en) Cognitive diagnosis based time-sequence problem recommendation method and device
CN110928764B (en) Automated evaluation method for crowdsourcing test report of mobile application and computer storage medium
JP6973625B2 (en) Learning equipment, learning methods and learning programs
CN110889463A (en) Sample labeling method and device, server and machine-readable storage medium
CN110852755A (en) User identity identification method and device for transaction scene
CN111506598B (en) Fault discrimination method, system and device based on small sample self-learning fault migration
Rady et al. Time series forecasting using tree based methods
CN108470194B (en) Feature screening method and device
CN111611486A (en) Deep learning sample labeling method based on online education big data
US11132790B2 (en) Wafer map identification method and computer-readable recording medium
CN113988044B (en) Method for judging error question reason type
CN113762401A (en) Self-adaptive classification task threshold adjusting method, device, equipment and storage medium
CN111353553A (en) Method and device for cleaning error labeling data, computer equipment and storage medium
JP5684084B2 (en) Misclassification detection apparatus, method, and program
CN112115996B (en) Image data processing method, device, equipment and storage medium
CN112015861A (en) Intelligent test paper algorithm based on user historical behavior analysis
CN110334080B (en) Knowledge base construction method for realizing autonomous learning
CN111414930B (en) Deep learning model training method and device, electronic equipment and storage medium
Alasalmi et al. Classification uncertainty of multiple imputed data
US20210397960A1 (en) Reliability evaluation device and reliability evaluation method
CN115660060A (en) Model training method, detection method, device, equipment and storage medium
KR102303111B1 (en) Training Data Quality Assessment Technique for Machine Learning-based Software
CN108959594B (en) Capacity level evaluation method and device based on time-varying weighting
CN114743048A (en) Method and device for detecting abnormal straw picture

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination