CN111353553A

CN111353553A - Method and device for cleaning error labeling data, computer equipment and storage medium

Info

Publication number: CN111353553A
Application number: CN202010182397.6A
Authority: CN
Inventors: 黄鸿康; 涂天牧; 刘新宇; 赵寒枫
Original assignee: Shenzhen Xinlian Credit Reporting Co ltd
Current assignee: Shenzhen Xinlian Credit Reporting Co ltd
Priority date: 2020-03-16
Filing date: 2020-03-16
Publication date: 2020-06-30

Abstract

The invention discloses a method and a device for cleaning error labeling data, computer equipment and a storage medium, wherein the method comprises the following steps: training a deep learning classification model, wherein a loss function of the classification model is loglos _ fl; adopting the trained classification model to reason each sample to obtain the prediction probability p of each sample_ijI ═ 1,2, …, N; j ═ 1,2, …, C; calculating a loss function loglos _ fl of each sample to obtain the loglos _ fl of each sample when N is 1; sorting the loglos _ fl of each sample from large to small; and taking a plurality of samples which are ranked at the top, and marking as error samples. The invention greatly improves the accuracy rate of finding the wrong sample and improves the inspection efficiency.

Description

Method and device for cleaning error labeling data, computer equipment and storage medium

Technical Field

The present invention relates to the field of data processing, and in particular, to a method and an apparatus for cleaning error marked data, a computer device, and a storage medium.

Background

At present, a plurality of artificial intelligence/deep learning models are supervised models established on the basis of accurate marking data; the acquisition of accurately labeled data requires a great deal of cost, and many labeled data inevitably have some wrongly labeled data, which greatly affects the trained model.

In the prior art, the inspection of the accuracy of the labeled data is mostly based on manual cross inspection, and the mode has low efficiency and the accuracy needs to be improved.

Disclosure of Invention

The invention aims to provide a method, a device, computer equipment and a storage medium for cleaning error marked data, and aims to solve the problems that in the prior art, a data verification mode is low in efficiency and accuracy needs to be improved.

The embodiment of the invention provides a method for cleaning error marking data, which comprises the following steps:

training a deep learning classification model, wherein a loss function of the classification model is loglos _ fl;

adopting the trained classification model to reason each sample to obtain the prediction probability p of each sample_ij，i＝1,2,…,N；j＝1,2,…,C；

Calculating a loss function loglos _ fl of each sample to obtain the loglos _ fl of each sample when N is 1;

wherein N is the number of samples, C is the number of categories, y_ijEqual to 1 indicates that the ith sample has a category of j, y_ijEqual to 0 indicates that the category of the ith sample is not j, and gamma is a hyper parameter;

sorting the loglos _ fl of each sample from large to small;

and taking a plurality of samples which are ranked at the top, and marking as error samples.

Preferably, γ is greater than 1.

Preferably, γ is greater than 1 and less than 2.

Preferably, γ is 1.5.

Preferably, the taking of a plurality of samples ranked at the top and marking as the error sample includes:

samples with loglos _ fl greater than the threshold are taken and labeled as erroneous samples.

Preferably, after the samples with the top rank are taken and marked as the error samples, the method further includes:

and re-labeling the error samples selected by the classification model.

Preferably, the classification model is a neural network multi-classification model.

The embodiment of the present invention further provides a device for cleaning error marked data, wherein the device comprises:

the model training unit is used for training a deep learning classification model, and the loss function of the classification model is loglos _ fl;

a sample reasoning unit for reasoning each sample by using the trained classification model to obtain the prediction probability p of each sample_ij，i＝1,2,…,N；j＝1,2,…,C；

The loss function calculating unit is used for calculating the loss function loglos _ fl of each sample to obtain the loglos _ fl of each sample when N is 1;

the sorting unit is used for sorting the logoss _ fl of each sample from large to small;

and the marking unit is used for taking out a plurality of samples sequenced in the front and marking the samples as error samples.

The embodiment of the present invention further provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, the method for cleaning error marking data as described above is implemented.

An embodiment of the present invention further provides a computer-readable storage medium, where the computer-readable storage medium stores thereon a computer program, and when the computer program is executed by a processor, the computer program implements the method for cleaning the error marking data as described above.

The embodiment of the invention provides a method and a device for cleaning error labeling data, computer equipment and a storage medium, wherein the method comprises the following steps: training a deep learning classification model, wherein a loss function of the classification model is loglos _ fl; adopting the trained classification model to reason each sample to obtain the prediction probability p of each sample_ijI ═ 1,2, …, N; j ═ 1,2, …, C; calculating a loss function loglos _ fl of each sample to obtain the loglos _ fl of each sample when N is 1;

wherein N is the number of samples, C is the number of categories, y_ijEqual to 1 indicates that the ith sample has a category of j, y_ijEqual to 0 indicates that the category of the ith sample is not j, and gamma is a hyper parameter; sorting the loglos _ fl of each sample from large to small; and taking a plurality of samples which are ranked at the top, and marking as error samples. The embodiment of the invention greatly improves the accuracy rate of finding the wrong sample and improves the inspection efficiency.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a flowchart illustrating a method for cleaning error marked data according to an embodiment of the present invention;

FIG. 2 is a schematic block diagram of an apparatus for cleaning error labeling data according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the specification of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should be further understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.

Referring to fig. 1, fig. 1 is a schematic flow chart illustrating a method for cleaning error marked data according to an embodiment of the present invention, the method including steps S101 to S105:

s101, training a deep learning classification model, wherein a loss function of the classification model is loglos _ fl;

s102, reasoning each sample by adopting the trained classification model to obtain the prediction probability p of each sample_ij，i＝1,2,…,N；j＝1,2,…,C；

S103, calculating a loss function loglos _ fl of each sample to obtain the loglos _ fl of each sample when N is equal to 1;

s104, sequencing the loglos _ fl of each sample from large to small;

and S105, taking a plurality of samples which are ranked at the front, and marking as error samples.

The embodiment of the invention is an improvement on deep learning and identification of wrong samples, and by utilizing a loss function, the loss of the wrong labeled samples (wrong samples) is greatly exposed during reasoning, so that the effect of identifying the wrong labeled samples is achieved, the correct labeled samples (correct samples) are more concerned during model training, and the influence of the wrong labeled samples on the model is reduced.

In step S101, a deep learning classification model is first trained; the classification model is a neural network multi-classification model, and the loss function of the classification model is loglos _ fl.

In step S102, each sample is inferred by using the trained classification model to obtain the prediction probability p of each sample_ij，i＝1,2,…,N；j＝1,2,…,C。

Wherein, in the embodiment of the invention, the probability p given to the simple sample-model_ijProbability p given to large, difficult sample-model_ijIs small.

In step S103, calculating a loss function loglos _ fl of each sample (regarding one sample as one set), and obtaining the loglos _ fl of each sample when N is 1;

wherein gamma is a hyper-ginseng, and theoretically, values of (1, + ∞) are all effective. In one embodiment, γ is greater than 1. Preferably, γ is greater than 1 and less than 2. Preferably, γ is 1.5.

In the prior art, when training a classification model, cross entropy (logarithmic loss) is generally used as the loss of the classification model, and the logarithmic loss is defined as follows:

where N is the sample, C is the number of classes, y_ij1 denotes the i-th sample as j, otherwise y_ijIs 0, p_ijThe probability that the ith sample is in category j is predicted for the model. logloss is the model loss, and is used to measure the difference between the predicted probability distribution and the true probability distribution, with smaller values being better.

From the above, it can be seen that, in the training, the weights of the error sample and the correct sample are the same for a single sample. If the parameters of the neural network are sufficient, the wrong sample can be easily learned, and the wrong sample can be difficult to distinguish during testing. Therefore, the embodiment of the present invention improves the loss function to obtain loglos _ fl.

Probability p given by loglos _ fl for simple samples, i.e. models_ijThe larger the calculation is, the larger the loglos _ fl is finally calculated; and given probability p to difficult sample-model_ijThe smaller the final calculated loglos _ fl. Therefore, the model will tend to be more robust to reduce the loss of simple samples. For difficult samples (mostly mislabeled samples), the model will be ignored to some extent. Therefore, the classification model obtained by final learning has higher output probability to the correctly labeled sample. While the output probability for the incorrectly labeled sample is lower. Therefore, after model reasoning, relatively large samples, mostly mislabeled samples, are lost. Thus, most mislabeled samples can be screened out by the loss function.

In the embodiment of the invention, the amount of the wrongly labeled samples is a small part in the total samples, and the model per se can learn the wrongly labeled samples less, so that the output p_ijMay be small by itself; in addition, a certain contradiction exists between the incorrectly labeled sample and the correctly labeled sample, the model is uncertain, and the output p is_ijWill be relatively small; while training, loglos _ fl ignores p more each time_ijComparing small samples, making errorsSamples are becoming increasingly difficult to obtain.

For example: for sample i, the corresponding class is j, p_ijThe cases of large (0.9) and small (0.6) loglos and loglos _ f1 are as follows (γ ═ 1.5)

p_ij	logloss	logloss_f1
			0.9	0.1054	0.0900
0.6	0.5108	0.2374

From the above, in p_ijAt larger times, the degree of lowering of loglos to loglos _ f1 is less than p_ijThe smaller the reduction in logloss to logloss _ f1, the smaller the use of logloss _ f1, the less negligible p is_ijLarger cases, and p will be largely ignored_ijThe smaller the case.

In step S104, the loglos _ fl of each sample is sorted from large to small;

in one embodiment, the step S105 includes:

In this embodiment, a plurality of samples in the top ranking are taken out, which is the samples considered by the model to have a high probability of being marked with errors.

In an embodiment, after the step S105, the method further includes:

and re-labeling the error samples selected by the classification model.

The embodiment of the invention identifies and discovers the wrongly marked sample by applying the loss function, thereby greatly improving the recall rate and the accuracy rate of the discovered wrongly marked sample and greatly reducing the manual inspection.

As shown in fig. 2, an embodiment of the invention further provides an apparatus 200 for cleaning error marked data, which includes:

a model training unit 201, configured to train a deep-learning classification model, where a loss function of the classification model is loglos _ fl;

a sample reasoning unit 202, configured to reason for each sample by using the trained classification model to obtain a predicted probability p of each sample_ij，i＝1,2,…,N；j＝1,2,…,C；

A loss function calculating unit 203, configured to calculate a loss function loglos _ fl of each sample, and obtain the loglos _ fl of each sample when N is 1;

a sorting unit 204, configured to sort the logoss _ fl of each sample from large to small;

and the marking unit 205 is used for taking out a plurality of samples which are sequenced at the front and marking the samples as error samples.

Since the embodiments of the apparatus portion and the method portion correspond to each other, please refer to the description of the embodiments of the method portion for the embodiments of the apparatus portion, which is not repeated here.

The present invention also provides a computer-readable storage medium having a computer program stored thereon, which, when executed, can implement the method provided by the above-described embodiments. The storage medium may include: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The invention also provides a computer device, which may include a memory and a processor, wherein the memory stores a computer program, and the processor may implement the method provided by the above embodiment when calling the computer program in the memory. Of course, the computer device may also include various network interfaces, power supplies, and the like.

The embodiments are described in a progressive manner in the specification, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description. It should be noted that, for those skilled in the art, it is possible to make various improvements and modifications to the present invention without departing from the principle of the present invention, and those improvements and modifications also fall within the scope of the claims of the present invention.

It is further noted that, in the present specification, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

Claims

1. A method for scrubbing mislabeled data, comprising:

sorting the loglos _ fl of each sample from large to small;

2. The method of scrubbing mislabeled data as recited in claim 1, wherein γ is greater than 1.

3. The method of scrubbing false mark data as claimed in claim 2, wherein γ is greater than 1 and less than 2.

4. The method of scrubbing false mark data as set forth in claim 3, wherein γ is 1.5.

5. The method for cleaning error marked data according to claim 1, wherein the taking of a plurality of samples which are ranked at the top and marked as error samples comprises:

6. The method for cleaning error marked data according to claim 1, wherein after the samples are taken and marked as error samples, the method further comprises:

and re-labeling the error samples selected by the classification model.

7. The method of cleansing mis-tagged data as recited in claim 1, wherein said classification model is a neural network multi-classification model.

8. An apparatus for scrubbing mislabeled data, comprising:

9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method of scrubbing error annotation data according to any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method of scrubbing false mark data according to any one of claims 1 to 7.