CN112734035A

CN112734035A - Data processing method and device and readable storage medium

Info

Publication number: CN112734035A
Application number: CN202011644826.3A
Authority: CN
Inventors: 张翼; 顾华鑫; 李辰; 廖强
Original assignee: Chengdu Jiahua Chain Cloud Technology Co ltd
Current assignee: Chengdu Jiahua Chain Cloud Technology Co ltd
Priority date: 2020-12-31
Filing date: 2020-12-31
Publication date: 2021-04-30
Anticipated expiration: 2040-12-31
Also published as: CN112734035B

Abstract

The application provides a data processing method and device and a readable storage medium. The data processing method comprises the following steps: acquiring a plurality of sample pictures and a plurality of pre-trained verification models; each sample picture corresponds to one label; respectively inputting a plurality of sample pictures into a plurality of verification models to obtain a verification result output by each verification model; the verification result comprises the following steps: the probability that the label of each sample picture is each of a plurality of preset labels is obtained; the plurality of tags are not identical; the label of the label belongs to a plurality of labels; determining the average cross entropy between the label corresponding to each sample picture and the plurality of labels according to the verification results output by the plurality of verification models; determining fitting distribution of a plurality of labels according to the average cross entropy of a plurality of sample pictures; and determining whether the label corresponding to each sample picture is a correct label according to the probability of the average cross entropy of each sample picture in the fitting distribution. The method improves the accuracy and efficiency of label cleaning.

Description

Data processing method and device and readable storage medium

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a data processing method and apparatus, and a readable storage medium.

Background

For the training samples of the neural network model, the corresponding labeling labels are usually manually labeled. Manually labeled labels suffer from inaccuracy and, therefore, require cleaning of the manually labeled labels.

In the prior art, when labels are cleaned, wrong labels are mainly screened out manually. The manual screening of error labels has high requirements on data labeling practitioners, and particularly in some special data industries, non-industrial personnel may need longer training time and have the problems of low manual efficiency and the like.

Therefore, the existing label washing mode has low accuracy and efficiency.

Disclosure of Invention

An object of the embodiments of the present application is to provide a data processing method and apparatus, and a readable storage medium, so as to improve accuracy and efficiency of label cleaning.

In a first aspect, an embodiment of the present application provides a data processing method, including: acquiring a plurality of sample pictures and a plurality of pre-trained verification models; each sample picture corresponds to one label; respectively inputting the multiple sample pictures into the multiple verification models to obtain a verification result output by each verification model; the verification result comprises: the probability that the label of each sample picture is each of a plurality of preset labels is obtained; the preset labels are different; the label tag belongs to a tag of the plurality of tags; determining the average cross entropy between the labeling label corresponding to each sample picture and the plurality of labels according to the verification results output by the plurality of verification models; determining fitting distribution of the labels according to the average cross entropy of the sample pictures; and determining whether the label corresponding to each sample picture is a correct label according to the probability of the average cross entropy of each sample picture in the fitting distribution.

In the embodiment of the application, compared with the prior art, the probability that the label of each sample picture is one of the preset labels is respectively output through a plurality of pre-trained verification models, and the average cross entropy between the label corresponding to each sample picture and the labels is determined based on the probability, wherein the average cross entropy can represent the distance between the real label and the label; and finally, determining whether the label is a correct label according to the probability of the average cross entropy in the fit distribution. In the process, manual screening is not needed, and the label cleaning efficiency is improved; meanwhile, the label is analyzed more scientifically based on the average cross entropy, the verification model and the probability distribution, and the label cleaning accuracy is improved.

As a possible implementation manner, before the obtaining a plurality of sample pictures and a plurality of pre-trained verification models, the method further includes: acquiring a cross data set; the cross data set comprises a training data set and a verification data set, the training data set comprises a plurality of first sample pictures, the verification data set comprises a plurality of second sample pictures, and the plurality of first sample pictures and the plurality of second sample pictures are all selected from the plurality of sample pictures; training a plurality of initial verification models respectively through the cross data set to obtain a plurality of trained verification models; the training data set is used for training a classifier in an initial verification model, and the verification data set is used for testing the verification model obtained through training.

In the embodiment of the application, the cross training of the verification model is realized through the cross data set selected from a plurality of sample pictures, and the accuracy of the verification model is improved.

As a possible implementation manner, after the training a plurality of initial verification models through the cross data sets respectively to obtain a plurality of trained verification models, the method further includes: determining the accuracy and recall rate of the trained verification models; determining quality scores of the trained verification models according to the accuracy rate and the recall rate; and optimizing the trained verification models according to the quality scores.

In the embodiment of the application, the quality score is determined through the accuracy and the recall rate, and then the verification model is optimized through the quality score, so that the accuracy of the verification model is improved.

As a possible implementation manner, the determining, according to the verification results output by the multiple verification models, an average cross entropy between the label corresponding to each sample picture and the multiple labels includes: by the formula:

calculating the average cross entropy between the labeling label corresponding to each sample picture and the plurality of labels; wherein the content of the first and second substances,

is the average cross entropy of the jth sample picture,

the label of the jth sample picture output for the ith verification model is preset

The probability of an individual tag being present,

the corresponding label of the jth sample picture is the second

The number of the tags is 1 for each tag,

the corresponding label of the jth sample picture is not the jth sample picture

0 for each tag.

In the embodiment of the application, the average cross entropy is determined by combining the relation between the labeling label and the labels, so that the average cross entropy can effectively reflect the distance between the real label and the labeling label.

As a possible implementation manner, the determining whether the label tag corresponding to each sample picture is a correct tag according to the probability of the average cross entropy of each sample picture in the fitting distribution includes: substituting the average cross entropy of each sample picture into the probability density function of the fitting distribution to obtain the probability that the label of each sample picture is the label corresponding to each sample picture; and determining whether the label corresponding to each sample picture is a correct label or not according to the probability that the label of each sample picture is the label corresponding to each sample picture and a preset probability threshold.

In the embodiment of the application, the probability that the label of each sample picture is the corresponding labeling label of each sample picture is determined through the average cross entropy and the fitting distribution, and then the correctness of the labeling label is effectively and accurately determined through the probability threshold.

As a possible implementation manner, after determining whether the label corresponding to each sample picture is a correct label according to the probability of the average cross entropy of each sample picture in the fit distribution, the method further includes: acquiring a target sample picture; the label corresponding to the target sample picture is an error label; and correcting the error label according to the verification result of the target sample picture output by the plurality of verification models.

In the embodiment of the application, based on the result output by the verification model, besides the judgment on the correctness of the label, the correction of the error label can be realized.

As a possible implementation manner, the correcting the error label according to the verification result of the target sample picture output by the multiple verification models includes: calculating the average probability that the label of the target sample picture is each preset label in a plurality of labels according to the verification result of the target sample picture output by the plurality of verification models; and determining the label with the maximum average probability as a correct label corresponding to the target sample picture.

In the embodiment of the application, the label with the maximum average probability is determined as the correct label corresponding to the target sample picture, so that the labeled label is effectively corrected.

As a possible implementation manner, the correcting the error label according to the verification result of the target sample picture output by the multiple verification models includes: determining a frequency of a probability that the probability of each of the plurality of tags is the maximum according to the verification result of the target sample picture output by the plurality of verification models; and determining the label with the maximum frequency as a correct label corresponding to the target sample picture.

In the embodiment of the application, the label is effectively corrected by determining the frequency of the maximum probability of each label and then determining the label with the maximum frequency as the correct label corresponding to the target sample picture.

In a second aspect, an embodiment of the present application provides a data processing apparatus, which includes functional modules for implementing the data processing method described in the first aspect and any one of the possible implementation manners of the first aspect.

In a third aspect, an embodiment of the present application provides a readable storage medium, where a computer program is stored on the readable storage medium, and the computer program is executed by a computer to perform the data processing method described in the first aspect and any one of the possible implementation manners of the first aspect.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and that those skilled in the art can also obtain other related drawings based on the drawings without inventive efforts.

Fig. 1 is a flowchart of a data processing method provided in an embodiment of the present application;

fig. 2 is a functional block diagram of a data processing apparatus according to an embodiment of the present application.

Icon: 200-a data processing apparatus; 210-an obtaining module; 220-processing module.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.

The technical scheme provided by the embodiment of the application can be applied to an application scene needing label cleaning, the cleaned label is a manually labeled label, and the label cleaning result can comprise two aspects, namely judging whether the label is a correctly labeled label or not on one hand, and correcting a wrongly labeled label on the other hand.

In the embodiment of the present application, the processed object may be a picture, and therefore, the cleaned tag may also be a tag of the picture. For the labels of the pictures, there can be different forms: if there are picture tags for marking the target objects in the picture, these tags may represent the category or name of the target objects. For example, if there are picture tags for marking the attributes of the target object in the picture, these tags may represent the attributes of the target object, such as: whether the retina in the retinal image is a healthy retina. From another perspective, the tags may be used to classify the pictures, or the tags may represent the recognition or detection results of the pictures.

Of course, for other non-image processing objects, the tag may also be cleaned by referring to the technical solution of the embodiment of the present application, and the image application scenario defined in the embodiment of the present application does not limit the technical solution of the embodiment of the present application.

The hardware operating environment of the technical scheme provided by the embodiment of the application can be a front end, such as a client, or a back end, such as a server; the system may also be a system composed of a front end and a back end, in which the front end is configured to receive a data processing request, and then the front end sends the data processing request and corresponding data to the back end, and the back end performs data processing based on the data processing request and the corresponding data.

Based on the introduction of the application scenario and the hardware operating environment, referring to fig. 1, a flowchart of a data processing method provided in an embodiment of the present application is shown, where the data processing method includes:

step 110: and acquiring a plurality of sample pictures and a plurality of pre-trained verification models. Each sample picture corresponds to one labeling label.

Step 120: and respectively inputting the multiple sample pictures into the multiple verification models to obtain a verification result output by each verification model. The verification result comprises the following steps: the probability that the label of each sample picture is each of a plurality of preset labels is obtained; the annotation label belongs to a label of the plurality of labels.

Step 130: and determining the average cross entropy between the labeling label corresponding to each sample picture and the plurality of labels according to the verification results output by the plurality of verification models.

Step 140: and determining the fitting distribution of the labels according to the average cross entropy of the sample pictures.

Step 150: and determining whether the label corresponding to each sample picture is a correct label according to the probability of the average cross entropy of each sample picture in the fitting distribution.

A detailed implementation of steps 110-150 is described next.

In step 110, the plurality of sample pictures are pictures corresponding to labeling labels, and the labeling labels are manually labeled labels. The plurality of verification models are pre-trained models, and the verification models are used for predicting labels of the sample pictures. Next, the preset modes of the plurality of sample pictures and the plurality of verification models will be described.

The plurality of sample pictures are pictures to be subjected to label cleaning, can be provided by a user, and can also be obtained from a database. As an optional implementation manner, a plurality of pictures are obtained first, the plurality of pictures may include two types, one type is a picture marked with a label, and the part of the picture may be directly used as a sample picture; the other is a picture which is not marked with a label, and the part of the picture can be used as a sample picture after the label is marked.

As an alternative embodiment, the training process of the plurality of verification models includes: acquiring a cross data set; the cross data set comprises a training data set and a verification data set, the training data set comprises a plurality of first sample pictures, the verification data set comprises a plurality of second sample pictures, and the plurality of first sample pictures and the plurality of second sample pictures are all selected from the plurality of sample pictures; respectively training a plurality of initial verification models through a cross data set to obtain a plurality of trained verification models; the training data set is used for training a classifier in an initial verification model, and the verification data set is used for testing the verification model obtained through training.

In this embodiment, assuming that the obtained multiple pictures are the target data set, a metadata set may be obtained based on the target data set, where the metadata set may be directly the target data set, or may be obtained by sampling from the target data set by a random sampling method.

Then, the training dataset and the validation dataset are selected from the metadata set. Before selecting the training data set and the verification data set, the sample pictures in the metadata set can be cross-split according to the format of the data set. The format of the data set is divided into a detection data set and a classification data set, and for the detection data set, the pictures need to be cut together according to the label to form a classification data set, that is, the label is labeled on the pictures in the embodiment, so as to obtain a sample picture with a label; no cropping operation is required for classifying the data set, i.e. the image already corresponding to the label in the foregoing embodiment.

Selecting a part of sample pictures from the classification data set as a training data set based on the classification data set; and the other part of the sample picture is used as a verification data set. And then training the initial verification model by using the training data set, and testing the verification model obtained by training by using the verification data set so as to evaluate the performance index of the model.

Since the verification model has a plurality, different training data sets and groupings of verification data sets can be set based on the classification data sets for cross-training of different verification models. As an alternative embodiment, it is assumed that there are multiple data sets, multiple verification models to be trained, and some of the data sets are used as training data sets in turn, and the rest data sets are used as verification data sets, and then are used for cross-training different verification models respectively. For example, assume that there are 10 data sets and 10 verification models to be trained, the training data set corresponding to the 1 st verification model is the 1 st to 9 th data sets, and the corresponding verification data set is the 10 th data set; the training data set corresponding to the 2 nd verification model is the 2 nd to 10 th data, and the corresponding verification data set is the 1 st data; and the data sets corresponding to other verification models are analogized in sequence. In practical application, the grouping mode of the data sets can be specified in advance, then different grouping is carried out according to the preset grouping mode, different training data sets and verification data sets are obtained, and then training of the verification model is carried out.

In the embodiment of the present application, the structure of the verification model may adopt an EfficientNet (a neural network) proposed by Google brain team, or may adopt other feasible network structures. During training, the training data set is used for training a classifier in the verification model so that the classifier can predict the probability that the picture is taken as each label; the verification data set is used for testing the verification model obtained through training so as to test whether the verification model can realize the prediction of the label probability. The training and testing of the verification model is well within the skill of the art and will not be described in detail herein.

In the embodiment of the present application, after the verification model is trained, it may predict the probability that the label of the sample picture is each label, for example: if there are three labels in total, the verification model may output the probability that the label of the sample picture is the first label, the probability that the label of the sample picture is the second label, and the probability that the label of the sample picture is the third label. In addition, for the sample picture, a tag label is also corresponded, and the tag label is one of the three tags, but the tag label may or may not be the correct label.

In this embodiment of the present application, after training of multiple verification models is completed, quality evaluation may also be performed on the multiple verification models, which may also be understood as evaluating the prediction capability of the verification models, and therefore, the method further includes: determining the accuracy and recall rate of the trained verification models; determining the quality scores of the trained verification models according to the accuracy rate and the recall rate; and optimizing the trained multiple verification models according to the quality scores.

Wherein, the accuracy P can be determined by:

calculating; the recall rate R may be determined by:

and (6) performing calculation. The TP is the number of sample pictures with correct prediction results in the sample pictures predicted by the verification model, and for correct prediction results, the label with the maximum probability output by the verification model is the correct label of the sample pictures; FP is the number of sample pictures with a first error as a prediction result in the sample pictures predicted by the verification model, and for the first error, the label with the maximum probability output by the verification model is a non-labeled label which is not a correct label; FN is authentication model instituteAnd for the second error, the label with the highest probability output by the verification model is the label, and the label is not the correct label.

Based on accuracy and recall, the quality score of the verification model F1 may be determined by:

and (6) performing calculation.

Based on the quality scores, the verification model can be continuously optimized, such as: and presetting a quality score threshold, calculating the quality score of the verification model after finishing the training of the verification model each time, and training the verification model again if the quality score does not reach the quality score threshold until the quality score reaches the quality score threshold. When the verification model is retrained again, the cross data set can be replaced or not.

Based on the above description of the verification model, in step 120, a plurality of sample pictures are respectively input into a plurality of verification models, such as: assuming 10 sample pictures, and 10 verification models, it is necessary to: 10 sample pictures were input into each verification model. Correspondingly, each verification model outputs a plurality of verification results. Such as: the verification result output by the 1 st verification model comprises the probability that the 1 st sample picture is each of the plurality of labels, the probability that the 2 nd sample picture is each of the plurality of labels, and the 3 rd to 10 th sample pictures are similar. It can be understood that when multiple sample pictures are input into multiple verification models, the labeling labels of the multiple sample pictures are not input into the verification models together, and therefore the probability of each label output by the verification model includes the probability of the labeling label.

In step 120, the preset labels may be set according to the labels corresponding to the sample pictures, the labels corresponding to different sample pictures may be the same or different, and the preset labels are all different labels among the labels. Such as: there are 3 corresponding label labels of picture to be label A in many sample pictures, have 4 corresponding label labels of picture to be label B, have 3 corresponding label labels of picture to be label C, then a plurality of labels include: tag a, tag B, and tag C.

Based on the verification result obtained in step 120, in step 130, an average cross entropy between the label and the plurality of labels corresponding to each sample picture is determined according to the verification results output by the plurality of verification models.

As an optional implementation manner, the average cross entropy between the label and the plurality of labels corresponding to each sample picture is calculated by the following formula:

wherein the content of the first and second substances,

is the average cross entropy of the jth sample picture,

The probability of an individual tag being present,

the corresponding label of the jth sample picture is the second

The number of the tags is 1 for each tag,

the corresponding label of the jth sample picture is not the jth sample picture

0 for each tag.

By way of example, assume that the preset plurality of tags includes: the apple and the banana, the object name in this label representative sample picture, the apple is the 1 st label, the banana is the 2 nd label, and the quantity of verifying the model is 2, and the quantity of sample picture is 10.

For the 5 th sample picture, the label is banana, the probability that the label of the sample picture output by the 1 st verification model is apple is 0.3, and the probability that the label of the sample picture is banana is 0.7. The probability that the label of the sample picture output by the 2 nd verification model is an apple is 0.8, and the probability that the label of the sample picture is a banana is 0.2. Then, for the 5 th sample picture, the average cross entropy is: - (0 × log0.3+1 × log0.7+0 × log0.8+1 × log0.2)/2 ═ 0.983.

Continuing with the example, for the 6 th sample picture, which is labeled banana, the probability that the label of the 1 st verification model output sample picture is apple is 0.3, the probability that the label of the sample picture is banana is 0.7, the probability that the label of the 2 nd verification model output sample picture is apple is 0.2, and the probability that the label of the sample picture is banana is 0.8, then the average cross entropy for the 6 th sample picture is- (0 log0.3+1 log0.7+0 log0.2+1 log 0.8)/2-0.2899.

Further, the average cross entropy implicitly fits the distance between the correct label and the label. For example, in practice, the correct label of the 5 th sample picture is an apple, the label thereof is a banana, and the distance between the label and the correct label is large, which indicates that the label is an error label. And the correct label of the 6 th sample picture is a banana, the label of the sample picture is also a banana, and the distance between the label and the correct label is smaller, so that the label is the correct label.

From step 130, an average cross entropy for each sample picture may be determined, based on which, in step 140, a fitted distribution of the plurality of labels is determined from the average cross entropy of the plurality of sample pictures.

As an alternative implementation, in step 140, cross entropy values corresponding to all the labels may be counted, a histogram distribution is calculated, and continuous distribution fitting is performed according to different labels. The fitted distribution can be expressed as:

wherein the content of the first and second substances,

all average cross entropies corresponding to the kth label are, for example: all average cross entropies corresponding to the apple labels; d_kThe distribution obtained for all mean cross entropy fits based on the kth label.

After the fitting distribution is obtained in step 140, in step 150, it is determined whether the label corresponding to each sample picture is the correct label according to the probability of the average cross entropy of each sample picture in the fitting distribution, as an optional implementation manner, the step includes: substituting the average cross entropy of each sample picture into the probability density function of the fitting distribution to obtain the probability that the label of each sample picture is the label corresponding to each sample picture; and determining whether the label corresponding to each sample picture is a correct label or not according to the probability that the label of each sample picture is the label corresponding to each sample picture and a preset probability threshold.

In this embodiment, each sample picture corresponds to an average cross entropy, which is substituted into the fitted distribution D_kThe probability of the sample picture in the fitting distribution can be obtained from the probability density function, and the probability can represent the probability that the label of the sample picture is the corresponding label. And then, judging the correctness of the label through a preset probability threshold, such as: if the obtained probability is greater than or equal to the probability threshold, the label corresponding to the sample picture is a correct label; if the obtained probability is smaller than the probability threshold, the marking label corresponding to the sample picture is shownIs not the correct tag.

In the embodiment of the application, the probability that the label of each sample picture is the corresponding label of each sample picture is determined through the average cross entropy and the fitting distribution, and then the correctness of the probability threshold is effectively and accurately determined through the probability threshold.

In the embodiment of the present application, after the determination result is obtained in step 150, the sample picture with the correct label can be automatically marked according to the determination result, for example, the sample picture with the correct label is marked as the correct label sample; and marking the sample picture with the wrong label as the wrong label sample.

And correcting the labeling label of the sample picture which is labeled by mistake. Thus, after step 150, the method further comprises: acquiring a target sample picture; the label corresponding to the target sample picture is an error label; and correcting the error label according to the verification result of the target sample picture output by the plurality of verification models.

As described in the foregoing embodiment, based on the determination result, it is possible to mark whether the sample picture is the correctly marked sample picture, and therefore, the incorrectly marked sample picture can be directly obtained as the target sample picture.

For the target sample picture, each verification model outputs a corresponding verification result. Based on the verification result output by each verification model, correction of the error label can be realized, and in the embodiment of the application, two optional correction modes are provided.

A first optional correction mode, namely calculating the average probability that the label of the target sample picture is each preset label in a plurality of labels according to the verification result of the target sample picture output by a plurality of verification models; and determining the label with the maximum average probability as a correct label corresponding to the target sample picture.

In this modification, an average probability of the probability that the label of the target sample picture output by each verification model is the label is calculated, for example, it is assumed that there are 2 verification models and 2 preset labels, the probability that the label of the target sample picture output by the 1 st verification model is the 1 st label is a, the probability that the label of the target sample picture output by the 1 st verification model is the 2 nd label is b, the probability that the label of the target sample picture output by the 2 nd verification model is the 1 st label is c, and the probability that the label of the target sample picture output by the 1 st verification model is the 2 nd label is d. Then, the average probability of the 1 st tag is: (a + c)/2; the average probability for the 2 nd tag is: (b + d)/2.

And determining the label with the maximum average probability as a correct label corresponding to the target sample picture based on the average probability. Continuing with the example, if (a + c)/2 > (b + d)/2, the correct label corresponding to the target sample picture is the 1 st label; if (a + c)/2 < (b + d)/2, the correct tags corresponding to the target sample picture are 2 tags; if (a + c)/2 is equal to (b + d)/2, the correct label corresponding to the target sample picture is the corresponding label.

A second optional modification manner, which is to determine, according to the verification result of the target sample picture output by the multiple verification models, a frequency of a probability that the probability of each of the multiple tags is the maximum; and determining the label with the maximum frequency as a correct label corresponding to the target sample picture.

In this modification, for the target sample picture, each verification model outputs a different probability of the label, but there is a maximum probability among the probabilities of the different labels, and therefore, the maximum probability of the label output by each verification model may be counted, and then the frequency of the maximum probability of each label is determined based on the statistical result.

For example, assuming that there are 10 verification models, the preset tag includes a tag a and a tag B, and for the target sample picture, the probability of the tag a is greater in the output results of 8 verification models, and then the frequency of the tag a is: 0.8, the probability of the label B is higher in the output results of 2 verification models, and the frequency of the label B is 0.2.

Further, after the frequency of each label which is the maximum probability label is determined, the label with the maximum frequency is determined as the correct label corresponding to the target sample picture. Continuing with the example, label a described above will be the correct label for the target sample picture.

The two modifications may be used in combination or individually. Such as: label correction is carried out on the target sample pictures in a first correction mode; or the target sample picture is subjected to label correction by adopting a second correction mode.

Or the target sample picture is firstly corrected by the first correction mode, if the first correction mode cannot complete the correction, for example: the label with the highest average probability is more than one, such as: and (a + c)/2 is (b + d)/2, and then the correction is carried out by adopting a second correction mode, if the second correction mode is adopted, the correction cannot be finished, for example: the frequency of the label a is equal to that of the label B, and the target sample picture can be marked as an unknown label sample picture. Of course, the order of implementation of the first modification and the second modification may be reversed.

Or, the target sample picture firstly adopts a first correction mode to perform label correction, and then verifies the correction result of the first correction mode by using a second correction mode, such as: determining a correct label through a second correction mode, and determining that the correction is correct if the correct label is the same as the correct label determined by the first correction mode; if the correct label is different from the correct label determined by the first correction mode, it is determined that the correction may be wrong, and at this time, the target sample picture may be marked as an unknown label sample. Of course, the order of implementation of the first modification and the second modification may be reversed.

The application modes of the two correction modes can be flexibly set according to actual application scenarios, and in the embodiment of the present application, the application modes are merely exemplary and are not limited.

In the foregoing embodiment, if the target sample picture cannot be corrected, the target sample picture is marked as an unknown label sample picture, and for the unknown label sample picture, a variety of processing methods may be adopted. Such as: and feeding back the sample picture of the unknown label, and labeling the correct label by a professional.

Based on the same inventive concept, please refer to fig. 2, an embodiment of the present application further provides a data processing apparatus 200, which includes an obtaining module 210 and a processing module 220.

The obtaining module 210 is configured to: acquiring a plurality of sample pictures and a plurality of pre-trained verification models; each sample picture corresponds to one label; the processing module 220 is configured to: respectively inputting the multiple sample pictures into the multiple verification models to obtain a verification result output by each verification model; the verification result comprises: the probability that the label of each sample picture is each of a plurality of preset labels is obtained; the preset labels are different; the label tag belongs to a tag of the plurality of tags; determining the average cross entropy between the labeling label corresponding to each sample picture and the plurality of labels according to the verification results output by the plurality of verification models; determining fitting distribution of the labels according to the average cross entropy of the sample pictures; and determining whether the label corresponding to each sample picture is a correct label according to the probability of the average cross entropy of each sample picture in the fitting distribution.

In this embodiment of the present application, the obtaining module 210 is further configured to obtain a cross data set; the cross data set comprises a training data set and a verification data set, the training data set comprises a plurality of first sample pictures, the verification data set comprises a plurality of second sample pictures, and the plurality of first sample pictures and the plurality of second sample pictures are all selected from the plurality of sample pictures; the processing module 220 is further configured to: training a plurality of initial verification models respectively through the cross data set to obtain a plurality of trained verification models; the training data set is used for training a classifier in an initial verification model, and the verification data set is used for testing the verification model obtained through training.

In an embodiment of the present application, the processing module 220 is further configured to: determining the accuracy and recall rate of the trained verification models; determining quality scores of the trained verification models according to the accuracy rate and the recall rate; and optimizing the trained verification models according to the quality scores.

In this embodiment of the application, the processing module 220 is specifically configured to: by the formula:

is the average cross entropy of the jth sample picture,

The probability of an individual tag being present,

the corresponding label of the jth sample picture is the second

The number of the tags is 1 for each tag,

the corresponding label of the jth sample picture is not the jth sample picture

0 for each tag.

In this embodiment of the application, the processing module 220 is specifically configured to: substituting the average cross entropy of each sample picture into the probability density function of the fitting distribution to obtain the probability that the label of each sample picture is the label corresponding to each sample picture; and determining whether the label corresponding to each sample picture is a correct label or not according to the probability that the label of each sample picture is the label corresponding to each sample picture and a preset probability threshold.

In this embodiment of the application, the obtaining module 210 is further configured to obtain a target sample picture; the label corresponding to the target sample picture is an error label; the processing module 220 is further configured to: and correcting the error label according to the verification result of the target sample picture output by the plurality of verification models.

In this embodiment of the application, the processing module 220 is specifically configured to: calculating the average probability that the label of the target sample picture is each preset label in a plurality of labels according to the verification result of the target sample picture output by the plurality of verification models; and determining the label with the maximum average probability as a correct label corresponding to the target sample picture.

In this embodiment of the application, the processing module 220 is further specifically configured to: determining a frequency of a probability that the probability of each of the plurality of tags is the maximum according to the verification result of the target sample picture output by the plurality of verification models; and determining the label with the maximum frequency as a correct label corresponding to the target sample picture.

The functional blocks of the data processing apparatus 200 correspond to the steps of the data processing method described in the foregoing embodiment, and therefore, the implementation of the functional blocks may refer to the implementation of the steps of the data processing method described in the foregoing embodiment, and the description thereof is not repeated in the embodiment of the present application.

Based on the same inventive concept, embodiments of the present application further provide a readable storage medium, where a computer program is stored, and when the computer program is executed by a computer, the data processing method described in the foregoing embodiments is executed.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.

In addition, units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

Furthermore, the functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.

In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions.

The above description is only an example of the present application and is not intended to limit the scope of the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A data processing method, comprising:

acquiring a plurality of sample pictures and a plurality of pre-trained verification models; each sample picture corresponds to one label;

respectively inputting the multiple sample pictures into the multiple verification models to obtain a verification result output by each verification model; the verification result comprises: the probability that the label of each sample picture is each of a plurality of preset labels is obtained; the label belongs to the labels in the plurality of labels, and the plurality of preset labels are different;

determining the average cross entropy between the labeling label corresponding to each sample picture and the plurality of labels according to the verification results output by the plurality of verification models;

determining fitting distribution of the labels according to the average cross entropy of the sample pictures;

and determining whether the label corresponding to each sample picture is a correct label according to the probability of the average cross entropy of each sample picture in the fitting distribution.

2. The method of claim 1, wherein prior to said obtaining the plurality of sample pictures and the plurality of pre-trained verification models, the method further comprises:

acquiring a cross data set; the cross data set comprises a training data set and a verification data set, the training data set comprises a plurality of first sample pictures, the verification data set comprises a plurality of second sample pictures, and the plurality of first sample pictures and the plurality of second sample pictures are all selected from the plurality of sample pictures;

training a plurality of initial verification models respectively through the cross data set to obtain a plurality of trained verification models; the training data set is used for training a classifier in an initial verification model, and the verification data set is used for testing the verification model obtained through training.

3. The method of claim 2, wherein after the training of the plurality of initial verification models by the respective cross data sets to obtain the trained plurality of verification models, the method further comprises:

determining the accuracy and recall rate of the trained verification models;

determining quality scores of the trained verification models according to the accuracy rate and the recall rate;

and optimizing the trained verification models according to the quality scores.

4. The method of claim 1, wherein determining the average cross entropy between the label corresponding to each sample picture and the plurality of labels according to the verification results output by the plurality of verification models comprises:

by the formula:

calculating the average cross entropy between the labeling label corresponding to each sample picture and the plurality of labels;

wherein the content of the first and second substances,

is the average cross entropy of the jth sample picture,

The probability of an individual tag being present,

the corresponding label of the jth sample picture is the second

The number of the tags is 1 for each tag,

the corresponding label of the jth sample picture is not the jth sample picture

0 for each tag.

5. The method of claim 1, wherein determining whether the label corresponding to each sample picture is a correct label according to the probability of the average cross entropy of each sample picture in the fit distribution comprises:

substituting the average cross entropy of each sample picture into the probability density function of the fitting distribution to obtain the probability that the label of each sample picture is the label corresponding to each sample picture;

and determining whether the label corresponding to each sample picture is a correct label or not according to the probability that the label of each sample picture is the label corresponding to each sample picture and a preset probability threshold.

6. The method of claim 1, wherein after determining whether the label corresponding to each sample picture is the correct label according to the probability of the average cross entropy of each sample picture in the fit distribution, the method further comprises:

acquiring a target sample picture; the label corresponding to the target sample picture is an error label;

and correcting the error label according to the verification result of the target sample picture output by the plurality of verification models.

7. The method according to claim 6, wherein the correcting the error label according to the verification result of the target sample picture output by the plurality of verification models comprises:

calculating the average probability that the label of the target sample picture is each preset label in a plurality of labels according to the verification result of the target sample picture output by the plurality of verification models;

and determining the label with the maximum average probability as a correct label corresponding to the target sample picture.

8. The method according to claim 6, wherein the correcting the error label according to the verification result of the target sample picture output by the plurality of verification models comprises:

determining a frequency of a probability that the probability of each of the plurality of tags is the maximum according to the verification result of the target sample picture output by the plurality of verification models;

and determining the label with the maximum frequency as a correct label corresponding to the target sample picture.

9. A data processing apparatus, comprising:

the acquisition module is used for acquiring a plurality of sample pictures and a plurality of pre-trained verification models; each sample picture corresponds to one label;

a processing module to:

respectively inputting the multiple sample pictures into the multiple verification models to obtain a verification result output by each verification model; the verification result comprises: the probability that the label of each sample picture is each of a plurality of preset labels is obtained; the label tag belongs to a tag of the plurality of tags; the preset labels are different;

10. A readable storage medium, characterized in that the readable storage medium has stored thereon a computer program which, when executed by a computer, performs the data processing method according to any one of claims 1 to 8.