CN112734035A - Data processing method and device and readable storage medium - Google Patents

Data processing method and device and readable storage medium Download PDF

Info

Publication number
CN112734035A
CN112734035A CN202011644826.3A CN202011644826A CN112734035A CN 112734035 A CN112734035 A CN 112734035A CN 202011644826 A CN202011644826 A CN 202011644826A CN 112734035 A CN112734035 A CN 112734035A
Authority
CN
China
Prior art keywords
label
verification
sample picture
probability
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011644826.3A
Other languages
Chinese (zh)
Other versions
CN112734035B (en
Inventor
张翼
顾华鑫
李辰
廖强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Jiahua Chain Cloud Technology Co ltd
Original Assignee
Chengdu Jiahua Chain Cloud Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Jiahua Chain Cloud Technology Co ltd filed Critical Chengdu Jiahua Chain Cloud Technology Co ltd
Priority to CN202011644826.3A priority Critical patent/CN112734035B/en
Publication of CN112734035A publication Critical patent/CN112734035A/en
Application granted granted Critical
Publication of CN112734035B publication Critical patent/CN112734035B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques

Abstract

The application provides a data processing method and device and a readable storage medium. The data processing method comprises the following steps: acquiring a plurality of sample pictures and a plurality of pre-trained verification models; each sample picture corresponds to one label; respectively inputting a plurality of sample pictures into a plurality of verification models to obtain a verification result output by each verification model; the verification result comprises the following steps: the probability that the label of each sample picture is each of a plurality of preset labels is obtained; the plurality of tags are not identical; the label of the label belongs to a plurality of labels; determining the average cross entropy between the label corresponding to each sample picture and the plurality of labels according to the verification results output by the plurality of verification models; determining fitting distribution of a plurality of labels according to the average cross entropy of a plurality of sample pictures; and determining whether the label corresponding to each sample picture is a correct label according to the probability of the average cross entropy of each sample picture in the fitting distribution. The method improves the accuracy and efficiency of label cleaning.

Description

Data processing method and device and readable storage medium
Technical Field
The present application relates to the field of artificial intelligence technologies, and in particular, to a data processing method and apparatus, and a readable storage medium.
Background
For the training samples of the neural network model, the corresponding labeling labels are usually manually labeled. Manually labeled labels suffer from inaccuracy and, therefore, require cleaning of the manually labeled labels.
In the prior art, when labels are cleaned, wrong labels are mainly screened out manually. The manual screening of error labels has high requirements on data labeling practitioners, and particularly in some special data industries, non-industrial personnel may need longer training time and have the problems of low manual efficiency and the like.
Therefore, the existing label washing mode has low accuracy and efficiency.
Disclosure of Invention
An object of the embodiments of the present application is to provide a data processing method and apparatus, and a readable storage medium, so as to improve accuracy and efficiency of label cleaning.
In a first aspect, an embodiment of the present application provides a data processing method, including: acquiring a plurality of sample pictures and a plurality of pre-trained verification models; each sample picture corresponds to one label; respectively inputting the multiple sample pictures into the multiple verification models to obtain a verification result output by each verification model; the verification result comprises: the probability that the label of each sample picture is each of a plurality of preset labels is obtained; the preset labels are different; the label tag belongs to a tag of the plurality of tags; determining the average cross entropy between the labeling label corresponding to each sample picture and the plurality of labels according to the verification results output by the plurality of verification models; determining fitting distribution of the labels according to the average cross entropy of the sample pictures; and determining whether the label corresponding to each sample picture is a correct label according to the probability of the average cross entropy of each sample picture in the fitting distribution.
In the embodiment of the application, compared with the prior art, the probability that the label of each sample picture is one of the preset labels is respectively output through a plurality of pre-trained verification models, and the average cross entropy between the label corresponding to each sample picture and the labels is determined based on the probability, wherein the average cross entropy can represent the distance between the real label and the label; and finally, determining whether the label is a correct label according to the probability of the average cross entropy in the fit distribution. In the process, manual screening is not needed, and the label cleaning efficiency is improved; meanwhile, the label is analyzed more scientifically based on the average cross entropy, the verification model and the probability distribution, and the label cleaning accuracy is improved.
As a possible implementation manner, before the obtaining a plurality of sample pictures and a plurality of pre-trained verification models, the method further includes: acquiring a cross data set; the cross data set comprises a training data set and a verification data set, the training data set comprises a plurality of first sample pictures, the verification data set comprises a plurality of second sample pictures, and the plurality of first sample pictures and the plurality of second sample pictures are all selected from the plurality of sample pictures; training a plurality of initial verification models respectively through the cross data set to obtain a plurality of trained verification models; the training data set is used for training a classifier in an initial verification model, and the verification data set is used for testing the verification model obtained through training.
In the embodiment of the application, the cross training of the verification model is realized through the cross data set selected from a plurality of sample pictures, and the accuracy of the verification model is improved.
As a possible implementation manner, after the training a plurality of initial verification models through the cross data sets respectively to obtain a plurality of trained verification models, the method further includes: determining the accuracy and recall rate of the trained verification models; determining quality scores of the trained verification models according to the accuracy rate and the recall rate; and optimizing the trained verification models according to the quality scores.
In the embodiment of the application, the quality score is determined through the accuracy and the recall rate, and then the verification model is optimized through the quality score, so that the accuracy of the verification model is improved.
As a possible implementation manner, the determining, according to the verification results output by the multiple verification models, an average cross entropy between the label corresponding to each sample picture and the multiple labels includes: by the formula:
Figure BDA0002877880770000031
calculating the average cross entropy between the labeling label corresponding to each sample picture and the plurality of labels; wherein the content of the first and second substances,
Figure BDA0002877880770000032
is the average cross entropy of the jth sample picture,
Figure BDA0002877880770000033
the label of the jth sample picture output for the ith verification model is preset
Figure BDA0002877880770000034
The probability of an individual tag being present,
Figure BDA0002877880770000035
the corresponding label of the jth sample picture is the second
Figure BDA0002877880770000036
The number of the tags is 1 for each tag,
Figure BDA0002877880770000037
the corresponding label of the jth sample picture is not the jth sample picture
Figure BDA0002877880770000038
0 for each tag.
In the embodiment of the application, the average cross entropy is determined by combining the relation between the labeling label and the labels, so that the average cross entropy can effectively reflect the distance between the real label and the labeling label.
As a possible implementation manner, the determining whether the label tag corresponding to each sample picture is a correct tag according to the probability of the average cross entropy of each sample picture in the fitting distribution includes: substituting the average cross entropy of each sample picture into the probability density function of the fitting distribution to obtain the probability that the label of each sample picture is the label corresponding to each sample picture; and determining whether the label corresponding to each sample picture is a correct label or not according to the probability that the label of each sample picture is the label corresponding to each sample picture and a preset probability threshold.
In the embodiment of the application, the probability that the label of each sample picture is the corresponding labeling label of each sample picture is determined through the average cross entropy and the fitting distribution, and then the correctness of the labeling label is effectively and accurately determined through the probability threshold.
As a possible implementation manner, after determining whether the label corresponding to each sample picture is a correct label according to the probability of the average cross entropy of each sample picture in the fit distribution, the method further includes: acquiring a target sample picture; the label corresponding to the target sample picture is an error label; and correcting the error label according to the verification result of the target sample picture output by the plurality of verification models.
In the embodiment of the application, based on the result output by the verification model, besides the judgment on the correctness of the label, the correction of the error label can be realized.
As a possible implementation manner, the correcting the error label according to the verification result of the target sample picture output by the multiple verification models includes: calculating the average probability that the label of the target sample picture is each preset label in a plurality of labels according to the verification result of the target sample picture output by the plurality of verification models; and determining the label with the maximum average probability as a correct label corresponding to the target sample picture.
In the embodiment of the application, the label with the maximum average probability is determined as the correct label corresponding to the target sample picture, so that the labeled label is effectively corrected.
As a possible implementation manner, the correcting the error label according to the verification result of the target sample picture output by the multiple verification models includes: determining a frequency of a probability that the probability of each of the plurality of tags is the maximum according to the verification result of the target sample picture output by the plurality of verification models; and determining the label with the maximum frequency as a correct label corresponding to the target sample picture.
In the embodiment of the application, the label is effectively corrected by determining the frequency of the maximum probability of each label and then determining the label with the maximum frequency as the correct label corresponding to the target sample picture.
In a second aspect, an embodiment of the present application provides a data processing apparatus, which includes functional modules for implementing the data processing method described in the first aspect and any one of the possible implementation manners of the first aspect.
In a third aspect, an embodiment of the present application provides a readable storage medium, where a computer program is stored on the readable storage medium, and the computer program is executed by a computer to perform the data processing method described in the first aspect and any one of the possible implementation manners of the first aspect.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and that those skilled in the art can also obtain other related drawings based on the drawings without inventive efforts.
Fig. 1 is a flowchart of a data processing method provided in an embodiment of the present application;
fig. 2 is a functional block diagram of a data processing apparatus according to an embodiment of the present application.
Icon: 200-a data processing apparatus; 210-an obtaining module; 220-processing module.
Detailed Description
The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.
The technical scheme provided by the embodiment of the application can be applied to an application scene needing label cleaning, the cleaned label is a manually labeled label, and the label cleaning result can comprise two aspects, namely judging whether the label is a correctly labeled label or not on one hand, and correcting a wrongly labeled label on the other hand.
In the embodiment of the present application, the processed object may be a picture, and therefore, the cleaned tag may also be a tag of the picture. For the labels of the pictures, there can be different forms: if there are picture tags for marking the target objects in the picture, these tags may represent the category or name of the target objects. For example, if there are picture tags for marking the attributes of the target object in the picture, these tags may represent the attributes of the target object, such as: whether the retina in the retinal image is a healthy retina. From another perspective, the tags may be used to classify the pictures, or the tags may represent the recognition or detection results of the pictures.
Of course, for other non-image processing objects, the tag may also be cleaned by referring to the technical solution of the embodiment of the present application, and the image application scenario defined in the embodiment of the present application does not limit the technical solution of the embodiment of the present application.
The hardware operating environment of the technical scheme provided by the embodiment of the application can be a front end, such as a client, or a back end, such as a server; the system may also be a system composed of a front end and a back end, in which the front end is configured to receive a data processing request, and then the front end sends the data processing request and corresponding data to the back end, and the back end performs data processing based on the data processing request and the corresponding data.
Based on the introduction of the application scenario and the hardware operating environment, referring to fig. 1, a flowchart of a data processing method provided in an embodiment of the present application is shown, where the data processing method includes:
step 110: and acquiring a plurality of sample pictures and a plurality of pre-trained verification models. Each sample picture corresponds to one labeling label.
Step 120: and respectively inputting the multiple sample pictures into the multiple verification models to obtain a verification result output by each verification model. The verification result comprises the following steps: the probability that the label of each sample picture is each of a plurality of preset labels is obtained; the annotation label belongs to a label of the plurality of labels.
Step 130: and determining the average cross entropy between the labeling label corresponding to each sample picture and the plurality of labels according to the verification results output by the plurality of verification models.
Step 140: and determining the fitting distribution of the labels according to the average cross entropy of the sample pictures.
Step 150: and determining whether the label corresponding to each sample picture is a correct label according to the probability of the average cross entropy of each sample picture in the fitting distribution.
In the embodiment of the application, compared with the prior art, the probability that the label of each sample picture is one of the preset labels is respectively output through a plurality of pre-trained verification models, and the average cross entropy between the label corresponding to each sample picture and the labels is determined based on the probability, wherein the average cross entropy can represent the distance between the real label and the label; and finally, determining whether the label is a correct label according to the probability of the average cross entropy in the fit distribution. In the process, manual screening is not needed, and the label cleaning efficiency is improved; meanwhile, the label is analyzed more scientifically based on the average cross entropy, the verification model and the probability distribution, and the label cleaning accuracy is improved.
A detailed implementation of steps 110-150 is described next.
In step 110, the plurality of sample pictures are pictures corresponding to labeling labels, and the labeling labels are manually labeled labels. The plurality of verification models are pre-trained models, and the verification models are used for predicting labels of the sample pictures. Next, the preset modes of the plurality of sample pictures and the plurality of verification models will be described.
The plurality of sample pictures are pictures to be subjected to label cleaning, can be provided by a user, and can also be obtained from a database. As an optional implementation manner, a plurality of pictures are obtained first, the plurality of pictures may include two types, one type is a picture marked with a label, and the part of the picture may be directly used as a sample picture; the other is a picture which is not marked with a label, and the part of the picture can be used as a sample picture after the label is marked.
As an alternative embodiment, the training process of the plurality of verification models includes: acquiring a cross data set; the cross data set comprises a training data set and a verification data set, the training data set comprises a plurality of first sample pictures, the verification data set comprises a plurality of second sample pictures, and the plurality of first sample pictures and the plurality of second sample pictures are all selected from the plurality of sample pictures; respectively training a plurality of initial verification models through a cross data set to obtain a plurality of trained verification models; the training data set is used for training a classifier in an initial verification model, and the verification data set is used for testing the verification model obtained through training.
In this embodiment, assuming that the obtained multiple pictures are the target data set, a metadata set may be obtained based on the target data set, where the metadata set may be directly the target data set, or may be obtained by sampling from the target data set by a random sampling method.
Then, the training dataset and the validation dataset are selected from the metadata set. Before selecting the training data set and the verification data set, the sample pictures in the metadata set can be cross-split according to the format of the data set. The format of the data set is divided into a detection data set and a classification data set, and for the detection data set, the pictures need to be cut together according to the label to form a classification data set, that is, the label is labeled on the pictures in the embodiment, so as to obtain a sample picture with a label; no cropping operation is required for classifying the data set, i.e. the image already corresponding to the label in the foregoing embodiment.
Selecting a part of sample pictures from the classification data set as a training data set based on the classification data set; and the other part of the sample picture is used as a verification data set. And then training the initial verification model by using the training data set, and testing the verification model obtained by training by using the verification data set so as to evaluate the performance index of the model.
Since the verification model has a plurality, different training data sets and groupings of verification data sets can be set based on the classification data sets for cross-training of different verification models. As an alternative embodiment, it is assumed that there are multiple data sets, multiple verification models to be trained, and some of the data sets are used as training data sets in turn, and the rest data sets are used as verification data sets, and then are used for cross-training different verification models respectively. For example, assume that there are 10 data sets and 10 verification models to be trained, the training data set corresponding to the 1 st verification model is the 1 st to 9 th data sets, and the corresponding verification data set is the 10 th data set; the training data set corresponding to the 2 nd verification model is the 2 nd to 10 th data, and the corresponding verification data set is the 1 st data; and the data sets corresponding to other verification models are analogized in sequence. In practical application, the grouping mode of the data sets can be specified in advance, then different grouping is carried out according to the preset grouping mode, different training data sets and verification data sets are obtained, and then training of the verification model is carried out.
In the embodiment of the present application, the structure of the verification model may adopt an EfficientNet (a neural network) proposed by Google brain team, or may adopt other feasible network structures. During training, the training data set is used for training a classifier in the verification model so that the classifier can predict the probability that the picture is taken as each label; the verification data set is used for testing the verification model obtained through training so as to test whether the verification model can realize the prediction of the label probability. The training and testing of the verification model is well within the skill of the art and will not be described in detail herein.
In the embodiment of the present application, after the verification model is trained, it may predict the probability that the label of the sample picture is each label, for example: if there are three labels in total, the verification model may output the probability that the label of the sample picture is the first label, the probability that the label of the sample picture is the second label, and the probability that the label of the sample picture is the third label. In addition, for the sample picture, a tag label is also corresponded, and the tag label is one of the three tags, but the tag label may or may not be the correct label.
In this embodiment of the present application, after training of multiple verification models is completed, quality evaluation may also be performed on the multiple verification models, which may also be understood as evaluating the prediction capability of the verification models, and therefore, the method further includes: determining the accuracy and recall rate of the trained verification models; determining the quality scores of the trained verification models according to the accuracy rate and the recall rate; and optimizing the trained multiple verification models according to the quality scores.
Wherein, the accuracy P can be determined by:
Figure BDA0002877880770000091
calculating; the recall rate R may be determined by:
Figure BDA0002877880770000092
and (6) performing calculation. The TP is the number of sample pictures with correct prediction results in the sample pictures predicted by the verification model, and for correct prediction results, the label with the maximum probability output by the verification model is the correct label of the sample pictures; FP is the number of sample pictures with a first error as a prediction result in the sample pictures predicted by the verification model, and for the first error, the label with the maximum probability output by the verification model is a non-labeled label which is not a correct label; FN is authentication model instituteAnd for the second error, the label with the highest probability output by the verification model is the label, and the label is not the correct label.
Based on accuracy and recall, the quality score of the verification model F1 may be determined by:
Figure BDA0002877880770000093
and (6) performing calculation.
Based on the quality scores, the verification model can be continuously optimized, such as: and presetting a quality score threshold, calculating the quality score of the verification model after finishing the training of the verification model each time, and training the verification model again if the quality score does not reach the quality score threshold until the quality score reaches the quality score threshold. When the verification model is retrained again, the cross data set can be replaced or not.
In the embodiment of the application, the quality score is determined through the accuracy and the recall rate, and then the verification model is optimized through the quality score, so that the accuracy of the verification model is improved.
Based on the above description of the verification model, in step 120, a plurality of sample pictures are respectively input into a plurality of verification models, such as: assuming 10 sample pictures, and 10 verification models, it is necessary to: 10 sample pictures were input into each verification model. Correspondingly, each verification model outputs a plurality of verification results. Such as: the verification result output by the 1 st verification model comprises the probability that the 1 st sample picture is each of the plurality of labels, the probability that the 2 nd sample picture is each of the plurality of labels, and the 3 rd to 10 th sample pictures are similar. It can be understood that when multiple sample pictures are input into multiple verification models, the labeling labels of the multiple sample pictures are not input into the verification models together, and therefore the probability of each label output by the verification model includes the probability of the labeling label.
In step 120, the preset labels may be set according to the labels corresponding to the sample pictures, the labels corresponding to different sample pictures may be the same or different, and the preset labels are all different labels among the labels. Such as: there are 3 corresponding label labels of picture to be label A in many sample pictures, have 4 corresponding label labels of picture to be label B, have 3 corresponding label labels of picture to be label C, then a plurality of labels include: tag a, tag B, and tag C.
Based on the verification result obtained in step 120, in step 130, an average cross entropy between the label and the plurality of labels corresponding to each sample picture is determined according to the verification results output by the plurality of verification models.
As an optional implementation manner, the average cross entropy between the label and the plurality of labels corresponding to each sample picture is calculated by the following formula:
Figure BDA0002877880770000101
wherein the content of the first and second substances,
Figure BDA0002877880770000102
is the average cross entropy of the jth sample picture,
Figure BDA0002877880770000103
the label of the jth sample picture output for the ith verification model is preset
Figure BDA0002877880770000111
The probability of an individual tag being present,
Figure BDA0002877880770000112
the corresponding label of the jth sample picture is the second
Figure BDA0002877880770000113
The number of the tags is 1 for each tag,
Figure BDA0002877880770000114
the corresponding label of the jth sample picture is not the jth sample picture
Figure BDA0002877880770000115
0 for each tag.
By way of example, assume that the preset plurality of tags includes: the apple and the banana, the object name in this label representative sample picture, the apple is the 1 st label, the banana is the 2 nd label, and the quantity of verifying the model is 2, and the quantity of sample picture is 10.
For the 5 th sample picture, the label is banana, the probability that the label of the sample picture output by the 1 st verification model is apple is 0.3, and the probability that the label of the sample picture is banana is 0.7. The probability that the label of the sample picture output by the 2 nd verification model is an apple is 0.8, and the probability that the label of the sample picture is a banana is 0.2. Then, for the 5 th sample picture, the average cross entropy is: - (0 × log0.3+1 × log0.7+0 × log0.8+1 × log0.2)/2 ═ 0.983.
Continuing with the example, for the 6 th sample picture, which is labeled banana, the probability that the label of the 1 st verification model output sample picture is apple is 0.3, the probability that the label of the sample picture is banana is 0.7, the probability that the label of the 2 nd verification model output sample picture is apple is 0.2, and the probability that the label of the sample picture is banana is 0.8, then the average cross entropy for the 6 th sample picture is- (0 log0.3+1 log0.7+0 log0.2+1 log 0.8)/2-0.2899.
Further, the average cross entropy implicitly fits the distance between the correct label and the label. For example, in practice, the correct label of the 5 th sample picture is an apple, the label thereof is a banana, and the distance between the label and the correct label is large, which indicates that the label is an error label. And the correct label of the 6 th sample picture is a banana, the label of the sample picture is also a banana, and the distance between the label and the correct label is smaller, so that the label is the correct label.
From step 130, an average cross entropy for each sample picture may be determined, based on which, in step 140, a fitted distribution of the plurality of labels is determined from the average cross entropy of the plurality of sample pictures.
As an alternative implementation, in step 140, cross entropy values corresponding to all the labels may be counted, a histogram distribution is calculated, and continuous distribution fitting is performed according to different labels. The fitted distribution can be expressed as:
Figure BDA0002877880770000121
wherein the content of the first and second substances,
Figure BDA0002877880770000122
all average cross entropies corresponding to the kth label are, for example: all average cross entropies corresponding to the apple labels; dkThe distribution obtained for all mean cross entropy fits based on the kth label.
After the fitting distribution is obtained in step 140, in step 150, it is determined whether the label corresponding to each sample picture is the correct label according to the probability of the average cross entropy of each sample picture in the fitting distribution, as an optional implementation manner, the step includes: substituting the average cross entropy of each sample picture into the probability density function of the fitting distribution to obtain the probability that the label of each sample picture is the label corresponding to each sample picture; and determining whether the label corresponding to each sample picture is a correct label or not according to the probability that the label of each sample picture is the label corresponding to each sample picture and a preset probability threshold.
In this embodiment, each sample picture corresponds to an average cross entropy, which is substituted into the fitted distribution DkThe probability of the sample picture in the fitting distribution can be obtained from the probability density function, and the probability can represent the probability that the label of the sample picture is the corresponding label. And then, judging the correctness of the label through a preset probability threshold, such as: if the obtained probability is greater than or equal to the probability threshold, the label corresponding to the sample picture is a correct label; if the obtained probability is smaller than the probability threshold, the marking label corresponding to the sample picture is shownIs not the correct tag.
In the embodiment of the application, the probability that the label of each sample picture is the corresponding label of each sample picture is determined through the average cross entropy and the fitting distribution, and then the correctness of the probability threshold is effectively and accurately determined through the probability threshold.
In the embodiment of the present application, after the determination result is obtained in step 150, the sample picture with the correct label can be automatically marked according to the determination result, for example, the sample picture with the correct label is marked as the correct label sample; and marking the sample picture with the wrong label as the wrong label sample.
And correcting the labeling label of the sample picture which is labeled by mistake. Thus, after step 150, the method further comprises: acquiring a target sample picture; the label corresponding to the target sample picture is an error label; and correcting the error label according to the verification result of the target sample picture output by the plurality of verification models.
As described in the foregoing embodiment, based on the determination result, it is possible to mark whether the sample picture is the correctly marked sample picture, and therefore, the incorrectly marked sample picture can be directly obtained as the target sample picture.
For the target sample picture, each verification model outputs a corresponding verification result. Based on the verification result output by each verification model, correction of the error label can be realized, and in the embodiment of the application, two optional correction modes are provided.
A first optional correction mode, namely calculating the average probability that the label of the target sample picture is each preset label in a plurality of labels according to the verification result of the target sample picture output by a plurality of verification models; and determining the label with the maximum average probability as a correct label corresponding to the target sample picture.
In this modification, an average probability of the probability that the label of the target sample picture output by each verification model is the label is calculated, for example, it is assumed that there are 2 verification models and 2 preset labels, the probability that the label of the target sample picture output by the 1 st verification model is the 1 st label is a, the probability that the label of the target sample picture output by the 1 st verification model is the 2 nd label is b, the probability that the label of the target sample picture output by the 2 nd verification model is the 1 st label is c, and the probability that the label of the target sample picture output by the 1 st verification model is the 2 nd label is d. Then, the average probability of the 1 st tag is: (a + c)/2; the average probability for the 2 nd tag is: (b + d)/2.
And determining the label with the maximum average probability as a correct label corresponding to the target sample picture based on the average probability. Continuing with the example, if (a + c)/2 > (b + d)/2, the correct label corresponding to the target sample picture is the 1 st label; if (a + c)/2 < (b + d)/2, the correct tags corresponding to the target sample picture are 2 tags; if (a + c)/2 is equal to (b + d)/2, the correct label corresponding to the target sample picture is the corresponding label.
In the embodiment of the application, the label with the maximum average probability is determined as the correct label corresponding to the target sample picture, so that the labeled label is effectively corrected.
A second optional modification manner, which is to determine, according to the verification result of the target sample picture output by the multiple verification models, a frequency of a probability that the probability of each of the multiple tags is the maximum; and determining the label with the maximum frequency as a correct label corresponding to the target sample picture.
In this modification, for the target sample picture, each verification model outputs a different probability of the label, but there is a maximum probability among the probabilities of the different labels, and therefore, the maximum probability of the label output by each verification model may be counted, and then the frequency of the maximum probability of each label is determined based on the statistical result.
For example, assuming that there are 10 verification models, the preset tag includes a tag a and a tag B, and for the target sample picture, the probability of the tag a is greater in the output results of 8 verification models, and then the frequency of the tag a is: 0.8, the probability of the label B is higher in the output results of 2 verification models, and the frequency of the label B is 0.2.
Further, after the frequency of each label which is the maximum probability label is determined, the label with the maximum frequency is determined as the correct label corresponding to the target sample picture. Continuing with the example, label a described above will be the correct label for the target sample picture.
The two modifications may be used in combination or individually. Such as: label correction is carried out on the target sample pictures in a first correction mode; or the target sample picture is subjected to label correction by adopting a second correction mode.
Or the target sample picture is firstly corrected by the first correction mode, if the first correction mode cannot complete the correction, for example: the label with the highest average probability is more than one, such as: and (a + c)/2 is (b + d)/2, and then the correction is carried out by adopting a second correction mode, if the second correction mode is adopted, the correction cannot be finished, for example: the frequency of the label a is equal to that of the label B, and the target sample picture can be marked as an unknown label sample picture. Of course, the order of implementation of the first modification and the second modification may be reversed.
Or, the target sample picture firstly adopts a first correction mode to perform label correction, and then verifies the correction result of the first correction mode by using a second correction mode, such as: determining a correct label through a second correction mode, and determining that the correction is correct if the correct label is the same as the correct label determined by the first correction mode; if the correct label is different from the correct label determined by the first correction mode, it is determined that the correction may be wrong, and at this time, the target sample picture may be marked as an unknown label sample. Of course, the order of implementation of the first modification and the second modification may be reversed.
The application modes of the two correction modes can be flexibly set according to actual application scenarios, and in the embodiment of the present application, the application modes are merely exemplary and are not limited.
In the foregoing embodiment, if the target sample picture cannot be corrected, the target sample picture is marked as an unknown label sample picture, and for the unknown label sample picture, a variety of processing methods may be adopted. Such as: and feeding back the sample picture of the unknown label, and labeling the correct label by a professional.
Based on the same inventive concept, please refer to fig. 2, an embodiment of the present application further provides a data processing apparatus 200, which includes an obtaining module 210 and a processing module 220.
The obtaining module 210 is configured to: acquiring a plurality of sample pictures and a plurality of pre-trained verification models; each sample picture corresponds to one label; the processing module 220 is configured to: respectively inputting the multiple sample pictures into the multiple verification models to obtain a verification result output by each verification model; the verification result comprises: the probability that the label of each sample picture is each of a plurality of preset labels is obtained; the preset labels are different; the label tag belongs to a tag of the plurality of tags; determining the average cross entropy between the labeling label corresponding to each sample picture and the plurality of labels according to the verification results output by the plurality of verification models; determining fitting distribution of the labels according to the average cross entropy of the sample pictures; and determining whether the label corresponding to each sample picture is a correct label according to the probability of the average cross entropy of each sample picture in the fitting distribution.
In this embodiment of the present application, the obtaining module 210 is further configured to obtain a cross data set; the cross data set comprises a training data set and a verification data set, the training data set comprises a plurality of first sample pictures, the verification data set comprises a plurality of second sample pictures, and the plurality of first sample pictures and the plurality of second sample pictures are all selected from the plurality of sample pictures; the processing module 220 is further configured to: training a plurality of initial verification models respectively through the cross data set to obtain a plurality of trained verification models; the training data set is used for training a classifier in an initial verification model, and the verification data set is used for testing the verification model obtained through training.
In an embodiment of the present application, the processing module 220 is further configured to: determining the accuracy and recall rate of the trained verification models; determining quality scores of the trained verification models according to the accuracy rate and the recall rate; and optimizing the trained verification models according to the quality scores.
In this embodiment of the application, the processing module 220 is specifically configured to: by the formula:
Figure BDA0002877880770000161
calculating the average cross entropy between the labeling label corresponding to each sample picture and the plurality of labels; wherein the content of the first and second substances,
Figure BDA0002877880770000162
is the average cross entropy of the jth sample picture,
Figure BDA0002877880770000163
the label of the jth sample picture output for the ith verification model is preset
Figure BDA0002877880770000164
The probability of an individual tag being present,
Figure BDA0002877880770000165
the corresponding label of the jth sample picture is the second
Figure BDA0002877880770000166
The number of the tags is 1 for each tag,
Figure BDA0002877880770000167
the corresponding label of the jth sample picture is not the jth sample picture
Figure BDA0002877880770000168
0 for each tag.
In this embodiment of the application, the processing module 220 is specifically configured to: substituting the average cross entropy of each sample picture into the probability density function of the fitting distribution to obtain the probability that the label of each sample picture is the label corresponding to each sample picture; and determining whether the label corresponding to each sample picture is a correct label or not according to the probability that the label of each sample picture is the label corresponding to each sample picture and a preset probability threshold.
In this embodiment of the application, the obtaining module 210 is further configured to obtain a target sample picture; the label corresponding to the target sample picture is an error label; the processing module 220 is further configured to: and correcting the error label according to the verification result of the target sample picture output by the plurality of verification models.
In this embodiment of the application, the processing module 220 is specifically configured to: calculating the average probability that the label of the target sample picture is each preset label in a plurality of labels according to the verification result of the target sample picture output by the plurality of verification models; and determining the label with the maximum average probability as a correct label corresponding to the target sample picture.
In this embodiment of the application, the processing module 220 is further specifically configured to: determining a frequency of a probability that the probability of each of the plurality of tags is the maximum according to the verification result of the target sample picture output by the plurality of verification models; and determining the label with the maximum frequency as a correct label corresponding to the target sample picture.
The functional blocks of the data processing apparatus 200 correspond to the steps of the data processing method described in the foregoing embodiment, and therefore, the implementation of the functional blocks may refer to the implementation of the steps of the data processing method described in the foregoing embodiment, and the description thereof is not repeated in the embodiment of the present application.
Based on the same inventive concept, embodiments of the present application further provide a readable storage medium, where a computer program is stored, and when the computer program is executed by a computer, the data processing method described in the foregoing embodiments is executed.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.
In addition, units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
Furthermore, the functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.
In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions.
The above description is only an example of the present application and is not intended to limit the scope of the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims (10)

1. A data processing method, comprising:
acquiring a plurality of sample pictures and a plurality of pre-trained verification models; each sample picture corresponds to one label;
respectively inputting the multiple sample pictures into the multiple verification models to obtain a verification result output by each verification model; the verification result comprises: the probability that the label of each sample picture is each of a plurality of preset labels is obtained; the label belongs to the labels in the plurality of labels, and the plurality of preset labels are different;
determining the average cross entropy between the labeling label corresponding to each sample picture and the plurality of labels according to the verification results output by the plurality of verification models;
determining fitting distribution of the labels according to the average cross entropy of the sample pictures;
and determining whether the label corresponding to each sample picture is a correct label according to the probability of the average cross entropy of each sample picture in the fitting distribution.
2. The method of claim 1, wherein prior to said obtaining the plurality of sample pictures and the plurality of pre-trained verification models, the method further comprises:
acquiring a cross data set; the cross data set comprises a training data set and a verification data set, the training data set comprises a plurality of first sample pictures, the verification data set comprises a plurality of second sample pictures, and the plurality of first sample pictures and the plurality of second sample pictures are all selected from the plurality of sample pictures;
training a plurality of initial verification models respectively through the cross data set to obtain a plurality of trained verification models; the training data set is used for training a classifier in an initial verification model, and the verification data set is used for testing the verification model obtained through training.
3. The method of claim 2, wherein after the training of the plurality of initial verification models by the respective cross data sets to obtain the trained plurality of verification models, the method further comprises:
determining the accuracy and recall rate of the trained verification models;
determining quality scores of the trained verification models according to the accuracy rate and the recall rate;
and optimizing the trained verification models according to the quality scores.
4. The method of claim 1, wherein determining the average cross entropy between the label corresponding to each sample picture and the plurality of labels according to the verification results output by the plurality of verification models comprises:
by the formula:
Figure FDA0002877880760000021
calculating the average cross entropy between the labeling label corresponding to each sample picture and the plurality of labels;
wherein the content of the first and second substances,
Figure FDA0002877880760000022
is the average cross entropy of the jth sample picture,
Figure FDA0002877880760000023
the label of the jth sample picture output for the ith verification model is preset
Figure FDA0002877880760000024
The probability of an individual tag being present,
Figure FDA0002877880760000025
the corresponding label of the jth sample picture is the second
Figure FDA0002877880760000026
The number of the tags is 1 for each tag,
Figure FDA0002877880760000027
the corresponding label of the jth sample picture is not the jth sample picture
Figure FDA0002877880760000028
0 for each tag.
5. The method of claim 1, wherein determining whether the label corresponding to each sample picture is a correct label according to the probability of the average cross entropy of each sample picture in the fit distribution comprises:
substituting the average cross entropy of each sample picture into the probability density function of the fitting distribution to obtain the probability that the label of each sample picture is the label corresponding to each sample picture;
and determining whether the label corresponding to each sample picture is a correct label or not according to the probability that the label of each sample picture is the label corresponding to each sample picture and a preset probability threshold.
6. The method of claim 1, wherein after determining whether the label corresponding to each sample picture is the correct label according to the probability of the average cross entropy of each sample picture in the fit distribution, the method further comprises:
acquiring a target sample picture; the label corresponding to the target sample picture is an error label;
and correcting the error label according to the verification result of the target sample picture output by the plurality of verification models.
7. The method according to claim 6, wherein the correcting the error label according to the verification result of the target sample picture output by the plurality of verification models comprises:
calculating the average probability that the label of the target sample picture is each preset label in a plurality of labels according to the verification result of the target sample picture output by the plurality of verification models;
and determining the label with the maximum average probability as a correct label corresponding to the target sample picture.
8. The method according to claim 6, wherein the correcting the error label according to the verification result of the target sample picture output by the plurality of verification models comprises:
determining a frequency of a probability that the probability of each of the plurality of tags is the maximum according to the verification result of the target sample picture output by the plurality of verification models;
and determining the label with the maximum frequency as a correct label corresponding to the target sample picture.
9. A data processing apparatus, comprising:
the acquisition module is used for acquiring a plurality of sample pictures and a plurality of pre-trained verification models; each sample picture corresponds to one label;
a processing module to:
respectively inputting the multiple sample pictures into the multiple verification models to obtain a verification result output by each verification model; the verification result comprises: the probability that the label of each sample picture is each of a plurality of preset labels is obtained; the label tag belongs to a tag of the plurality of tags; the preset labels are different;
determining the average cross entropy between the labeling label corresponding to each sample picture and the plurality of labels according to the verification results output by the plurality of verification models;
determining fitting distribution of the labels according to the average cross entropy of the sample pictures;
and determining whether the label corresponding to each sample picture is a correct label according to the probability of the average cross entropy of each sample picture in the fitting distribution.
10. A readable storage medium, characterized in that the readable storage medium has stored thereon a computer program which, when executed by a computer, performs the data processing method according to any one of claims 1 to 8.
CN202011644826.3A 2020-12-31 2020-12-31 Data processing method and device and readable storage medium Active CN112734035B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011644826.3A CN112734035B (en) 2020-12-31 2020-12-31 Data processing method and device and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011644826.3A CN112734035B (en) 2020-12-31 2020-12-31 Data processing method and device and readable storage medium

Publications (2)

Publication Number Publication Date
CN112734035A true CN112734035A (en) 2021-04-30
CN112734035B CN112734035B (en) 2023-10-27

Family

ID=75609406

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011644826.3A Active CN112734035B (en) 2020-12-31 2020-12-31 Data processing method and device and readable storage medium

Country Status (1)

Country Link
CN (1) CN112734035B (en)

Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170061330A1 (en) * 2015-08-31 2017-03-02 International Business Machines Corporation Method, system and computer program product for learning classification model
CN109784391A (en) * 2019-01-04 2019-05-21 杭州比智科技有限公司 Sample mask method and device based on multi-model
CN110163234A (en) * 2018-10-10 2019-08-23 腾讯科技(深圳)有限公司 A kind of model training method, device and storage medium
CN110472665A (en) * 2019-07-17 2019-11-19 新华三大数据技术有限公司 Model training method, file classification method and relevant apparatus
US20190354857A1 (en) * 2018-05-17 2019-11-21 Raytheon Company Machine learning using informed pseudolabels
CN110490237A (en) * 2019-08-02 2019-11-22 Oppo广东移动通信有限公司 Data processing method, device, storage medium and electronic equipment
CN110991496A (en) * 2019-11-15 2020-04-10 北京三快在线科技有限公司 Method and device for training model
CN111209972A (en) * 2020-01-09 2020-05-29 中国科学院计算技术研究所 Image classification method and system based on hybrid connectivity deep convolution neural network
WO2020152627A1 (en) * 2019-01-23 2020-07-30 Aptiv Technologies Limited Automatically choosing data samples for annotation
CN111553315A (en) * 2020-05-14 2020-08-18 哈尔滨工业大学(深圳) Satellite image-based poverty prediction model construction and poverty prediction method
CN111582366A (en) * 2020-05-07 2020-08-25 清华大学 Image processing method, device and equipment
CN111832627A (en) * 2020-06-19 2020-10-27 华中科技大学 Image classification model training method, classification method and system for suppressing label noise
CN111897996A (en) * 2020-08-10 2020-11-06 北京达佳互联信息技术有限公司 Topic label recommendation method, device, equipment and storage medium
CN112115995A (en) * 2020-09-11 2020-12-22 北京邮电大学 Image multi-label classification method based on semi-supervised learning
US20200401852A1 (en) * 2019-06-20 2020-12-24 Baidu Online Network Technology (Beijing) Co., Ltd. Method and apparatus for generating information assessment model

Patent Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170061330A1 (en) * 2015-08-31 2017-03-02 International Business Machines Corporation Method, system and computer program product for learning classification model
US20190354857A1 (en) * 2018-05-17 2019-11-21 Raytheon Company Machine learning using informed pseudolabels
CN110163234A (en) * 2018-10-10 2019-08-23 腾讯科技(深圳)有限公司 A kind of model training method, device and storage medium
CN109784391A (en) * 2019-01-04 2019-05-21 杭州比智科技有限公司 Sample mask method and device based on multi-model
WO2020152627A1 (en) * 2019-01-23 2020-07-30 Aptiv Technologies Limited Automatically choosing data samples for annotation
US20200401852A1 (en) * 2019-06-20 2020-12-24 Baidu Online Network Technology (Beijing) Co., Ltd. Method and apparatus for generating information assessment model
CN110472665A (en) * 2019-07-17 2019-11-19 新华三大数据技术有限公司 Model training method, file classification method and relevant apparatus
CN110490237A (en) * 2019-08-02 2019-11-22 Oppo广东移动通信有限公司 Data processing method, device, storage medium and electronic equipment
CN110991496A (en) * 2019-11-15 2020-04-10 北京三快在线科技有限公司 Method and device for training model
CN111209972A (en) * 2020-01-09 2020-05-29 中国科学院计算技术研究所 Image classification method and system based on hybrid connectivity deep convolution neural network
CN111582366A (en) * 2020-05-07 2020-08-25 清华大学 Image processing method, device and equipment
CN111553315A (en) * 2020-05-14 2020-08-18 哈尔滨工业大学(深圳) Satellite image-based poverty prediction model construction and poverty prediction method
CN111832627A (en) * 2020-06-19 2020-10-27 华中科技大学 Image classification model training method, classification method and system for suppressing label noise
CN111897996A (en) * 2020-08-10 2020-11-06 北京达佳互联信息技术有限公司 Topic label recommendation method, device, equipment and storage medium
CN112115995A (en) * 2020-09-11 2020-12-22 北京邮电大学 Image multi-label classification method based on semi-supervised learning

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
A. RUSIECKI: "Trimmed categorical cross-entropy for deep learning with label noise", IMAGE AND VISION PROCESSING AND DISPLAY TECHNOLOGY, vol. 55, no. 6, pages 319 - 320, XP006075931, DOI: 10.1049/el.2018.7980 *
姚佳奇 等: "基于标签语义相似的动态多标签文本分类算法", 计算机工程与应用, vol. 56, no. 19, pages 94 - 98 *
牛中超: "基于深度学习的问题标签预测研究与实现", 中国优秀硕士学位论文全文数据库 信息科技辑, no. 6, pages 138 - 1154 *

Also Published As

Publication number Publication date
CN112734035B (en) 2023-10-27

Similar Documents

Publication Publication Date Title
CN109189767B (en) Data processing method and device, electronic equipment and storage medium
JP5506722B2 (en) Method for training a multi-class classifier
CN109583332B (en) Face recognition method, face recognition system, medium, and electronic device
JP2019521443A (en) Cell annotation method and annotation system using adaptive additional learning
JP2015087903A (en) Apparatus and method for information processing
CN111353549B (en) Image label verification method and device, electronic equipment and storage medium
CN111460250B (en) Image data cleaning method, image data cleaning device, image data cleaning medium, and electronic apparatus
CN108595657B (en) Data table classification mapping method and device of HIS (hardware-in-the-system)
CN109948735A (en) A kind of multi-tag classification method, system, device and storage medium
CN111340233B (en) Training method and device of machine learning model, and sample processing method and device
CN112766427A (en) Training method, device and equipment of image recognition model
CN112215845A (en) Medical image information identification method, device and system based on multi-neural network
US10867255B2 (en) Efficient annotation of large sample group
CN114254146A (en) Image data classification method, device and system
CN108345942B (en) Machine learning identification method based on embedded code learning
CN113192028A (en) Quality evaluation method and device for face image, electronic equipment and storage medium
Clark et al. Performance characterization in computer vision a tutorial
CN111414930B (en) Deep learning model training method and device, electronic equipment and storage medium
CN112734035A (en) Data processing method and device and readable storage medium
CN115456481A (en) Attendance data processing method applied to enterprise management and attendance server
CN112861962B (en) Sample processing method, device, electronic equipment and storage medium
CN110826616B (en) Information processing method and device, electronic equipment and storage medium
CN114445679A (en) Model training method, related device, equipment and storage medium
CN113628077A (en) Method for generating non-repeated examination questions, terminal and readable storage medium
WO2012075221A1 (en) Method for inferring attributes of a data set and recognizers used thereon

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant