CN110222791B - Sample labeling information auditing method and device - Google Patents
Sample labeling information auditing method and device Download PDFInfo
- Publication number
- CN110222791B CN110222791B CN201910538177.XA CN201910538177A CN110222791B CN 110222791 B CN110222791 B CN 110222791B CN 201910538177 A CN201910538177 A CN 201910538177A CN 110222791 B CN110222791 B CN 110222791B
- Authority
- CN
- China
- Prior art keywords
- sample
- identification
- recognition
- auditing
- result
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Computation (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computational Linguistics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Image Analysis (AREA)
Abstract
The invention provides a method and a device for auditing sample labeling information, wherein the method comprises the following steps: acquiring a labeled sample to be examined and forming a training sample set; dividing the training sample set into a preset number of sub-sample sets, respectively training different sub-sample sets, and establishing different first recognition models; acquiring an identification sample set used for testing, identifying each identification sample in the identification sample set through different established first identification models to obtain an identification result of each first identification model on the identification sample, counting the occurrence frequency of each identification result, and determining the first identification model corresponding to the identification result with the occurrence frequency smaller than a preset threshold value as a target identification model when the identification result with the occurrence frequency not smaller than the preset threshold value exists; and performing annotation information examination on the annotated samples in the sub-sample set corresponding to the target identification model. By applying the scheme provided by the invention, the labeling result of the sample can be quickly examined.
Description
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a method and a device for auditing sample labeling information, electronic equipment and a computer-readable storage medium.
Background
In the field of artificial intelligence model training, samples need to be labeled, for example, the samples are labeled manually, or the samples are identified and labeled automatically through a pre-established neural network identification model. In order to ensure the accuracy of model training, whether the labeling information of the sample is accurate or not needs to be checked.
Currently, the annotation information of all annotated samples is usually reviewed manually. However, since the number of samples in the sample set is large, it takes much time and labor to review the labeled information of the samples.
Disclosure of Invention
The invention aims to provide a method and a device for auditing sample labeling information, electronic equipment and a computer readable storage medium, which are used for quickly auditing the labeling information of a sample. The specific technical scheme is as follows:
in a first aspect, the present invention provides a method for auditing sample labeling information, where the method includes:
acquiring a labeled sample to be examined and forming a training sample set; wherein, the labeling sample is labeled with labeling information in advance;
dividing the training sample set into a preset number of sub-sample sets, respectively training different sub-sample sets, and establishing different first recognition models; the first recognition model is a neural network-based model;
acquiring an identification sample set used for testing, identifying each identification sample in the identification sample set through different established first identification models to obtain an identification result of each first identification model on the identification sample, counting the occurrence frequency of each identification result, and determining the first identification model corresponding to the identification result with the occurrence frequency smaller than a preset threshold value as a target identification model when the identification result with the occurrence frequency not smaller than the preset threshold value exists;
and performing annotation information examination on the annotated samples in the sub-sample set corresponding to the target identification model.
Optionally, when there are a plurality of recognition results whose occurrence times are not less than the preset threshold, the method further includes:
determining a first recognition model corresponding to the target recognition result as a target recognition model;
and the target recognition result is a recognition result except the recognition result with the largest occurrence frequency in the plurality of recognition results.
Optionally, when the occurrence frequency of each recognition result is less than the preset threshold, the method further includes:
and auditing the identification sample to obtain an auditing result of the identification sample.
Optionally, after the audit is performed on the identification sample, the method further includes:
judging whether the auditing result of the identification sample exists in the identification results of different first identification models for the identification sample;
if the identification result is different from the auditing result, determining a first identification model with the identification result different from the auditing result as a target identification model;
if not, all of the first recognition models are determined to be target recognition models.
Optionally, the preset number is greater than or equal to 3.
Preferably, the preset number is greater than or equal to 5.
Optionally, the dividing the training sample set into a preset number of sub-sample sets includes:
and averagely dividing the training sample set into a preset number of sub-sample sets, wherein the difference between the number of samples in any two sub-sample sets is less than or equal to 1.
Optionally, the obtaining of the identification sample set used as the test includes:
and acquiring part of the labeled samples from the labeled samples needing to be examined to form an identification sample set used for testing.
Optionally, the examining and verifying labeled information of labeled samples in the sub-sample set corresponding to the target recognition model includes:
and sending the labeled samples in the sub-sample set corresponding to the target identification model to a verification client so that the verification client can perform labeled information verification on the received labeled samples.
Optionally, the verification client is a client that audits the received marked sample through a second recognition model established through pre-training, and the recognition accuracy of the second recognition model is higher than a certain threshold; or
The verification client is a client for carrying out manual examination on the received labeling sample.
In a second aspect, the present invention further provides an apparatus for auditing sample labeling information, where the apparatus includes:
the acquisition module is used for acquiring the marked samples needing to be audited and forming a training sample set; wherein, the labeling sample is labeled with labeling information in advance;
the training module is used for dividing the training sample set into a preset number of sub sample sets, respectively training different sub sample sets and establishing different first recognition models; the first recognition model is a neural network-based model;
the identification module is used for acquiring an identification sample set used for testing, identifying each identification sample in the identification sample set through different established first identification models respectively to obtain an identification result of each first identification model on the identification sample, counting the occurrence frequency of each identification result, and determining the first identification model corresponding to the identification result with the occurrence frequency smaller than a preset threshold value as a target identification model when the identification result with the occurrence frequency not smaller than the preset threshold value exists;
and the auditing module is used for auditing the labeling information of the labeling samples in the sub-sample set corresponding to the target identification model.
Optionally, the identification module is further configured to:
when a plurality of recognition results with the occurrence times not less than a preset threshold exist, determining a first recognition model corresponding to the target recognition result as a target recognition model;
and the target recognition result is a recognition result except the recognition result with the largest occurrence frequency in the plurality of recognition results.
Optionally, the identification module is further configured to:
and when the occurrence frequency of each identification result is smaller than the preset threshold value, auditing the identification sample to obtain an auditing result of the identification sample.
Optionally, the identification module is further configured to:
after the identification sample is audited, judging whether the audit result of the identification sample exists in the identification results of the identification sample of different first identification models; if the identification result is different from the auditing result, determining a first identification model with the identification result different from the auditing result as a target identification model; if not, all of the first recognition models are determined to be target recognition models.
Optionally, the preset number is greater than or equal to 3.
Preferably, the preset number is greater than or equal to 5.
Optionally, the training module divides the training sample set into a preset number of sub-sample sets, specifically:
and averagely dividing the training sample set into a preset number of sub-sample sets, wherein the difference between the number of samples in any two sub-sample sets is less than or equal to 1.
Optionally, the identification module obtains an identification sample set used for the test, specifically:
and acquiring part of the labeled samples from the labeled samples needing to be examined to form an identification sample set used for testing.
Optionally, the auditing module performs annotation information auditing on the annotated samples in the sub-sample set corresponding to the target identification model, specifically:
and sending the labeled samples in the sub-sample set corresponding to the target identification model to a verification client so that the verification client can perform labeled information verification on the received labeled samples.
Optionally, the verification client is a client that audits the received marked sample through a second recognition model established through pre-training, and the recognition accuracy of the second recognition model is higher than a certain threshold; or
The verification client is a client for carrying out manual examination on the received labeling sample.
In a third aspect, the present invention further provides an electronic device, including a processor, a communication interface, a memory, and a communication bus, where the processor, the communication interface, and the memory complete communication with each other through the communication bus;
the memory is used for storing a computer program;
the processor is configured to implement the steps of the method for auditing sample labeling information according to the first aspect when executing the program stored in the memory.
In a fourth aspect, the present invention further provides a computer-readable storage medium, in which a computer program is stored, and the computer program, when executed by a processor, implements the steps of the method for auditing the sample annotation information according to the first aspect.
Compared with the prior art, the invention firstly obtains the marked samples to be checked and forms the training sample set, then divides the training sample set into a preset number of sub-sample sets, respectively training different sub-sample sets, establishing different first recognition models, then obtaining recognition sample sets used for testing, respectively recognizing each recognition sample in the recognition sample sets through the established different first recognition models to obtain the recognition result of each first recognition model to the recognition sample, counting the occurrence frequency of each recognition result, when the recognition result with the occurrence frequency not less than the preset threshold exists, determining a first recognition model corresponding to the recognition result with the occurrence frequency less than the preset threshold as a target recognition model, and then performing annotation information auditing on the annotated samples in the subset sample set corresponding to the determined target identification model. Compared with the mode of manually auditing the labeling information of all samples in a sample set in the prior art, the method and the device can realize the quick auditing of the labeling information of the samples, and reduce the time and labor cost; meanwhile, the training sample set is divided into a plurality of sub-sample sets and a plurality of first recognition models are trained and established, and the method is particularly suitable for scenes in which the training sample set contains a large number of marked samples needing to be audited.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a schematic flowchart of an auditing method for sample annotation information according to an embodiment of the present invention;
fig. 2 is a schematic structural diagram of an apparatus for auditing sample annotation information according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
The following describes a method, an apparatus, an electronic device, and a computer-readable storage medium for auditing sample annotation information according to embodiments of the present invention in further detail with reference to the accompanying drawings. The advantages and features of the present invention will become more fully apparent from the appended claims and the following description. It is to be noted that the drawings are in a very simplified form and are not to precise scale, which is merely for the purpose of facilitating and distinctly claiming the embodiments of the present invention.
In order to solve the problems in the prior art, embodiments of the present invention provide an auditing method and apparatus for sample annotation information, an electronic device, and a computer-readable storage medium.
It should be noted that the method for auditing sample annotation information according to the embodiment of the present invention can be applied to an apparatus for auditing sample annotation information according to the embodiment of the present invention, and the apparatus for auditing sample annotation information can be configured on an electronic device. The electronic device may be a personal computer, a mobile terminal, and the like, and the mobile terminal may be a hardware device having various operating systems, such as a mobile phone and a tablet computer.
Fig. 1 is a schematic flowchart of an auditing method for sample annotation information according to an embodiment of the present invention. Referring to fig. 1, an auditing method for sample annotation information may include the following steps:
step S101, obtaining a labeled sample needing to be examined and forming a training sample set; and the labeling sample is labeled with labeling information in advance.
The labeled sample to be audited may be a sample identified and labeled by a human client, or may be a sample automatically identified and labeled by a recognition model established through pre-training, which is not limited in this embodiment.
The marked sample can be pictures of various objects of different types, such as test paper, animals and plants, scenic spots, vehicles, human faces or parts of human body components, articles, bills and the like. Taking a test paper as an example, the process of labeling the test paper sample may be: the method comprises the steps of identifying the areas of all questions on a test paper by using an area identification model, segmenting all the areas to form area sample pictures, identifying the character content of each area sample picture by using a character identification model, and carrying out labeling processing.
The type of the labeled sample is not limited in this embodiment, but the types of the labeled samples constituting the same training sample set must be the same, and the types of the labeling information of the labeled samples must also be the same. For example, each labeled sample forming the training sample set a is a picture containing characters, and the labeled information is the content of the characters on the picture. For another example, all the labeled samples forming the training sample set B are face images, and the labeled information is gender. For another example, all the labeled samples forming the training sample set C are face images, and the labeled information is age. In practical applications, for the training sample sets B and C, the samples in the two sample sets may be the same, but because the type of the labeled information is different, two different training sample sets are formed.
Step S102, dividing the training sample set into a preset number of sub-sample sets, respectively training different sub-sample sets, and establishing different first recognition models; the first recognition model is a neural network-based model.
In this embodiment, the labeled samples in the training sample set may be distributed to a preset number of sub-sample sets as uniformly as possible according to the actual number, and the added remainder is sequentially distributed to one sample to the corresponding sub-sample set in sequence, respectively, until all the samples are completely distributed.
Specifically, the training sample set is equally divided into a preset number of sub-sample sets, and the difference between the numbers of samples in any two sub-sample sets is less than or equal to 1. For example, the number of samples in the training sample set is 1002, and the number of the sub-sample sets is set to 10, then according to the above allocation principle, 1000 samples are first equally allocated to 10 sub-sample sets, and then the remaining 2 samples are allocated to 2 of the sub-sample sets, so that the difference between the numbers of samples in any two sub-sample sets does not exceed 1.
The distribution formula can enable the number of samples in each sub-sample set to be approximately the same, and further enables the recognition accuracy of each first recognition model established by training not to be different due to the number difference of training samples when each sub-sample set is trained to establish a plurality of first recognition models respectively.
As will be appreciated by those skilled in the art, the first recognition model is trained based on the samples in the subset and the label information of each sample. And if the sample types are different, or the sample types are the same but the types of the labeling information are different, the first recognition models established by training are different. For example, if the sample picture is a picture containing characters and the labeling information is character content, the first recognition model established by training is the character recognition model. If the sample is a face image and the annotation information is gender, the trained and established first recognition model is used for recognizing the gender of the person in the face image, and if the sample is the face image and the annotation information is the age of the person, the trained and established first recognition model is used for recognizing the age of the person in the image. In step S101, the types of the labeled samples in the training sample set are the same, and the types of the labeled information are the same, so that the types of the first recognition models established by training are the same.
Each first recognition model established by training is a model based on a neural network, and further can be a deep convolutional neural network or other neural network models, such as R-CNN, Fast R-CNN, SPP-net, R-FCN, FPN, YOLO, SSD, DenseBox, RetinaNet, and RRC detection combined with RNN algorithm, Deformable CNN combined with DPM, and the like.
In general, the preset number may be set to 3 or more, and preferably, the number is set to 5 or more. The preset number can be determined according to the number of the labeled samples in the training sample set in practical application.
Step S103, obtaining an identification sample set used for testing, identifying each identification sample in the identification sample set through different established first identification models to obtain an identification result of each first identification model to the identification sample, counting the occurrence frequency of each identification result, and determining the first identification model corresponding to the identification result with the occurrence frequency smaller than a preset threshold value as a target identification model when the identification result with the occurrence frequency not smaller than the preset threshold value exists.
In this embodiment, part of the labeled samples may be obtained from the labeled samples that need to be audited to form an identification sample set used for testing, for example, part of the labeled samples may be randomly extracted from the training sample set, the extraction ratio may be 5% to 20%, and the ratio of the number of the samples in the identification sample set to the total number of the samples in the training sample set may be adjusted according to the audit result.
For each identification sample, a plurality of different first identification models established in step S102 are respectively identified to obtain an identification result of each first identification model for the identification sample. For example, the number of the first recognition models established in step S102 is 10, and the recognition result of each first recognition model for three recognition samples X, Y, Z is shown in table one below:
watch 1
First recognition model | Identifying sample X | Identifying sample Y | Identifying sample Z |
Model 1 | A | A | A |
Model 2 | B | B | A |
Model 3 | C | C | A |
Model 4 | A | A | B |
Model 5 | A | A | B |
Model 6 | A | A | C |
Model 7 | A | A | C |
Model 8 | A | C | C |
Model 9 | A | C | D |
Model 10 | C | C | D |
The occurrence frequency of each identification result corresponding to the identification sample X is counted, so that the identification result A appears 7 times, the identification result B appears 1 time, and the identification result C appears 2 times. If the preset threshold is set to 4, because the number of occurrences of the recognition result a is the largest and the number of occurrences 7 exceeds the preset threshold, the recognition result a can be considered to be a correct recognition result of the recognition sample X, that is, the first recognition model with the recognition result a can correctly recognize the recognition sample X, and the first recognition model with the recognition result B or C (models 2, 3, 10) cannot correctly recognize the recognition sample X, and further, the labeling information used for training the labeling samples in the subset sample set of the established models 2, 3, 10 may not be accurate, so that the recognition accuracy of the three models is low. Therefore, in the process for the recognition sample X, the models 2, 3, 10 are determined as the target recognition models.
Further, when a plurality of recognition results of which the occurrence times are not less than a preset threshold value exist, the first recognition model corresponding to the target recognition result can be determined as the target recognition model; the target recognition result is a recognition result other than the recognition result with the largest occurrence number in the plurality of recognition results, that is, the occurrence number of the target recognition result is the largest in the recognition results with the occurrence number not less than a preset threshold.
For example, in the recognition results of the 10 first recognition models in the table i on the recognition sample Y, the recognition result a appears 5 times, the recognition result B appears 1 time, and the recognition result C appears 4 times. If the preset threshold is set to 4, the occurrence frequency of the recognition result a and the occurrence frequency of the recognition result C both exceed the preset threshold, but the occurrence frequency of the recognition result a is greater than the occurrence frequency of the recognition result C, the probability that the recognition result a is the correct recognition result of the recognition sample Y may be considered to be greater, and the recognition accuracy of the first recognition model with the recognition result C is lower, so that the labeling information of the labeled samples in the sub-training set corresponding to the first recognition model (models 3, 8, 9, 10) with the recognition result C may not be accurate. Therefore, in the processing for the recognition sample Y, not only the model 2 but also the models 3, 8, 9, and 10 may be determined as the target recognition model.
In practical application, if the occurrence frequency of each recognition result is found to be smaller than the preset threshold after counting the occurrence frequency of each recognition result, the recognition sample can be audited to obtain the audit result of the recognition sample. Further, whether the auditing result of the identification sample exists in the identification results of the identification sample of different first identification models is judged; if the identification result is different from the auditing result, determining a first identification model with the identification result different from the auditing result as a target identification model; if not, all of the first recognition models are determined to be target recognition models.
For example, in the recognition results of the 10 first recognition models on the recognition sample Z in the table, the recognition result a appears 3 times, the recognition result B appears 2 times, the recognition result C appears 3 times, and the recognition result D appears 2 times. If the preset threshold is set to 4, the occurrence frequency of each recognition result is smaller than the preset threshold, which indicates that each first recognition model cannot correctly recognize the recognition sample Z, and therefore, the recognition sample Z needs to be audited to obtain the audit result of the recognition sample Z. On the other hand, if the audit result of the recognition sample Z is a, the first recognition model whose recognition result is a is considered to be able to correctly recognize the recognition sample Z, and the first recognition models (models 4 to 10) whose recognition results are B, C or D are not able to correctly recognize the recognition sample Z, so that the models 4 to 10 are determined as the target recognition models. On the other hand, if the audit result of the recognition sample Z is E and there is no E in the recognition results of the recognition sample Z by the 10 first recognition models, it indicates that no correct recognition can be performed on the recognition sample Z by any of the 10 first recognition models, and therefore all of the 10 first recognition models are determined as the target recognition model.
And step S104, performing annotation information audit on the annotated samples in the sub-sample set corresponding to the target identification model.
In this embodiment, the labeled samples in the sub-sample set corresponding to the target identification model determined in step S103 may be sent to the verification client, so that the verification client performs labeled information auditing on the received labeled samples.
In this embodiment, the training sample set is divided into a plurality of sub-sample sets, a plurality of first recognition models are obtained through training, then the first recognition model with low recognition accuracy is obtained as the target recognition model, only the labeled samples in the sub-sample set corresponding to the target recognition model are examined, and all the samples are not required to be examined, so that the examination efficiency can be improved.
In one implementation manner, the verification client is a client that audits the received labeled sample through a second recognition model established through pre-training, and the recognition accuracy of the second recognition model is higher than a certain threshold. For example, the identification accuracy of the second identification model exceeds 99% to ensure the auditing accuracy of the verification client for the labeling information of the labeled sample, the labeled sample can be automatically identified and verified by the verification client of the second identification model, and the second identification model and the first identification model are the same in type. Or, the verification client may also be a manual client, and performs manual review on the received labeled sample.
To sum up, in the embodiment, firstly, the labeled samples to be examined are obtained and form a training sample set, then the training sample set is divided into a preset number of sub-sample sets, respectively training different sub-sample sets, establishing different first recognition models, then obtaining recognition sample sets used for testing, respectively recognizing each recognition sample in the recognition sample sets through the established different first recognition models to obtain the recognition result of each first recognition model to the recognition sample, counting the occurrence frequency of each recognition result, when the recognition result with the occurrence frequency not less than the preset threshold exists, determining a first recognition model corresponding to the recognition result with the occurrence frequency less than the preset threshold as a target recognition model, and then performing annotation information auditing on the annotated samples in the subset sample set corresponding to the determined target identification model. Compared with the mode of manually auditing the labeling information of all samples in a sample set in the prior art, the embodiment can realize the quick auditing of the labeling information of the samples, and reduce the time and labor cost; meanwhile, the training sample set is divided into a plurality of sub-sample sets and a plurality of first recognition models are trained and established, and the method is particularly suitable for scenes in which the training sample set contains a large number of marked samples needing to be audited.
Corresponding to the embodiment of the method for auditing the sample annotation information, an embodiment of the present invention further provides an auditing apparatus for sample annotation information, and fig. 2 is a schematic structural diagram of an auditing apparatus for sample annotation information according to an embodiment of the present invention. Referring to fig. 2, an apparatus for auditing sample annotation information may include:
an obtaining module 201, configured to obtain labeled samples that need to be examined and form a training sample set; wherein, the labeling sample is labeled with labeling information in advance;
the training module 202 is configured to divide the training sample set into a preset number of sub-sample sets, train different sub-sample sets, and establish different first recognition models; the first recognition model is a neural network-based model;
the identification module 203 is configured to obtain an identification sample set used for testing, identify each identification sample in the identification sample set through different established first identification models, obtain an identification result of each first identification model for the identification sample, count the occurrence frequency of each identification result, and determine, as a target identification model, a first identification model corresponding to an identification result whose occurrence frequency is smaller than a preset threshold value when there is an identification result whose occurrence frequency is not smaller than the preset threshold value;
and the auditing module 204 is configured to perform annotation information auditing on the annotated samples in the sub-sample set corresponding to the target identification model.
Optionally, the identifying module 203 is further configured to:
when a plurality of recognition results with the occurrence times not less than a preset threshold exist, determining a first recognition model corresponding to the target recognition result as a target recognition model;
and the target recognition result is a recognition result except the recognition result with the largest occurrence frequency in the plurality of recognition results.
Optionally, the identifying module 203 is further configured to:
and when the occurrence frequency of each identification result is smaller than the preset threshold value, auditing the identification sample to obtain an auditing result of the identification sample.
Optionally, the identifying module 203 is further configured to:
after the identification sample is audited, judging whether the audit result of the identification sample exists in the identification results of the identification sample of different first identification models; if the identification result is different from the auditing result, determining a first identification model with the identification result different from the auditing result as a target identification model; if not, all of the first recognition models are determined to be target recognition models.
Optionally, the preset number is greater than or equal to 3.
Preferably, the preset number is greater than or equal to 5.
Optionally, the training module divides the training sample set into a preset number of sub-sample sets, specifically:
and averagely dividing the training sample set into a preset number of sub-sample sets, wherein the difference between the number of samples in any two sub-sample sets is less than or equal to 1.
Optionally, the identification module 203 acquires an identification sample set used for the test, specifically:
and acquiring part of the labeled samples from the labeled samples needing to be examined to form an identification sample set used for testing.
Optionally, the auditing module 204 performs annotation information auditing on the annotated samples in the sub-sample set corresponding to the target recognition model, specifically:
and sending the labeled samples in the sub-sample set corresponding to the target identification model to a verification client so that the verification client can perform labeled information verification on the received labeled samples.
Optionally, the verification client is a client that audits the received marked sample through a second recognition model established through pre-training, and the recognition accuracy of the second recognition model is higher than a certain threshold; or
The verification client is a client for carrying out manual examination on the received labeling sample.
To sum up, in the embodiment, firstly, the labeled samples to be examined are obtained and form a training sample set, then the training sample set is divided into a preset number of sub-sample sets, respectively training different sub-sample sets, establishing different first recognition models, then obtaining recognition sample sets used for testing, respectively recognizing each recognition sample in the recognition sample sets through the established different first recognition models to obtain the recognition result of each first recognition model to the recognition sample, counting the occurrence frequency of each recognition result, when the recognition result with the occurrence frequency not less than the preset threshold exists, determining a first recognition model corresponding to the recognition result with the occurrence frequency less than the preset threshold as a target recognition model, and then performing annotation information auditing on the annotated samples in the subset sample set corresponding to the determined target identification model. Compared with the mode of manually auditing the labeling information of all samples in a sample set in the prior art, the embodiment can realize the quick auditing of the labeling information of the samples, and reduce the time and labor cost; meanwhile, the training sample set is divided into a plurality of sub-sample sets and a plurality of first recognition models are trained and established, and the method is particularly suitable for scenes in which the training sample set contains a large number of marked samples needing to be audited.
An embodiment of the present invention further provides an electronic device, and fig. 3 is a schematic structural diagram of the electronic device according to the embodiment of the present invention. Referring to fig. 3, an electronic device includes a processor 301, a communication interface 302, a memory 303 and a communication bus 304, wherein the processor 301, the communication interface 302 and the memory 303 communicate with each other via the communication bus 304,
a memory 303 for storing a computer program;
the processor 301, when executing the program stored in the memory 303, implements the following steps:
acquiring a labeled sample to be examined and forming a training sample set; wherein, the labeling sample is labeled with labeling information in advance;
dividing the training sample set into a preset number of sub-sample sets, respectively training different sub-sample sets, and establishing different first recognition models; the first recognition model is a neural network-based model;
acquiring an identification sample set used for testing, identifying each identification sample in the identification sample set through different established first identification models to obtain an identification result of each first identification model on the identification sample, counting the occurrence frequency of each identification result, and determining the first identification model corresponding to the identification result with the occurrence frequency smaller than a preset threshold value as a target identification model when the identification result with the occurrence frequency not smaller than the preset threshold value exists;
and performing annotation information examination on the annotated samples in the sub-sample set corresponding to the target identification model.
For specific implementation and related explanation of each step of the method, reference may be made to the method embodiment shown in fig. 1, which is not described herein again.
Compared with the mode of manually auditing the labeling information of all samples in a sample set in the prior art, the embodiment can realize the quick auditing of the labeling information of the samples, and reduce the time and labor cost; meanwhile, the training sample set is divided into a plurality of sub-sample sets and a plurality of first recognition models are trained and established, and the method is particularly suitable for a scene that the training sample set contains a large number of marked samples needing to be audited.
In addition, other implementation manners of the method for auditing the sample labeling information, which are realized by the processor 301 executing the program stored in the memory 303, are the same as the implementation manners mentioned in the foregoing method embodiment section, and are not described herein again.
The communication bus mentioned in the electronic device may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.
The communication interface is used for communication between the electronic equipment and other equipment.
The Memory may include a Random Access Memory (RAM) or a Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.
The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but also Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components.
An embodiment of the present invention further provides a computer-readable storage medium, in which a computer program is stored, and the computer program, when executed by a processor, implements the method steps of the method for auditing the sample annotation information.
Compared with the mode of manually auditing the labeling information of all samples in a sample set in the prior art, the embodiment can realize the quick auditing of the labeling information of the samples, and reduce the time and labor cost; meanwhile, the training sample set is divided into a plurality of sub-sample sets and a plurality of first recognition models are trained and established, and the method is particularly suitable for a scene that the training sample set contains a large number of marked samples needing to be audited.
It should be noted that, in the present specification, all the embodiments are described in a related manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the embodiments of the apparatus, the electronic device, and the computer-readable storage medium, since they are substantially similar to the embodiments of the method, the description is simple, and for the relevant points, reference may be made to the partial description of the embodiments of the method.
In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The above description is only for the purpose of describing the preferred embodiments of the present invention, and is not intended to limit the scope of the present invention, and any variations and modifications made by those skilled in the art based on the above disclosure are within the scope of the appended claims.
Claims (20)
1. A method for auditing sample labeling information is characterized by comprising the following steps:
acquiring a labeled sample to be examined and forming a training sample set; wherein, the labeling sample is labeled with labeling information in advance;
dividing the training sample set into a preset number of sub-sample sets, respectively training different sub-sample sets, and establishing different first recognition models; the first recognition model is a neural network-based model;
acquiring an identification sample set used for testing, identifying each identification sample in the identification sample set through different established first identification models to obtain an identification result of each first identification model on the identification sample, counting the occurrence frequency of each identification result, and determining the first identification model corresponding to the identification result with the occurrence frequency smaller than a preset threshold value as a target identification model when the identification result with the occurrence frequency not smaller than the preset threshold value exists;
and performing annotation information examination on the annotated samples in the sub-sample set corresponding to the target identification model.
2. The method for auditing sample labeling information according to claim 1, when there are a plurality of recognition results whose occurrence number is not less than a preset threshold, the method further comprising:
determining a first recognition model corresponding to the target recognition result as a target recognition model;
and the target recognition result is a recognition result except the recognition result with the largest occurrence frequency in the plurality of recognition results.
3. The method for auditing sample annotation information according to claim 1, when the number of occurrences of each recognition result is less than the preset threshold, the method further comprising:
and auditing the identification sample to obtain an auditing result of the identification sample.
4. The method for auditing of sample annotation information of claim 3, wherein after auditing the identified sample, the method further comprises:
judging whether the auditing result of the identification sample exists in the identification results of different first identification models for the identification sample;
if the identification result is different from the auditing result, determining a first identification model with the identification result different from the auditing result as a target identification model;
if not, all of the first recognition models are determined to be target recognition models.
5. An auditing method for sample annotation information according to claim 1, in which the predetermined number is 3 or more.
6. An auditing method for sample annotation information according to claim 5, in which the predetermined number is greater than or equal to 5.
7. The method for auditing of sample labeling information of claim 1, where dividing the training sample set into a preset number of sub-sample sets comprises:
and averagely dividing the training sample set into a preset number of sub-sample sets, wherein the difference between the number of samples in any two sub-sample sets is less than or equal to 1.
8. The method for auditing sample labeling information according to claim 1, wherein the obtaining of the identified sample set for use as a test includes:
and acquiring part of the labeled samples from the labeled samples needing to be examined to form an identification sample set used for testing.
9. The method for auditing of sample annotation information according to claim 1, wherein the auditing of annotation information for annotated samples in a set of subsamples corresponding to the target recognition model comprises:
and sending the labeled samples in the sub-sample set corresponding to the target identification model to a verification client so that the verification client can perform labeled information verification on the received labeled samples.
10. The method for auditing sample annotation information of claim 9, wherein the verification client is a client that audits a received annotated sample through a second recognition model established by pre-training, and the recognition accuracy of the second recognition model is higher than a certain threshold; or
The verification client is a client for carrying out manual examination on the received labeling sample.
11. An auditing device for sample labeling information, the device comprising:
the acquisition module is used for acquiring the marked samples needing to be audited and forming a training sample set; wherein, the labeling sample is labeled with labeling information in advance;
the training module is used for dividing the training sample set into a preset number of sub sample sets, respectively training different sub sample sets and establishing different first recognition models; the first recognition model is a neural network-based model;
the identification module is used for acquiring an identification sample set used for testing, identifying each identification sample in the identification sample set through different established first identification models respectively to obtain an identification result of each first identification model on the identification sample, counting the occurrence frequency of each identification result, and determining the first identification model corresponding to the identification result with the occurrence frequency smaller than a preset threshold value as a target identification model when the identification result with the occurrence frequency not smaller than the preset threshold value exists;
and the auditing module is used for auditing the labeling information of the labeling samples in the sub-sample set corresponding to the target identification model.
12. An auditing apparatus for sample annotation information according to claim 11, wherein the identification module is further configured to:
when a plurality of recognition results with the occurrence times not less than a preset threshold exist, determining a first recognition model corresponding to the target recognition result as a target recognition model;
and the target recognition result is a recognition result except the recognition result with the largest occurrence frequency in the plurality of recognition results.
13. An auditing apparatus for sample annotation information according to claim 11, wherein the identification module is further configured to:
and when the occurrence frequency of each identification result is smaller than the preset threshold value, auditing the identification sample to obtain an auditing result of the identification sample.
14. An auditing apparatus for sample annotation information according to claim 13, wherein the identification module is further configured to:
after the identification sample is audited, judging whether the audit result of the identification sample exists in the identification results of the identification sample of different first identification models; if the identification result is different from the auditing result, determining a first identification model with the identification result different from the auditing result as a target identification model; if not, all of the first recognition models are determined to be target recognition models.
15. An auditing device for sample annotation information according to claim 11, in which the predetermined number is 3 or more.
16. The apparatus for auditing of sample labeling information according to claim 11, wherein the identification module obtains an identification sample set used for testing, specifically:
and acquiring part of the labeled samples from the labeled samples needing to be examined to form an identification sample set used for testing.
17. The apparatus for auditing of sample annotation information according to claim 11, wherein the auditing module performs annotation information auditing for annotated samples in the sub-sample set corresponding to the target identification model, specifically:
and sending the labeled samples in the sub-sample set corresponding to the target identification model to a verification client so that the verification client can perform labeled information verification on the received labeled samples.
18. The apparatus for auditing of sample annotation information of claim 17, wherein the verification client is a client that audits a received annotated sample through a second recognition model established by pre-training, and the recognition accuracy of the second recognition model is higher than a certain threshold; or
The verification client is a client for carrying out manual examination on the received labeling sample.
19. An electronic device, comprising a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory communicate with each other via the communication bus;
the memory is used for storing a computer program;
the processor, when executing the program stored in the memory, implementing the method steps of any of claims 1-10.
20. A computer-readable storage medium, characterized in that a computer program is stored in the computer-readable storage medium, which computer program, when being executed by a processor, carries out the method steps of any one of claims 1-10.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910538177.XA CN110222791B (en) | 2019-06-20 | 2019-06-20 | Sample labeling information auditing method and device |
PCT/CN2020/095978 WO2020253636A1 (en) | 2019-06-20 | 2020-06-12 | Sample label information verification method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910538177.XA CN110222791B (en) | 2019-06-20 | 2019-06-20 | Sample labeling information auditing method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110222791A CN110222791A (en) | 2019-09-10 |
CN110222791B true CN110222791B (en) | 2020-12-04 |
Family
ID=67814089
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910538177.XA Active CN110222791B (en) | 2019-06-20 | 2019-06-20 | Sample labeling information auditing method and device |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN110222791B (en) |
WO (1) | WO2020253636A1 (en) |
Families Citing this family (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110222791B (en) * | 2019-06-20 | 2020-12-04 | 杭州睿琪软件有限公司 | Sample labeling information auditing method and device |
CN110705257B (en) * | 2019-09-16 | 2021-06-25 | 腾讯科技(深圳)有限公司 | Media resource identification method and device, storage medium and electronic device |
CN110852376B (en) * | 2019-11-11 | 2023-05-26 | 杭州睿琪软件有限公司 | Method and system for identifying biological species |
CN111160188A (en) * | 2019-12-20 | 2020-05-15 | 中国建设银行股份有限公司 | Financial bill identification method, device, equipment and storage medium |
CN111259980B (en) * | 2020-02-10 | 2023-10-03 | 北京小马慧行科技有限公司 | Method and device for processing annotation data |
CN112070224B (en) * | 2020-08-26 | 2024-02-23 | 成都品果科技有限公司 | Revision system and method of samples for neural network training |
CN112328822B (en) * | 2020-10-15 | 2024-04-02 | 深圳市优必选科技股份有限公司 | Picture pre-marking method and device and terminal equipment |
CN113034025B (en) * | 2021-04-08 | 2023-12-01 | 成都国星宇航科技股份有限公司 | Remote sensing image labeling system and method |
CN113610047A (en) * | 2021-08-24 | 2021-11-05 | 上海发网供应链管理有限公司 | Object detection-based identification method and system for production line articles |
CN113839953A (en) * | 2021-09-27 | 2021-12-24 | 上海商汤科技开发有限公司 | Labeling method and device, electronic equipment and storage medium |
CN114189709A (en) * | 2021-11-12 | 2022-03-15 | 北京天眼查科技有限公司 | Method and device for auditing video, storage medium and electronic equipment |
CN114240101A (en) * | 2021-12-02 | 2022-03-25 | 支付宝(杭州)信息技术有限公司 | Risk identification model verification method, device and equipment |
CN114219501B (en) * | 2022-02-22 | 2022-06-28 | 杭州衡泰技术股份有限公司 | Sample labeling resource allocation method, device and application |
CN114972846A (en) * | 2022-04-29 | 2022-08-30 | 上海深至信息科技有限公司 | Ultrasonic image annotation system |
CN114861820A (en) * | 2022-05-27 | 2022-08-05 | 北京百度网讯科技有限公司 | Sample data screening method, model training device and electronic equipment |
CN118211681B (en) * | 2024-05-22 | 2024-09-13 | 上海斗象信息科技有限公司 | Labeling sample judging method and device and electronic equipment |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101359372A (en) * | 2008-09-26 | 2009-02-04 | 腾讯科技(深圳)有限公司 | Training method and device of classifier, and method apparatus for recognising sensitization picture |
CN104751188A (en) * | 2015-04-15 | 2015-07-01 | 爱威科技股份有限公司 | Image processing method and system |
CN108806668A (en) * | 2018-06-08 | 2018-11-13 | 国家计算机网络与信息安全管理中心 | A kind of audio and video various dimensions mark and model optimization method |
CN109583468A (en) * | 2018-10-12 | 2019-04-05 | 阿里巴巴集团控股有限公司 | Training sample acquisition methods, sample predictions method and corresponding intrument |
CN109784391A (en) * | 2019-01-04 | 2019-05-21 | 杭州比智科技有限公司 | Sample mask method and device based on multi-model |
Family Cites Families (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20070260342A1 (en) * | 2006-05-08 | 2007-11-08 | Standard Aero Limited | Method for inspection process development or improvement and parts inspection process |
US9390112B1 (en) * | 2013-11-22 | 2016-07-12 | Groupon, Inc. | Automated dynamic data quality assessment |
US11379695B2 (en) * | 2016-10-24 | 2022-07-05 | International Business Machines Corporation | Edge-based adaptive machine learning for object recognition |
CN109446369B (en) * | 2018-09-28 | 2021-10-08 | 武汉中海庭数据技术有限公司 | Interaction method and system for semi-automatic image annotation |
CN109284784A (en) * | 2018-09-29 | 2019-01-29 | 北京数美时代科技有限公司 | A kind of content auditing model training method and device for live scene video |
CN110222791B (en) * | 2019-06-20 | 2020-12-04 | 杭州睿琪软件有限公司 | Sample labeling information auditing method and device |
-
2019
- 2019-06-20 CN CN201910538177.XA patent/CN110222791B/en active Active
-
2020
- 2020-06-12 WO PCT/CN2020/095978 patent/WO2020253636A1/en active Application Filing
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101359372A (en) * | 2008-09-26 | 2009-02-04 | 腾讯科技(深圳)有限公司 | Training method and device of classifier, and method apparatus for recognising sensitization picture |
CN104751188A (en) * | 2015-04-15 | 2015-07-01 | 爱威科技股份有限公司 | Image processing method and system |
CN108806668A (en) * | 2018-06-08 | 2018-11-13 | 国家计算机网络与信息安全管理中心 | A kind of audio and video various dimensions mark and model optimization method |
CN109583468A (en) * | 2018-10-12 | 2019-04-05 | 阿里巴巴集团控股有限公司 | Training sample acquisition methods, sample predictions method and corresponding intrument |
CN109784391A (en) * | 2019-01-04 | 2019-05-21 | 杭州比智科技有限公司 | Sample mask method and device based on multi-model |
Also Published As
Publication number | Publication date |
---|---|
CN110222791A (en) | 2019-09-10 |
WO2020253636A1 (en) | 2020-12-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110222791B (en) | Sample labeling information auditing method and device | |
CN110163300B (en) | Image classification method and device, electronic equipment and storage medium | |
CN110222170B (en) | Method, device, storage medium and computer equipment for identifying sensitive data | |
CN109145299B (en) | Text similarity determination method, device, equipment and storage medium | |
CN110245087B (en) | State checking method and device of manual client for sample auditing | |
CN110263853B (en) | Method and device for checking state of manual client by using error sample | |
CN107491536B (en) | Test question checking method, test question checking device and electronic equipment | |
CN109189895B (en) | Question correcting method and device for oral calculation questions | |
CN112434556A (en) | Pet nose print recognition method and device, computer equipment and storage medium | |
US11721229B2 (en) | Question correction method, device, electronic equipment and storage medium for oral calculation questions | |
CN113688630A (en) | Text content auditing method and device, computer equipment and storage medium | |
CN114241505B (en) | Method and device for extracting chemical structure image, storage medium and electronic equipment | |
CN109284700B (en) | Method, storage medium, device and system for detecting multiple faces in image | |
CN109697267B (en) | CMS (content management system) identification method and device | |
CN113723467A (en) | Sample collection method, device and equipment for defect detection | |
CN112434717A (en) | Model training method and device | |
CN111767390A (en) | Skill word evaluation method and device, electronic equipment and computer readable medium | |
CN115082659A (en) | Image annotation method and device, electronic equipment and storage medium | |
CN114463345A (en) | Multi-parameter mammary gland magnetic resonance image segmentation method based on dynamic self-adaptive network | |
CN114494863A (en) | Animal cub counting method and device based on Blend Mask algorithm | |
CN111611781A (en) | Data labeling method, question answering method, device and electronic equipment | |
CN108491451B (en) | English reading article recommendation method and device, electronic equipment and storage medium | |
CN115830598A (en) | Tracing confirmation method, system, equipment and medium for standard equipment | |
CN112699886B (en) | Character recognition method and device and electronic equipment | |
CN114065858A (en) | Model training method and device, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right |
Effective date of registration: 20220429 Address after: 310053 room d3189, North third floor, building 1, 368 Liuhe Road, Binjiang District, Hangzhou City, Zhejiang Province Patentee after: Hangzhou Ruisheng Software Co.,Ltd. Address before: Room B2019, 2nd floor, building 1 (North), 368 Liuhe Road, Binjiang District, Hangzhou City, Zhejiang Province, 310053 Patentee before: HANGZHOU GLORITY SOFTWARE Ltd. |
|
TR01 | Transfer of patent right |