CN110222791A - Sample labeling information auditing method and device - Google Patents

Sample labeling information auditing method and device Download PDF

Info

Publication number
CN110222791A
CN110222791A CN201910538177.XA CN201910538177A CN110222791A CN 110222791 A CN110222791 A CN 110222791A CN 201910538177 A CN201910538177 A CN 201910538177A CN 110222791 A CN110222791 A CN 110222791A
Authority
CN
China
Prior art keywords
sample
identification
model
recognition
result
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910538177.XA
Other languages
Chinese (zh)
Other versions
CN110222791B (en
Inventor
徐青松
李青
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Ruisheng Software Co Ltd
Original Assignee
Hangzhou Glority Software Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Glority Software Ltd filed Critical Hangzhou Glority Software Ltd
Priority to CN201910538177.XA priority Critical patent/CN110222791B/en
Publication of CN110222791A publication Critical patent/CN110222791A/en
Priority to PCT/CN2020/095978 priority patent/WO2020253636A1/en
Application granted granted Critical
Publication of CN110222791B publication Critical patent/CN110222791B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a method and a device for auditing sample labeling information, wherein the method comprises the following steps: acquiring a labeled sample to be examined and forming a training sample set; dividing the training sample set into a preset number of sub-sample sets, respectively training different sub-sample sets, and establishing different first recognition models; acquiring an identification sample set used for testing, identifying each identification sample in the identification sample set through different established first identification models to obtain an identification result of each first identification model on the identification sample, counting the occurrence frequency of each identification result, and determining the first identification model corresponding to the identification result with the occurrence frequency smaller than a preset threshold value as a target identification model when the identification result with the occurrence frequency not smaller than the preset threshold value exists; and performing annotation information examination on the annotated samples in the sub-sample set corresponding to the target identification model. By applying the scheme provided by the invention, the labeling result of the sample can be quickly examined.

Description

The checking method and device of sample markup information
Technical field
The present invention relates to field of artificial intelligence more particularly to a kind of checking method, device, the electricity of sample markup information Sub- equipment and computer readable storage medium.
Background technique
It needs to be labeled sample in the model training of artificial intelligence field, for example sample is carried out by manually Mark, or the neural network recognization model by pre-establishing carry out automatic identification to sample and mark.To guarantee model Trained accuracy rate, it is also necessary to the which whether markup information of sample is accurately audited.
Currently, being usually to be audited by the markup information manually to all samples marked.However, due to sample set The quantity of middle sample is bigger, can thus spend more time and manpower to audit the markup information of sample.
Summary of the invention
The purpose of the present invention is to provide checking method, device, electronic equipment and the computers of a kind of sample markup information Readable storage medium storing program for executing, quickly to audit the markup information of sample.Specific technical solution is as follows:
In a first aspect, the present invention provides a kind of checking methods of sample markup information, which comprises
Obtain the mark sample audited and composition training sample set;Wherein, the mark sample marks in advance There is markup information;
The training sample set is divided into the sub- sample set of preset quantity, different subsample collection is trained respectively, Establish the first different identification models;First identification model is model neural network based;
The identification sample set used as test is obtained, for each identification sample in the identification sample set, is passed through The the first different identification models established are identified respectively, obtain every one first identification model to the identification knot of the identification sample Fruit counts the frequency of occurrence of each recognition result, when there are the recognition result that frequency of occurrence is not less than preset threshold, by occurrence out Corresponding first identification model of recognition result that number is less than the preset threshold is determined as Model of Target Recognition;
The mark sample concentrated to the corresponding subsample of the Model of Target Recognition is labeled signal auditing.
Optionally, when there are multiple recognition results that frequency of occurrence is not less than preset threshold, the method also includes:
Corresponding first identification model of target identification result is determined as Model of Target Recognition;
Wherein, the target identification result be the multiple recognition result in addition to the most recognition result of frequency of occurrence Recognition result.
Optionally, when the frequency of occurrence of each recognition result is respectively less than the preset threshold, the method also includes:
The identification sample is audited, the auditing result of the identification sample is obtained.
Optionally, after being audited to the identification sample, the method also includes:
Judge the auditing result of the identification sample with the presence or absence of the knowledge in the first different identification models to the identification sample In other result;
If it is present recognition result first identification model different from the auditing result is determined as target identification mould Type;
If it does not exist, then all the first identification models are determined as Model of Target Recognition.
Optionally, the preset quantity is more than or equal to 3.
Preferably, the preset quantity is more than or equal to 5.
It is optionally, described that the training sample set is divided into the sub- sample set of preset quantity, comprising:
The training sample set is equally divided into the sub- sample set of preset quantity, the number of sample is concentrated in any two subsample Amount difference is less than or equal to 1.
It is optionally, described to obtain the identification sample set used as test, comprising:
The identification sample that fetching portion mark sample composition is used as test from the mark sample audited Collection.
Optionally, the mark sample concentrated to the corresponding subsample of the Model of Target Recognition is labeled information and examines Core, comprising:
The mark sample that the corresponding subsample of the Model of Target Recognition is concentrated is sent to verification client, so that described Verification client is labeled signal auditing to received mark sample.
Optionally, the verification client is the second identification model by training foundation in advance to received mark sample The recognition accuracy of the client audited, second identification model is higher than certain threshold value;Or
The verification client is that the client of manual examination and verification is carried out to received mark sample.
Second aspect, the present invention also provides a kind of audit device of sample markup information, described device includes:
Module is obtained, for obtaining the mark sample audited and composition training sample set;Wherein, the mark Sample is labeled with markup information in advance;
Training module, for the training sample set to be divided into the sub- sample set of preset quantity, to different subsample collection It is trained respectively, establishes the first different identification models;First identification model is model neural network based;
Identification module, for obtaining the identification sample set used as test, for each in the identification sample set It identifies sample, is identified respectively by the first different identification models of foundation, obtain every one first identification model to the knowledge Very this recognition result, counts the frequency of occurrence of each recognition result, when there are the identifications that frequency of occurrence is not less than preset threshold When as a result, corresponding first identification model of recognition result that frequency of occurrence is less than the preset threshold is determined as target identification mould Type;
Auditing module, the mark sample for concentrating to the corresponding subsample of the Model of Target Recognition are labeled information Audit.
Optionally, the identification module is also used to:
When there are multiple recognition results that frequency of occurrence is not less than preset threshold, by target identification result corresponding One identification model is determined as Model of Target Recognition;
Wherein, the target identification result be the multiple recognition result in addition to the most recognition result of frequency of occurrence Recognition result.
Optionally, the identification module is also used to:
When the frequency of occurrence of each recognition result is respectively less than the preset threshold, which is audited, is somebody's turn to do Identify the auditing result of sample.
Optionally, the identification module is also used to:
After auditing to the identification sample, judge that the auditing result of the identification sample whether there is in different first Identification model is in the recognition result of the identification sample;If it is present by recognition result it is different from the auditing result One identification model is determined as Model of Target Recognition;If it does not exist, then all the first identification models are determined as target identification mould Type.
Optionally, the preset quantity is more than or equal to 3.
Preferably, the preset quantity is more than or equal to 5.
Optionally, the training sample set is divided into the sub- sample set of preset quantity by the training module, specifically:
The training sample set is equally divided into the sub- sample set of preset quantity, the number of sample is concentrated in any two subsample Amount difference is less than or equal to 1.
Optionally, the identification module obtains the identification sample set used as test, specifically:
The identification sample that fetching portion mark sample composition is used as test from the mark sample audited Collection.
Optionally, the mark sample that the auditing module concentrates the corresponding subsample of the Model of Target Recognition is marked Signal auditing is infused, specifically:
The mark sample that the corresponding subsample of the Model of Target Recognition is concentrated is sent to verification client, so that described Verification client is labeled signal auditing to received mark sample.
Optionally, the verification client is the second identification model by training foundation in advance to received mark sample The recognition accuracy of the client audited, second identification model is higher than certain threshold value;Or
The verification client is that the client of manual examination and verification is carried out to received mark sample.
The third aspect, the present invention also provides a kind of electronic equipment, including processor, communication interface, memory and communication Bus, wherein the processor, the communication interface, the memory complete mutual communication by the communication bus;
The memory, for storing computer program;
The processor when for executing the program stored on the memory, is realized described in above-mentioned first aspect The step of checking method of sample markup information.
Fourth aspect, the present invention also provides a kind of computer readable storage medium, the computer readable storage medium It is inside stored with computer program, the computer program realizes that sample described in above-mentioned first aspect marks when being executed by processor The step of checking method of information.
Compared with prior art, the present invention obtains the mark sample audited and composition training sample set first, Then training sample set is divided into the sub- sample set of preset quantity, different subsample collection is trained respectively, established different The first identification model, then obtain as the identification sample set that uses of test, for each identification sample in identification sample set, It is identified respectively by the first different identification models established, obtains every one first identification model to the identification sample Recognition result counts the frequency of occurrence of each recognition result, will when there are the recognition result that frequency of occurrence is not less than preset threshold Corresponding first identification model of recognition result that frequency of occurrence is less than preset threshold is determined as Model of Target Recognition, so to really The mark sample that the corresponding subsample of fixed Model of Target Recognition is concentrated is labeled signal auditing.Compared with the prior art by people The mode that work audits the markup information of samples all in sample set, the present invention may be implemented to the fast of sample markup information Speed audit, reduces time and human cost;Meanwhile training sample set is divided into multiple subsample collection to the present invention and training foundation is more A first identification model, this mode is especially suitable for the scene for marking sample that training sample set includes that a large amount of needs are audited.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this Some embodiments of invention for those of ordinary skill in the art without creative efforts, can be with It obtains other drawings based on these drawings.
Fig. 1 is the flow diagram of the checking method for the sample markup information that one embodiment of the invention provides;
Fig. 2 is the structural schematic diagram of the audit device for the sample markup information that one embodiment of the invention provides;
Fig. 3 is the structural schematic diagram for the electronic equipment that one embodiment of the invention provides.
Specific embodiment
Below in conjunction with the drawings and specific embodiments to a kind of checking method of sample markup information proposed by the present invention, dress It sets, electronic equipment and computer readable storage medium are described in further detail.According to claims and following explanation, this hair Bright advantage and feature will become apparent from.It should be noted that attached drawing is all made of very simplified form and uses non-accurate ratio Example, only for the purpose of facilitating and clarifying the purpose of the embodiments of the invention.
To solve problem of the prior art, the embodiment of the invention provides a kind of checking methods of sample markup information, dress It sets, electronic equipment and computer readable storage medium.
It should be noted that the checking method of the sample markup information of the embodiment of the present invention can be applied to the embodiment of the present invention Sample markup information audit device, the audit device of the sample markup information can be configured on electronic equipment.Wherein, should Electronic equipment can be personal computer, mobile terminal etc., which can be mobile phone, tablet computer etc. with various behaviour Make the hardware device of system.
Fig. 1 is a kind of flow diagram of the checking method for sample markup information that one embodiment of the invention provides.It please join Examine Fig. 1, a kind of checking method of sample markup information may include steps of:
Step S101 obtains the mark sample audited and composition training sample set;Wherein, the mark sample It is labeled with markup information in advance.
The mark sample audited can be the sample that identifies and mark by human customer end, can also be with For the sample for carrying out automatic identification and mark by the identification model that training is established in advance, the present embodiment is not limited this.
Marking sample can be the picture of various types of objects, for example, paper, animals and plants, sight spot, vehicle, face or Part human body component part, article, bill etc..By taking paper as an example, the process that is labeled to paper sample can be with are as follows: utilizes one Region recognition model identifies each topic destination region on paper, and carries out cutting forming region samples pictures to each region, so The character content of each region samples pictures is identified using a character recognition model afterwards and is labeled processing.
The type of the unlimited calibration note sample of the present embodiment, but form each mark sample of the same training sample set Type must be identical, and the type of the markup information of each mark sample also must be identical.For example, composition training Each mark sample of sample set A is the picture comprising character, and markup information is the character content on picture.For another example, The each mark sample standard deviation for forming training sample set B is facial image, and markup information is gender.For another example, training sample is formed Each mark sample standard deviation of this collection C is facial image, and markup information is the age.In practical applications, for training sample Collect B and C for, the sample in two sample sets can be it is identical, but due to the type of markup information difference, composition Two different training sample sets.
The training sample set is divided into the sub- sample set of preset quantity, distinguished different subsample collection by step S102 It is trained, establishes the first different identification models;First identification model is model neural network based.
In the present embodiment, the mark sample that training sample is concentrated can be given into present count according to actual quantity as far as possible Measure a sub- sample set, extra remainder according still further to being sequentially sequentially allocated a sample respectively to corresponding subsample collection, until All samples are all assigned.
Specifically, that is, by training sample set it is equally divided into the sub- sample set of preset quantity, and makes any two The quantity difference of sample is less than or equal to 1 in a sub- sample set.For example, it is 1002 that training sample, which concentrates sample size, increment The quantity of this collection is set as 10, then 1000 sample standard deviations are given into 10 sub- sample sets first according to above-mentioned distribution principle, Then remaining 2 samples are distributed into 2 sub- sample sets therein, in this way, the quantity of sample is concentrated in any two subsample Difference is no more than 1.
This method of salary distribution can make each subsample concentrate sample size roughly the same, and then to each subsample Collection is trained respectively when establishing multiple first identification models, the identification for each first identification model that training can be made to establish Between accuracy rate difference will not be generated because of the quantity variance of training sample.
It will be appreciated by persons skilled in the art that the first identification model is the sample and each concentrated based on subsample The markup information of sample is trained foundation.Sample type is different or sample type is identical but the type of markup information not Together, then the first identification model that training is established is different.For example, if samples pictures are the picture comprising character, markup information For character content, then the first identification model that training is established is character recognition model.If sample is facial image, markup information is Gender, then the first identification model that training is established for identification in facial image people gender, if sample is facial image, mark Information is the age of people, then the first identification model that training is established for identification in image people age.Due in step S101 Training sample concentrates the type of each mark sample identical and the type of markup information is identical, therefore each the first of training foundation The type of identification model is identical.
Each first identification model that training is established is model neural network based, further can be depth convolution Neural network or other neural network models, for example, R-CNN, Fast R-CNN, Faster R-CNN, SPP-net, R-FCN, FPN, YOLO, SSD, DenseBox, RetinaNet, and combine RRC detection of RNN algorithm, in conjunction with DPM's Deformable CNN etc..
In general, the preset quantity can be set as being more than or equal to 3, it is preferred that the quantity is set greater than equal to 5. The preset quantity can be determined according to the quantity for marking sample in practical application in training sample set.
Step S103 obtains the identification sample set used as test, for each identification in the identification sample set Sample is identified respectively by the first different identification models of foundation, obtains every one first identification model to the identification sample This recognition result, counts the frequency of occurrence of each recognition result, when there are the recognition results that frequency of occurrence is not less than preset threshold When, corresponding first identification model of recognition result that frequency of occurrence is less than the preset threshold is determined as Model of Target Recognition.
In the present embodiment, test can be used as by fetching portion mark sample composition from the mark sample audited The identification sample set used, for example, randomly select part mark sample from training sample concentration, extract ratio can for 5%~ 20%, the ratio for identifying that sample size accounts for training sample concentration total sample number amount in sample set can regard auditing result adjustment.
For each identification sample, carried out respectively by the first multiple and different identification models established in step S102 Identification, obtains every one first identification model to the recognition result of the identification sample.For example, the first knowledge that step S102 is established Other model quantity is 10, and every one first identification model identifies the recognition result of sample X, Y, Z as shown in following table table one to three:
Table one
First identification model Identify sample X Identify sample Y Identify sample Z
Model 1 A A A
Model 2 B B A
Model 3 C C A
Model 4 A A B
Model 5 A A B
Model 6 A A C
Model 7 A A C
Model 8 A C C
Model 9 A C D
Model 10 C C D
The frequency of occurrence of the corresponding each recognition result of statistics identification sample X it is found that recognition result A occurs 7 times, tie by identification Fruit B occurs 1 time, and recognition result C occurs 2 times.If preset threshold is set as 4, due to recognition result A occur number most More and frequency of occurrence 7 has been more than preset threshold, it is believed that recognition result A is the correct recognition result for identifying sample X, it can Think that the first identification model that recognition result is A can correctly identify identification sample X and recognition result is B or C First identification model (model 2,3,10) can not correctly identify identification sample X, and then it is also assumed that be used to train Concentrate the markup information of mark sample may the inaccurate knowledge so as to cause these three models in the subsample for establishing model 2,3,10 Other accuracy rate is lower.Therefore, in the treatment process for identification sample X, model 2,3,10 is determined as Model of Target Recognition.
It further, can also be by target when there are multiple recognition results that frequency of occurrence is not less than preset threshold Corresponding first identification model of recognition result is also determined as Model of Target Recognition;Wherein, the target identification result is described more Recognition result in a recognition result in addition to the most recognition result of frequency of occurrence, that is to say, that in frequency of occurrence not less than pre- If the frequency of occurrence of target identification result described in the recognition result of threshold value is most.
For example, for 10 the first identification models in the recognition result of identification sample Y, recognition result A occurs 5 in table one Secondary, recognition result B occurs 1 time, and recognition result C occurs 4 times.If preset threshold is set as 4, recognition result A's and C goes out Occurrence number is more than preset threshold, but the frequency of occurrence of recognition result A is greater than the frequency of occurrence of C, it is believed that recognition result A For identify sample Y correct recognition result probability it is bigger, recognition result be C the first identification model recognition accuracy compared with It is low, it will also be appreciated that being marked in the corresponding sub- training set of the first identification model (model 3,8,9,10) that recognition result is C The markup information of sample may inaccuracy.Therefore, in the treatment process for identification sample Y, model 2 is not only determined as mesh Identification model is marked, model 3,8,9,10 can also be also determined as Model of Target Recognition.
In practical applications, if finding after counting the frequency of occurrence of each recognition result, the frequency of occurrence of each recognition result is equal Less than the preset threshold, then the identification sample can be audited, obtain the auditing result of the identification sample.Further , then judge the auditing result of the identification sample with the presence or absence of the identification knot in the first different identification models to the identification sample In fruit;If it is present recognition result first identification model different from the auditing result is determined as Model of Target Recognition; If it does not exist, then all the first identification models are determined as Model of Target Recognition.
For example, as in the recognition result of identification sample Z, recognition result A goes out 10 the first identification models in table one Show 3 times, recognition result B occurs 2 times, and recognition result C occurs 3 times, and recognition result D occurs 2 times.If preset threshold It is set as 4, then the frequency of occurrence of each recognition result is respectively less than preset threshold, indicates that each first identification model cannot be to identification Sample Z is correctly identified, it is therefore desirable to be audited to identification sample Z, be obtained the auditing result of identification sample Z.On the one hand, If identifying, the auditing result of sample Z is A, then it is assumed that the first identification model that recognition result is A can carry out just identification sample Z The first identification model (model 4~10) of true identification, recognition result B, C or D cannot correctly know identification sample Z Not, therefore by model 4~10 it is determined as Model of Target Recognition.On the other hand, if the auditing result of identification sample Z is E, and 10 First identification model is to there is no E in the recognition result of identification sample Z, then it represents that 10 the first identification models cannot be to identification Sample Z is correctly identified, therefore 10 the first identification models are all determined as Model of Target Recognition.
Step S104 is labeled information to the mark sample that the corresponding subsample of the Model of Target Recognition is concentrated and examines Core.
It, can be by the mark sample of the corresponding subsample concentration of the Model of Target Recognition determined in step S103 in the present embodiment Originally it is sent to verification client, so that the verification client is labeled signal auditing to received mark sample.
Since training sample set is first divided into multiple subsample collection by the present embodiment, and training obtains multiple first identification moulds Type, then the first low identification model of acquisition recognition accuracy is only corresponding to Model of Target Recognition as Model of Target Recognition The mark sample that subsample is concentrated is audited, and without auditing to whole samples, therefore review efficiency can be improved.
In a kind of implementation, the verification client is by training is established in advance the second identification model to received The recognition accuracy of the client that mark sample is audited, second identification model is higher than certain threshold value.Such as second know The recognition accuracy of other model is more than 99%, to guarantee audit accuracy rate of the verification client to the markup information for marking sample, Verifying client by the second identification model may be implemented automatic identification and verification to mark sample, the second identification model and the The type of one identification model is identical.Alternatively, the verification client may be human customer end, to received mark sample into Row manual examination and verification.
In conclusion the present embodiment obtains the mark sample audited and composition training sample set first, then Training sample set is divided into preset quantity sub- sample set, different subsample collection is trained respectively, establishes different the One identification model, then the identification sample set used as test is obtained, it is logical for each identification sample in identification sample set It crosses the first different identification models established to be identified respectively, obtains identification of every one first identification model to the identification sample As a result, counting the frequency of occurrence of each recognition result, when there are the recognition result that frequency of occurrence is not less than preset threshold, will occur Corresponding first identification model of recognition result that number is less than preset threshold is determined as Model of Target Recognition, and then to identified The mark sample that the corresponding subsample of Model of Target Recognition is concentrated is labeled signal auditing.Compared with the prior art by artificial right The mode that the markup information of all samples is audited in sample set, the present embodiment may be implemented to the quick of sample markup information Audit reduces time and human cost;Meanwhile the present invention by training sample set be divided into multiple subsample collection and training establish it is multiple First identification model, this mode is especially suitable for the scene for marking sample that training sample set includes that a large amount of needs are audited.
Corresponding to the checking method embodiment of above-mentioned sample markup information, one embodiment of the invention also provides a kind of sample mark The audit device of information is infused, Fig. 2 is that a kind of structure of the audit device for sample markup information that one embodiment of the invention provides is shown It is intended to.Referring to FIG. 2, a kind of audit device of sample markup information may include:
Module 201 is obtained, for obtaining the mark sample audited and composition training sample set;Wherein, described Mark sample is labeled with markup information in advance;
Training module 202, for the training sample set to be divided into the sub- sample set of preset quantity, to different subsamples Collection is trained respectively, establishes the first different identification models;First identification model is model neural network based;
Identification module 203, for obtaining the identification sample set used as test, for every in the identification sample set One identification sample, is identified respectively by the first different identification models of foundation, obtains every one first identification model to this The recognition result for identifying sample, counts the frequency of occurrence of each recognition result, when there are the knowledges that frequency of occurrence is not less than preset threshold When other result, corresponding first identification model of recognition result that frequency of occurrence is less than the preset threshold is determined as target identification Model;
Auditing module 204, the mark sample for concentrating to the corresponding subsample of the Model of Target Recognition are labeled Signal auditing.
Optionally, the identification module 203 is also used to:
When there are multiple recognition results that frequency of occurrence is not less than preset threshold, by target identification result corresponding One identification model is determined as Model of Target Recognition;
Wherein, the target identification result be the multiple recognition result in addition to the most recognition result of frequency of occurrence Recognition result.
Optionally, the identification module 203 is also used to:
When the frequency of occurrence of each recognition result is respectively less than the preset threshold, which is audited, is somebody's turn to do Identify the auditing result of sample.
Optionally, the identification module 203 is also used to:
After auditing to the identification sample, judge that the auditing result of the identification sample whether there is in different first Identification model is in the recognition result of the identification sample;If it is present by recognition result it is different from the auditing result One identification model is determined as Model of Target Recognition;If it does not exist, then all the first identification models are determined as target identification mould Type.
Optionally, the preset quantity is more than or equal to 3.
Preferably, the preset quantity is more than or equal to 5.
Optionally, the training sample set is divided into the sub- sample set of preset quantity by the training module, specifically:
The training sample set is equally divided into the sub- sample set of preset quantity, the number of sample is concentrated in any two subsample Amount difference is less than or equal to 1.
Optionally, the identification module 203 obtains the identification sample set used as test, specifically:
The identification sample that fetching portion mark sample composition is used as test from the mark sample audited Collection.
Optionally, the mark sample that the auditing module 204 concentrates the corresponding subsample of the Model of Target Recognition into Rower infuses signal auditing, specifically:
The mark sample that the corresponding subsample of the Model of Target Recognition is concentrated is sent to verification client, so that described Verification client is labeled signal auditing to received mark sample.
Optionally, the verification client is the second identification model by training foundation in advance to received mark sample The recognition accuracy of the client audited, second identification model is higher than certain threshold value;Or
The verification client is that the client of manual examination and verification is carried out to received mark sample.
In conclusion the present embodiment obtains the mark sample audited and composition training sample set first, then Training sample set is divided into preset quantity sub- sample set, different subsample collection is trained respectively, establishes different the One identification model, then the identification sample set used as test is obtained, it is logical for each identification sample in identification sample set It crosses the first different identification models established to be identified respectively, obtains identification of every one first identification model to the identification sample As a result, counting the frequency of occurrence of each recognition result, when there are the recognition result that frequency of occurrence is not less than preset threshold, will occur Corresponding first identification model of recognition result that number is less than preset threshold is determined as Model of Target Recognition, and then to identified The mark sample that the corresponding subsample of Model of Target Recognition is concentrated is labeled signal auditing.Compared with the prior art by artificial right The mode that the markup information of all samples is audited in sample set, the present embodiment may be implemented to the quick of sample markup information Audit reduces time and human cost;Meanwhile the present invention by training sample set be divided into multiple subsample collection and training establish it is multiple First identification model, this mode is especially suitable for the scene for marking sample that training sample set includes that a large amount of needs are audited.
One embodiment of the invention additionally provides a kind of electronic equipment, and Fig. 3 is a kind of electronics that one embodiment of the invention provides The structural schematic diagram of equipment.Referring to FIG. 3, a kind of electronic equipment includes processor 301, communication interface 302,303 and of memory Communication bus 304, wherein processor 301, communication interface 302, memory 303 complete mutual lead to by communication bus 304 Letter,
Memory 303, for storing computer program;
Processor 301 when for executing the program stored on memory 303, realizes following steps:
Obtain the mark sample audited and composition training sample set;Wherein, the mark sample marks in advance There is markup information;
The training sample set is divided into the sub- sample set of preset quantity, different subsample collection is trained respectively, Establish the first different identification models;First identification model is model neural network based;
The identification sample set used as test is obtained, for each identification sample in the identification sample set, is passed through The the first different identification models established are identified respectively, obtain every one first identification model to the identification knot of the identification sample Fruit counts the frequency of occurrence of each recognition result, when there are the recognition result that frequency of occurrence is not less than preset threshold, by occurrence out Corresponding first identification model of recognition result that number is less than the preset threshold is determined as Model of Target Recognition;
The mark sample concentrated to the corresponding subsample of the Model of Target Recognition is labeled signal auditing.
Specific implementation and relevant explanation content about each step of this method may refer to above-mentioned method shown in FIG. 1 Embodiment, this will not be repeated here.
The mode audited compared with the prior art by the markup information manually to samples all in sample set, this implementation The quick audit to sample markup information may be implemented in example, reduces time and human cost;Meanwhile the present embodiment is by training sample Collection is divided into multiple subsample collection and multiple first identification models are established in training, and this mode includes especially suitable for training sample set The scene of a large amount of mark samples for needing to audit.
In addition, processor 301 executes the audit side of the program stored on memory 303 and the sample markup information realized Other implementations of method, it is identical as implementation mentioned by preceding method embodiment part, it also repeats no more here.
The communication bus that above-mentioned electronic equipment is mentioned can be Peripheral Component Interconnect standard (Peripheral Component Interconnect, PCI) bus or expanding the industrial standard structure (Extended Industry Standard Architecture, EISA) bus etc..The communication bus can be divided into address bus, data/address bus, control bus etc..For just It is only indicated with a thick line in expression, figure, it is not intended that an only bus or a type of bus.
Communication interface is for the communication between above-mentioned electronic equipment and other equipment.
Memory may include random access memory (Random Access Memory, RAM), also may include non-easy The property lost memory (Non-Volatile Memory, NVM), for example, at least a magnetic disk storage.Optionally, memory may be used also To be storage device that at least one is located remotely from aforementioned processor.
Above-mentioned processor can be general processor, including central processing unit (Central Processing Unit, CPU), network processing unit (Network Processor, NP) etc.;It can also be digital signal processor (Digital Signal Processing, DSP), it is specific integrated circuit (Application Specific Integrated Circuit, ASIC), existing It is field programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic device, discrete Door or transistor logic, discrete hardware components.
One embodiment of the invention additionally provides a kind of computer readable storage medium, the computer readable storage medium memory Computer program is contained, which realizes the side of the checking method of above-mentioned sample markup information when being executed by processor Method step.
The mode audited compared with the prior art by the markup information manually to samples all in sample set, this implementation The quick audit to sample markup information may be implemented in example, reduces time and human cost;Meanwhile the present embodiment is by training sample Collection is divided into multiple subsample collection and multiple first identification models are established in training, and this mode includes especially suitable for training sample set The scene of a large amount of mark samples for needing to audit.
Described it should be noted that each embodiment in this specification is all made of relevant mode, each embodiment it Between same and similar part may refer to each other, each embodiment focuses on the differences from other embodiments. For device, electronic equipment, computer readable storage medium embodiment, implement since it is substantially similar to method Example, so being described relatively simple, the relevent part can refer to the partial explaination of embodiments of method.
Herein, relational terms such as first and second and the like be used merely to by an entity or operation with it is another One entity or operation distinguish, and without necessarily requiring or implying between these entities or operation, there are any this reality Relationship or sequence.Moreover, the terms "include", "comprise" or its any other variant are intended to the packet of nonexcludability Contain, so that the process, method, article or equipment for including a series of elements not only includes those elements, but also including Other elements that are not explicitly listed, or further include for elements inherent to such a process, method, article, or device. In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that including the element Process, method, article or equipment in there is also other identical elements.
Foregoing description is only the description to present pre-ferred embodiments, not to any restriction of the scope of the invention, this hair Any change, the modification that the those of ordinary skill in bright field does according to the disclosure above content, belong to the protection of claims Range.

Claims (20)

1. a kind of checking method of sample markup information, which is characterized in that the described method includes:
Obtain the mark sample audited and composition training sample set;Wherein, the mark sample is labeled in advance Infuse information;
The training sample set is divided into the sub- sample set of preset quantity, different subsample collection is trained respectively, is established The first different identification models;First identification model is model neural network based;
The identification sample set used as test is obtained, for each identification sample in the identification sample set, passes through foundation The first different identification models identified respectively, obtain every one first identification model to the recognition result of the identification sample, The frequency of occurrence for counting each recognition result, when there are the recognition result that frequency of occurrence is not less than preset threshold, by frequency of occurrence The first identification model corresponding less than the recognition result of the preset threshold is determined as Model of Target Recognition;
The mark sample concentrated to the corresponding subsample of the Model of Target Recognition is labeled signal auditing.
2. the checking method of sample markup information as described in claim 1, which is characterized in that when that there are frequency of occurrence is not small When multiple recognition results of preset threshold, the method also includes:
Corresponding first identification model of target identification result is determined as Model of Target Recognition;
Wherein, the target identification result is the knowledge in the multiple recognition result in addition to the most recognition result of frequency of occurrence Other result.
3. the checking method of sample markup information as described in claim 1, which is characterized in that go out occurrence when each recognition result Number is respectively less than the preset threshold, the method also includes:
The identification sample is audited, the auditing result of the identification sample is obtained.
4. the checking method of sample markup information as claimed in claim 3, which is characterized in that audited to the identification sample Later, the method also includes:
Judge the auditing result of the identification sample with the presence or absence of the identification knot in the first different identification models to the identification sample In fruit;
If it is present recognition result first identification model different from the auditing result is determined as Model of Target Recognition;
If it does not exist, then all the first identification models are determined as Model of Target Recognition.
5. the checking method of sample markup information as described in claim 1, which is characterized in that the preset quantity is more than or equal to 3。
6. the checking method of sample markup information as claimed in claim 5, which is characterized in that the preset quantity is more than or equal to 5。
7. the checking method of sample markup information as described in claim 1, which is characterized in that described by the training sample set It is divided into the sub- sample set of preset quantity, comprising:
The training sample set is equally divided into the sub- sample set of preset quantity, any two subsample concentrates the quantity of sample poor Less than or equal to 1.
8. the checking method of sample markup information as described in claim 1, which is characterized in that the acquisition is used as test Identification sample set, comprising:
The identification sample set that fetching portion mark sample composition is used as test from the mark sample audited.
9. the checking method of sample markup information as described in claim 1, which is characterized in that described to the target identification mould The mark sample that the corresponding subsample of type is concentrated is labeled signal auditing, comprising:
The mark sample that the corresponding subsample of the Model of Target Recognition is concentrated is sent to verification client, so that the verification Client is labeled signal auditing to received mark sample.
10. the checking method of sample markup information as claimed in claim 9, which is characterized in that the verification client is logical Cross the client that the second identification model that training is established in advance audits received mark sample, second identification model Recognition accuracy be higher than certain threshold value;Or
The verification client is that the client of manual examination and verification is carried out to received mark sample.
11. a kind of audit device of sample markup information, which is characterized in that described device includes:
Module is obtained, for obtaining the mark sample audited and composition training sample set;Wherein, the mark sample It is labeled with markup information in advance;
Training module distinguishes different subsample collection for the training sample set to be divided into the sub- sample set of preset quantity It is trained, establishes the first different identification models;First identification model is model neural network based;
Identification module, for obtaining the identification sample set used as test, for each identification in the identification sample set Sample is identified respectively by the first different identification models of foundation, obtains every one first identification model to the identification sample This recognition result, counts the frequency of occurrence of each recognition result, when there are the recognition results that frequency of occurrence is not less than preset threshold When, corresponding first identification model of recognition result that frequency of occurrence is less than the preset threshold is determined as Model of Target Recognition;
Auditing module, the mark sample for concentrating to the corresponding subsample of the Model of Target Recognition are labeled information and examine Core.
12. the audit device of sample markup information as claimed in claim 11, which is characterized in that the identification module is also used In:
When there are multiple recognition results that frequency of occurrence is not less than preset threshold, target identification result corresponding first is known Other model is determined as Model of Target Recognition;
Wherein, the target identification result is the knowledge in the multiple recognition result in addition to the most recognition result of frequency of occurrence Other result.
13. the audit device of sample markup information as claimed in claim 11, which is characterized in that the identification module is also used In:
When the frequency of occurrence of each recognition result is respectively less than the preset threshold, which is audited, the identification is obtained The auditing result of sample.
14. the audit device of sample markup information as claimed in claim 13, which is characterized in that the identification module is also used In:
After auditing to the identification sample, judge that the auditing result of the identification sample whether there is in the first different identification Model is in the recognition result of the identification sample;If it is present the first knowledge that recognition result is different from the auditing result Other model is determined as Model of Target Recognition;If it does not exist, then all the first identification models are determined as Model of Target Recognition.
15. the audit device of sample markup information as claimed in claim 11, which is characterized in that the preset quantity be greater than etc. In 3.
16. the audit device of sample markup information as claimed in claim 11, which is characterized in that the identification module, which obtains, to be made To test the identification sample set used, specifically:
The identification sample set that fetching portion mark sample composition is used as test from the mark sample audited.
17. the audit device of sample markup information as claimed in claim 11, which is characterized in that the auditing module is to described The mark sample that the corresponding subsample of Model of Target Recognition is concentrated is labeled signal auditing, specifically:
The mark sample that the corresponding subsample of the Model of Target Recognition is concentrated is sent to verification client, so that the verification Client is labeled signal auditing to received mark sample.
18. the audit device of sample markup information as claimed in claim 17, which is characterized in that the verification client is logical Cross the client that the second identification model that training is established in advance audits received mark sample, second identification model Recognition accuracy be higher than certain threshold value;Or
The verification client is that the client of manual examination and verification is carried out to received mark sample.
19. a kind of electronic equipment, which is characterized in that including processor, communication interface, memory and communication bus, wherein described Processor, the communication interface, the memory complete mutual communication by the communication bus;
The memory, for storing computer program;
The processor when for executing the program stored on the memory, realizes that claim 1-10 is any described Method and step.
20. a kind of computer readable storage medium, which is characterized in that be stored with computer in the computer readable storage medium Program realizes claim 1-10 described in any item method and steps when the computer program is executed by processor.
CN201910538177.XA 2019-06-20 2019-06-20 Sample labeling information auditing method and device Active CN110222791B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201910538177.XA CN110222791B (en) 2019-06-20 2019-06-20 Sample labeling information auditing method and device
PCT/CN2020/095978 WO2020253636A1 (en) 2019-06-20 2020-06-12 Sample label information verification method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910538177.XA CN110222791B (en) 2019-06-20 2019-06-20 Sample labeling information auditing method and device

Publications (2)

Publication Number Publication Date
CN110222791A true CN110222791A (en) 2019-09-10
CN110222791B CN110222791B (en) 2020-12-04

Family

ID=67814089

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910538177.XA Active CN110222791B (en) 2019-06-20 2019-06-20 Sample labeling information auditing method and device

Country Status (2)

Country Link
CN (1) CN110222791B (en)
WO (1) WO2020253636A1 (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110705257A (en) * 2019-09-16 2020-01-17 腾讯科技(深圳)有限公司 Media resource identification method and device, storage medium and electronic device
CN110852376A (en) * 2019-11-11 2020-02-28 杭州睿琪软件有限公司 Method and system for identifying biological species
CN111160188A (en) * 2019-12-20 2020-05-15 中国建设银行股份有限公司 Financial bill identification method, device, equipment and storage medium
CN111259980A (en) * 2020-02-10 2020-06-09 北京小马慧行科技有限公司 Method and device for processing labeled data
CN112070224A (en) * 2020-08-26 2020-12-11 成都品果科技有限公司 Revision system and method of sample for neural network training
WO2020253636A1 (en) * 2019-06-20 2020-12-24 杭州睿琪软件有限公司 Sample label information verification method and device
CN112328822A (en) * 2020-10-15 2021-02-05 深圳市优必选科技股份有限公司 Picture pre-labeling method and device and terminal equipment
CN114219501A (en) * 2022-02-22 2022-03-22 杭州衡泰技术股份有限公司 Sample labeling resource allocation method, device and application
CN114240101A (en) * 2021-12-02 2022-03-25 支付宝(杭州)信息技术有限公司 Risk identification model verification method, device and equipment

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113034025B (en) * 2021-04-08 2023-12-01 成都国星宇航科技股份有限公司 Remote sensing image labeling system and method
CN113839953A (en) * 2021-09-27 2021-12-24 上海商汤科技开发有限公司 Labeling method and device, electronic equipment and storage medium
CN114189709A (en) * 2021-11-12 2022-03-15 北京天眼查科技有限公司 Method and device for auditing video, storage medium and electronic equipment
CN114972846A (en) * 2022-04-29 2022-08-30 上海深至信息科技有限公司 Ultrasonic image annotation system

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070260342A1 (en) * 2006-05-08 2007-11-08 Standard Aero Limited Method for inspection process development or improvement and parts inspection process
CN101359372A (en) * 2008-09-26 2009-02-04 腾讯科技(深圳)有限公司 Training method and device of classifier, and method apparatus for recognising sensitization picture
CN104751188A (en) * 2015-04-15 2015-07-01 爱威科技股份有限公司 Image processing method and system
CN108806668A (en) * 2018-06-08 2018-11-13 国家计算机网络与信息安全管理中心 A kind of audio and video various dimensions mark and model optimization method
CN109583468A (en) * 2018-10-12 2019-04-05 阿里巴巴集团控股有限公司 Training sample acquisition methods, sample predictions method and corresponding intrument
CN109784391A (en) * 2019-01-04 2019-05-21 杭州比智科技有限公司 Sample mask method and device based on multi-model

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9390112B1 (en) * 2013-11-22 2016-07-12 Groupon, Inc. Automated dynamic data quality assessment
US11288551B2 (en) * 2016-10-24 2022-03-29 International Business Machines Corporation Edge-based adaptive machine learning for object recognition
CN109446369B (en) * 2018-09-28 2021-10-08 武汉中海庭数据技术有限公司 Interaction method and system for semi-automatic image annotation
CN109284784A (en) * 2018-09-29 2019-01-29 北京数美时代科技有限公司 A kind of content auditing model training method and device for live scene video
CN110222791B (en) * 2019-06-20 2020-12-04 杭州睿琪软件有限公司 Sample labeling information auditing method and device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070260342A1 (en) * 2006-05-08 2007-11-08 Standard Aero Limited Method for inspection process development or improvement and parts inspection process
CN101359372A (en) * 2008-09-26 2009-02-04 腾讯科技(深圳)有限公司 Training method and device of classifier, and method apparatus for recognising sensitization picture
CN104751188A (en) * 2015-04-15 2015-07-01 爱威科技股份有限公司 Image processing method and system
CN108806668A (en) * 2018-06-08 2018-11-13 国家计算机网络与信息安全管理中心 A kind of audio and video various dimensions mark and model optimization method
CN109583468A (en) * 2018-10-12 2019-04-05 阿里巴巴集团控股有限公司 Training sample acquisition methods, sample predictions method and corresponding intrument
CN109784391A (en) * 2019-01-04 2019-05-21 杭州比智科技有限公司 Sample mask method and device based on multi-model

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020253636A1 (en) * 2019-06-20 2020-12-24 杭州睿琪软件有限公司 Sample label information verification method and device
CN110705257A (en) * 2019-09-16 2020-01-17 腾讯科技(深圳)有限公司 Media resource identification method and device, storage medium and electronic device
CN110705257B (en) * 2019-09-16 2021-06-25 腾讯科技(深圳)有限公司 Media resource identification method and device, storage medium and electronic device
CN110852376B (en) * 2019-11-11 2023-05-26 杭州睿琪软件有限公司 Method and system for identifying biological species
CN110852376A (en) * 2019-11-11 2020-02-28 杭州睿琪软件有限公司 Method and system for identifying biological species
CN111160188A (en) * 2019-12-20 2020-05-15 中国建设银行股份有限公司 Financial bill identification method, device, equipment and storage medium
CN111259980A (en) * 2020-02-10 2020-06-09 北京小马慧行科技有限公司 Method and device for processing labeled data
CN111259980B (en) * 2020-02-10 2023-10-03 北京小马慧行科技有限公司 Method and device for processing annotation data
CN112070224A (en) * 2020-08-26 2020-12-11 成都品果科技有限公司 Revision system and method of sample for neural network training
CN112070224B (en) * 2020-08-26 2024-02-23 成都品果科技有限公司 Revision system and method of samples for neural network training
CN112328822A (en) * 2020-10-15 2021-02-05 深圳市优必选科技股份有限公司 Picture pre-labeling method and device and terminal equipment
CN112328822B (en) * 2020-10-15 2024-04-02 深圳市优必选科技股份有限公司 Picture pre-marking method and device and terminal equipment
CN114240101A (en) * 2021-12-02 2022-03-25 支付宝(杭州)信息技术有限公司 Risk identification model verification method, device and equipment
CN114219501A (en) * 2022-02-22 2022-03-22 杭州衡泰技术股份有限公司 Sample labeling resource allocation method, device and application

Also Published As

Publication number Publication date
WO2020253636A1 (en) 2020-12-24
CN110222791B (en) 2020-12-04

Similar Documents

Publication Publication Date Title
CN110222791A (en) Sample labeling information auditing method and device
CN110222170B (en) Method, device, storage medium and computer equipment for identifying sensitive data
CN109271401B (en) Topic searching and correcting method and device, electronic equipment and storage medium
CN109561322B (en) Video auditing method, device, equipment and storage medium
CN109993112A (en) The recognition methods of table and device in a kind of picture
CN110245716A (en) Sample labeling auditing method and device
CN109784381A (en) Markup information processing method, device and electronic equipment
CN105975980A (en) Method of monitoring image mark quality and apparatus thereof
CN108491388B (en) Data set acquisition method, classification method, device, equipment and storage medium
CN105303179A (en) Fingerprint identification method and fingerprint identification device
CN109902285A (en) Corpus classification method, device, computer equipment and storage medium
CN105426862B (en) Analysis method and its system based on RFID, location technology and video technique
CN107491536B (en) Test question checking method, test question checking device and electronic equipment
CN110348444A (en) Wrong topic collection method, device and equipment based on deep learning
CN109508373A (en) Calculation method, equipment and the computer readable storage medium of enterprise's public opinion index
CN110503099B (en) Information identification method based on deep learning and related equipment
CN109284355A (en) A kind of method and device for the middle verbal exercise that corrects an examination paper
CN110473211B (en) Method and equipment for detecting number of spring pieces
CN107679046A (en) A kind of detection method and device of fraudulent user
CN110263853A (en) The method and device of artificial client state is checked using error sample
CN110826494A (en) Method and device for evaluating quality of labeled data, computer equipment and storage medium
CN109800309A (en) Classroom Discourse genre classification methods and device
US20210192965A1 (en) Question correction method, device, electronic equipment and storage medium for oral calculation questions
CN110503143A (en) Research on threshold selection, equipment, storage medium and device based on intention assessment
CN112926621A (en) Data labeling method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20220429

Address after: 310053 room d3189, North third floor, building 1, 368 Liuhe Road, Binjiang District, Hangzhou City, Zhejiang Province

Patentee after: Hangzhou Ruisheng Software Co.,Ltd.

Address before: Room B2019, 2nd floor, building 1 (North), 368 Liuhe Road, Binjiang District, Hangzhou City, Zhejiang Province, 310053

Patentee before: HANGZHOU GLORITY SOFTWARE Ltd.