CN112445831B - Data labeling method and device - Google Patents

Data labeling method and device Download PDF

Info

Publication number
CN112445831B
CN112445831B CN202110133276.7A CN202110133276A CN112445831B CN 112445831 B CN112445831 B CN 112445831B CN 202110133276 A CN202110133276 A CN 202110133276A CN 112445831 B CN112445831 B CN 112445831B
Authority
CN
China
Prior art keywords
data
sample
label
samples
labeling
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110133276.7A
Other languages
Chinese (zh)
Other versions
CN112445831A (en
Inventor
程会云
史明
王西颖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Qiyuan Technology Co.,Ltd.
Original Assignee
Nanjing Iqiyi Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Iqiyi Intelligent Technology Co Ltd filed Critical Nanjing Iqiyi Intelligent Technology Co Ltd
Priority to CN202110133276.7A priority Critical patent/CN112445831B/en
Publication of CN112445831A publication Critical patent/CN112445831A/en
Application granted granted Critical
Publication of CN112445831B publication Critical patent/CN112445831B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2457Query processing with adaptation to user needs
    • G06F16/24573Query processing with adaptation to user needs using data annotations, e.g. user-defined metadata

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Library & Information Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a data labeling method and a device, wherein the method comprises the following steps: inputting each data in a data set to be labeled into K labeling models respectively, and obtaining K labels for each data, wherein the K labeling models are obtained through K sub-training set training respectively, the K sub-training sets are obtained through K times of random sampling with replacement of samples in a total training set, and K is an integer greater than 1; dividing data corresponding to the tags into samples with different confusion degrees based on confidence degrees of the tags, wherein the confidence degree is the consistent degree of K tags obtained aiming at each data; and in a preset stage, marking the samples with different confusion degrees in sequence to obtain the label of each data in the data set to be marked. According to the technical scheme, the samples with different confusion degrees are compared and checked with each other through the trained K marking models, so that the samples with different confusion degrees are marked automatically, and the labor and time costs are greatly saved.

Description

Data labeling method and device
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a data annotation method and device.
Background
With the rapid development of science and technology, artificial intelligence has become one of the focuses of people's attention. With the support of technical progress such as big data, artificial intelligence has shown fruitful results in the fields of data analysis, image recognition, smart home, automatic driving and the like. The artificial intelligence technology takes mass data as drive and a deep learning algorithm as a core, so that the machine initially has basic visual and auditory abilities of human beings and is possibly competent for relatively complicated mental labor. Due to the requirement of a large amount of data in the deep learning algorithm, the labeling of mass data becomes an urgent need of the market.
One of the existing data labeling methods usually adopts a manual labeling method, but the manual labeling method is time-consuming and is easily affected by subjective factors of a labeling person, so that the labeling precision is not high.
In addition, there is a method of labeling data based on a trained model, however, the trained model relies on a large number of samples for training, and the labeling accuracy of the labeled model is completely determined by the number of samples and the quality of the samples. Therefore, how to find an automatic labeling method which is less time-consuming and has higher accuracy is an urgent problem to be solved.
Disclosure of Invention
In view of the foregoing problems, an object of the embodiments of the present invention is to provide a data annotation method and apparatus, so as to solve the deficiencies of the prior art.
According to an embodiment of the present invention, there is provided a data annotation method, including:
inputting each data in a data set to be labeled into K labeling models respectively, and obtaining K labels for each data, wherein the K labeling models are obtained through K sub-training set training respectively, the K sub-training sets are obtained through K times of random sampling with replacement of samples in a total training set, and K is an integer greater than 1;
dividing the data corresponding to the labels into samples with different confusion degrees based on the consistency degree of the labels;
and in a preset stage, marking the samples with different confusion degrees in sequence to obtain the label of each data in the data set to be marked.
In the above data labeling method, the total training set includes a first predetermined number of labeled samples.
In the above data labeling method, the samples with different confusion degrees include simple samples, confusable samples and difficult samples;
the dividing the label corresponding data into samples with different confusion degrees based on the consistency degree of the labels comprises:
determining data with consistent K labels as simple samples; m pieces of label consistent data in the K labels are confusable samples, N pieces of label inconsistent data in the K labels are difficult samples, and M, N are positive integers smaller than K.
In the above data labeling method, the preset stage includes a first stage, and the sequentially labeling the samples with different confusion degrees to obtain the label of each data in the data set to be labeled includes:
adding the label marked correctly in the simple sample into a first sample set, and adding the label marked incorrectly into a second sample set;
adding the first sample set into the total training set, and repeatedly executing all steps before the steps of adding the first sample set into the total training set and adding the first sample set into the total training set until the execution times reach a first preset threshold value, and stopping execution;
and marking the second sample set, the confusable sample and the difficult sample obtained after the last execution in response to the user operation to obtain the label of each data in the data set to be marked.
In the above data labeling method, the preset stage includes a first stage and a second stage, and the sequentially labeling the samples with different confusion degrees to obtain the label of each data in the data set to be labeled includes:
adding the label marked correctly in the simple sample into a first sample set, and adding the label marked incorrectly into a second sample set;
adding the first sample set into the total training set, and repeatedly executing all steps before the steps of adding the first sample set into the total training set and adding the first sample set into the total training set until the execution times reach a first preset threshold value, and stopping execution;
sending each sample in the second sample set obtained after the last execution into K labeling models obtained after the last execution to obtain K labels respectively;
adding a label which is marked correctly in the simple sample and the sample which is easy to be confused into a third sample set and adding a label which is marked incorrectly into a fourth sample set aiming at the label;
adding the third sample set into the total training set, and repeatedly executing all steps before the step of adding the third sample set into the total training set and the step of adding the third sample set into the total training set until the execution times reach a second preset threshold value, and stopping execution;
and marking the fourth sample set and the difficult sample obtained after the last execution in response to the user operation to obtain the label of each data in the data set to be marked.
In the above data labeling method, the preset stage includes a first stage, a second stage and a third stage, and the labeling of the samples with different confusion degrees in sequence to obtain the label of each data in the data set to be labeled includes:
adding the label marked correctly in the simple sample into a first sample set, and adding the label marked incorrectly into a second sample set;
adding the first sample set into the total training set, and repeatedly executing all steps before the steps of adding the first sample set into the total training set and adding the first sample set into the total training set until the execution times reach a first preset threshold value, and stopping executing;
sending each sample in the second sample set obtained after the last execution into K labeling models obtained after the last execution to obtain K labels respectively;
adding a label which is marked correctly in the simple sample and the sample which is easy to be confused into a third sample set and adding a label which is marked incorrectly into a fourth sample set aiming at the label;
adding the third sample set into the total training set, and repeatedly executing all steps before the step of adding the third sample set into the total training set and the step of adding the third sample set into the total training set until the execution times reach a second preset threshold value, and stopping execution;
sending each sample in the fourth sample set obtained after the last execution into K labeling models obtained after the last execution to respectively obtain K labels;
and regarding each sample in the fourth sample set, taking the label with the highest confidence coefficient as the label of the sample, and obtaining the label of each data in the data set to be labeled.
In the above data labeling method, for each sample in the fourth sample set, after the label with the highest confidence is taken as the label of the sample, the method further includes:
and verifying the label of each sample in the fourth sample set in response to user operation, so that the label of each data in the data set to be labeled is obtained after the labels of all samples in the fourth sample set are labeled correctly.
In the above data labeling method, the first preset threshold, the second preset threshold, and the third preset threshold are the same.
In the above data labeling method, the K labeling models are the same model.
According to another embodiment of the present invention, there is provided a data annotation apparatus including:
the system comprises an input module, a label module and a label module, wherein the input module is used for respectively inputting each data in a data set to be labeled into K labeling models and obtaining K labels for each data, the K labeling models are respectively obtained by training K sub-training sets, the K sub-training sets are obtained by performing K-time replaced random sampling on samples in a total training set, and K is an integer greater than 1;
the sample determining module is used for dividing the data corresponding to the labels into samples with different confusion degrees based on the consistency degree of the labels;
and the marking module is used for marking the samples with different confusion degrees in sequence in a preset stage to obtain the label of each data in the data set to be marked.
According to still another embodiment of the present invention, an electronic device is provided, which includes a memory for storing a computer program and a processor for executing the computer program to make the electronic device execute the above data annotation method.
According to still another embodiment of the present invention, there is provided a computer-readable storage medium storing the computer program used in the electronic device.
The technical scheme provided by the embodiment of the disclosure can have the following beneficial effects:
according to the data labeling method and device, K sub-training sets are obtained by randomly sampling all samples in a total training set for K times and putting back, K labeling models are trained through K sub-training sets, and data are labeled through verification and comparison of the K labeling models; and labeling each data in the data set to be labeled respectively through K labeling models, labeling samples with different confusion degrees according to the confidence degrees of the labels, wherein the confidence degrees are the consistent degrees of the K labels obtained aiming at each data, labeling the labels respectively based on the labels with different confidence degrees according to the mode, automatically labeling each data in the data set to be labeled, improving the labeling precision based on the mode of the confidence degrees, and greatly saving labor and time cost.
In order to make the aforementioned and other objects, features and advantages of the present invention comprehensible, preferred embodiments accompanied with figures are described in detail below.
Drawings
In order to more clearly illustrate the technical solution of the present invention, the drawings needed in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention, and therefore should not be considered as limiting the scope of the present invention, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.
Fig. 1 is a schematic flow chart illustrating a data annotation method according to a first embodiment of the present invention;
fig. 2 is a schematic flow chart illustrating a first-stage data annotation method according to a first embodiment of the present invention;
FIG. 3 is a flowchart illustrating a second-stage data annotation method according to a first embodiment of the present invention;
fig. 4 is a flowchart illustrating a third-stage data annotation method according to a first embodiment of the present invention;
fig. 5 is a schematic structural diagram illustrating a data annotation device according to a second embodiment of the present invention.
Description of the main element symbols:
500-data annotation means; 510-an input module; 520-a sample determination module; 530-labeling module.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.
Example 1
Fig. 1 is a schematic flow chart illustrating a data annotation method according to a first embodiment of the present invention.
The data annotation method comprises the following steps:
in step S110, each data in the data set to be labeled is input into K labeling models, and K labels are obtained for each data.
Specifically, the K labeled models are obtained by training K sub-training sets respectively, the K sub-training sets are obtained by performing K times of random sampling with a feedback on samples in a total training set, and K is an integer greater than 1.
Because the K labeling models are trained in the embodiment, if the condition of not putting back the sampling results in smaller similarity between the labeling models, the labeling models respectively and independently run, the result of the verification comparison between the K labeling models is poorer, and the generalization capability of each labeling model is poorer, so that the final labeling result is lower in precision. Therefore, in the embodiment, the samples in the total training set are sampled randomly and put back again, wherein the samples put back improve the relevance and the generalization degree between the labeling models, and the random sampling introduces noise to a certain extent, changes the distribution of the samples to a certain extent, and increases the generalization ability of each labeling model. Therefore, the labeling models are trained through K sub-training sets obtained by K times of replaced random sampling, so that the data labeling accuracy can be improved, and the generalization capability of each labeling model can be improved.
Further, the total training set includes a first predetermined number of labeled samples.
In this embodiment, in order to reduce the workload of labeling the samples, the value of the first preset number may be set to be smaller than K. That is to say, based on the semi-supervised mode, each labeling model is trained through labeling of a small number of samples, and labor cost and time cost are reduced.
Of course, in some other embodiments, the first preset number may also be equal to K, and each labeled model is trained through the total completely labeled training set, which may be determined according to the user requirement.
Specifically, the K labeled models are the same model, for example, the K labeled models may be models with the same network structure level, such as K same classifiers.
In this embodiment, the same K labeling models are cascaded, and the results of the K labeling models are verified and compared to obtain a general scheme with strong labeling capability, so that the labeling precision is improved.
In step S120, the label correspondence data is divided into samples of different degrees of confusion based on the degree of agreement of the labels.
Specifically, for each data in a data set to be labeled, inputting the data into K labeling models to obtain K labels for the data; and calculating the consistency degree of the K labels to divide the data corresponding to the labels into samples with different confusion degrees.
In this embodiment, if K is 5, the sample M is input into 5 trained labeling models to obtain 5 labels: l1, L2, L1, L2, and L3, wherein the labels of 3 labeling model outputs are all L1 in a consistent manner, the labels of two labeling model outputs are all L2 in a consistent manner, and the label of one labeling model output is L3, so 3/5 can be used as the consistent degree of the label. Of course, in some other embodiments, other representations of the degree of conformity may exist, as may be desired.
Further, the samples of different degrees of confusion include simple samples, confusable samples, and difficult samples;
the dividing the label corresponding data into samples with different confusion degrees based on the consistency degree of the labels comprises:
determining data with consistent K labels as simple samples, wherein the simple samples are samples with reliable labeling results; the data with M labels consistent in the K labels is an easy confusion sample, and the easy confusion sample means that the labeling result is easy to be confused when compared with the price, for example, the labeling result can be a reliable result or an unreliable result; and the data with inconsistent N labels in the K labels is a difficult sample, namely the sample with unreliable labeling results. M, N are all positive integers less than K.
In this embodiment, the value of M may be equal to the value of N; in some other embodiments, the value of M may not be equal to the value of N, depending on the requirement.
In step S130, in a preset stage, the samples with different confusion degrees are labeled in sequence to obtain a label of each data in the data set to be labeled.
Specifically, the time cost and the dimension of the accuracy are considered, different stages can be set, and in the different stages, the trained K labeling models can be sent to label for samples with different confusion degrees at a time, so that the label of each data in the label data set is finally obtained, and the data labeling is completed.
In this embodiment, the samples may be labeled sequentially from low to high in the degree of consistency based on the degree of consistency of the samples.
For example, in the first stage, labeling may be performed based on samples with low confusability (high consistency), that is, simple samples, and samples with wrong labeling are put back into the total training set; in the second stage, the samples in the total training set obtained in the first stage are continuously subjected to replaceable random sampling to obtain K sub-training sets, K new labeling models are obtained through K sub-training set training, the confusable samples are added into the data set to be labeled and are labeled through the new K labeling models, and the samples with the wrong label in the second stage are put back into the total training set; in the third stage, the samples of the total training set obtained in the second stage are continuously subjected to replaceable random sampling to obtain K sub-training sets, K new labeling models are obtained through the training of the K sub-training sets, the difficult samples are added into the data set to be labeled, and the labels of each data in the data set to be labeled are obtained through the labeling of the K new labeling models trained in the second stage.
In addition, after the label of each data in the data set to be labeled is finally obtained, whether the label of each data is labeled correctly can be determined in a manual verification mode, and if the label of the data is incorrect, the label of the data is labeled manually, so that the data in the data set to be labeled are all labeled correctly.
Further, the preset stage includes a first stage, the labeling of the samples with different confusion degrees in sequence to obtain the label of each data in the data set to be labeled includes:
adding the label marked correctly in the simple sample into a first sample set, and adding the label marked incorrectly into a second sample set;
adding the first sample set into the total training set, and repeatedly executing all the steps until the execution times reach a first preset threshold value, and stopping execution;
and marking the second sample set, the confusable sample and the difficult sample obtained after the last execution in response to the user operation to obtain the label of each data in the data set to be marked.
Specifically, as shown in fig. 2, the data annotation scheme of the first stage may include the following steps:
step S211, setting an initial small number of labeled data sets N0, a stage iteration count, and an unlabeled data set E.
Specifically, the labeled data set N0 is a labeled sample, and the data set N0 includes a first predetermined number of labeled samples, where the first predetermined number may be N, N is an integer greater than K, and K is the number of labeled models to be trained.
The range of the iteration times of the first stage is (1-I), I is a first preset threshold value, and the first preset threshold value is set according to the user requirement or the mark iteration cost.
And the unmarked data set is the data set to be marked.
In step S212, K times of sample volume M playback random sampling are performed on the N data.
Specifically, the labeled data set N0 includes N labeled samples, the N labeled samples are subjected to K times of replaced random sampling, the sampling capacity of each time is M samples, that is, each time of sampling obtains a sub-training set, each sub-training set includes M labeled samples, and K times of replaced random sampling obtains K sub-training sets.
Step S213, training K marking models respectively for the depth models corresponding to the K data sets.
Specifically, the annotation model is a depth model, which may be a classifier, which may be a neural network model, a decision tree model, or the like.
And respectively training a depth model through each sub-training set, and finally training the K sub-training sets to obtain K labeling models which are used for labeling the data in the data set to be labeled.
In step S214, the number of iterations i =0 is set.
Specifically, a variable i is set, the variable i is used for representing the iteration number of the first stage, the initial value of the iteration number is 0, and the value of i is increased by one every iteration.
In step S215, it is determined whether I < I is satisfied.
If I < I, it means that the number of iterations has not reached the first preset threshold, each iteration means that the data labeling method is repeatedly executed once, and if the number of executions has not reached the first preset threshold, the process proceeds to step S217; if I > = I, meaning that the number of iterations has reached the first preset threshold, at which point the iteration terminates and proceeds to step S216.
Step S216, manually calibrating S0 to obtain Sr and Se.
Specifically, S0 represents a sample set of K tags that are consistent and of high confidence. Where confidence indicates how correct the tag is. The degree of accuracy can be represented by the classification score of the label. The more accurate the classification, the higher the classification score.
In this embodiment, the confidence of a certain label may be determined by a probability value of the sample belonging to the label, which is output by the labeling model. Of course, in some other embodiments, the confidence level may also be expressed by other ways, which are not limited herein.
In this embodiment, the high confidence level indicates that the confidence level of the tag is higher than a standard confidence level threshold; the standard confidence threshold may be determined based on the confidence of the labels corresponding to all samples in the preset verification set, where the labels correspond to the samples in the verification set by statistics according to the distribution of the preset verification set (the set of samples used for verifying the labeled model).
Sr represents the first sample set, i.e. the correct sample set is labeled in S0; se denotes the second sample set, i.e., the sample set labeled with errors in S0.
And if the iteration number of the first stage reaches a first preset threshold, manually checking each sample in S0 obtained after the iteration is finished, determining whether the label of the sample is correctly labeled, adding the sample with the correct label into the first sample set Sr, and adding the sample with the wrong label into the second sample set Se.
In step S217, the sample S belongs to E.
Specifically, if the iteration number has not reached the first preset threshold, the samples in the unlabeled data set E need to be labeled continuously, and the sample S in the unlabeled data set E is selected for subsequent labeling.
In step S218, it is determined whether E is not null.
Specifically, if K labels (labels) obtained after the sample S in E passes through K labeling models are all consistent and have high confidence, S needs to be taken out of E and added to S0, and if samples are continuously taken out subsequently, a situation that E is empty may occur, so before labeling the sample S, a determination whether E is empty needs to be performed. If E is empty, go to step S219; if E is not empty, proceed to step S221.
Step S219, E = E1.
Specifically, E1 represents a dataset that has been labeled but in which the K labels corresponding to each sample are either inconsistent or not highly confident. When the original E is empty, E = E1, and E1 is defined as the data set to be labeled.
After the execution of step S219 ends, the process proceeds to step S225.
Step S220, N0+ = Sr, E = Se.
After manually calibrating S0 to obtain Sr and Se, Sr was added to N0, and Se was used as the dataset to be annotated.
And step S221, testing the K labeling models S to obtain K labels.
Specifically, the sample S is sent to K trained labeling models to obtain K Label models.
Step S222, judging whether the K Label are consistent and have high confidence level.
Specifically, if the K labels are all consistent and high in confidence, the process proceeds to step S224, otherwise, the process proceeds to step S223.
In step S223, E1+ = { S }.
Specifically, if none of the K Label' S are consistent or have low confidence, S is added to E1.
Step S224, S0+ = { S }, E = { S }.
Specifically, if the K Label' S are all consistent and of high confidence, the sample S is removed from E and added to S0.
If the number of iterations has not reached the first preset threshold, the process goes to step S217 after the execution of steps S223 and S214 is finished.
Step S225, N0+ = S0.
S0 is added to N0, and the process continues back to step S212 to re-execute the data annotation process of the first stage.
And finally, manually labeling the remaining unlabeled samples (the confusable samples and the difficult samples) and the samples in the second sample set to obtain the label of each data in the data set to be labeled.
Further, the preset stage includes a first stage and a second stage, the labeling of the samples with different confusion degrees in sequence to obtain the label of each data in the data set to be labeled includes:
adding the label marked correctly in the simple sample into a first sample set, and adding the label marked incorrectly into a second sample set;
adding the first sample set into the total training set, and repeatedly executing all the steps until the execution times reach a first preset threshold value, and stopping execution;
sending each sample in the second sample set obtained after the last execution into K labeling models obtained after the last execution to obtain K labels respectively;
adding a label which is marked correctly in the simple sample and the sample which is easy to be confused into a third sample set and adding a label which is marked incorrectly into a fourth sample set aiming at the label;
adding the third sample set into the total training set, and repeatedly executing all the steps until the execution times reach a second preset threshold value, and stopping execution;
and marking the fourth sample set and the difficult sample obtained after the last execution in response to the user operation to obtain the label of each data in the data set to be marked.
In this embodiment, the labeling of the data needs to consider the labeling of two stages, that is, the data labeling of the first stage and the data labeling of the second stage, that is, only simple samples are processed in the first stage, and the obtained simple samples are added to the training set through multiple iterations to expand the sample capacity and richness of the training set; and in the second stage, simple and confusable samples are processed simultaneously, the simple and confusable samples are added into the training set, and the sample capacity and the richness of the training set are further expanded.
The data annotation scheme of the first stage refers to fig. 1, and this part only explains the data annotation scheme of the second stage.
Specifically, as shown in fig. 3, the data annotation method at the second stage includes the following steps:
it should be noted that the steps in fig. 3 are the same as the steps in fig. 2, and the content of this section only explains the different parts, and the same parts in fig. 2 are explained with reference to fig. 2.
And step S311, labeling the data set N0, performing two-stage iteration times II, and not labeling the data set E.
Specifically, the range of the number of iterations in the two stages is (1-II), and II is a second preset threshold.
In step S312, K times of sample volume M playback random sampling are performed on the N data.
The step is specifically referred to step S212, and is not described herein again.
And step S313, respectively training K marking models for the depth models corresponding to the K data sets.
This step is specifically referred to as step S213, and will not be described herein.
In step S314, the number of iterations i =0 is set.
Specifically, the initial value of the second-stage iteration number i is set to 0.
In step S315, it is determined whether i < II is satisfied.
If so, the process proceeds to step S318, otherwise, the process proceeds to step S316.
Step S316, manually calibrating S0 to obtain Sr and Se.
Specifically, in the second stage, S0 indicates that there is at least a sample set in which m tags match among the K tags and the m matched tags have high confidence. Sr is the third sample set and Se is the fourth sample set.
Wherein m is an integer greater than 1 and less than K.
Step S317, N0+ = Sr, E = Se.
In step S318, it is determined that the sample S belongs to E.
Step S319, determine whether E is empty.
If yes, the process proceeds to step S321, and if no, the process proceeds to step S320.
Step S320, E = E1.
This step is specifically referred to step S219, and will not be described herein.
After step S320, the process proceeds to step S325.
In step S321, K model tests S obtain K Label.
Specifically, the K annotation models are repeatedly trained in the first stage. Therefore, in the second stage, the K models are the K labeled models obtained by the last training.
In step S322, it is determined that > = m labels are all identical and have high confidence.
Specifically, in the second stage, the simple samples and the confusable samples are labeled, and therefore, in this step, it is determined whether at least m labels of the K labels obtained from the sample S are all consistent and all have high confidence, if yes, the process proceeds to step S324, and otherwise, the process proceeds to step S323.
Step S324, S0+ = { S }, E- = { S }.
This step is specifically referred to as step S224, and is not described herein again.
After step S323 and step S324, the process proceeds to step S318.
Step S325, N0+ = S0.
This step is specifically referred to as step S225, and will not be described herein again.
After step S325, the process returns to step S312 to continue the second stage data annotation process.
And finally, manually labeling the remaining unlabeled samples (difficult samples) and the samples in the fourth sample set to obtain the label of each data in the data set to be labeled.
Further, the preset stage includes a first stage, a second stage and a third stage, the labeling of the samples with different confusion degrees in sequence to obtain the label of each data in the data set to be labeled includes:
adding the label marked correctly in the simple sample into a first sample set, and adding the label marked incorrectly into a second sample set;
adding the first sample set into the total training set, and repeatedly executing all the steps until the execution times reach a first preset threshold value, and stopping execution;
sending each sample in the second sample set obtained after the last execution into K labeling models obtained after the last execution to obtain K labels respectively;
adding a label which is marked correctly in the simple sample and the sample which is easy to be confused into a third sample set and adding a label which is marked incorrectly into a fourth sample set aiming at the label;
adding the third sample set into the total training set, and repeatedly executing all the steps until the execution times reach a second preset threshold value, and stopping execution;
sending each sample in the fourth sample set obtained after the last execution into K labeling models obtained after the last execution to respectively obtain K labels;
and regarding each sample in the fourth sample set, taking the label with the highest confidence coefficient as the label of the sample, and obtaining the label of each data in the data set to be labeled.
In this embodiment, the labeling of the data needs to consider the labeling of three stages, that is, the data labeling of the first stage, the data labeling of the second stage, and the data labeling of the third stage, that is, only a simple sample is processed in the first stage, and the obtained simple sample is added to the training set through multiple iterations to expand the sample capacity and richness of the training set; in the second stage, simple and confusable samples are processed simultaneously, the simple and confusable samples are added into the training set, and the sample capacity and the richness of the training set are further expanded; the third stage begins to process the hard samples, and adds simple samples, confusable samples and hard samples into the training set.
The data annotation scheme of the first stage refers to fig. 2, the data annotation scheme of the second stage refers to fig. 3, and this part only explains the data annotation scheme of the third stage.
Specifically, as shown in fig. 4, the data annotation method at the third stage includes the following steps:
it should be noted that the steps in fig. 4 are the same as those in fig. 2 and fig. 3, and the content of this section only explains different parts, and the same parts in fig. 2 or fig. 3 are explained with reference to fig. 2 or fig. 3.
As shown in fig. 4, the data annotation method at the third stage includes the following steps:
and step S411, labeling a data set N0, performing two-stage iteration times III, and not labeling a data set E.
Specifically, the range of the three-stage iteration times is (1-III), and III is a third preset threshold.
In step S412, K times of sample volume M playback random sampling are performed on the N data.
The step is specifically referred to step S212, and is not described herein again.
Step S413, training K labeling models respectively for the depth models corresponding to the K data sets.
This step is specifically referred to as step S213, and will not be described herein.
In step S414, the number of iterations i =0 is set.
Specifically, the initial value of the third-stage iteration number i is set to 0.
In step S415, it is determined whether i < III is satisfied.
If so, proceed to step S418, otherwise proceed to step S416.
Step S416, manually calibrating S0 to obtain Sr and Se.
Specifically, in the third stage, S0 represents the sample set with the highest confidence among the K tags. Sr is the third sample set and Se is the fourth sample set.
Step S417, N0+ = Sr, E = Se.
The step is specifically referred to as step S220, and is not described herein again.
Step S418, determine that sample S belongs to E.
In step S419, it is determined whether E is empty.
If yes, the process proceeds to step S420, and if no, the process proceeds to step S423.
And step S420, K model tests S obtain K Label.
This step is specifically referred to as step S221, and will not be described herein again.
In step S421, the Label with the highest confidence is selected as S.
Specifically, in the third stage, for K labels of the sample S, the Label with the highest confidence is selected as the Label of the sample S.
Step S422, S0+ = { S }, E = { S }.
This step is specifically referred to as step S224, and is not described herein again.
After step S422, the process proceeds to step S418.
Step S423, N0+ = S0.
This step is specifically referred to as step S225, and will not be described herein again.
After the step S423, the process returns to the step S412 to continue the data annotation method of the third stage.
Further, for each sample in the fourth sample set, after the label with the highest confidence is taken as the label of the sample, the method further includes:
and verifying the label of each sample in the fourth sample set in response to user operation, so that the label of each data in the data set to be labeled is obtained after the labels of all samples in the fourth sample set are labeled correctly.
Specifically, since the accuracy of the labels in the fourth sample set is determined according to the confidence, in order to improve the labeling accuracy, manual verification may be performed on the labels of all the samples in the fourth sample set, and the label of the sample with the wrong label is changed into a correct label, so that the label of each data in the data set to be labeled is obtained after the labels of all the samples in the fourth sample set are labeled correctly.
Further, the first preset threshold, the second preset threshold and the third preset threshold are the same.
In this embodiment, the first preset threshold, the second preset threshold and the third preset threshold may be set to the same value. In some other embodiments, the first preset threshold, the second preset threshold and the third preset threshold may also be set to different values, depending on the user's requirements.
Example 2
Fig. 5 is a schematic structural diagram illustrating a data annotation device according to a second embodiment of the present invention. The data annotation device 500 corresponds to the data annotation method in embodiment 1, and the data annotation method in embodiment 1 is also applicable to the data annotation device 500, which is not described herein again.
The data annotation device 500 includes an input module 510, a sample determination module 520, and an annotation module 530.
An input module 510, configured to input each data in a data set to be labeled into K labeling models respectively, and obtain K labels for each data, where the K labeling models are obtained through K training subsets, the K training subsets are obtained by performing K times of random sampling with a sample being returned in a total training set, and K is an integer greater than 1.
A sample determining module 520, configured to divide the data corresponding to the tags into samples with different confusion degrees based on the consistency degree of the tags.
And a labeling module 530, configured to label, in a preset stage, the samples with different confusion degrees in sequence, so as to obtain a label of each data in the data set to be labeled.
Another embodiment of the present invention further provides an electronic device, where the electronic device includes a memory and a processor, the memory is used to store a computer program, and the processor runs the computer program to make the electronic device execute the functions of each module in the above data annotation method or the above data annotation apparatus.
The memory may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to use of the computer device, and the like. Further, the memory may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.
The embodiment also provides a computer storage medium for storing the data annotation method used in the electronic device.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method can be implemented in other ways. The apparatus embodiments described above are merely illustrative and, for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
In addition, each functional module or unit in each embodiment of the present invention may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.
The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention or a part of the technical solution that contributes to the prior art in essence can be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a smart phone, a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention.

Claims (9)

1. A method for annotating data, the method comprising:
inputting each data in a data set to be labeled into K labeling models respectively, and obtaining K labels for each data, wherein the K labeling models are obtained through K sub-training set training respectively, the K sub-training sets are obtained through K times of random sampling with replacement of samples in a total training set, and K is an integer greater than 1;
dividing the data corresponding to the labels into samples with different confusion degrees based on the consistency degree of the labels;
in a preset stage, marking samples with different confusion degrees in sequence to obtain a label of each data in a data set to be marked;
the samples with different confusion degrees comprise simple samples, confusable samples and difficult samples;
the dividing the label corresponding data into samples with different confusion degrees based on the consistency degree of the labels comprises:
determining data with consistent K labels as simple samples; m pieces of label consistent data in the K labels are confusable samples, N pieces of label inconsistent data in the K labels are difficult samples, and M, N are positive integers smaller than K.
2. The method of claim 1, wherein the total training set comprises a first predetermined number of labeled samples.
3. The data labeling method of claim 1, wherein the preset stage comprises a first stage, and the labeling of the samples with different confusion degrees in sequence to obtain the label of each data in the data set to be labeled comprises:
adding the label marked correctly in the simple sample into a first sample set, and adding the label marked incorrectly into a second sample set;
adding the first sample set into the total training set, and repeatedly executing all steps before the steps of adding the first sample set into the total training set and adding the first sample set into the total training set until the execution times reach a first preset threshold value, and stopping execution;
and marking the second sample set, the confusable sample and the difficult sample obtained after the last execution in response to the user operation to obtain the label of each data in the data set to be marked.
4. The data labeling method of claim 1, wherein the preset stage comprises a first stage and a second stage, and the labeling of the samples with different confusion degrees in sequence to obtain the label of each data in the data set to be labeled comprises:
adding the label marked correctly in the simple sample into a first sample set, and adding the label marked incorrectly into a second sample set;
adding the first sample set into the total training set, and repeatedly executing all steps before the steps of adding the first sample set into the total training set and adding the first sample set into the total training set until the execution times reach a first preset threshold value, and stopping execution;
sending each sample in the second sample set obtained after the last execution into K labeling models obtained after the last execution to obtain K labels respectively;
adding a label which is marked correctly in the simple sample and the sample which is easy to be confused into a third sample set and adding a label which is marked incorrectly into a fourth sample set aiming at the label;
adding the third sample set into the total training set, and repeatedly executing all steps before the step of adding the third sample set into the total training set and the step of adding the third sample set into the total training set until the execution times reach a second preset threshold value, and stopping execution;
and marking the fourth sample set and the difficult sample obtained after the last execution in response to the user operation to obtain the label of each data in the data set to be marked.
5. The data annotation method of claim 1, wherein the preset stage includes a first stage, a second stage, and a third stage, and the sequentially annotating the samples with different degrees of confusion to obtain the label of each data in the data set to be annotated includes:
adding the label marked correctly in the simple sample into a first sample set, and adding the label marked incorrectly into a second sample set;
adding the first sample set into the total training set, and repeatedly executing all steps before the steps of adding the first sample set into the total training set and adding the first sample set into the total training set until the execution times reach a first preset threshold value, and stopping executing;
sending each sample in the second sample set obtained after the last execution into K labeling models obtained after the last execution to obtain K labels respectively;
adding a label which is marked correctly in the simple sample and the sample which is easy to be confused into a third sample set and adding a label which is marked incorrectly into a fourth sample set aiming at the label;
adding the third sample set into the total training set, and repeatedly executing all steps before the step of adding the third sample set into the total training set and the step of adding the third sample set into the total training set until the execution times reach a second preset threshold value, and stopping execution;
sending each sample in the fourth sample set obtained after the last execution into K labeling models obtained after the last execution to respectively obtain K labels;
and regarding each sample in the fourth sample set, taking the label with the highest confidence coefficient as the label of the sample, and obtaining the label of each data in the data set to be labeled.
6. The data annotation method of claim 5, wherein for each sample in the fourth sample set, the following the label with the highest confidence as the label of the sample further comprises:
and verifying the label of each sample in the fourth sample set in response to user operation, so that the label of each data in the data set to be labeled is obtained after the labels of all samples in the fourth sample set are labeled correctly.
7. The data annotation method of claim 5, wherein the first predetermined threshold and the second predetermined threshold are the same.
8. The data annotation method of claim 1, wherein the K annotation models are the same model.
9. A data annotation device, the device comprising:
the system comprises an input module, a label module and a label module, wherein the input module is used for respectively inputting each data in a data set to be labeled into K labeling models and obtaining K labels for each data, the K labeling models are respectively obtained by training K sub-training sets, the K sub-training sets are obtained by performing K-time replaced random sampling on samples in a total training set, and K is an integer greater than 1;
the sample determining module is used for dividing the data corresponding to the labels into samples with different confusion degrees based on the consistency degree of the labels;
the labeling module is used for sequentially labeling the samples with different confusion degrees in a preset stage to obtain a label of each data in the data set to be labeled;
the samples with different confusion degrees comprise simple samples, confusable samples and difficult samples;
the dividing the label corresponding data into samples with different confusion degrees based on the consistency degree of the labels comprises:
determining data with consistent K labels as simple samples; m pieces of label consistent data in the K labels are confusable samples, N pieces of label inconsistent data in the K labels are difficult samples, and M, N are positive integers smaller than K.
CN202110133276.7A 2021-02-01 2021-02-01 Data labeling method and device Active CN112445831B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110133276.7A CN112445831B (en) 2021-02-01 2021-02-01 Data labeling method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110133276.7A CN112445831B (en) 2021-02-01 2021-02-01 Data labeling method and device

Publications (2)

Publication Number Publication Date
CN112445831A CN112445831A (en) 2021-03-05
CN112445831B true CN112445831B (en) 2021-05-07

Family

ID=74740594

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110133276.7A Active CN112445831B (en) 2021-02-01 2021-02-01 Data labeling method and device

Country Status (1)

Country Link
CN (1) CN112445831B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113139072A (en) * 2021-04-20 2021-07-20 苏州挚途科技有限公司 Data labeling method and device and electronic equipment
CN115146622B (en) * 2022-07-21 2023-05-05 平安科技(深圳)有限公司 Data annotation error correction method and device, electronic equipment and storage medium

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10698905B2 (en) * 2017-09-14 2020-06-30 SparkCognition, Inc. Natural language querying of data in a structured context
CN109242013B (en) * 2018-08-28 2021-06-08 北京九狐时代智能科技有限公司 Data labeling method and device, electronic equipment and storage medium
CN111506776B (en) * 2019-11-08 2021-03-30 马上消费金融股份有限公司 Data labeling method and related device
CN111104479A (en) * 2019-11-13 2020-05-05 中国建设银行股份有限公司 Data labeling method and device
CN112036166A (en) * 2020-07-22 2020-12-04 大箴(杭州)科技有限公司 Data labeling method and device, storage medium and computer equipment

Also Published As

Publication number Publication date
CN112445831A (en) 2021-03-05

Similar Documents

Publication Publication Date Title
CN112256828B (en) Medical entity relation extraction method, device, computer equipment and readable storage medium
CN108052499B (en) Text error correction method and device based on artificial intelligence and computer readable medium
CN110795543A (en) Unstructured data extraction method and device based on deep learning and storage medium
CN112445831B (en) Data labeling method and device
CN106919794B (en) Multi-data-source-oriented medicine entity identification method and device
WO2015020921A2 (en) Clustering short answers to questions
RU2760471C1 (en) Methods and systems for identifying fields in a document
CN109408821B (en) Corpus generation method and device, computing equipment and storage medium
CN111144079B (en) Method and device for intelligently acquiring learning resources, printer and storage medium
CN109948735A (en) A kind of multi-tag classification method, system, device and storage medium
CN111552773A (en) Method and system for searching key sentence of question or not in reading and understanding task
CN114925174A (en) Document retrieval method and device and electronic equipment
CN109977400B (en) Verification processing method and device, computer storage medium and terminal
CN115757775B (en) Text inclusion-based trigger word-free text event detection method and system
CN109885180B (en) Error correction method and apparatus, computer readable medium
CN111368547A (en) Entity identification method, device, equipment and storage medium based on semantic analysis
CN115641395A (en) Image-text alignment method based on mutual information
CN114511084A (en) Answer extraction method and system for automatic question-answering system for enhancing question-answering interaction information
CN112016607A (en) Error cause analysis method based on deep learning
CN112164262A (en) Intelligent paper reading tutoring system
CN112100976B (en) Knowledge point relation marking method and system
CN117252209B (en) Automatic grading method, system, storage medium and processing terminal for themes in science
CN111522904A (en) Method and device for providing word information
CN117077679B (en) Named entity recognition method and device
CN115374884B (en) Method for training abstract generation model based on contrast learning and abstract generation method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP01 Change in the name or title of a patent holder
CP01 Change in the name or title of a patent holder

Address after: Room 1103, building C, Xingzhi science and Technology Park, Nanjing Economic and Technological Development Zone, Nanjing, Jiangsu Province 210038

Patentee after: Nanjing Qiyuan Technology Co.,Ltd.

Address before: Room 1103, building C, Xingzhi science and Technology Park, Nanjing Economic and Technological Development Zone, Nanjing, Jiangsu Province 210038

Patentee before: Nanjing iqiyi Intelligent Technology Co.,Ltd.