CN112445831B

CN112445831B - Data labeling method and device

Info

Publication number: CN112445831B
Application number: CN202110133276.7A
Authority: CN
Inventors: 程会云; 史明; 王西颖
Original assignee: Nanjing Iqiyi Intelligent Technology Co Ltd
Current assignee: Nanjing Qiyuan Technology Co.,Ltd.
Priority date: 2021-02-01
Filing date: 2021-02-01
Publication date: 2021-05-07
Anticipated expiration: 2041-02-01
Also published as: CN112445831A

Abstract

The invention discloses a data labeling method and a device, wherein the method comprises the following steps: inputting each data in a data set to be labeled into K labeling models respectively, and obtaining K labels for each data, wherein the K labeling models are obtained through K sub-training set training respectively, the K sub-training sets are obtained through K times of random sampling with replacement of samples in a total training set, and K is an integer greater than 1; dividing data corresponding to the tags into samples with different confusion degrees based on confidence degrees of the tags, wherein the confidence degree is the consistent degree of K tags obtained aiming at each data; and in a preset stage, marking the samples with different confusion degrees in sequence to obtain the label of each data in the data set to be marked. According to the technical scheme, the samples with different confusion degrees are compared and checked with each other through the trained K marking models, so that the samples with different confusion degrees are marked automatically, and the labor and time costs are greatly saved.

Description

Data labeling method and device

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a data annotation method and device.

Background

With the rapid development of science and technology, artificial intelligence has become one of the focuses of people's attention. With the support of technical progress such as big data, artificial intelligence has shown fruitful results in the fields of data analysis, image recognition, smart home, automatic driving and the like. The artificial intelligence technology takes mass data as drive and a deep learning algorithm as a core, so that the machine initially has basic visual and auditory abilities of human beings and is possibly competent for relatively complicated mental labor. Due to the requirement of a large amount of data in the deep learning algorithm, the labeling of mass data becomes an urgent need of the market.

One of the existing data labeling methods usually adopts a manual labeling method, but the manual labeling method is time-consuming and is easily affected by subjective factors of a labeling person, so that the labeling precision is not high.

In addition, there is a method of labeling data based on a trained model, however, the trained model relies on a large number of samples for training, and the labeling accuracy of the labeled model is completely determined by the number of samples and the quality of the samples. Therefore, how to find an automatic labeling method which is less time-consuming and has higher accuracy is an urgent problem to be solved.

Disclosure of Invention

In view of the foregoing problems, an object of the embodiments of the present invention is to provide a data annotation method and apparatus, so as to solve the deficiencies of the prior art.

According to an embodiment of the present invention, there is provided a data annotation method, including:

inputting each data in a data set to be labeled into K labeling models respectively, and obtaining K labels for each data, wherein the K labeling models are obtained through K sub-training set training respectively, the K sub-training sets are obtained through K times of random sampling with replacement of samples in a total training set, and K is an integer greater than 1;

dividing the data corresponding to the labels into samples with different confusion degrees based on the consistency degree of the labels;

and in a preset stage, marking the samples with different confusion degrees in sequence to obtain the label of each data in the data set to be marked.

In the above data labeling method, the total training set includes a first predetermined number of labeled samples.

In the above data labeling method, the samples with different confusion degrees include simple samples, confusable samples and difficult samples;

the dividing the label corresponding data into samples with different confusion degrees based on the consistency degree of the labels comprises:

determining data with consistent K labels as simple samples; m pieces of label consistent data in the K labels are confusable samples, N pieces of label inconsistent data in the K labels are difficult samples, and M, N are positive integers smaller than K.

In the above data labeling method, the preset stage includes a first stage, and the sequentially labeling the samples with different confusion degrees to obtain the label of each data in the data set to be labeled includes:

adding the label marked correctly in the simple sample into a first sample set, and adding the label marked incorrectly into a second sample set;

adding the first sample set into the total training set, and repeatedly executing all steps before the steps of adding the first sample set into the total training set and adding the first sample set into the total training set until the execution times reach a first preset threshold value, and stopping execution;

and marking the second sample set, the confusable sample and the difficult sample obtained after the last execution in response to the user operation to obtain the label of each data in the data set to be marked.

In the above data labeling method, the preset stage includes a first stage and a second stage, and the sequentially labeling the samples with different confusion degrees to obtain the label of each data in the data set to be labeled includes:

sending each sample in the second sample set obtained after the last execution into K labeling models obtained after the last execution to obtain K labels respectively;

adding a label which is marked correctly in the simple sample and the sample which is easy to be confused into a third sample set and adding a label which is marked incorrectly into a fourth sample set aiming at the label;

adding the third sample set into the total training set, and repeatedly executing all steps before the step of adding the third sample set into the total training set and the step of adding the third sample set into the total training set until the execution times reach a second preset threshold value, and stopping execution;

and marking the fourth sample set and the difficult sample obtained after the last execution in response to the user operation to obtain the label of each data in the data set to be marked.

In the above data labeling method, the preset stage includes a first stage, a second stage and a third stage, and the labeling of the samples with different confusion degrees in sequence to obtain the label of each data in the data set to be labeled includes:

adding the first sample set into the total training set, and repeatedly executing all steps before the steps of adding the first sample set into the total training set and adding the first sample set into the total training set until the execution times reach a first preset threshold value, and stopping executing;

sending each sample in the fourth sample set obtained after the last execution into K labeling models obtained after the last execution to respectively obtain K labels;

and regarding each sample in the fourth sample set, taking the label with the highest confidence coefficient as the label of the sample, and obtaining the label of each data in the data set to be labeled.

In the above data labeling method, for each sample in the fourth sample set, after the label with the highest confidence is taken as the label of the sample, the method further includes:

and verifying the label of each sample in the fourth sample set in response to user operation, so that the label of each data in the data set to be labeled is obtained after the labels of all samples in the fourth sample set are labeled correctly.

In the above data labeling method, the first preset threshold, the second preset threshold, and the third preset threshold are the same.

In the above data labeling method, the K labeling models are the same model.

According to another embodiment of the present invention, there is provided a data annotation apparatus including:

the system comprises an input module, a label module and a label module, wherein the input module is used for respectively inputting each data in a data set to be labeled into K labeling models and obtaining K labels for each data, the K labeling models are respectively obtained by training K sub-training sets, the K sub-training sets are obtained by performing K-time replaced random sampling on samples in a total training set, and K is an integer greater than 1;

the sample determining module is used for dividing the data corresponding to the labels into samples with different confusion degrees based on the consistency degree of the labels;

and the marking module is used for marking the samples with different confusion degrees in sequence in a preset stage to obtain the label of each data in the data set to be marked.

According to still another embodiment of the present invention, an electronic device is provided, which includes a memory for storing a computer program and a processor for executing the computer program to make the electronic device execute the above data annotation method.

According to still another embodiment of the present invention, there is provided a computer-readable storage medium storing the computer program used in the electronic device.

The technical scheme provided by the embodiment of the disclosure can have the following beneficial effects:

according to the data labeling method and device, K sub-training sets are obtained by randomly sampling all samples in a total training set for K times and putting back, K labeling models are trained through K sub-training sets, and data are labeled through verification and comparison of the K labeling models; and labeling each data in the data set to be labeled respectively through K labeling models, labeling samples with different confusion degrees according to the confidence degrees of the labels, wherein the confidence degrees are the consistent degrees of the K labels obtained aiming at each data, labeling the labels respectively based on the labels with different confidence degrees according to the mode, automatically labeling each data in the data set to be labeled, improving the labeling precision based on the mode of the confidence degrees, and greatly saving labor and time cost.

In order to make the aforementioned and other objects, features and advantages of the present invention comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the technical solution of the present invention, the drawings needed in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention, and therefore should not be considered as limiting the scope of the present invention, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.

Fig. 1 is a schematic flow chart illustrating a data annotation method according to a first embodiment of the present invention;

fig. 2 is a schematic flow chart illustrating a first-stage data annotation method according to a first embodiment of the present invention;

FIG. 3 is a flowchart illustrating a second-stage data annotation method according to a first embodiment of the present invention;

fig. 4 is a flowchart illustrating a third-stage data annotation method according to a first embodiment of the present invention;

fig. 5 is a schematic structural diagram illustrating a data annotation device according to a second embodiment of the present invention.

Description of the main element symbols:

500-data annotation means; 510-an input module; 520-a sample determination module; 530-labeling module.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.

Example 1

Fig. 1 is a schematic flow chart illustrating a data annotation method according to a first embodiment of the present invention.

The data annotation method comprises the following steps:

in step S110, each data in the data set to be labeled is input into K labeling models, and K labels are obtained for each data.

Specifically, the K labeled models are obtained by training K sub-training sets respectively, the K sub-training sets are obtained by performing K times of random sampling with a feedback on samples in a total training set, and K is an integer greater than 1.

Because the K labeling models are trained in the embodiment, if the condition of not putting back the sampling results in smaller similarity between the labeling models, the labeling models respectively and independently run, the result of the verification comparison between the K labeling models is poorer, and the generalization capability of each labeling model is poorer, so that the final labeling result is lower in precision. Therefore, in the embodiment, the samples in the total training set are sampled randomly and put back again, wherein the samples put back improve the relevance and the generalization degree between the labeling models, and the random sampling introduces noise to a certain extent, changes the distribution of the samples to a certain extent, and increases the generalization ability of each labeling model. Therefore, the labeling models are trained through K sub-training sets obtained by K times of replaced random sampling, so that the data labeling accuracy can be improved, and the generalization capability of each labeling model can be improved.

Further, the total training set includes a first predetermined number of labeled samples.

In this embodiment, in order to reduce the workload of labeling the samples, the value of the first preset number may be set to be smaller than K. That is to say, based on the semi-supervised mode, each labeling model is trained through labeling of a small number of samples, and labor cost and time cost are reduced.

Of course, in some other embodiments, the first preset number may also be equal to K, and each labeled model is trained through the total completely labeled training set, which may be determined according to the user requirement.

Specifically, the K labeled models are the same model, for example, the K labeled models may be models with the same network structure level, such as K same classifiers.

In this embodiment, the same K labeling models are cascaded, and the results of the K labeling models are verified and compared to obtain a general scheme with strong labeling capability, so that the labeling precision is improved.

In step S120, the label correspondence data is divided into samples of different degrees of confusion based on the degree of agreement of the labels.

Specifically, for each data in a data set to be labeled, inputting the data into K labeling models to obtain K labels for the data; and calculating the consistency degree of the K labels to divide the data corresponding to the labels into samples with different confusion degrees.

In this embodiment, if K is 5, the sample M is input into 5 trained labeling models to obtain 5 labels: l1, L2, L1, L2, and L3, wherein the labels of 3 labeling model outputs are all L1 in a consistent manner, the labels of two labeling model outputs are all L2 in a consistent manner, and the label of one labeling model output is L3, so 3/5 can be used as the consistent degree of the label. Of course, in some other embodiments, other representations of the degree of conformity may exist, as may be desired.

Further, the samples of different degrees of confusion include simple samples, confusable samples, and difficult samples;

determining data with consistent K labels as simple samples, wherein the simple samples are samples with reliable labeling results; the data with M labels consistent in the K labels is an easy confusion sample, and the easy confusion sample means that the labeling result is easy to be confused when compared with the price, for example, the labeling result can be a reliable result or an unreliable result; and the data with inconsistent N labels in the K labels is a difficult sample, namely the sample with unreliable labeling results. M, N are all positive integers less than K.

In this embodiment, the value of M may be equal to the value of N; in some other embodiments, the value of M may not be equal to the value of N, depending on the requirement.

In step S130, in a preset stage, the samples with different confusion degrees are labeled in sequence to obtain a label of each data in the data set to be labeled.

Specifically, the time cost and the dimension of the accuracy are considered, different stages can be set, and in the different stages, the trained K labeling models can be sent to label for samples with different confusion degrees at a time, so that the label of each data in the label data set is finally obtained, and the data labeling is completed.

In this embodiment, the samples may be labeled sequentially from low to high in the degree of consistency based on the degree of consistency of the samples.

For example, in the first stage, labeling may be performed based on samples with low confusability (high consistency), that is, simple samples, and samples with wrong labeling are put back into the total training set; in the second stage, the samples in the total training set obtained in the first stage are continuously subjected to replaceable random sampling to obtain K sub-training sets, K new labeling models are obtained through K sub-training set training, the confusable samples are added into the data set to be labeled and are labeled through the new K labeling models, and the samples with the wrong label in the second stage are put back into the total training set; in the third stage, the samples of the total training set obtained in the second stage are continuously subjected to replaceable random sampling to obtain K sub-training sets, K new labeling models are obtained through the training of the K sub-training sets, the difficult samples are added into the data set to be labeled, and the labels of each data in the data set to be labeled are obtained through the labeling of the K new labeling models trained in the second stage.

In addition, after the label of each data in the data set to be labeled is finally obtained, whether the label of each data is labeled correctly can be determined in a manual verification mode, and if the label of the data is incorrect, the label of the data is labeled manually, so that the data in the data set to be labeled are all labeled correctly.

Further, the preset stage includes a first stage, the labeling of the samples with different confusion degrees in sequence to obtain the label of each data in the data set to be labeled includes:

adding the first sample set into the total training set, and repeatedly executing all the steps until the execution times reach a first preset threshold value, and stopping execution;

Specifically, as shown in fig. 2, the data annotation scheme of the first stage may include the following steps:

step S211, setting an initial small number of labeled data sets N0, a stage iteration count, and an unlabeled data set E.

Specifically, the labeled data set N0 is a labeled sample, and the data set N0 includes a first predetermined number of labeled samples, where the first predetermined number may be N, N is an integer greater than K, and K is the number of labeled models to be trained.

The range of the iteration times of the first stage is (1-I), I is a first preset threshold value, and the first preset threshold value is set according to the user requirement or the mark iteration cost.

And the unmarked data set is the data set to be marked.

In step S212, K times of sample volume M playback random sampling are performed on the N data.

Specifically, the labeled data set N0 includes N labeled samples, the N labeled samples are subjected to K times of replaced random sampling, the sampling capacity of each time is M samples, that is, each time of sampling obtains a sub-training set, each sub-training set includes M labeled samples, and K times of replaced random sampling obtains K sub-training sets.

Step S213, training K marking models respectively for the depth models corresponding to the K data sets.

Specifically, the annotation model is a depth model, which may be a classifier, which may be a neural network model, a decision tree model, or the like.

And respectively training a depth model through each sub-training set, and finally training the K sub-training sets to obtain K labeling models which are used for labeling the data in the data set to be labeled.

In step S214, the number of iterations i =0 is set.

Specifically, a variable i is set, the variable i is used for representing the iteration number of the first stage, the initial value of the iteration number is 0, and the value of i is increased by one every iteration.

In step S215, it is determined whether I < I is satisfied.

If I < I, it means that the number of iterations has not reached the first preset threshold, each iteration means that the data labeling method is repeatedly executed once, and if the number of executions has not reached the first preset threshold, the process proceeds to step S217; if I > = I, meaning that the number of iterations has reached the first preset threshold, at which point the iteration terminates and proceeds to step S216.

Step S216, manually calibrating S0 to obtain Sr and Se.

Specifically, S0 represents a sample set of K tags that are consistent and of high confidence. Where confidence indicates how correct the tag is. The degree of accuracy can be represented by the classification score of the label. The more accurate the classification, the higher the classification score.

In this embodiment, the confidence of a certain label may be determined by a probability value of the sample belonging to the label, which is output by the labeling model. Of course, in some other embodiments, the confidence level may also be expressed by other ways, which are not limited herein.

In this embodiment, the high confidence level indicates that the confidence level of the tag is higher than a standard confidence level threshold; the standard confidence threshold may be determined based on the confidence of the labels corresponding to all samples in the preset verification set, where the labels correspond to the samples in the verification set by statistics according to the distribution of the preset verification set (the set of samples used for verifying the labeled model).

Sr represents the first sample set, i.e. the correct sample set is labeled in S0; se denotes the second sample set, i.e., the sample set labeled with errors in S0.

And if the iteration number of the first stage reaches a first preset threshold, manually checking each sample in S0 obtained after the iteration is finished, determining whether the label of the sample is correctly labeled, adding the sample with the correct label into the first sample set Sr, and adding the sample with the wrong label into the second sample set Se.

In step S217, the sample S belongs to E.

Specifically, if the iteration number has not reached the first preset threshold, the samples in the unlabeled data set E need to be labeled continuously, and the sample S in the unlabeled data set E is selected for subsequent labeling.

In step S218, it is determined whether E is not null.

Specifically, if K labels (labels) obtained after the sample S in E passes through K labeling models are all consistent and have high confidence, S needs to be taken out of E and added to S0, and if samples are continuously taken out subsequently, a situation that E is empty may occur, so before labeling the sample S, a determination whether E is empty needs to be performed. If E is empty, go to step S219; if E is not empty, proceed to step S221.

Step S219, E = E1.

Specifically, E1 represents a dataset that has been labeled but in which the K labels corresponding to each sample are either inconsistent or not highly confident. When the original E is empty, E = E1, and E1 is defined as the data set to be labeled.

After the execution of step S219 ends, the process proceeds to step S225.

Step S220, N0+ = Sr, E = Se.

After manually calibrating S0 to obtain Sr and Se, Sr was added to N0, and Se was used as the dataset to be annotated.

And step S221, testing the K labeling models S to obtain K labels.

Specifically, the sample S is sent to K trained labeling models to obtain K Label models.

Step S222, judging whether the K Label are consistent and have high confidence level.

Specifically, if the K labels are all consistent and high in confidence, the process proceeds to step S224, otherwise, the process proceeds to step S223.

In step S223, E1+ = { S }.

Specifically, if none of the K Label' S are consistent or have low confidence, S is added to E1.

Step S224, S0+ = { S }, E = { S }.

Specifically, if the K Label' S are all consistent and of high confidence, the sample S is removed from E and added to S0.

If the number of iterations has not reached the first preset threshold, the process goes to step S217 after the execution of steps S223 and S214 is finished.

Step S225, N0+ = S0.

S0 is added to N0, and the process continues back to step S212 to re-execute the data annotation process of the first stage.

And finally, manually labeling the remaining unlabeled samples (the confusable samples and the difficult samples) and the samples in the second sample set to obtain the label of each data in the data set to be labeled.

Further, the preset stage includes a first stage and a second stage, the labeling of the samples with different confusion degrees in sequence to obtain the label of each data in the data set to be labeled includes:

adding the third sample set into the total training set, and repeatedly executing all the steps until the execution times reach a second preset threshold value, and stopping execution;

In this embodiment, the labeling of the data needs to consider the labeling of two stages, that is, the data labeling of the first stage and the data labeling of the second stage, that is, only simple samples are processed in the first stage, and the obtained simple samples are added to the training set through multiple iterations to expand the sample capacity and richness of the training set; and in the second stage, simple and confusable samples are processed simultaneously, the simple and confusable samples are added into the training set, and the sample capacity and the richness of the training set are further expanded.

The data annotation scheme of the first stage refers to fig. 1, and this part only explains the data annotation scheme of the second stage.

Specifically, as shown in fig. 3, the data annotation method at the second stage includes the following steps:

it should be noted that the steps in fig. 3 are the same as the steps in fig. 2, and the content of this section only explains the different parts, and the same parts in fig. 2 are explained with reference to fig. 2.

And step S311, labeling the data set N0, performing two-stage iteration times II, and not labeling the data set E.

Specifically, the range of the number of iterations in the two stages is (1-II), and II is a second preset threshold.

In step S312, K times of sample volume M playback random sampling are performed on the N data.

The step is specifically referred to step S212, and is not described herein again.

And step S313, respectively training K marking models for the depth models corresponding to the K data sets.

This step is specifically referred to as step S213, and will not be described herein.

In step S314, the number of iterations i =0 is set.

Specifically, the initial value of the second-stage iteration number i is set to 0.

In step S315, it is determined whether i < II is satisfied.

If so, the process proceeds to step S318, otherwise, the process proceeds to step S316.

Step S316, manually calibrating S0 to obtain Sr and Se.

Specifically, in the second stage, S0 indicates that there is at least a sample set in which m tags match among the K tags and the m matched tags have high confidence. Sr is the third sample set and Se is the fourth sample set.

Wherein m is an integer greater than 1 and less than K.

Step S317, N0+ = Sr, E = Se.

In step S318, it is determined that the sample S belongs to E.

Step S319, determine whether E is empty.

If yes, the process proceeds to step S321, and if no, the process proceeds to step S320.

Step S320, E = E1.

This step is specifically referred to step S219, and will not be described herein.

After step S320, the process proceeds to step S325.

In step S321, K model tests S obtain K Label.

Specifically, the K annotation models are repeatedly trained in the first stage. Therefore, in the second stage, the K models are the K labeled models obtained by the last training.

In step S322, it is determined that > = m labels are all identical and have high confidence.

Specifically, in the second stage, the simple samples and the confusable samples are labeled, and therefore, in this step, it is determined whether at least m labels of the K labels obtained from the sample S are all consistent and all have high confidence, if yes, the process proceeds to step S324, and otherwise, the process proceeds to step S323.

Step S324, S0+ = { S }, E- = { S }.

This step is specifically referred to as step S224, and is not described herein again.

After step S323 and step S324, the process proceeds to step S318.

Step S325, N0+ = S0.

This step is specifically referred to as step S225, and will not be described herein again.

After step S325, the process returns to step S312 to continue the second stage data annotation process.

And finally, manually labeling the remaining unlabeled samples (difficult samples) and the samples in the fourth sample set to obtain the label of each data in the data set to be labeled.

Further, the preset stage includes a first stage, a second stage and a third stage, the labeling of the samples with different confusion degrees in sequence to obtain the label of each data in the data set to be labeled includes:

In this embodiment, the labeling of the data needs to consider the labeling of three stages, that is, the data labeling of the first stage, the data labeling of the second stage, and the data labeling of the third stage, that is, only a simple sample is processed in the first stage, and the obtained simple sample is added to the training set through multiple iterations to expand the sample capacity and richness of the training set; in the second stage, simple and confusable samples are processed simultaneously, the simple and confusable samples are added into the training set, and the sample capacity and the richness of the training set are further expanded; the third stage begins to process the hard samples, and adds simple samples, confusable samples and hard samples into the training set.

The data annotation scheme of the first stage refers to fig. 2, the data annotation scheme of the second stage refers to fig. 3, and this part only explains the data annotation scheme of the third stage.

Specifically, as shown in fig. 4, the data annotation method at the third stage includes the following steps:

it should be noted that the steps in fig. 4 are the same as those in fig. 2 and fig. 3, and the content of this section only explains different parts, and the same parts in fig. 2 or fig. 3 are explained with reference to fig. 2 or fig. 3.

As shown in fig. 4, the data annotation method at the third stage includes the following steps:

and step S411, labeling a data set N0, performing two-stage iteration times III, and not labeling a data set E.

Specifically, the range of the three-stage iteration times is (1-III), and III is a third preset threshold.

In step S412, K times of sample volume M playback random sampling are performed on the N data.

Step S413, training K labeling models respectively for the depth models corresponding to the K data sets.

In step S414, the number of iterations i =0 is set.

Specifically, the initial value of the third-stage iteration number i is set to 0.

In step S415, it is determined whether i < III is satisfied.

If so, proceed to step S418, otherwise proceed to step S416.

Step S416, manually calibrating S0 to obtain Sr and Se.

Specifically, in the third stage, S0 represents the sample set with the highest confidence among the K tags. Sr is the third sample set and Se is the fourth sample set.

Step S417, N0+ = Sr, E = Se.

The step is specifically referred to as step S220, and is not described herein again.

Step S418, determine that sample S belongs to E.

In step S419, it is determined whether E is empty.

If yes, the process proceeds to step S420, and if no, the process proceeds to step S423.

And step S420, K model tests S obtain K Label.

This step is specifically referred to as step S221, and will not be described herein again.

In step S421, the Label with the highest confidence is selected as S.

Specifically, in the third stage, for K labels of the sample S, the Label with the highest confidence is selected as the Label of the sample S.

Step S422, S0+ = { S }, E = { S }.

After step S422, the process proceeds to step S418.

Step S423, N0+ = S0.

After the step S423, the process returns to the step S412 to continue the data annotation method of the third stage.

Further, for each sample in the fourth sample set, after the label with the highest confidence is taken as the label of the sample, the method further includes:

Specifically, since the accuracy of the labels in the fourth sample set is determined according to the confidence, in order to improve the labeling accuracy, manual verification may be performed on the labels of all the samples in the fourth sample set, and the label of the sample with the wrong label is changed into a correct label, so that the label of each data in the data set to be labeled is obtained after the labels of all the samples in the fourth sample set are labeled correctly.

Further, the first preset threshold, the second preset threshold and the third preset threshold are the same.

In this embodiment, the first preset threshold, the second preset threshold and the third preset threshold may be set to the same value. In some other embodiments, the first preset threshold, the second preset threshold and the third preset threshold may also be set to different values, depending on the user's requirements.

Example 2

Fig. 5 is a schematic structural diagram illustrating a data annotation device according to a second embodiment of the present invention. The data annotation device 500 corresponds to the data annotation method in embodiment 1, and the data annotation method in embodiment 1 is also applicable to the data annotation device 500, which is not described herein again.

The data annotation device 500 includes an input module 510, a sample determination module 520, and an annotation module 530.

An input module 510, configured to input each data in a data set to be labeled into K labeling models respectively, and obtain K labels for each data, where the K labeling models are obtained through K training subsets, the K training subsets are obtained by performing K times of random sampling with a sample being returned in a total training set, and K is an integer greater than 1.

A sample determining module 520, configured to divide the data corresponding to the tags into samples with different confusion degrees based on the consistency degree of the tags.

And a labeling module 530, configured to label, in a preset stage, the samples with different confusion degrees in sequence, so as to obtain a label of each data in the data set to be labeled.

Another embodiment of the present invention further provides an electronic device, where the electronic device includes a memory and a processor, the memory is used to store a computer program, and the processor runs the computer program to make the electronic device execute the functions of each module in the above data annotation method or the above data annotation apparatus.

The memory may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to use of the computer device, and the like. Further, the memory may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.

The embodiment also provides a computer storage medium for storing the data annotation method used in the electronic device.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method can be implemented in other ways. The apparatus embodiments described above are merely illustrative and, for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In addition, each functional module or unit in each embodiment of the present invention may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.

The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention or a part of the technical solution that contributes to the prior art in essence can be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a smart phone, a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention.

Claims

1. A method for annotating data, the method comprising:

in a preset stage, marking samples with different confusion degrees in sequence to obtain a label of each data in a data set to be marked;

the samples with different confusion degrees comprise simple samples, confusable samples and difficult samples;

2. The method of claim 1, wherein the total training set comprises a first predetermined number of labeled samples.

3. The data labeling method of claim 1, wherein the preset stage comprises a first stage, and the labeling of the samples with different confusion degrees in sequence to obtain the label of each data in the data set to be labeled comprises:

4. The data labeling method of claim 1, wherein the preset stage comprises a first stage and a second stage, and the labeling of the samples with different confusion degrees in sequence to obtain the label of each data in the data set to be labeled comprises:

5. The data annotation method of claim 1, wherein the preset stage includes a first stage, a second stage, and a third stage, and the sequentially annotating the samples with different degrees of confusion to obtain the label of each data in the data set to be annotated includes:

6. The data annotation method of claim 5, wherein for each sample in the fourth sample set, the following the label with the highest confidence as the label of the sample further comprises:

7. The data annotation method of claim 5, wherein the first predetermined threshold and the second predetermined threshold are the same.

8. The data annotation method of claim 1, wherein the K annotation models are the same model.

9. A data annotation device, the device comprising:

the labeling module is used for sequentially labeling the samples with different confusion degrees in a preset stage to obtain a label of each data in the data set to be labeled;