CN110288007B - Data labeling method and device and electronic equipment - Google Patents

Data labeling method and device and electronic equipment Download PDF

Info

Publication number
CN110288007B
CN110288007B CN201910487643.6A CN201910487643A CN110288007B CN 110288007 B CN110288007 B CN 110288007B CN 201910487643 A CN201910487643 A CN 201910487643A CN 110288007 B CN110288007 B CN 110288007B
Authority
CN
China
Prior art keywords
data
labeling
sample
target
labeled
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910487643.6A
Other languages
Chinese (zh)
Other versions
CN110288007A (en
Inventor
刘宇达
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Sankuai Online Technology Co Ltd
Original Assignee
Beijing Sankuai Online Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Sankuai Online Technology Co Ltd filed Critical Beijing Sankuai Online Technology Co Ltd
Priority to CN201910487643.6A priority Critical patent/CN110288007B/en
Publication of CN110288007A publication Critical patent/CN110288007A/en
Priority to PCT/CN2019/123406 priority patent/WO2020244183A1/en
Application granted granted Critical
Publication of CN110288007B publication Critical patent/CN110288007B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Abstract

The application provides a method, a device and an electronic device for data annotation, wherein a specific implementation manner of the method comprises the following steps: labeling a plurality of data to be labeled through a pre-trained target labeling model to obtain a target set formed by labeling results corresponding to the data to be labeled; selecting the non-credible data in the data to be labeled by using a pre-trained target classifier so as to verify the labeling result corresponding to the non-credible data; and correcting the labeling result corresponding to the non-credible data which is not verified in the target set. The implementation method can ensure that the labeling work does not need to depend on people completely, saves a large amount of human resources and improves the labeling efficiency. Meanwhile, the result of the marked data can be verified more pertinently, and the marking accuracy is improved.

Description

Data labeling method and device and electronic equipment
Technical Field
The present application relates to the field of machine learning technologies, and in particular, to a method and an apparatus for data annotation, and an electronic device.
Background
With the continuous development of artificial intelligence technology, artificial intelligence technology has been widely used in various fields. Artificial intelligence techniques typically involve machine learning, in which a large amount of training sample data is required and labeled. Currently, a large amount of training sample data is generally labeled manually, so that the labeling work is excessively dependent on people and the workload is huge. Thereby consuming a large amount of human resources and having low labeling efficiency.
Disclosure of Invention
In order to solve one of the above technical problems, the present application provides a method, an apparatus and an electronic device for data annotation.
According to a first aspect of embodiments of the present application, there is provided a method for data annotation, including:
labeling a plurality of data to be labeled through a pre-trained target labeling model to obtain a target set formed by labeling results corresponding to the data to be labeled;
selecting the non-credible data in the data to be labeled by using a pre-trained target classifier so as to verify the labeling result corresponding to the non-credible data;
and correcting the labeling result corresponding to the non-credible data which is not verified in the target set.
Optionally, the method further includes:
determining the non-authentic data which is not verified as a first positive sample, and determining the non-authentic data which is verified as a first negative sample;
updating the target classifier with the first positive examples and the first negative examples.
Optionally, the target labeling model is trained in the following manner:
iteratively executing the updating operation of the annotation model until a stopping condition is met, and taking the iteratively updated annotation model as the target annotation model; wherein the update operation comprises:
selecting an untrusted sample from the sample data by using a current target classifier;
acquiring a manual marking result corresponding to the non-credible sample as a first sample result;
and updating the current labeling model by using the first sample result.
Optionally, the updating operation further includes:
selecting a credible sample in the sample data by using a current target classifier;
labeling the credible sample by using a current labeling model to obtain a second sample result;
determining a verification result for verifying the second sample result;
determining a second sample result which is not verified as a second negative sample and determining a second sample result which is verified as a second positive sample based on the verification result;
updating the current target classifier with the second positive examples and the second negative examples.
Optionally, the target annotation model includes a plurality of unit models with different structures; the data to be marked is image data to be marked;
any image data to be labeled is labeled by the following method:
labeling the image data by adopting each unit model respectively;
and determining one or more labeling targets in the image data by using a non-maximum suppression NMS algorithm based on the result of labeling in the image data by each unit model, and determining a label corresponding to each labeling target.
Optionally, after correcting the annotation result corresponding to the non-trusted data that fails to pass the verification in the target set, the method further includes:
storing the corrected target set into a pre-established marking database;
and updating the target labeling model by using the labeling database.
According to a second aspect of the embodiments of the present application, there is provided an apparatus for data annotation, including:
the labeling module is used for labeling a plurality of data to be labeled through a pre-trained target labeling model to obtain a target set formed by labeling results corresponding to the data to be labeled;
the selecting module is used for selecting the unreliable data in the data to be labeled by utilizing a pre-trained target classifier so as to verify the labeling result corresponding to the unreliable data;
and the correcting module is used for correcting the labeling result corresponding to the non-credible data which is not verified in the target set.
Optionally, the method further includes:
the determining module is used for determining the non-authentic data which is not verified as a first positive sample and determining the non-authentic data which is verified as a first negative sample;
a first updating module for updating the target classifier using the first positive sample and the first negative sample.
According to a third aspect of embodiments herein, there is provided a computer-readable storage medium storing a computer program which, when executed by a processor, implements the method of any one of the above first aspects.
According to a fourth aspect of embodiments of the present application, there is provided an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method of any one of the first aspect when executing the program.
The technical scheme provided by the embodiment of the application can have the following beneficial effects:
according to the data labeling method and device provided by the embodiment of the application, a plurality of data to be labeled are labeled through a pre-trained target labeling model to obtain a target set formed by labeling results corresponding to the data to be labeled, and non-credible data in the data to be labeled are selected by utilizing a pre-trained target classifier to verify the labeling results corresponding to the non-credible data and correct the labeling results corresponding to the non-credible data which are not verified in the target set. In the embodiment, after the pre-trained target labeling model is used for labeling the data to be labeled, the target classifier is used for screening the data to be labeled which are more likely to be labeled wrongly for sampling inspection, and correcting the wrong labeling result. The marking work does not need to depend on people completely, a large amount of human resources are saved, and the marking efficiency is improved. Meanwhile, the result of the marked data can be verified more pertinently, and the marking accuracy is improved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application.
FIG. 1 is a flow chart illustrating a method of data annotation according to an exemplary embodiment of the present application;
FIG. 2 is a flow chart illustrating another method of data annotation shown herein in accordance with an exemplary embodiment;
FIG. 3 is a flow chart illustrating another method of data annotation shown herein in accordance with an exemplary embodiment;
FIG. 4 is a block diagram of an apparatus for data annotation shown herein in accordance with an exemplary embodiment;
FIG. 5 is a block diagram illustrating another apparatus for data annotation according to an exemplary embodiment of the present application;
FIG. 6 is a block diagram illustrating another apparatus for data annotation according to an exemplary embodiment;
fig. 7 is a schematic structural diagram of an electronic device shown in the present application according to an exemplary embodiment.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present application, as detailed in the appended claims.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this application and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.
It is to be understood that although the terms first, second, third, etc. may be used herein to describe various information, such information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present application. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.
As shown in fig. 1, fig. 1 is a flowchart illustrating a method for data annotation according to an exemplary embodiment, where the method may be applied to a terminal device and may also be applied to a server. The method comprises the following steps:
in step 101, a plurality of data to be labeled are labeled through a pre-trained target labeling model, so as to obtain a target set formed by labeling results corresponding to the data to be labeled.
In this embodiment, a plurality of data to be labeled can be labeled through a pre-trained target labeling model to obtain a target set, where the target set is formed by labeling results corresponding to the data to be labeled. The data to be annotated can be image type data, sound type data, text type data, and the like, and it can be understood that the data to be annotated can also be any other types of data, and the specific type aspect of the data to be annotated is not limited in the present application.
In this embodiment, labeling the data to be labeled may be to label a labeling target in the data to be labeled, and set a corresponding label (e.g., a category label, an attribute label, an ID label, etc.) for the label. Taking the data to be labeled of the image type as an example, labeling the data of the image type may be to label a target object (i.e., a labeling target) in the data of the image type with a labeling frame, and set a label (e.g., an object type, an object attribute, an object ID, etc.) of the target object.
In this embodiment, the target annotation model may be one model or a plurality of models with different structures, and optionally, if the data to be annotated is image-type data, the target annotation model may be a neural network-type model. When the target labeling model is initially trained, the collected training sample data can be labeled in a manual mode, and then the target labeling model is obtained through training by using the manually labeled training sample data. In the subsequent use process, the target labeling model can be continuously optimized and updated, so that the labeling accuracy of the target labeling model is higher.
In this embodiment, if the target annotation model includes a plurality of unit models with different structures, and the data to be annotated is the image data to be annotated, any image data to be annotated can be annotated by the following method: first, each unit model can be individually labeled in the image data. Then, based on the result of labeling in the image data by each unit model, one or more labeling targets in the image data are determined by using a Non Maximum Suppression (NMS) algorithm, and a label corresponding to each labeling target is determined, thereby completing labeling of the image data. Wherein, the labeling target is an object needing to be labeled in the image data. Because the unit models with different structures are adopted for marking, the marking accuracy can be improved.
In step 102, a pre-trained target classifier is used to select the non-credible data in the data to be labeled so as to verify the labeling result corresponding to the non-credible data.
In this embodiment, a pre-trained target classifier may be used to select the unreliable data from the data to be labeled, so as to verify the labeling result corresponding to the unreliable data. The non-credible data can be data which has strong confusion, is not obvious in characteristics, is difficult to distinguish, is difficult to label and is easy to label errors. Taking the data of the image type as an example, the untrusted data may be image data in which an image is blurred, or an object having a large interference in the image, or a target object in the image is not obvious, or the like. For example, the non-trusted data may be sound data with large environmental noise or small target sound. Taking text-type data as an example, the untrusted data may be text data with fuzzy semantic features, or the like.
In this embodiment, the target classifier may be a pre-trained classifier, and any classifier known in the art and possibly appearing in the future that can be applied thereto may be applied to the present application, and the present application is not limited to a specific type of the target classifier. When the target classifier is initially trained, training sample data can be divided into credible sample data and non-credible sample data in a manual classification screening mode. And then training by using training sample data obtained by manual classification screening to obtain the target classifier. In the subsequent use process, the target classifier can be continuously optimized and updated, so that the classification accuracy of the target classifier is higher.
In this embodiment, the target classifier is used to select the unreliable data that is easily labeled with errors from the data to be labeled, and then the labeling result corresponding to the unreliable data is verified, so as to determine whether the labeling result corresponding to the unreliable data is correct. The verification can be performed manually or in any other reasonable manner, and it is understood that the specific verification manner is not limited in the present application. For any piece of non-trusted data, if the labeling result corresponding to the non-trusted data is determined to be wrong, the non-trusted data is not verified, and if the labeling result corresponding to the non-trusted data is determined to be correct, the non-trusted data is verified.
In step 103, the labeling result corresponding to the non-trusted data that fails to pass the verification in the target set is corrected.
In this embodiment, for the selected non-trusted data, if the non-trusted data passes the verification, no operation is performed. If the non-credible data is not verified, the non-credible data needs to be correctly labeled again, so that a correct labeling result is obtained. And replacing the wrong labeling result corresponding to the non-credible data in the target set by using the correct labeling result corresponding to the non-credible data, thereby correcting the labeling result corresponding to the non-credible data which is not verified in the target set.
In this embodiment, a label database may be pre-established, where the label database is used to store labeled data, and the machine learning training is performed by using the data stored in the label database. The corrected target set can be stored in the label database as labeled data.
In the data labeling method provided by the embodiment of the application, a plurality of data to be labeled are labeled through a pre-trained target labeling model to obtain a target set formed by labeling results corresponding to the data to be labeled, and non-credible data in the data to be labeled is selected by using a pre-trained target classifier to verify the labeling results corresponding to the non-credible data and correct the labeling results corresponding to the non-credible data which does not pass the verification in the target set. In the embodiment, after the pre-trained target labeling model is used for labeling the data to be labeled, the target classifier is used for screening the data to be labeled which are more likely to be labeled wrongly for sampling inspection, and correcting the wrong labeling result. The marking work does not need to depend on people completely, a large amount of human resources are saved, and the marking efficiency is improved. Meanwhile, the result of the marked data can be verified more pertinently, and the marking accuracy is improved.
Fig. 2 is a flowchart illustrating another method for labeling data according to an exemplary embodiment, which describes a process of updating a target classifier and may be applied to a terminal device or a server, as shown in fig. 2. The method comprises the following steps:
in step 201, a plurality of data to be labeled are labeled through a pre-trained target labeling model, so as to obtain a target set formed by labeling results corresponding to the data to be labeled.
In step 202, the pre-trained target classifier is used to select the non-credible data in the data to be labeled, so as to verify the labeling result corresponding to the non-credible data.
In step 203, the labeling result corresponding to the non-trusted data that fails to pass the verification in the target set is corrected.
In step 204, the non-validated untrusted data is determined to be a first positive sample, and the validated untrusted data is determined to be a first negative sample.
In the embodiment, since the non-credible data has strong confusion, has insignificant characteristics, is difficult to distinguish, is difficult to label and is easy to label wrongly. Therefore, if the non-credible data screened by the target classifier is labeled with an error, the classification of the non-credible data by the target classifier is more accurate. And if the non-credible data screened by the target classifier is marked correctly, the classification of the non-credible data by the target classifier is not accurate enough. Therefore, the non-authentic data that failed verification (i.e., the non-authentic data marked with errors) is taken as a first positive sample, and the non-authentic data that passed verification (i.e., the non-authentic data marked with correct errors) is determined as a first negative sample.
In step 205, the target classifier is updated with the first positive examples and the first negative examples.
In this embodiment, the target classifier may be retrained by using the first positive sample and the first negative sample to perform optimization updating on the target classifier, so that the target classifier can screen more appropriate untrusted data.
It should be noted that, for the same steps as in the embodiment of fig. 1, details are not repeated in the embodiment of fig. 2, and related contents may refer to the embodiment of fig. 1.
In the data labeling method provided by the embodiment of the application, a plurality of data to be labeled are labeled through a pre-trained target labeling model to obtain a target set formed by labeling results corresponding to the data to be labeled, and a pre-trained target classifier is used for selecting the non-credible data in the data to be labeled to verify the labeling results corresponding to the non-credible data and correct the labeling results corresponding to the non-credible data which do not pass the verification in the target set. And determining the un-verified untrusted data as a first positive sample, determining the verified untrusted data as a first negative sample, and updating the target classifier by using the first positive sample and the first negative sample. Not only saves a large amount of human resources and improves the efficiency and accuracy of labeling, but also ensures that the non-credible data screened by the target classifier is more targeted because the target classifier is continuously optimized and updated in the labeling process.
Fig. 3 is a flowchart illustrating another method for annotating data according to an exemplary embodiment, which describes in detail the process of updating a target annotation model and is applicable to a terminal device and a server, as shown in fig. 3. The method comprises the following steps:
in step 301, a plurality of data to be labeled are labeled through a pre-trained target labeling model, so as to obtain a target set formed by labeling results corresponding to the data to be labeled.
In step 302, a pre-trained target classifier is used to select the non-credible data in the data to be labeled, so as to verify the labeling result corresponding to the non-credible data.
In step 303, the labeled result corresponding to the non-trusted data that fails to pass the verification in the target set is corrected.
In step 304, the corrected target set is stored in a pre-established annotation database.
In step 305, the target annotation model is updated with the annotation database.
In this embodiment, a label database may be pre-established, where the label database is used to store labeled data, and the machine learning training is performed by using the data stored in the label database. The corrected target set can be stored in the label database as labeled data.
In this embodiment, the data in the annotation database may be used as a training sample to retrain the target annotation model, so as to perform optimization updating on the target annotation model, thereby making the result of the target annotation model for annotating the data to be annotated more accurate.
It should be noted that, for the same steps as in the embodiment of fig. 1 and fig. 2, details are not repeated in the embodiment of fig. 3, and related contents may refer to the embodiment of fig. 1 and fig. 2.
In the data labeling method provided by the above embodiment of the present application, a plurality of data to be labeled are labeled through a pre-trained target labeling model to obtain a target set formed by labeling results corresponding to the data to be labeled, non-trusted data in the data to be labeled is selected by using a pre-trained target classifier to verify the labeling result corresponding to the non-trusted data, the labeling result corresponding to the non-trusted data which does not pass the verification in the target set is corrected, the corrected target set is stored in a pre-established labeling database, and the labeling database is used to update the target labeling model. Not only saves a large amount of human resources and improves the efficiency of labeling, but also optimizes and updates the target labeling model by using the labeling database after the labeling is finished each time, so that the target labeling model is continuously perfected and the accuracy of labeling is further improved.
In some alternative embodiments, the target annotation model may be trained by: and (4) iteratively executing the updating operation of the annotation model until a stopping condition is met, and taking the iteratively updated annotation model as a target annotation model.
In this embodiment, when the annotation model is initially trained, the acquired training sample data may be labeled manually, and the annotation model is trained by fully using the manually labeled training sample data. When the labeling result of the trained labeling model on the data to be labeled reaches a certain accuracy, the method of the embodiment can be adopted to continue to train the labeling model. Specifically, the updating operation on the annotation model may be iteratively performed until a stop condition is satisfied (for example, the objective function convergence is satisfied, or the iteration number exceeds a preset number, and the like), and the annotation model after being iteratively updated is used as the target annotation model. Wherein, the update operation may include the following steps:
step a: and selecting an untrusted sample from the sample data by using the current target classifier.
In this embodiment, firstly, an untrusted sample in sample data may be selected by using a currently trained target classifier, where the untrusted sample may be sample data that is relatively confusing, has insignificant characteristics, is difficult to distinguish, is difficult to label, and is easy to be labeled incorrectly.
Step b: and acquiring a manual marking result corresponding to the non-credible sample as a first sample result.
Step c: and updating the current annotation model by using the first sample result.
In this embodiment, since the untrusted sample has strong confusability and insignificant features, it is more likely that an error occurs when the untrusted sample is labeled by using the labeling model. Therefore, the non-trusted sample can be labeled manually, so that a manual labeling result corresponding to the non-trusted sample is obtained and serves as the first sample result. The first sample result has higher accuracy, and the current annotation model can be continuously updated by using the first sample result, so that the annotation model is continuously perfected until a stopping condition is met, and the target annotation model is obtained. Therefore, the accuracy of the target labeling model is improved.
In further alternative embodiments, the update operation may further include the following steps:
step d: and selecting a credible sample in the sample data by using the current target classifier.
In this embodiment, a current target classifier may be further used to select a trusted sample from the sample data, which is the opposite of the non-trusted sample. The credible sample can be sample data which has obvious characteristics, is easy to distinguish, is easy to label and is not easy to label errors.
Step e: and marking the credible sample by using the current marking model to obtain a second sample result.
Step f: a verification result is determined that verifies against the second sample result.
In this embodiment, because the characteristics of the credible sample are significant and easy to distinguish, errors are not easy to occur when the credible sample is labeled by using the labeling model. The current labeling model can be used for labeling the credible sample to obtain a second sample result. Verification can be performed on the second sample result to determine whether the result of labeling the credible sample by the current labeling model is correct.
Step g: based on the verification result, the second sample result that has not passed the verification is determined as a second negative sample, and the second sample result that has passed the verification is determined as a second positive sample.
In this embodiment, if the credible data screened by the current target classifier is labeled by the labeling model incorrectly, it indicates that the classification of the credible data by the target classifier is not accurate enough. If the credible data screened by the current target classifier is correctly labeled by the labeling model, the classification of the credible data by the current target classifier is more accurate. Therefore, the result of the second sample that fails to verify (i.e., the trusted data that is marked as erroneous) may be determined as a second negative sample, and the result of the second sample that passes verification (i.e., the trusted data that is marked as correct) may be determined as a second positive sample.
Step h: and updating the current target classifier by using the second positive sample and the second negative sample.
In this embodiment, the current target classifier can be optimized and updated by using the second positive sample and the second negative sample, so that the target classifier can more accurately screen the trusted data and the untrusted data.
It should be noted that although in the above embodiments, the operations of the methods of the present application were described in a particular order, this does not require or imply that these operations must be performed in that particular order, or that all of the illustrated operations must be performed, to achieve desirable results. Rather, the steps depicted in the flowcharts may change the order of execution. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions.
Corresponding to the foregoing method embodiments for data annotation, the present application also provides embodiments of a device for data annotation.
As shown in fig. 4, fig. 4 is a block diagram of an apparatus for data annotation according to an exemplary embodiment of the present application, where the apparatus may include: a marking module 401, a selecting module 402 and a correcting module 403.
The labeling module 401 is configured to label multiple data to be labeled through a pre-trained target labeling model, and obtain a target set formed by labeling results corresponding to the data to be labeled.
A selecting module 402, configured to select, by using a pre-trained target classifier, untrusted data in the data to be labeled, so as to verify a labeling result corresponding to the untrusted data.
The correcting module 403 is configured to correct a labeling result corresponding to the non-trusted data that fails to pass the verification in the target set.
As shown in fig. 5, fig. 5 is a block diagram of another data annotation apparatus shown in this application according to an exemplary embodiment, which may further include, on the basis of the foregoing embodiment shown in fig. 4: a determination module 404 and a first update module 405.
The determining module 404 is configured to determine the non-authentic data that has not passed the verification as a first positive sample, and determine the non-authentic data that has passed the verification as a first negative sample.
A first updating module 405 for updating the target classifier with the first positive examples and the first negative examples.
In some alternative embodiments, the target labeling model may be trained by: and iteratively executing the updating operation of the annotation model until a stopping condition is met, and taking the iteratively updated annotation model as the target annotation model. Wherein, the update operation may include: and selecting an untrusted sample in the sample data by using the current target classifier, obtaining an artificial labeling result corresponding to the untrusted sample as a first sample result, and updating the current labeling model by using the first sample result.
In still other optional embodiments, the update operation may further include: selecting a credible sample from the sample data by using the current target classifier, labeling the credible sample by using the current labeling model to obtain a second sample result, determining a verification result for verifying the second sample result, determining the second sample result which does not pass the verification as a second negative sample based on the verification result, determining the second sample result which passes the verification as a second positive sample, and updating the current target classifier by using the second positive sample and the second negative sample.
In other alternative embodiments, the target annotation model may include a plurality of unit models with different structures, and the data to be annotated is image data to be annotated.
Any image data to be annotated can be annotated as follows: and respectively labeling the image data by adopting each unit model, determining one or more labeling targets in the image data by using a non-maximum suppression (NMS) algorithm based on the result of labeling the image data by each unit model, and determining a label corresponding to each labeling target.
As shown in fig. 6, fig. 6 is a block diagram of another data annotation apparatus shown in this application according to an exemplary embodiment, which may further include, on the basis of the foregoing embodiment shown in fig. 4: a deposit module 406 and a second update module 407.
The storing module 406 is configured to store the corrected target set into a pre-established label database.
And a second updating module 407, configured to update the target annotation model with the annotation database.
It should be understood that the above-mentioned apparatus may be preset in the terminal device or the server, and may also be loaded into the terminal device or the server by downloading or the like. The corresponding modules in the above device can cooperate with the modules in the terminal equipment or the server to implement the data annotation scheme.
For the device embodiments, since they substantially correspond to the method embodiments, reference may be made to the partial description of the method embodiments for relevant points. The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the scheme of the application. One of ordinary skill in the art can understand and implement it without inventive effort.
An embodiment of the present application further provides a computer-readable storage medium, where the storage medium stores a computer program, and the computer program can be used to execute the method for data annotation provided in any one of the embodiments of fig. 1 to fig. 3.
Corresponding to the above-mentioned data labeling method, an embodiment of the present application further provides a schematic structural diagram of an electronic device according to an exemplary embodiment of the present application, shown in fig. 7. Referring to fig. 7, at the hardware level, the electronic device includes a processor, an internal bus, a network interface, a memory, and a non-volatile memory, but may also include hardware required for other services. The processor reads the corresponding computer program from the nonvolatile memory into the memory and then runs the computer program to form the data labeling device on the logic level. Of course, besides the software implementation, the present application does not exclude other implementations, such as logic devices or a combination of software and hardware, and the like, that is, the execution subject of the following processing flow is not limited to each logic unit, and may also be hardware or logic devices.
Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.
It will be understood that the present application is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims (7)

1. A method of data annotation, the method comprising:
labeling a plurality of data to be labeled through a pre-trained target labeling model to obtain a target set formed by labeling results corresponding to the data to be labeled;
selecting the non-credible data in the data to be labeled by using a pre-trained target classifier so as to verify the labeling result corresponding to the non-credible data;
correcting a labeling result corresponding to the non-credible data which is not verified in the target set, storing the corrected target set into a pre-established labeling database, and updating the target labeling model by using the labeling database;
the data to be labeled is used as training sample data of machine learning, and the labeling result is used as labeling of the training sample data of machine learning;
the method further comprises the following steps:
determining the non-authentic data which is not verified as a first positive sample, and determining the non-authentic data which is verified as a first negative sample;
updating the target classifier with the first positive examples and the first negative examples.
2. The method of claim 1, wherein the target labeling model is trained by:
iteratively executing the updating operation of the annotation model until a stopping condition is met, and taking the iteratively updated annotation model as the target annotation model; wherein the update operation comprises:
selecting an untrusted sample from the sample data by using a current target classifier;
acquiring a manual marking result corresponding to the non-credible sample as a first sample result;
and updating the current labeling model by using the first sample result.
3. The method of claim 2, wherein the update operation further comprises:
selecting a credible sample in the sample data by using a current target classifier;
labeling the credible sample by using a current labeling model to obtain a second sample result;
determining a verification result for verifying the second sample result;
determining a second sample result which is not verified as a second negative sample and determining a second sample result which is verified as a second positive sample based on the verification result;
updating the current target classifier with the second positive examples and the second negative examples.
4. The method of claim 1, wherein the target annotation model comprises a plurality of structurally distinct unit models; the data to be marked is image data to be marked;
any image data to be labeled is labeled by the following method:
labeling the image data by adopting each unit model respectively;
and determining one or more labeling targets in the image data by using a non-maximum suppression NMS algorithm based on the result of labeling in the image data by each unit model, and determining a label corresponding to each labeling target.
5. An apparatus for annotating data, the apparatus comprising:
the labeling module is used for labeling a plurality of data to be labeled through a pre-trained target labeling model to obtain a target set formed by labeling results corresponding to the data to be labeled;
the selecting module is used for selecting the unreliable data in the data to be labeled by utilizing a pre-trained target classifier so as to verify the labeling result corresponding to the unreliable data;
the correcting module is used for correcting the labeling result corresponding to the non-credible data which is not verified in the target set;
the storage module is used for storing the corrected target set into a pre-established labeling database;
the second updating module is used for updating the target labeling model by utilizing the labeling database;
the data to be labeled is used as training sample data of machine learning, and the labeling result is used as labeling of the training sample data of machine learning;
the device further comprises:
the determining module is used for determining the non-authentic data which is not verified as a first positive sample and determining the non-authentic data which is verified as a first negative sample;
a first updating module for updating the target classifier using the first positive sample and the first negative sample.
6. A computer-readable storage medium, characterized in that the storage medium stores a computer program which, when being executed by a processor, carries out the method of any of the preceding claims 1-4.
7. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of any of claims 1-4 when executing the program.
CN201910487643.6A 2019-06-05 2019-06-05 Data labeling method and device and electronic equipment Active CN110288007B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201910487643.6A CN110288007B (en) 2019-06-05 2019-06-05 Data labeling method and device and electronic equipment
PCT/CN2019/123406 WO2020244183A1 (en) 2019-06-05 2019-12-05 Data annotation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910487643.6A CN110288007B (en) 2019-06-05 2019-06-05 Data labeling method and device and electronic equipment

Publications (2)

Publication Number Publication Date
CN110288007A CN110288007A (en) 2019-09-27
CN110288007B true CN110288007B (en) 2021-02-02

Family

ID=68003424

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910487643.6A Active CN110288007B (en) 2019-06-05 2019-06-05 Data labeling method and device and electronic equipment

Country Status (2)

Country Link
CN (1) CN110288007B (en)
WO (1) WO2020244183A1 (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110288007B (en) * 2019-06-05 2021-02-02 北京三快在线科技有限公司 Data labeling method and device and electronic equipment
CN110797101B (en) * 2019-10-28 2023-11-03 腾讯医疗健康(深圳)有限公司 Medical data processing method, medical data processing device, readable storage medium and computer equipment
CN113469205B (en) * 2020-03-31 2023-01-17 阿里巴巴集团控股有限公司 Data processing method and system, network model and training method thereof, and electronic device
CN111897991B (en) * 2020-06-19 2022-08-26 济南信通达电气科技有限公司 Image annotation method and device
CN112163424A (en) * 2020-09-17 2021-01-01 中国建设银行股份有限公司 Data labeling method, device, equipment and medium
CN112861962B (en) * 2021-02-03 2024-04-09 北京百度网讯科技有限公司 Sample processing method, device, electronic equipment and storage medium
CN112884060B (en) * 2021-03-09 2024-04-26 联仁健康医疗大数据科技股份有限公司 Image labeling method, device, electronic equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101980210A (en) * 2010-11-12 2011-02-23 百度在线网络技术(北京)有限公司 Marked word classifying and grading method and system
CN102541838A (en) * 2010-12-24 2012-07-04 日电(中国)有限公司 Method and equipment for optimizing emotional classifier
CN104281569A (en) * 2013-07-01 2015-01-14 富士通株式会社 Building device and method, classifying device and method and electronic device
CN105117429A (en) * 2015-08-05 2015-12-02 广东工业大学 Scenario image annotation method based on active learning and multi-label multi-instance learning
CN105224947A (en) * 2014-06-06 2016-01-06 株式会社理光 Sorter training method and system

Family Cites Families (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050027664A1 (en) * 2003-07-31 2005-02-03 Johnson David E. Interactive machine learning system for automated annotation of information in text
US20100023319A1 (en) * 2008-07-28 2010-01-28 International Business Machines Corporation Model-driven feedback for annotation
US11074495B2 (en) * 2013-02-28 2021-07-27 Z Advanced Computing, Inc. (Zac) System and method for extremely efficient image and pattern recognition and artificial intelligence platform
CN103324937B (en) * 2012-03-21 2016-08-03 日电(中国)有限公司 The method and apparatus of label target
US8855430B1 (en) * 2012-05-30 2014-10-07 Google Inc. Refining image annotations
CN108875768A (en) * 2018-01-23 2018-11-23 北京迈格威科技有限公司 Data mask method, device and system and storage medium
CN108197668A (en) * 2018-01-31 2018-06-22 达闼科技(北京)有限公司 The method for building up and cloud system of model data collection
CN109241997B (en) * 2018-08-03 2022-03-22 硕橙(厦门)科技有限公司 Method and device for generating training set
CN109242013B (en) * 2018-08-28 2021-06-08 北京九狐时代智能科技有限公司 Data labeling method and device, electronic equipment and storage medium
CN109543713B (en) * 2018-10-16 2021-03-26 北京奇艺世纪科技有限公司 Training set correction method and device
CN109446961B (en) * 2018-10-19 2020-10-30 北京达佳互联信息技术有限公司 Gesture detection method, device, equipment and storage medium
CN109635838B (en) * 2018-11-12 2023-07-11 平安科技(深圳)有限公司 Face sample picture labeling method and device, computer equipment and storage medium
CN109460795A (en) * 2018-12-17 2019-03-12 北京三快在线科技有限公司 Classifier training method, apparatus, electronic equipment and computer-readable medium
CN109784391B (en) * 2019-01-04 2021-01-05 杭州比智科技有限公司 Multi-model-based sample labeling method and device
CN110288007B (en) * 2019-06-05 2021-02-02 北京三快在线科技有限公司 Data labeling method and device and electronic equipment

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101980210A (en) * 2010-11-12 2011-02-23 百度在线网络技术(北京)有限公司 Marked word classifying and grading method and system
CN102541838A (en) * 2010-12-24 2012-07-04 日电(中国)有限公司 Method and equipment for optimizing emotional classifier
CN104281569A (en) * 2013-07-01 2015-01-14 富士通株式会社 Building device and method, classifying device and method and electronic device
CN105224947A (en) * 2014-06-06 2016-01-06 株式会社理光 Sorter training method and system
CN105117429A (en) * 2015-08-05 2015-12-02 广东工业大学 Scenario image annotation method based on active learning and multi-label multi-instance learning

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
"Baselines for Image Annotation";A.Makadia;《International Journal of Computer Vision》;20101231;全文 *
"Multiple Bernoulli relevance models for image and video annotation";Feng SL;《Proc.of the IEEE Conf.Computer Vision and Pattern Recognition》;20041231;全文 *
"基于难负样本挖掘的改进Faster RCNN训练方法";艾拓;《计算机科学》;20180531;第45卷(第5期);全文 *
"自动图像标注及标注改善算法的研究";宋海玉;《中国博士学位论文全文数据库(电子期刊)信息科技辑》;20121215;全文 *

Also Published As

Publication number Publication date
CN110288007A (en) 2019-09-27
WO2020244183A1 (en) 2020-12-10

Similar Documents

Publication Publication Date Title
CN110288007B (en) Data labeling method and device and electronic equipment
US10671511B2 (en) Automated bug fixing
CN108351986B (en) Learning system, learning apparatus, training data generating method, and computer readable medium
AU2020200909A1 (en) Evaluation control
CN110852983B (en) Method for detecting defect in semiconductor device
US10803398B2 (en) Apparatus and method for information processing
US11307975B2 (en) Machine code analysis for identifying software defects
CN115641443B (en) Method for training image segmentation network model, method for processing image and product
CN109616101B (en) Acoustic model training method and device, computer equipment and readable storage medium
US20210311729A1 (en) Code review system
CN115168868B (en) Business vulnerability analysis method and server applied to artificial intelligence
CN114787831B (en) Improving accuracy of classification models
CN113220883A (en) Text classification model performance optimization method and device and storage medium
CN112597124A (en) Data field mapping method and device and storage medium
US20140325490A1 (en) Classifying Source Code Using an Expertise Model
CN111428858A (en) Method and device for determining number of samples, electronic equipment and storage medium
CN112149698A (en) Method and device for screening difficult sample data
CN110705689B (en) Continuous learning method and device capable of distinguishing features
CN112152968B (en) Network threat detection method and device
CN114238598A (en) Question-answering system and labeling, auditing and model training method thereof
CN111814949B (en) Data labeling method and device and electronic equipment
CN112364990A (en) Method and system for realizing grammar error correction and less sample field adaptation through meta-learning
CN109672781B (en) Safety protection method and device for electronic equipment
CN111225297A (en) Broadband passive optical network port resource remediation method and system
JP7456289B2 (en) Judgment program, judgment method, and information processing device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant