CN112906349A

CN112906349A - Data annotation method, system, equipment and readable storage medium

Info

Publication number: CN112906349A
Application number: CN202110342499.4A
Authority: CN
Inventors: 李正华; 周明月; 龚晨; 张民
Original assignee: Suzhou University
Current assignee: Suzhou University
Priority date: 2021-03-30
Filing date: 2021-03-30
Publication date: 2021-06-04
Also published as: WO2022205585A1

Abstract

The application discloses a data labeling method, which comprises the following steps: determining data to be annotated according to an input data annotation task; calling a data annotation model to perform data annotation on data to be annotated to obtain a corresponding data annotation result; judging whether the data annotation result is consistent with the input manual annotation result or not; and if the data are consistent, the data marking result is confirmed to be correct. Compared with the modes of human-marked person school and machine-marked person school, the method and the system can fundamentally solve the problem of recognition tendency of the annotator, fully excavate understanding difference of the problem, and promote the perfection of the annotation guide and the improvement of the annotation level; the method has the advantages that the assistance of automatic labeling of the machine is combined, the time cost and the money cost of labeling can be effectively reduced compared with a method of independent labeling of multiple people, and the cost is greatly reduced while the data labeling quality is ensured. The application also provides a system, equipment and a readable storage medium for data annotation, and the system, the equipment and the readable storage medium have the beneficial effects.

Description

Data annotation method, system, equipment and readable storage medium

Technical Field

The present application relates to the field of computer data processing, and in particular, to a method, a system, a device, and a readable storage medium for data annotation.

Background

Data annotation refers to that a person annotates some additional information on data, and the information embodies more knowledge about the data, so that convenience is brought to the processing of the person and a computer. The high-quality labeled data can effectively mine data information and promote the technical progress of related disciplines. At present, common data marking methods include a person marking person school, a machine marking person school and a plurality of independent marking. These labeling methods have thousands of years in terms of two important considerations of data labeling, quality and cost.

The calibration of the human target is carried out manually, so that the cost is high, and the problem of the recognition tendency of a proofreader exists. The robot marking people have relatively high marking speed and low cost, but still have the recognition tendency problem of the marker and have lower overall quality. Although the problem of the acceptance tendency can be solved fundamentally and the quality can be effectively improved by the independent labeling of multiple persons, the labeling cost and the auditing cost are high, if the auditing process is omitted, multi-labeling data with a lot of noises is obtained, and if the inconsistency is not further eliminated, the utilization value is low.

Therefore, how to reduce the cost while ensuring the quality of data annotation is a technical problem that needs to be solved by those skilled in the art.

Disclosure of Invention

The invention aims to provide a data annotation method, a system, equipment and a readable storage medium, which are used for reducing the cost while ensuring the quality of data annotation.

In order to solve the above technical problem, the present application provides a data annotation method, including:

determining data to be annotated according to an input data annotation task;

calling a data annotation model to perform data annotation on the data to be annotated to obtain a corresponding data annotation result;

judging whether the data annotation result is consistent with an input manual annotation result or not;

and if the data is consistent with the data, confirming that the data labeling result is correct.

Optionally, the method further includes:

if the data marking result is inconsistent with the input manual marking result, marking the data to be marked as the data to be checked;

and outputting the data to be audited so that auditors can manually audit the data to be audited.

Optionally, the method further includes:

if the data annotation result is inconsistent with the input manual annotation result, marking the data to be annotated as data to be voted;

calling a preset number of voting models to vote the data marking result and the manual marking result of the data to be voted to obtain a corresponding voting result;

and determining the labeling result with the most votes as the final data labeling result of the to-be-examined data.

Optionally, the data annotation model includes a syntax annotation model, and the invoking of the data annotation model performs data annotation on the data to be annotated to obtain a corresponding data annotation result, including:

calling the syntactic annotation model to determine the relation between words in the data to be annotated;

and labeling the data to be labeled as a corresponding dependency syntax tree according to the relation between words in the data to be labeled.

Optionally, the data annotation model includes a semantic annotation model, and the invoking of the data annotation model performs data annotation on the data to be annotated to obtain a corresponding data annotation result, including:

performing word segmentation processing on the data to be labeled to obtain at least one word;

and calling the semantic annotation model to perform semantic annotation on the words according to a word meaning database to obtain corresponding semantic annotation results.

Optionally, the data annotation task includes at least one of a sequence annotation task, a tree annotation task, and a classification annotation task.

The present application further provides a system for data annotation, the system comprising:

the first determining module is used for determining data to be annotated according to the input data annotation task;

the data annotation module is used for calling a data annotation model to perform data annotation on the data to be annotated to obtain a corresponding data annotation result;

the judging module is used for judging whether the data marking result is consistent with the input manual marking result or not;

and the confirming module is used for confirming that the data labeling result is correct when the data labeling result is consistent with the input manual labeling result.

Optionally, the method further includes:

the first marking module is used for marking the data to be marked as the data to be checked if the data marking result is inconsistent with the input manual marking result;

and the output module is used for outputting the data to be audited so as to ensure that auditors can carry out manual audit on the data to be audited.

The present application further provides a data annotation device, which includes:

a memory for storing a computer program;

a processor for implementing the steps of the method of data annotation as described in any one of the above when said computer program is executed.

The present application also provides a readable storage medium having stored thereon a computer program which, when executed by a processor, carries out the steps of the method of data annotation as claimed in any one of the preceding claims.

The data labeling method provided by the application comprises the following steps: determining data to be annotated according to an input data annotation task; calling a data annotation model to perform data annotation on data to be annotated to obtain a corresponding data annotation result; judging whether the data annotation result is consistent with the input manual annotation result or not; and if the data are consistent, the data marking result is confirmed to be correct.

According to the technical scheme, the data labeling result obtained by the data labeling model is compared with the input manual labeling result, if the data labeling result is consistent with the input manual labeling result, the data labeling result is confirmed to be correct, and a labeler of the manual labeling result cannot see the answer of the machine labeling; the method has the advantages that the assistance of automatic labeling of the machine is combined, the time cost and the money cost of labeling can be effectively reduced compared with a method of independent labeling of multiple people, and the cost is greatly reduced while the data labeling quality is ensured. The application also provides a system, a device and a readable storage medium for data annotation, which have the beneficial effects and are not repeated herein.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Fig. 1 is a flowchart of a method for data annotation according to an embodiment of the present application;

FIG. 2 is a flow chart of an actual representation of S102 in a method of data annotation provided in FIG. 1;

FIG. 3 is a summary diagram of dependency types provided by an embodiment of the present application;

FIG. 4 is a diagram illustrating a structure of a dependency syntax tree according to an embodiment of the present application;

FIG. 5 is a diagram illustrating a structure of another dependency syntax tree according to an embodiment of the present application;

FIG. 6 is a block diagram of a system for data annotation provided in an embodiment of the present application;

fig. 7 is a structural diagram of a data annotation device according to an embodiment of the present application.

Detailed Description

The core of the application is to provide a method, a system, equipment and a readable storage medium for data annotation, which are used for reducing the cost while ensuring the quality of data annotation.

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The current common labeling method can be summarized as follows: people mark people school, machine mark people school, many people's independent mark, wherein:

the person-to-label person proofreading refers to that after one annotator marks data, another annotator proofreads the data, and the proofreading result is used as an answer. The proofreader can be another annotator with the same level, or a high-level annotator with better attitude, more experience or stronger ability.

And the manual calibration of the robot refers to that the data is labeled by the machine, and then each sample is manually calibrated. The machine is a machine learning model trained on existing annotation data.

The multi-person independent annotation means that the same task is independently annotated by a plurality of annotating persons. The independent labeling means that the labeled persons do not discuss and do not know the answer of the other person, so that the inconsistency is mined to the maximum extent. If the results of the multiple people labeling are consistent, the labeling answer is generally considered to be correct, and the task is completed. If the results are inconsistent, the final answer needs to be determined by means of auditing, voting by multiple persons, or the like.

However, the human-marking person school marking method has the problem of recognition tendency of the marker. Due to the inertia of the person, there is often a tendency to approve or think of the presented answer. Thus, while some obvious errors may be found by the collation, many unnoticeable errors are missed.

The robot calibration and annotation method also has the problem of recognition tendency of the annotator, namely, the proofer tends to consider the result of the machine to be correct, so that the error correction rate is low. Because the corpus obtained by labeling is continuously used for training the model, the correction errors are less and less, and a more serious problem of model convergence is caused, namely, the target of the whole data labeling item is converted into the problem of rapidly improving the model accuracy rate instead of labeling high-quality data. And a good labeling item should be problem-oriented convergence, i.e. accurately depict the problem through the labeling guide, and distinguish different data strictly according to the labeling guide. In addition, the use of the robot calibration may result in less feedback obtained in the labeling process, which is not favorable for the perfection of the labeling guideline, and is difficult to find important, interesting and valuable different phenomena beneficial to deepening the problem understanding.

Although the multiple independent labeling has no recognition tendency problem of the labeling person, and the labeling quality is obviously improved compared with the human-labeled person school and the robot-labeled person school, the multiple independent labeling has a remarkable defect of increasing the labeling cost, wherein the labeling cost comprises time cost and money cost. The cost is increased in proportion according to the number of the marked people in the marking process, and the complexity is increased for the auditing process due to the increase of the number of the marked people, so that the auditing cost is obviously improved.

The present application thus provides a method of data annotation that addresses the above-mentioned problems.

Referring to fig. 1, fig. 1 is a flowchart illustrating a data annotation method according to an embodiment of the present disclosure.

The method specifically comprises the following steps:

s101: determining data to be annotated according to an input data annotation task;

data annotation refers to that a person annotates some additional information on data, and the information embodies more knowledge about the data, so that convenience is brought to the processing of the person and a computer. The high-quality labeled data can effectively mine data information and promote the technical progress of related disciplines.

In a specific embodiment, the data annotation task mentioned herein may include an image annotation task, a semantic annotation task, a syntax annotation task, and the like, and may also include at least one of a sequence annotation task, a tree annotation task, and a classification annotation task.

S102: calling a data annotation model to perform data annotation on data to be annotated to obtain a corresponding data annotation result;

the data annotation model mentioned here is used for data annotation of data to be annotated, and in a specific embodiment, the data annotation model may be input into the system after being trained in advance, or may be obtained by downloading the system connected to a specified location, which is not specifically limited in this application.

In a specific embodiment, the data annotation model mentioned herein may include a semantic annotation model, and on this basis, the calling data annotation model mentioned herein performs data annotation on data to be annotated to obtain a corresponding data annotation result, which may specifically be:

performing word segmentation processing on data to be labeled to obtain at least one word;

and calling a semantic annotation model to perform semantic annotation on the words according to the word meaning database to obtain corresponding semantic annotation results.

S103: judging whether the data annotation result is consistent with the input manual annotation result or not;

if yes, go to step S104;

when the data annotation result is consistent with the input manual annotation result, the data annotation model and the manual annotation result are indicated to obtain the same annotation result, the current data annotation result can be determined to be correct, and subsequent scientific research personnel can directly use the data annotation result to carry out the next work.

In a specific embodiment, when the data annotation result is inconsistent with the input manual annotation result, it is proved that the data annotation model and the manual annotation result have obtained different annotation results, and at this time, a final data annotation result can be obtained in a manual review manner, that is, the following steps can be further performed:

if the data annotation result is inconsistent with the input manual annotation result, marking the data to be annotated as the data to be checked;

In a specific embodiment, when the data annotation result is inconsistent with the input manual annotation result, a final data annotation result can be obtained in a model voting manner, that is, the following steps can be further performed:

calling a preset number of voting models to vote for the data marking result and the manual marking result of the data to be voted to obtain a corresponding voting result;

In a specific embodiment, the final data annotation result is obtained by a model voting method, and the following steps may be further performed:

calling a preset number of auditing and labeling models to perform data labeling on data to be audited to obtain corresponding auditing and labeling results;

and performing merging statistics on each audit marking result, and determining the audit marking result with the most occurrence frequency as the final data marking result of the to-be-audited nuclear data.

S104: and confirming that the data labeling result is correct.

Based on the technical scheme, the data labeling method provided by the application compares the data labeling result obtained by the data labeling model with the input manual labeling result, if the data labeling result is consistent with the input manual labeling result, the data labeling result is confirmed to be correct, and a labeler of the manual labeling result cannot see the answer of the machine labeling; the method has the advantages that the assistance of automatic labeling of the machine is combined, the time cost and the money cost of labeling can be effectively reduced compared with a method of independent labeling of multiple people, and the cost is greatly reduced while the data labeling quality is ensured.

For step S102 in the previous embodiment, the mentioned data annotation model may also include a syntax annotation model, and on this basis, the described calling data annotation model performs data annotation on data to be annotated to obtain a corresponding data annotation result, which may also be specifically implemented by executing the steps shown in fig. 2, which is described below with reference to fig. 2.

Referring to fig. 2, fig. 2 is a flowchart illustrating an actual representation manner of S102 in the data annotation method provided in fig. 1.

The method specifically comprises the following steps:

s201: calling a syntax annotation model to determine the relation between words in the data to be annotated;

s202: and marking the data to be marked as a corresponding dependency syntax tree according to the relation between words in the data to be marked.

Referring to fig. 3, fig. 3 is a dependency relationship type summary diagram provided in an embodiment of the present application, and as shown in fig. 3, in an embodiment, a syntax tagging model may tag data to be tagged as a corresponding dependency syntax tree according to a relationship between words in the data to be tagged according to the dependency relationship type summary diagram shown in fig. 3.

Taking the data to be labeled as 'I eat fish with fork' as an example, at the moment, the syntactic label model captures the modification between words in the sentenceReferring to fig. 4, fig. 4 is a schematic diagram of a dependency syntax tree according to an embodiment of the present application, as shown in fig. 4, wherein the dependency syntax tree is shown in the following₀Being a pseudo node, it points to a word that is the root node of the sentence. One dependent arc is formed by three elements_ir→w_jWherein w is_iAs a core word, w_jFor modifiers, r is a relationship type, meaning w_jEmbellishment of w with syntactic role r_i(ii) a root is root node, obj is subject, obj is object, and sasubj is the same subject.

Referring to fig. 5, fig. 5 is a schematic structural diagram of another dependency syntax tree provided in the present embodiment, as shown in fig. 5, where adv is a shape and pobj is a concierge, in a specific implementation, if the data annotation model and the manual annotation respectively provide two answers as shown in fig. 5, further review or voting is required to determine a final data annotation result.

Referring to fig. 6, fig. 6 is a block diagram of a data annotation system according to an embodiment of the present disclosure.

The system may include:

a first determining module 100, configured to determine data to be annotated according to an input data annotation task;

the data annotation module 200 is configured to invoke a data annotation model to perform data annotation on data to be annotated, so as to obtain a corresponding data annotation result;

the judging module 300 is configured to judge whether the data annotation result is consistent with the input manual annotation result;

and the confirming module 400 is configured to confirm that the data annotation result is correct when the data annotation result is consistent with the input manual annotation result.

On the basis of the above embodiment, in a specific embodiment, the system may further include:

and the output module is used for outputting the data to be audited so as to ensure that an auditor carries out manual audit on the data to be audited.

the second marking module is used for marking the data to be marked as the data to be voted if the data marking result is inconsistent with the input manual marking result;

the calling module is used for calling the voting models with preset number to vote for the data marking results and the manual marking results of the data to be voted to obtain corresponding voting results;

and the second determining module is used for determining the labeling result with the most votes as the final data labeling result of the to-be-examined data.

Based on the above embodiments, in a specific embodiment, the data annotation model may include a syntactic annotation model, and the data annotation module 200 may include:

the first calling submodule is used for calling a syntax annotation model to determine the relation between words in the data to be annotated;

and the labeling submodule is used for labeling the data to be labeled as the corresponding dependency syntax tree according to the relation between the words in the data to be labeled.

Based on the above embodiments, in a specific embodiment, the data annotation model may include a semantic annotation model, and the data annotation module 200 may include:

the word segmentation sub-module is used for carrying out word segmentation processing on the data to be labeled to obtain at least one word;

and the second calling submodule is used for calling the semantic annotation model to carry out semantic annotation on the words according to the word meaning database so as to obtain a corresponding semantic annotation result.

Since the embodiment of the system part corresponds to the embodiment of the method part, the embodiment of the system part is described with reference to the embodiment of the method part, and is not repeated here.

Referring to fig. 7, fig. 7 is a structural diagram of a data annotation device according to an embodiment of the present application.

The data annotation device 700, which may vary significantly depending on configuration or performance, may include one or more processors (CPUs) 722 (e.g., one or more processors) and memory 732, one or more storage media 730 (e.g., one or more mass storage devices) that store applications 742 or data 744. Memory 732 and storage medium 730 may be, among other things, transient storage or persistent storage. The program stored in the storage medium 730 may include one or more modules (not shown), each of which may include a sequence of instruction operations for the device. Further, the processor 722 may be configured to communicate with the storage medium 730 to execute a series of instruction operations in the storage medium 730 on the data annotation device 700.

The data annotation equipment 700 can also include one or more power supplies 727, one or more wired or wireless network interfaces 750, one or more input-output interfaces 758, and/or one or more operating systems 741, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, and so forth.

The steps in the data annotation method described in fig. 1 to 5 above are implemented by the data annotation device based on the structure shown in fig. 7.

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the system, the apparatus and the module described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus, device and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of modules is merely a division of logical functions, and an actual implementation may have another division, for example, a plurality of modules or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or modules, and may be in an electrical, mechanical or other form.

Modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical modules, may be located in one place, or may be distributed on a plurality of network modules. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

In addition, functional modules in the embodiments of the present application may be integrated into one processing module, or each of the modules may exist alone physically, or two or more modules are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode.

The integrated module, if implemented in the form of a software functional module and sold or used as a separate product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a function calling device, or a network device) to execute all or part of the steps of the method of the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

A method, a system, a device and a readable storage medium for data annotation provided by the present application are described in detail above. The principles and embodiments of the present application are explained herein using specific examples, which are provided only to help understand the method and the core idea of the present application. It should be noted that, for those skilled in the art, it is possible to make several improvements and modifications to the present application without departing from the principle of the present application, and such improvements and modifications also fall within the scope of the claims of the present application.

It is further noted that, in the present specification, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

Claims

1. A method of data annotation, comprising:

determining data to be annotated according to an input data annotation task;

2. The method of claim 1, further comprising:

3. The method of claim 1, further comprising:

4. The method of claim 1, wherein the data annotation model comprises a syntactic annotation model, and the invoking of the data annotation model for data annotation of the data to be annotated to obtain a corresponding data annotation result comprises:

5. The method of claim 1, wherein the data annotation model comprises a semantic annotation model, and the invoking of the data annotation model for data annotation of the data to be annotated to obtain a corresponding data annotation result comprises:

6. The method of claim 1, wherein the data annotation task comprises at least one of a sequence annotation task, a tree annotation task, and a category annotation task.

7. A system for annotating data, comprising:

8. The system of claim 7, further comprising:

9. A data annotation apparatus, comprising:

a memory for storing a computer program;

a processor for implementing the steps of the method of data annotation according to any one of claims 1 to 6 when executing said computer program.

10. A readable storage medium, characterized in that the readable storage medium has stored thereon a computer program which, when being executed by a processor, carries out the steps of the method of data annotation according to any one of claims 1 to 6.