CN112906349A - Data annotation method, system, equipment and readable storage medium - Google Patents

Data annotation method, system, equipment and readable storage medium Download PDF

Info

Publication number
CN112906349A
CN112906349A CN202110342499.4A CN202110342499A CN112906349A CN 112906349 A CN112906349 A CN 112906349A CN 202110342499 A CN202110342499 A CN 202110342499A CN 112906349 A CN112906349 A CN 112906349A
Authority
CN
China
Prior art keywords
data
annotation
result
labeling
marking
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110342499.4A
Other languages
Chinese (zh)
Inventor
李正华
周明月
龚晨
张民
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou University
Original Assignee
Suzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou University filed Critical Suzhou University
Priority to CN202110342499.4A priority Critical patent/CN112906349A/en
Priority to PCT/CN2021/095157 priority patent/WO2022205585A1/en
Publication of CN112906349A publication Critical patent/CN112906349A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/103Formatting, i.e. changing of presentation of documents
    • G06F40/117Tagging; Marking up; Designating a block; Setting of attributes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/14Tree-structured documents
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The application discloses a data labeling method, which comprises the following steps: determining data to be annotated according to an input data annotation task; calling a data annotation model to perform data annotation on data to be annotated to obtain a corresponding data annotation result; judging whether the data annotation result is consistent with the input manual annotation result or not; and if the data are consistent, the data marking result is confirmed to be correct. Compared with the modes of human-marked person school and machine-marked person school, the method and the system can fundamentally solve the problem of recognition tendency of the annotator, fully excavate understanding difference of the problem, and promote the perfection of the annotation guide and the improvement of the annotation level; the method has the advantages that the assistance of automatic labeling of the machine is combined, the time cost and the money cost of labeling can be effectively reduced compared with a method of independent labeling of multiple people, and the cost is greatly reduced while the data labeling quality is ensured. The application also provides a system, equipment and a readable storage medium for data annotation, and the system, the equipment and the readable storage medium have the beneficial effects.

Description

Data annotation method, system, equipment and readable storage medium
Technical Field
The present application relates to the field of computer data processing, and in particular, to a method, a system, a device, and a readable storage medium for data annotation.
Background
Data annotation refers to that a person annotates some additional information on data, and the information embodies more knowledge about the data, so that convenience is brought to the processing of the person and a computer. The high-quality labeled data can effectively mine data information and promote the technical progress of related disciplines. At present, common data marking methods include a person marking person school, a machine marking person school and a plurality of independent marking. These labeling methods have thousands of years in terms of two important considerations of data labeling, quality and cost.
The calibration of the human target is carried out manually, so that the cost is high, and the problem of the recognition tendency of a proofreader exists. The robot marking people have relatively high marking speed and low cost, but still have the recognition tendency problem of the marker and have lower overall quality. Although the problem of the acceptance tendency can be solved fundamentally and the quality can be effectively improved by the independent labeling of multiple persons, the labeling cost and the auditing cost are high, if the auditing process is omitted, multi-labeling data with a lot of noises is obtained, and if the inconsistency is not further eliminated, the utilization value is low.
Therefore, how to reduce the cost while ensuring the quality of data annotation is a technical problem that needs to be solved by those skilled in the art.
Disclosure of Invention
The invention aims to provide a data annotation method, a system, equipment and a readable storage medium, which are used for reducing the cost while ensuring the quality of data annotation.
In order to solve the above technical problem, the present application provides a data annotation method, including:
determining data to be annotated according to an input data annotation task;
calling a data annotation model to perform data annotation on the data to be annotated to obtain a corresponding data annotation result;
judging whether the data annotation result is consistent with an input manual annotation result or not;
and if the data is consistent with the data, confirming that the data labeling result is correct.
Optionally, the method further includes:
if the data marking result is inconsistent with the input manual marking result, marking the data to be marked as the data to be checked;
and outputting the data to be audited so that auditors can manually audit the data to be audited.
Optionally, the method further includes:
if the data annotation result is inconsistent with the input manual annotation result, marking the data to be annotated as data to be voted;
calling a preset number of voting models to vote the data marking result and the manual marking result of the data to be voted to obtain a corresponding voting result;
and determining the labeling result with the most votes as the final data labeling result of the to-be-examined data.
Optionally, the data annotation model includes a syntax annotation model, and the invoking of the data annotation model performs data annotation on the data to be annotated to obtain a corresponding data annotation result, including:
calling the syntactic annotation model to determine the relation between words in the data to be annotated;
and labeling the data to be labeled as a corresponding dependency syntax tree according to the relation between words in the data to be labeled.
Optionally, the data annotation model includes a semantic annotation model, and the invoking of the data annotation model performs data annotation on the data to be annotated to obtain a corresponding data annotation result, including:
performing word segmentation processing on the data to be labeled to obtain at least one word;
and calling the semantic annotation model to perform semantic annotation on the words according to a word meaning database to obtain corresponding semantic annotation results.
Optionally, the data annotation task includes at least one of a sequence annotation task, a tree annotation task, and a classification annotation task.
The present application further provides a system for data annotation, the system comprising:
the first determining module is used for determining data to be annotated according to the input data annotation task;
the data annotation module is used for calling a data annotation model to perform data annotation on the data to be annotated to obtain a corresponding data annotation result;
the judging module is used for judging whether the data marking result is consistent with the input manual marking result or not;
and the confirming module is used for confirming that the data labeling result is correct when the data labeling result is consistent with the input manual labeling result.
Optionally, the method further includes:
the first marking module is used for marking the data to be marked as the data to be checked if the data marking result is inconsistent with the input manual marking result;
and the output module is used for outputting the data to be audited so as to ensure that auditors can carry out manual audit on the data to be audited.
The present application further provides a data annotation device, which includes:
a memory for storing a computer program;
a processor for implementing the steps of the method of data annotation as described in any one of the above when said computer program is executed.
The present application also provides a readable storage medium having stored thereon a computer program which, when executed by a processor, carries out the steps of the method of data annotation as claimed in any one of the preceding claims.
The data labeling method provided by the application comprises the following steps: determining data to be annotated according to an input data annotation task; calling a data annotation model to perform data annotation on data to be annotated to obtain a corresponding data annotation result; judging whether the data annotation result is consistent with the input manual annotation result or not; and if the data are consistent, the data marking result is confirmed to be correct.
According to the technical scheme, the data labeling result obtained by the data labeling model is compared with the input manual labeling result, if the data labeling result is consistent with the input manual labeling result, the data labeling result is confirmed to be correct, and a labeler of the manual labeling result cannot see the answer of the machine labeling; the method has the advantages that the assistance of automatic labeling of the machine is combined, the time cost and the money cost of labeling can be effectively reduced compared with a method of independent labeling of multiple people, and the cost is greatly reduced while the data labeling quality is ensured. The application also provides a system, a device and a readable storage medium for data annotation, which have the beneficial effects and are not repeated herein.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
Fig. 1 is a flowchart of a method for data annotation according to an embodiment of the present application;
FIG. 2 is a flow chart of an actual representation of S102 in a method of data annotation provided in FIG. 1;
FIG. 3 is a summary diagram of dependency types provided by an embodiment of the present application;
FIG. 4 is a diagram illustrating a structure of a dependency syntax tree according to an embodiment of the present application;
FIG. 5 is a diagram illustrating a structure of another dependency syntax tree according to an embodiment of the present application;
FIG. 6 is a block diagram of a system for data annotation provided in an embodiment of the present application;
fig. 7 is a structural diagram of a data annotation device according to an embodiment of the present application.
Detailed Description
The core of the application is to provide a method, a system, equipment and a readable storage medium for data annotation, which are used for reducing the cost while ensuring the quality of data annotation.
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The current common labeling method can be summarized as follows: people mark people school, machine mark people school, many people's independent mark, wherein:
the person-to-label person proofreading refers to that after one annotator marks data, another annotator proofreads the data, and the proofreading result is used as an answer. The proofreader can be another annotator with the same level, or a high-level annotator with better attitude, more experience or stronger ability.
And the manual calibration of the robot refers to that the data is labeled by the machine, and then each sample is manually calibrated. The machine is a machine learning model trained on existing annotation data.
The multi-person independent annotation means that the same task is independently annotated by a plurality of annotating persons. The independent labeling means that the labeled persons do not discuss and do not know the answer of the other person, so that the inconsistency is mined to the maximum extent. If the results of the multiple people labeling are consistent, the labeling answer is generally considered to be correct, and the task is completed. If the results are inconsistent, the final answer needs to be determined by means of auditing, voting by multiple persons, or the like.
However, the human-marking person school marking method has the problem of recognition tendency of the marker. Due to the inertia of the person, there is often a tendency to approve or think of the presented answer. Thus, while some obvious errors may be found by the collation, many unnoticeable errors are missed.
The robot calibration and annotation method also has the problem of recognition tendency of the annotator, namely, the proofer tends to consider the result of the machine to be correct, so that the error correction rate is low. Because the corpus obtained by labeling is continuously used for training the model, the correction errors are less and less, and a more serious problem of model convergence is caused, namely, the target of the whole data labeling item is converted into the problem of rapidly improving the model accuracy rate instead of labeling high-quality data. And a good labeling item should be problem-oriented convergence, i.e. accurately depict the problem through the labeling guide, and distinguish different data strictly according to the labeling guide. In addition, the use of the robot calibration may result in less feedback obtained in the labeling process, which is not favorable for the perfection of the labeling guideline, and is difficult to find important, interesting and valuable different phenomena beneficial to deepening the problem understanding.
Although the multiple independent labeling has no recognition tendency problem of the labeling person, and the labeling quality is obviously improved compared with the human-labeled person school and the robot-labeled person school, the multiple independent labeling has a remarkable defect of increasing the labeling cost, wherein the labeling cost comprises time cost and money cost. The cost is increased in proportion according to the number of the marked people in the marking process, and the complexity is increased for the auditing process due to the increase of the number of the marked people, so that the auditing cost is obviously improved.
The present application thus provides a method of data annotation that addresses the above-mentioned problems.
Referring to fig. 1, fig. 1 is a flowchart illustrating a data annotation method according to an embodiment of the present disclosure.
The method specifically comprises the following steps:
s101: determining data to be annotated according to an input data annotation task;
data annotation refers to that a person annotates some additional information on data, and the information embodies more knowledge about the data, so that convenience is brought to the processing of the person and a computer. The high-quality labeled data can effectively mine data information and promote the technical progress of related disciplines.
In a specific embodiment, the data annotation task mentioned herein may include an image annotation task, a semantic annotation task, a syntax annotation task, and the like, and may also include at least one of a sequence annotation task, a tree annotation task, and a classification annotation task.
S102: calling a data annotation model to perform data annotation on data to be annotated to obtain a corresponding data annotation result;
the data annotation model mentioned here is used for data annotation of data to be annotated, and in a specific embodiment, the data annotation model may be input into the system after being trained in advance, or may be obtained by downloading the system connected to a specified location, which is not specifically limited in this application.
In a specific embodiment, the data annotation model mentioned herein may include a semantic annotation model, and on this basis, the calling data annotation model mentioned herein performs data annotation on data to be annotated to obtain a corresponding data annotation result, which may specifically be:
performing word segmentation processing on data to be labeled to obtain at least one word;
and calling a semantic annotation model to perform semantic annotation on the words according to the word meaning database to obtain corresponding semantic annotation results.
S103: judging whether the data annotation result is consistent with the input manual annotation result or not;
if yes, go to step S104;
when the data annotation result is consistent with the input manual annotation result, the data annotation model and the manual annotation result are indicated to obtain the same annotation result, the current data annotation result can be determined to be correct, and subsequent scientific research personnel can directly use the data annotation result to carry out the next work.
In a specific embodiment, when the data annotation result is inconsistent with the input manual annotation result, it is proved that the data annotation model and the manual annotation result have obtained different annotation results, and at this time, a final data annotation result can be obtained in a manual review manner, that is, the following steps can be further performed:
if the data annotation result is inconsistent with the input manual annotation result, marking the data to be annotated as the data to be checked;
and outputting the data to be audited so that auditors can manually audit the data to be audited.
In a specific embodiment, when the data annotation result is inconsistent with the input manual annotation result, a final data annotation result can be obtained in a model voting manner, that is, the following steps can be further performed:
if the data annotation result is inconsistent with the input manual annotation result, marking the data to be annotated as data to be voted;
calling a preset number of voting models to vote for the data marking result and the manual marking result of the data to be voted to obtain a corresponding voting result;
and determining the labeling result with the most votes as the final data labeling result of the to-be-examined data.
In a specific embodiment, the final data annotation result is obtained by a model voting method, and the following steps may be further performed:
if the data annotation result is inconsistent with the input manual annotation result, marking the data to be annotated as the data to be checked;
calling a preset number of auditing and labeling models to perform data labeling on data to be audited to obtain corresponding auditing and labeling results;
and performing merging statistics on each audit marking result, and determining the audit marking result with the most occurrence frequency as the final data marking result of the to-be-audited nuclear data.
S104: and confirming that the data labeling result is correct.
Based on the technical scheme, the data labeling method provided by the application compares the data labeling result obtained by the data labeling model with the input manual labeling result, if the data labeling result is consistent with the input manual labeling result, the data labeling result is confirmed to be correct, and a labeler of the manual labeling result cannot see the answer of the machine labeling; the method has the advantages that the assistance of automatic labeling of the machine is combined, the time cost and the money cost of labeling can be effectively reduced compared with a method of independent labeling of multiple people, and the cost is greatly reduced while the data labeling quality is ensured.
For step S102 in the previous embodiment, the mentioned data annotation model may also include a syntax annotation model, and on this basis, the described calling data annotation model performs data annotation on data to be annotated to obtain a corresponding data annotation result, which may also be specifically implemented by executing the steps shown in fig. 2, which is described below with reference to fig. 2.
Referring to fig. 2, fig. 2 is a flowchart illustrating an actual representation manner of S102 in the data annotation method provided in fig. 1.
The method specifically comprises the following steps:
s201: calling a syntax annotation model to determine the relation between words in the data to be annotated;
s202: and marking the data to be marked as a corresponding dependency syntax tree according to the relation between words in the data to be marked.
Referring to fig. 3, fig. 3 is a dependency relationship type summary diagram provided in an embodiment of the present application, and as shown in fig. 3, in an embodiment, a syntax tagging model may tag data to be tagged as a corresponding dependency syntax tree according to a relationship between words in the data to be tagged according to the dependency relationship type summary diagram shown in fig. 3.
Taking the data to be labeled as 'I eat fish with fork' as an example, at the moment, the syntactic label model captures the modification between words in the sentenceReferring to fig. 4, fig. 4 is a schematic diagram of a dependency syntax tree according to an embodiment of the present application, as shown in fig. 4, wherein the dependency syntax tree is shown in the following0Being a pseudo node, it points to a word that is the root node of the sentence. One dependent arc is formed by three elementsir→wjWherein w isiAs a core word, wjFor modifiers, r is a relationship type, meaning wjEmbellishment of w with syntactic role ri(ii) a root is root node, obj is subject, obj is object, and sasubj is the same subject.
Referring to fig. 5, fig. 5 is a schematic structural diagram of another dependency syntax tree provided in the present embodiment, as shown in fig. 5, where adv is a shape and pobj is a concierge, in a specific implementation, if the data annotation model and the manual annotation respectively provide two answers as shown in fig. 5, further review or voting is required to determine a final data annotation result.
Referring to fig. 6, fig. 6 is a block diagram of a data annotation system according to an embodiment of the present disclosure.
The system may include:
a first determining module 100, configured to determine data to be annotated according to an input data annotation task;
the data annotation module 200 is configured to invoke a data annotation model to perform data annotation on data to be annotated, so as to obtain a corresponding data annotation result;
the judging module 300 is configured to judge whether the data annotation result is consistent with the input manual annotation result;
and the confirming module 400 is configured to confirm that the data annotation result is correct when the data annotation result is consistent with the input manual annotation result.
On the basis of the above embodiment, in a specific embodiment, the system may further include:
the first marking module is used for marking the data to be marked as the data to be checked if the data marking result is inconsistent with the input manual marking result;
and the output module is used for outputting the data to be audited so as to ensure that an auditor carries out manual audit on the data to be audited.
On the basis of the above embodiment, in a specific embodiment, the system may further include:
the second marking module is used for marking the data to be marked as the data to be voted if the data marking result is inconsistent with the input manual marking result;
the calling module is used for calling the voting models with preset number to vote for the data marking results and the manual marking results of the data to be voted to obtain corresponding voting results;
and the second determining module is used for determining the labeling result with the most votes as the final data labeling result of the to-be-examined data.
Based on the above embodiments, in a specific embodiment, the data annotation model may include a syntactic annotation model, and the data annotation module 200 may include:
the first calling submodule is used for calling a syntax annotation model to determine the relation between words in the data to be annotated;
and the labeling submodule is used for labeling the data to be labeled as the corresponding dependency syntax tree according to the relation between the words in the data to be labeled.
Based on the above embodiments, in a specific embodiment, the data annotation model may include a semantic annotation model, and the data annotation module 200 may include:
the word segmentation sub-module is used for carrying out word segmentation processing on the data to be labeled to obtain at least one word;
and the second calling submodule is used for calling the semantic annotation model to carry out semantic annotation on the words according to the word meaning database so as to obtain a corresponding semantic annotation result.
Since the embodiment of the system part corresponds to the embodiment of the method part, the embodiment of the system part is described with reference to the embodiment of the method part, and is not repeated here.
Referring to fig. 7, fig. 7 is a structural diagram of a data annotation device according to an embodiment of the present application.
The data annotation device 700, which may vary significantly depending on configuration or performance, may include one or more processors (CPUs) 722 (e.g., one or more processors) and memory 732, one or more storage media 730 (e.g., one or more mass storage devices) that store applications 742 or data 744. Memory 732 and storage medium 730 may be, among other things, transient storage or persistent storage. The program stored in the storage medium 730 may include one or more modules (not shown), each of which may include a sequence of instruction operations for the device. Further, the processor 722 may be configured to communicate with the storage medium 730 to execute a series of instruction operations in the storage medium 730 on the data annotation device 700.
The data annotation equipment 700 can also include one or more power supplies 727, one or more wired or wireless network interfaces 750, one or more input-output interfaces 758, and/or one or more operating systems 741, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, and so forth.
The steps in the data annotation method described in fig. 1 to 5 above are implemented by the data annotation device based on the structure shown in fig. 7.
It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the system, the apparatus and the module described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the several embodiments provided in the present application, it should be understood that the disclosed apparatus, device and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of modules is merely a division of logical functions, and an actual implementation may have another division, for example, a plurality of modules or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or modules, and may be in an electrical, mechanical or other form.
Modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical modules, may be located in one place, or may be distributed on a plurality of network modules. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.
In addition, functional modules in the embodiments of the present application may be integrated into one processing module, or each of the modules may exist alone physically, or two or more modules are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode.
The integrated module, if implemented in the form of a software functional module and sold or used as a separate product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a function calling device, or a network device) to execute all or part of the steps of the method of the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
A method, a system, a device and a readable storage medium for data annotation provided by the present application are described in detail above. The principles and embodiments of the present application are explained herein using specific examples, which are provided only to help understand the method and the core idea of the present application. It should be noted that, for those skilled in the art, it is possible to make several improvements and modifications to the present application without departing from the principle of the present application, and such improvements and modifications also fall within the scope of the claims of the present application.
It is further noted that, in the present specification, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

Claims (10)

1. A method of data annotation, comprising:
determining data to be annotated according to an input data annotation task;
calling a data annotation model to perform data annotation on the data to be annotated to obtain a corresponding data annotation result;
judging whether the data annotation result is consistent with an input manual annotation result or not;
and if the data is consistent with the data, confirming that the data labeling result is correct.
2. The method of claim 1, further comprising:
if the data marking result is inconsistent with the input manual marking result, marking the data to be marked as the data to be checked;
and outputting the data to be audited so that auditors can manually audit the data to be audited.
3. The method of claim 1, further comprising:
if the data annotation result is inconsistent with the input manual annotation result, marking the data to be annotated as data to be voted;
calling a preset number of voting models to vote the data marking result and the manual marking result of the data to be voted to obtain a corresponding voting result;
and determining the labeling result with the most votes as the final data labeling result of the to-be-examined data.
4. The method of claim 1, wherein the data annotation model comprises a syntactic annotation model, and the invoking of the data annotation model for data annotation of the data to be annotated to obtain a corresponding data annotation result comprises:
calling the syntactic annotation model to determine the relation between words in the data to be annotated;
and labeling the data to be labeled as a corresponding dependency syntax tree according to the relation between words in the data to be labeled.
5. The method of claim 1, wherein the data annotation model comprises a semantic annotation model, and the invoking of the data annotation model for data annotation of the data to be annotated to obtain a corresponding data annotation result comprises:
performing word segmentation processing on the data to be labeled to obtain at least one word;
and calling the semantic annotation model to perform semantic annotation on the words according to a word meaning database to obtain corresponding semantic annotation results.
6. The method of claim 1, wherein the data annotation task comprises at least one of a sequence annotation task, a tree annotation task, and a category annotation task.
7. A system for annotating data, comprising:
the first determining module is used for determining data to be annotated according to the input data annotation task;
the data annotation module is used for calling a data annotation model to perform data annotation on the data to be annotated to obtain a corresponding data annotation result;
the judging module is used for judging whether the data marking result is consistent with the input manual marking result or not;
and the confirming module is used for confirming that the data labeling result is correct when the data labeling result is consistent with the input manual labeling result.
8. The system of claim 7, further comprising:
the first marking module is used for marking the data to be marked as the data to be checked if the data marking result is inconsistent with the input manual marking result;
and the output module is used for outputting the data to be audited so as to ensure that auditors can carry out manual audit on the data to be audited.
9. A data annotation apparatus, comprising:
a memory for storing a computer program;
a processor for implementing the steps of the method of data annotation according to any one of claims 1 to 6 when executing said computer program.
10. A readable storage medium, characterized in that the readable storage medium has stored thereon a computer program which, when being executed by a processor, carries out the steps of the method of data annotation according to any one of claims 1 to 6.
CN202110342499.4A 2021-03-30 2021-03-30 Data annotation method, system, equipment and readable storage medium Pending CN112906349A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202110342499.4A CN112906349A (en) 2021-03-30 2021-03-30 Data annotation method, system, equipment and readable storage medium
PCT/CN2021/095157 WO2022205585A1 (en) 2021-03-30 2021-05-21 Data labeling method, system, and device, and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110342499.4A CN112906349A (en) 2021-03-30 2021-03-30 Data annotation method, system, equipment and readable storage medium

Publications (1)

Publication Number Publication Date
CN112906349A true CN112906349A (en) 2021-06-04

Family

ID=76109514

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110342499.4A Pending CN112906349A (en) 2021-03-30 2021-03-30 Data annotation method, system, equipment and readable storage medium

Country Status (2)

Country Link
CN (1) CN112906349A (en)
WO (1) WO2022205585A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114219501A (en) * 2022-02-22 2022-03-22 杭州衡泰技术股份有限公司 Sample labeling resource allocation method, device and application

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115618810A (en) * 2022-12-20 2023-01-17 中化现代农业有限公司 Method and device for improving data labeling accuracy

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110750523A (en) * 2019-09-12 2020-02-04 苏宁云计算有限公司 Data annotation method, system, computer equipment and storage medium
CN110880021A (en) * 2019-11-06 2020-03-13 创新奇智(北京)科技有限公司 Model-assisted data annotation system and annotation method
CN111368902A (en) * 2020-02-28 2020-07-03 北京三快在线科技有限公司 Data labeling method and device
CN111651271A (en) * 2020-05-19 2020-09-11 南京擎盾信息科技有限公司 Multi-task learning semantic annotation method and device based on legal data
US20200320171A1 (en) * 2019-04-02 2020-10-08 International Business Machines Corporation Cross-subject model-generated training data for relation extraction modeling
CN111881657A (en) * 2020-08-04 2020-11-03 厦门渊亭信息科技有限公司 Intelligent marking method, terminal equipment and storage medium
CN112381526A (en) * 2020-11-29 2021-02-19 杭州知衣科技有限公司 Data labeling system and method based on automatic verification

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110162644B (en) * 2018-10-10 2022-12-20 腾讯科技(深圳)有限公司 Image set establishing method, device and storage medium
CN110069602B (en) * 2019-04-15 2021-11-19 网宿科技股份有限公司 Corpus labeling method, apparatus, server and storage medium
CN110245716B (en) * 2019-06-20 2021-05-14 杭州睿琪软件有限公司 Sample labeling auditing method and device
CN110704633B (en) * 2019-09-04 2023-07-21 平安科技(深圳)有限公司 Named entity recognition method, named entity recognition device, named entity recognition computer equipment and named entity recognition storage medium
CN111259980B (en) * 2020-02-10 2023-10-03 北京小马慧行科技有限公司 Method and device for processing annotation data
CN112163424A (en) * 2020-09-17 2021-01-01 中国建设银行股份有限公司 Data labeling method, device, equipment and medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200320171A1 (en) * 2019-04-02 2020-10-08 International Business Machines Corporation Cross-subject model-generated training data for relation extraction modeling
CN110750523A (en) * 2019-09-12 2020-02-04 苏宁云计算有限公司 Data annotation method, system, computer equipment and storage medium
CN110880021A (en) * 2019-11-06 2020-03-13 创新奇智(北京)科技有限公司 Model-assisted data annotation system and annotation method
CN111368902A (en) * 2020-02-28 2020-07-03 北京三快在线科技有限公司 Data labeling method and device
CN111651271A (en) * 2020-05-19 2020-09-11 南京擎盾信息科技有限公司 Multi-task learning semantic annotation method and device based on legal data
CN111881657A (en) * 2020-08-04 2020-11-03 厦门渊亭信息科技有限公司 Intelligent marking method, terminal equipment and storage medium
CN112381526A (en) * 2020-11-29 2021-02-19 杭州知衣科技有限公司 Data labeling system and method based on automatic verification

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114219501A (en) * 2022-02-22 2022-03-22 杭州衡泰技术股份有限公司 Sample labeling resource allocation method, device and application

Also Published As

Publication number Publication date
WO2022205585A1 (en) 2022-10-06

Similar Documents

Publication Publication Date Title
US9230025B2 (en) Searching for information based on generic attributes of the query
US9501467B2 (en) Systems, methods, software and interfaces for entity extraction and resolution and tagging
CN112631997B (en) Data processing method, device, terminal and storage medium
CN109446341A (en) The construction method and device of knowledge mapping
US20110295864A1 (en) Iterative fact-extraction
CN107221328B (en) Method and device for positioning modification source, computer equipment and readable medium
CN107798123B (en) Knowledge base and establishing, modifying and intelligent question and answer methods, devices and equipment thereof
CN109376202B (en) NLP-based enterprise supply relationship automatic extraction and analysis method
CN112163424A (en) Data labeling method, device, equipment and medium
CN112906349A (en) Data annotation method, system, equipment and readable storage medium
CN116860949B (en) Question-answering processing method, device, system, computing equipment and computer storage medium
CN110399488A (en) File classification method and device
US11393232B2 (en) Extracting values from images of documents
CN112115252A (en) Intelligent auxiliary writing processing method and device, electronic equipment and storage medium
CN115687563A (en) Interpretable intelligent judgment method and device, electronic equipment and storage medium
CN114240672A (en) Method for identifying green asset proportion and related product
CN111597302B (en) Text event acquisition method and device, electronic equipment and storage medium
US20230244878A1 (en) Extracting conversational relationships based on speaker prediction and trigger word prediction
CN110489740A (en) Semantic analytic method and Related product
CN112733517B (en) Method for checking requirement template conformity, electronic equipment and storage medium
CN114067343A (en) Data set construction method, model training method and corresponding device
CN114139543A (en) Entity link corpus labeling method and device
CN109144564B (en) Modification influence analysis recommendation method and system based on historical modification mode
CN114218431A (en) Video searching method and device, electronic equipment and storage medium
CN112819622A (en) Information entity relationship joint extraction method and device and terminal equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20210604