CN114818679A - Intelligent auxiliary labeling method and system for text data - Google Patents

Intelligent auxiliary labeling method and system for text data Download PDF

Info

Publication number
CN114818679A
CN114818679A CN202210591077.5A CN202210591077A CN114818679A CN 114818679 A CN114818679 A CN 114818679A CN 202210591077 A CN202210591077 A CN 202210591077A CN 114818679 A CN114818679 A CN 114818679A
Authority
CN
China
Prior art keywords
data
labeling
data set
pseudo
labeled
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210591077.5A
Other languages
Chinese (zh)
Inventor
杨万征
蔡超
武学敏
董乐乐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chinatranslate Information Technology Shanghai Co ltd
Original Assignee
Chinatranslate Information Technology Shanghai Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chinatranslate Information Technology Shanghai Co ltd filed Critical Chinatranslate Information Technology Shanghai Co ltd
Priority to CN202210591077.5A priority Critical patent/CN114818679A/en
Publication of CN114818679A publication Critical patent/CN114818679A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods

Abstract

The invention discloses an intelligent auxiliary labeling method and system for text data, wherein the intelligent auxiliary labeling method for the text data comprises the following steps: acquiring a data set to be marked; automatically pre-labeling the text data in the data set according to a data labeling rule to obtain a pseudo-labeled data set; acquiring manual modification behavior information of the pseudo-labeled data set, and performing two-class training on the pseudo-labeled data set according to the manual modification behavior information to obtain an evaluation result of the pseudo-labeled data set; and filtering the pseudo-labeled data set according to the evaluation result to obtain high-quality labeled data. The technical scheme of the invention can solve the problems of high labor cost investment, poor reliability and low efficiency and accuracy of text data labeling in a data labeling mode in the prior art.

Description

Intelligent auxiliary labeling method and system for text data
Technical Field
The invention relates to the field of computers, in particular to an intelligent auxiliary labeling method and system for text data.
Background
Data annotation is one of the necessary processes for building a computer algorithm model, and determines the upper limit of the algorithm model. The existing data marking is usually manual marking, and the marking cost is high. In order to reduce the labeling cost of data, the data is labeled by constructing a machine learning model by combining a neural network artificially in the prior art.
Specifically, in the existing data labeling method, a pre-processed original file is generally automatically labeled through a pre-constructed machine learning model, and the construction process of the machine learning model requires manual extraction of text features from a large amount of data, and the text features are manually labeled to obtain a text corpus, then a training set and a verification set are obtained through manual sorting, the training set is input into an initialized neural network to obtain prediction output, the prediction output is verified through the verification set, the neural network is continuously optimized, and the machine learning model is obtained through updating. And finally, labeling the existing preprocessed original file by using the machine learning model.
However, the above data labeling method needs manual identification and extraction of text features to form a training set and a verification set, and trains a neural network, which has high human cost input and poor reliability, and the efficiency and accuracy of the neural network model for data labeling are low.
Disclosure of Invention
The invention provides an intelligent auxiliary labeling method and system for text data, and aims to solve the problems that in the prior art, a data labeling mode is high in labor cost investment, poor in reliability and low in efficiency and accuracy of data labeling.
In order to achieve the above object, according to a first aspect of the present invention, the present invention provides an intelligent auxiliary annotation method for text data, including:
acquiring a data set to be marked;
automatically pre-labeling the text data in the data set according to a data labeling rule to obtain a pseudo-labeled data set;
acquiring manual modification behavior information of the pseudo-labeled data set, and performing two-class training on the pseudo-labeled data set according to the manual modification behavior information to obtain an evaluation result of the pseudo-labeled data set;
and filtering the pseudo-labeled data set according to the evaluation result to obtain high-quality labeled data.
Preferably, the method for intelligently assisting annotation of data further includes, after the step of filtering the pseudo-annotated data set according to the evaluation result:
judging whether the high-quality marking data meet the modification standard of the manual modification behavior information;
if the high-quality labeling data do not accord with the modification standard of the manual modification behavior information, performing two-classification training on the high-quality labeling data by using a sample classifier according to the manual modification behavior information to obtain an evaluation result of the high-quality labeling data;
filtering the high-quality labeled data according to the evaluation result;
and iterating the steps until the obtained high-quality marking data meets the modification standard of the manual modification behavior information.
Preferably, the intelligent auxiliary labeling method for data further includes, before the step of automatically pre-labeling the data in the data set according to the data labeling rule:
setting part-of-speech relation rules and syntax dependence rules of data in a plurality of projects as data tagging rules according to preset data requirements of the plurality of projects;
and combining the part-of-speech relation rule and the syntactic dependency rule to generate a corresponding part-of-speech relation template and a corresponding syntactic dependency template.
Preferably, in the method for intelligently and auxiliarily labeling data, the step of automatically pre-labeling the data in the data set according to the data labeling rule includes:
using a part-of-speech relation rule or a part-of-speech relation template to automatically pre-label the part of speech of the text data in the data set; and the number of the first and second groups,
automatically pre-labeling the syntax of the text data in the data set by using a syntax relation rule or a syntax dependence template;
and synthesizing the part of speech and the syntax of the automatic pre-labeling to obtain a pseudo-labeled data set.
Preferably, in the above method for intelligently and auxiliarily labeling data, the step of filtering the pseudo-labeled data set according to the evaluation result to obtain high-quality labeled data includes:
filtering error labels in the pseudo label data set according to the evaluation result to obtain secondary label data;
and automatically pre-labeling the secondary labeled data again by using a data labeling rule to obtain high-quality labeled data.
Preferably, the method for intelligently and auxiliarily labeling data further includes, after the step of obtaining high-quality labeled data:
analyzing to obtain a part-of-speech relationship and a syntactic dependency relationship corresponding to the high-quality label data;
and generating a corresponding part-of-speech relation rule and a corresponding syntactic dependency rule by using the part-of-speech relation and the syntactic dependency relation, and adding the part-of-speech relation rule and the syntactic dependency rule to the data tagging rule.
According to a second aspect of the present invention, the present invention further provides an intelligent auxiliary annotation system for data, comprising:
the data set acquisition module is used for acquiring a data set to be marked;
the pre-labeling module is used for automatically pre-labeling the text data in the data set according to the data labeling rule to obtain a pseudo-labeled data set;
the characteristic acquisition module is used for acquiring the manual modification behavior information of the pseudo-labeled data set;
the first classification training module is used for performing classification training on the pseudo-labeled data set according to the manual modification behavior information to obtain an evaluation result of the pseudo-labeled data set;
and the first filtering module is used for filtering the pseudo-labeled data set according to the evaluation result to obtain high-quality labeled data.
Preferably, the system for intelligently and auxiliarily labeling the data further comprises:
the standard judgment module is used for judging whether the high-quality marking data meet the modification standard of the manual modification behavior information;
the second classification training module is used for performing classification training on the high-quality labeled data by using the sample classifier according to the manual modification behavior information to obtain an evaluation result of the high-quality labeled data if the high-quality labeled data does not meet the modification standard of the manual modification behavior information;
and the second filtering module is used for filtering the high-quality labeling data according to the evaluation result until the obtained high-quality labeling data meets the modification standard of the manual modification behavior information.
Preferably, the system for intelligently and auxiliarily labeling the data further includes:
the rule setting module is used for setting part-of-speech relation rules and syntax dependence rules of data in a plurality of projects as data tagging rules according to preset data requirements of the plurality of projects;
and the template combination module is used for carrying out template combination on the part of speech relation rule and the syntactic dependency rule to generate a corresponding part of speech relation template and a corresponding syntactic dependency template.
According to a third aspect of the present invention, the present invention further provides an intelligent auxiliary annotation system for data, comprising:
the intelligent auxiliary annotation program of the data is stored in the memory and can be operated on the processor, and when being executed by the processor, the intelligent auxiliary annotation program of the data realizes the steps of the intelligent auxiliary annotation method of the data according to any one of the technical schemes.
In summary, according to the intelligent auxiliary labeling scheme for text data provided by the technical scheme of the present invention, a data set to be labeled is obtained, then text data in the data set is automatically pre-labeled according to a predefined data labeling rule corresponding to the text data to obtain a pseudo-labeled data set, then manual modification behavior information of the pseudo-labeled data set is obtained, and then binary training is performed on the pseudo-labeled data set according to the manual modification behavior information to obtain an evaluation result of the pseudo-labeled data set, so that the pseudo-labeled data set can be evaluated according to the evaluation result, low-quality labels in the pseudo-labeled data set are filtered, and high-quality labeled data are obtained. Because the data in the data set is automatically labeled according to the predefined data labeling rule corresponding to the project and the pseudo-labeled data set is subjected to the two-classification training by using the sample classifier, the text features do not need to be identified and extracted manually, the neural network does not need to be trained, the input labor cost is low, the artificial modification behavior information is used for auxiliary verification, the reliability is higher compared with that of a verification set of the neural network, the pseudo-labeled data set is filtered in an online learning mode, high-quality labeled data is obtained, and the labeling complexity is reduced. Through the mode, the problem that the efficiency and the accuracy rate of data labeling are low in the prior art can be solved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the structures shown in the drawings without creative efforts.
Fig. 1 is a schematic flowchart of a first method for intelligently assisting annotation of data according to an embodiment of the present invention;
FIG. 2 is a schematic flowchart of a second method for intelligently assisting annotation of data according to an embodiment of the present invention;
FIG. 3 is a flowchart illustrating a third method for intelligently assisting annotation of data according to an embodiment of the present invention;
FIG. 4 is a flowchart illustrating a method for automatic pre-labeling of data in a data set according to an embodiment of the present invention;
FIG. 5 is a flowchart illustrating a method for filtering a pseudo-annotation data set according to an embodiment of the present invention;
FIG. 6 is a flowchart illustrating a fourth method for intelligent assisted annotation of data according to an embodiment of the present invention;
fig. 7 is a schematic flowchart of a fifth method for intelligently assisting annotation of data according to an embodiment of the present invention;
FIG. 8-a is a schematic structural diagram of a part-of-speech rule template and a syntactic dependency template according to an embodiment of the present invention;
FIG. 8-b is a diagram of a pseudo-annotated data set according to an embodiment of the invention;
FIG. 8-c is a schematic diagram of a artificially verified pseudo-annotated data set according to an embodiment of the present invention;
FIG. 8-d is a diagram of a first high quality annotation data provided by an embodiment of the present invention;
FIG. 9 is a schematic structural diagram of an intelligent auxiliary annotation system for first data according to an embodiment of the present invention;
FIG. 10 is a schematic structural diagram of an intelligent auxiliary annotation system for second data according to an embodiment of the present invention;
FIG. 11 is a schematic structural diagram of an intelligent auxiliary annotation system for third data according to an embodiment of the present invention;
fig. 12 is a schematic structural diagram of an intelligent auxiliary annotation system for fourth data according to an embodiment of the present invention.
The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The embodiment of the invention mainly solves the technical problems as follows:
current data mark mode uses the manual mode to mark usually, and this kind of mark mode needs the manual work to extract the text characteristic from a large amount of data to carry out the feature mark through the manual work to the text, obtain training set and verification set, the guide arm just can mark current preliminary treatment original document after training machine learning model, this in-process human cost drops into highly and the reliability is relatively poor, the efficiency and the rate of accuracy of neural network model to data mark are lower relatively.
In order to solve the above problem, the following embodiments of the present invention provide an intelligent auxiliary annotation scheme for data, where a data set of a related item is obtained, text data in the data set is automatically pre-annotated according to a predefined data annotation rule corresponding to the related item to obtain a pseudo-annotated data set, and then two-class training is performed on the pseudo-annotated data set according to manual modification behavior information to obtain an evaluation result of the pseudo-annotated data set, so that the pseudo-annotated data set can be evaluated according to the evaluation result, low-quality annotations in the pseudo-annotated data set are filtered, and high-quality annotated data is obtained. The text features do not need to be identified and extracted manually, and the neural network does not need to be trained, so that the input labor cost is low, and compared with the neural network, the efficiency and the accuracy are higher by using the manual modification behavior information for verification.
To achieve the above object, please refer to fig. 1, where fig. 1 is a schematic flow chart of a first method for intelligently assisting annotation of data according to an embodiment of the present invention, and as shown in fig. 1, the method for intelligently assisting annotation of data includes the following steps:
s110: and acquiring a data set to be marked. In the embodiment of the present application, first, according to the needs of a project, a related data set is obtained, where the data set can be a set of data to be labeled, and generally belongs to a specific kind of project. According to the embodiment of the application, data are labeled in an online learning mode, so that training of a neural network is not needed, and labeling can be directly performed in the data set to obtain high-quality labeled data.
S120: and automatically pre-labeling the text data in the data set according to the data labeling rule to obtain a pseudo-labeled data set. The data labeling rule comprises a part-of-speech relation rule and a syntactic dependency rule, the rules comprise a part-of-speech combination relation and a syntactic dependency relation, and the part-of-speech and the syntactic are unified through designing a template of a related project according to the data labeling rule. Because the data set to be labeled is usually related to the item, the data labeling rule corresponds to some type of item, and the data labeling rule corresponding to the data set can be selected during labeling.
The data tagging rule comprises a part-of-speech relation rule and a syntactic dependency rule, and in the data tagging rule, the part-of-speech relation rule and the syntactic dependency rule can be subjected to template combination according to item types to obtain a part-of-speech relation template and a syntactic dependency template. Specifically, as a preferred embodiment, as shown in fig. 4, the step of automatically pre-labeling the text data in the data set according to the predefined data labeling rule includes:
s121: using a part-of-speech relation rule or a part-of-speech relation template to automatically pre-label the part of speech of the text data in the data set; and the number of the first and second groups,
s122: using a syntax dependency rule or a syntax dependency template to perform automatic syntax pre-labeling on the text data in the data set;
s123: and synthesizing the part of speech and the syntax of the automatic pre-labeling to obtain a pseudo-labeled data set.
According to the technical scheme, the data in the data set are automatically pre-labeled according to the part of speech relation template, so that part of speech information of the data can be labeled, inefficiency caused by manual labeling is reduced, and compared with the existing neural network model, the scheme for labeling the data set through training data is more convenient to realize, so that efficiency is higher. Similarly, by using the syntactic dependency template to perform automatic pre-labeling of the syntax on the data in the data set, the syntactic relation of the data can be labeled, and then the relevant information in the data can be identified and obtained according to the relation. By combining the part of speech and the syntax of the automatic pre-tagging, for example, tagging the part of speech and the syntax of specific data in the data set, a pseudo-tagged data set can be obtained. The label is labeled according to a certain part of speech relation and grammar dependency relation, and can be subjected to templating to obtain the part of speech relation template and the grammar dependency template.
After the data in the data set is automatically pre-labeled to obtain a pseudo-labeled data set, the intelligent auxiliary labeling method for data shown in fig. 1 further includes:
s130: and acquiring the manual modification behavior information of the pseudo-labeled data set, and performing two-class training on the pseudo-labeled data set according to the manual modification behavior information to obtain an evaluation result of the pseudo-labeled data set.
The manual modification behavior information comprises a manual modification result of the pseudo-labeled data set, and comprises the steps of deleting error labels, adding missing label data, modifying error label contents and the like. The sample classifier can evaluate the labels in the pseudo-labeled data set according to the manual modification behavior information, and the pseudo-labeled data set is divided into high-quality pre-labeled data and low-quality pre-labeled data.
S140: and filtering the pseudo-labeled data set according to the evaluation result to obtain high-quality labeled data.
The evaluation result is divided into high quality and low quality, so that after the whole pseudo-labeled data set is filtered, high-quality labeled data corresponding to the whole pseudo-labeled data set can be obtained. Thereby enabling the sample classifier to preferentially display high quality annotation data.
As a preferred embodiment, as shown in fig. 5, the step of filtering the pseudo-labeled data set according to the evaluation result to obtain the high-quality labeled data specifically includes:
s141: and filtering the error labels in the pseudo label data set according to the evaluation result to obtain secondary label data.
S142: and automatically pre-labeling the secondary labeled data again by using a data labeling rule to obtain high-quality labeled data.
And the evaluation result comprises the judgment of the error label in the pseudo label data set, and comprises the prompt of modifying, deleting or adding the label to the error label, wherein the error label is filtered, and the actions of deleting or adding the label are included, the secondary label data is obtained after the error label is filtered, then the secondary label data is automatically pre-labeled again by using the data labeling rule until the pre-labeled result meets the evaluation result, and the high-quality label data can be obtained.
To sum up, according to the intelligent auxiliary labeling method for data provided in the embodiment of the present application, a data set to be labeled is obtained, then text data in the data set is automatically pre-labeled according to a predefined data labeling rule corresponding to the data set, so as to obtain a pseudo-labeled data set, then manual modification behavior information of the pseudo-labeled data set is obtained, then binary training is performed on the pseudo-labeled data set according to the manual modification behavior information, so as to obtain an evaluation result of the pseudo-labeled data set, so that the pseudo-labeled data set can be evaluated according to the evaluation result, low-quality labels in the pseudo-labeled data set are filtered, and high-quality labeled data is obtained. In the process, text data in the data set is automatically labeled according to a predefined data labeling rule, and the pseudo-labeled data set is subjected to two-classification training by using a sample classifier, so that the text features do not need to be identified and extracted manually, a neural network does not need to be trained, the input labor cost is low, the verification is performed by using manually modified behavior information, the reliability is higher compared with that of a verification set of the neural network, the pseudo-labeled data set is filtered in an online learning mode, high-quality labeled data are obtained, and the labeling complexity is reduced. Through the mode, the problem that the efficiency and the accuracy rate of data labeling are low in the prior art can be solved.
In addition, the high quality of the high-quality labeled data is close to and meets the modification standard of the manual modification behavior information, and specifically, the label of the high-quality labeled data can be compared with the label revised by the manual modification behavior, and if a certain proportion (for example, 85%) of the label of the manual modification behavior is reached, the labeled data can be determined to be the high-quality labeled data. Meanwhile, in order to enable the high-quality marking data to fully meet the modification standard of the manual modification behavior information, the manual modification cost is reduced, and the original marking behavior is gradually changed into the sample confirmation behavior.
As a preferred embodiment, as shown in fig. 2, compared with the annotation method shown in fig. 1, the intelligent assisted annotation method for data provided in the embodiment of the present application further includes the following steps after the step of filtering the pseudo-annotated data set according to the evaluation result:
s150: judging whether the high-quality marking data meet the modification standard of the manual modification behavior information; if the high-quality annotation data does not meet the modification criteria of the manual modification behavior information, step S160 is executed. By comparing the labeled content of the high-quality labeled data with the labeled content of the manual modification behavior information, when the accuracy of the labeled content reaches a certain proportion (for example, 90%) of the manual modification behavior, it can be determined that the modification standard of the manual modification behavior information is met.
S160: and performing two-classification training on the high-quality labeled data by using a sample classifier according to the manual modification behavior information to obtain an evaluation result of the high-quality labeled data. The original manual modification behavior information can be used for comparing the high-quality marked data, and manual modification can be carried out on the high-quality marked data again to obtain the manual modification behavior information.
S170: and filtering the high-quality labeled data according to the evaluation result. The filtered high-quality labeled data is judged again through the step S150.
And iterating the steps until the obtained high-quality marking data meets the modification standard of the manual modification behavior information. Through continuously repeating the steps until the obtained high-quality marking data meet the modification standard of the manual modification behavior information, the high-quality marking data obtained through filtering can be continuously modified, the quality of the sample classifier is continuously improved, the manual modification cost is reduced, the original marking behavior is gradually changed into a sample confirmation behavior, manual intervention on marking of a subsequent data set is reduced, and the efficiency and accuracy of data marking are improved.
As a preferred embodiment, as shown in fig. 3, in the method for intelligently and secondarily labeling the data, in step S120: according to a predefined data labeling rule, the step of automatically pre-labeling the text data in the data set further comprises:
s210: and setting part-of-speech relation rules and syntactic dependency rules of data in the plurality of items as the data tagging rules according to preset data requirements of the plurality of items.
S220: and combining the part-of-speech relation rule and the syntactic dependency rule to generate a corresponding part-of-speech relation template and a corresponding syntactic dependency template.
Setting part-of-speech relation rules and syntactic dependency rules of data in a plurality of items by data requirements according to the plurality of items set in advance.
With reference to the embodiment shown in fig. 7, the data tagging rules may be set in the same open extraction module, and the open extraction module may openly set the data tagging rules and the data requirements of a plurality of items, and may set the part-of-speech relationship rules and the syntax dependency rules of the data in the plurality of items according to the data requirements, so that the corresponding part-of-speech relationship templates may be obtained by combining the part-of-speech relationship rules, and may generate the corresponding syntax dependency templates by combining the syntax dependency rules in the same manner. When the predefined data tagging rule is used for tagging the data in the data set, the data in the data set can be automatically pre-tagged by using the part-of-speech relation template and the syntactic dependency template corresponding to the item of the data set, so that a pseudo-tagged data set is obtained.
In addition, as a preferred embodiment, as shown in fig. 6, the method for intelligently assisting annotation of data further includes, after the step of obtaining high-quality annotation data:
s310: and analyzing to obtain a part-of-speech relationship and a syntactic dependency relationship corresponding to the high-quality labeling data.
S320: and generating a corresponding part-of-speech relation rule and a corresponding syntactic dependency rule by using the part-of-speech relation and the syntactic dependency relation, and adding the part-of-speech relation rule and the syntactic dependency rule to the data tagging rule.
Because the high-quality labeled data is obtained by filtering the evaluation result classified by using the sample classifier and is revised according to the manual modification behavior information, the high-quality labeled data has certain accuracy, is analyzed to obtain the corresponding part-of-speech relation rule and the syntax dependence rule, and is added into the data labeling rule, a small sample classification model can be added in the process of labeling the data, the working pressure of manual labeling is reduced, and the accuracy of model labeling is improved.
In addition, as a preferred embodiment, as shown in fig. 7, an embodiment of the present application further provides an intelligent auxiliary marking method for data, including the following steps:
s401: and constructing a data set.
S402: the open extraction module predefines data labeling rules. The rules include part-of-speech template selection, part-of-speech template addition, syntactic template selection, and syntactic template addition.
S403: and (4) pre-labeling. And performing pre-labeling according to a data labeling rule predefined by the open extraction module to obtain a pseudo-labeled data set.
S404: and manually marking, and manually modifying in the pseudo marking data set, such as deleting wrong marks, adding missing mark data, and the like.
S405: modified behavior information is obtained. And obtaining modified behavior information according to the manual marking behavior.
S406: and training a sample classifier. And training a sample two classifier according to the user labeled behavior, namely modifying behavior information, so as to evaluate the labeling result.
S407: and screening the data to obtain high-quality labeled data. And filtering the low-quality samples according to the evaluation result of the sample classifier, and preferentially displaying the high-quality pre-labeled data.
And repeating the steps S405 to S407, continuously improving the quality of the sample classifier, preferentially displaying high-quality pre-labeled data, reducing the manual modification cost, and gradually changing the original labeling behavior into the sample confirmation behavior.
Specifically, the labels of the patent data are taken as examples, as shown in fig. 8-a to 8-d.
Firstly, collecting patent sample data, and selecting Chinese invention patents in a specific year (such as 2019).
Secondly, constructing a part-of-speech rule template and a syntax dependence template, such as: [ vn ] + n, mn, ATTn, CMP _ VOB, etc., see FIG. 8-a in particular.
Thirdly, according to the rules, the patent sample data is pre-labeled, and the obtained form is shown in fig. 8-b. The fullerene liposome, hydrogenated soybean lecithin, cholesterol, beta-sitosterol and tween are labeled as products.
As can be seen in FIG. 8-b, the pre-labeled results are still happy, but "β -sitosterol and Tween" are two entities, but the pre-labeling combines them into one.
Fourthly, manually checking, manually adjusting the mark errors, and adjusting the mark errors as shown in the figure 8-c.
And training a sample classifier according to the manual modification behaviors, and collecting features such as feature (ncn, VOB,8, wp. Wp ], label ═ 0, feature ═ n, VOB,5, wp, and, c ], label ═ 1], including but not limited to the above features. As shown in particular in fig. 8-d.
And sixthly, screening the residual pre-labeled data by using a sample classifier, selecting a specific target sample, and submitting the specific target sample to manual confirmation.
Finally, repeating the third step till the sample number meets the requirement or other termination conditions.
Based on the same concept of the embodiment of the method, the embodiment of the invention also provides an intelligent auxiliary labeling system for text data, which is used for realizing the method of the invention.
Referring to fig. 9, fig. 9 is a schematic structural diagram of an intelligent auxiliary labeling system for text data according to an embodiment of the present invention. As shown in fig. 9, the intelligent auxiliary annotation system for text data includes:
a data set obtaining module 101, configured to obtain a data set to be labeled;
the pre-labeling module 102 is configured to perform automatic pre-labeling on text data in a data set according to a data labeling rule to obtain a pseudo-labeled data set;
the characteristic acquisition module 103 is used for acquiring the manual modification behavior information of the pseudo-annotation data set;
the first classification training module 104 is configured to perform classification training on the pseudo-labeled data set according to the manual modification behavior information to obtain an evaluation result of the pseudo-labeled data set;
and the first filtering module 105 is configured to filter the pseudo-labeled data set according to the evaluation result to obtain high-quality labeled data.
To sum up, the intelligent auxiliary labeling system for text data provided in the embodiment of the present application obtains a data set of related items, automatically pre-labels text data in the data set according to a predefined data labeling rule corresponding to the data set to obtain a pseudo-labeled data set, then obtains manual modification behavior information of the pseudo-labeled data set, and then performs classification training on the pseudo-labeled data set according to the manual modification behavior information to obtain an evaluation result of the pseudo-labeled data set, so that the pseudo-labeled data set can be evaluated according to the evaluation result, low-quality labels in the pseudo-labeled data set are filtered, and high-quality labeled data are obtained. Because the data in the data set is automatically labeled according to the predefined data labeling rule corresponding to the project and the pseudo-labeled data set is subjected to the two-classification training by using the sample classifier, the text features do not need to be identified and extracted manually, the neural network does not need to be trained, the input labor cost is low, the reliability is higher compared with that of the verification set of the neural network by using the manually modified behavior information for verification, the pseudo-labeled data set is filtered in an online learning mode, high-quality labeled data is obtained, and the labeling complexity is reduced. Through the mode, the problem that the efficiency and the accuracy rate of data labeling are low in the prior art can be solved.
As a preferred embodiment, as shown in fig. 10, the system for intelligently assisting annotation of the data further includes:
the standard judgment module 106 is used for judging whether the high-quality marking data meets the modification standard of the manual modification behavior information;
the second classification training module 107 is configured to perform classification training on the high-quality labeled data according to the manual modification behavior information by using the sample classifier if the high-quality labeled data does not meet the modification standard of the manual modification behavior information, so as to obtain an evaluation result of the high-quality labeled data;
and the second filtering module 108 is configured to filter the high-quality annotation data according to the evaluation result until the obtained high-quality annotation data meets the modification standard of the manual modification behavior information.
As a preferred embodiment, as shown in fig. 11, the system for intelligently assisting annotation of the data further includes:
the rule setting module 110 is configured to set a part-of-speech relationship rule and a syntactic dependency rule of data in a plurality of items as data tagging rules according to preset data requirements of the plurality of items;
and the template combination module 111 is used for performing template combination on the part-of-speech relation rule and the syntactic dependency rule to generate a corresponding part-of-speech relation template and a corresponding syntactic dependency template.
In addition, as a preferred embodiment, as shown in fig. 12, an embodiment of the present invention further provides an intelligent auxiliary annotation system for text data, including:
the intelligent auxiliary annotation program for data comprises a communication line 1002, a communication module 1003, a memory 1004, a processor 1001 and an operating system of the intelligent auxiliary annotation program for text data, which is stored in the memory 1004 and can be run on the processor 1001, and when the intelligent auxiliary annotation program for data is executed by the processor 1001, the steps of the intelligent auxiliary annotation method for data according to any one of the above technical solutions are realized.
In summary, the intelligent auxiliary labeling system for data provided by the embodiment of the application does not need manual identification and extraction of text features, does not need training of a neural network, is low in input labor cost, verifies by means of manual modification of behavior information, and is higher in reliability compared with a verification set of the neural network, so that a pseudo-labeling data set is filtered in an online learning mode, high-quality labeling data are obtained, and the labeling complexity is reduced. Through the mode, the problem that the efficiency and the accuracy rate of data labeling are low in the prior art can be solved.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
It should be noted that in the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names.
While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims (10)

1. An intelligent auxiliary labeling method for text data is characterized by comprising the following steps:
acquiring a data set to be marked;
automatically pre-labeling the text data in the data set according to a data labeling rule to obtain a pseudo-labeled data set;
acquiring manual modification behavior information of the pseudo-labeled data set, and performing two-class training on the pseudo-labeled data set according to the manual modification behavior information to obtain an evaluation result of the pseudo-labeled data set;
and filtering the pseudo-labeled data set according to the evaluation result to obtain high-quality labeled data.
2. The intelligent assisted annotation method of claim 1, wherein after the step of filtering the pseudo-annotated data set according to the evaluation result, the method further comprises:
judging whether the high-quality marking data meet the modification standard of the manual modification behavior information;
if the high-quality labeling data do not accord with the modification standard of the manual modification behavior information, performing two-classification training on the high-quality labeling data by using a sample classifier according to the manual modification behavior information to obtain an evaluation result of the high-quality labeling data;
filtering the high-quality labeled data according to the evaluation result;
and iterating the steps until the obtained high-quality marking data meets the modification standard of the manual modification behavior information.
3. The intelligent auxiliary labeling method according to claim 1 or 2, wherein before the step of automatically pre-labeling the text data in the data set according to the data labeling rule, the method further comprises:
setting part-of-speech relation rules and syntax dependency rules of data in a plurality of items as the data tagging rules according to preset data requirements of the plurality of items;
and performing template combination on the part of speech relation rule and the syntactic dependency rule to generate a corresponding part of speech relation template and a corresponding syntactic dependency template.
4. The intelligent auxiliary labeling method according to claim 3, wherein the step of automatically pre-labeling the text data in the data set according to the data labeling rule comprises:
using the part-of-speech relation rule or the part-of-speech relation template to automatically pre-label the part of speech of the text data in the data set; and the number of the first and second groups,
automatically pre-labeling the text data in the data set syntactically using the syntactical dependency rules or the syntactical dependency templates;
and synthesizing the part of speech and the syntax of the automatic pre-labeling to obtain the pseudo-labeled data set.
5. The intelligent auxiliary labeling method of claim 1, wherein the step of filtering the pseudo-labeled data set according to the evaluation result to obtain high-quality labeled data comprises:
filtering the false annotations in the pseudo-annotation data set according to the evaluation result to obtain secondary annotation data;
and using the data labeling rule to perform automatic pre-labeling on the secondary labeling data again to obtain the high-quality labeling data.
6. The intelligent assisted annotation method of claim 1, wherein after the step of obtaining the high quality annotation data, the method further comprises:
analyzing to obtain a part-of-speech relationship and a syntactic dependency relationship corresponding to the high-quality labeling data;
and generating a corresponding part-of-speech relation rule and a corresponding syntactic dependency rule by using the part-of-speech relation and the syntactic dependency relation, and adding the part-of-speech relation rule and the syntactic dependency rule to the data tagging rule.
7. An intelligent auxiliary labeling system for text data, comprising:
the data set acquisition module is used for acquiring a data set to be marked;
the pre-labeling module is used for automatically pre-labeling the text data in the data set according to a data labeling rule to obtain a pseudo-labeled data set;
the characteristic acquisition module is used for acquiring the artificial modification behavior information of the pseudo-labeled data set;
the first classification training module is used for performing classification training on the pseudo-labeled data set according to the manual modification behavior information to obtain an evaluation result of the pseudo-labeled data set;
and the first filtering module is used for filtering the pseudo-labeled data set according to the evaluation result to obtain high-quality labeled data.
8. The intelligent assisted labeling system of claim 7, further comprising:
the standard judgment module is used for judging whether the high-quality marking data meet the modification standard of the manual modification behavior information;
the second classification training module is used for performing classification training on the high-quality labeled data by using a sample classifier according to the manual modification behavior information to obtain an evaluation result of the high-quality labeled data if the high-quality labeled data does not meet the modification standard of the manual modification behavior information;
and the second filtering module is used for filtering the high-quality labeling data according to the evaluation result until the obtained high-quality labeling data meets the modification standard of the manual modification behavior information.
9. The intelligent assisted labeling system of claim 7 or 8, further comprising:
the rule setting module is used for setting part-of-speech relation rules and syntax dependence rules of data in a plurality of items as the data tagging rules according to preset data requirements of the plurality of items;
and the template combination module is used for carrying out template combination on the part of speech relation rule and the syntactic dependency rule to generate a corresponding part of speech relation template and a corresponding syntactic dependency template.
10. An intelligent auxiliary labeling system for text data, comprising:
memory, processor and an intelligent assisted annotation procedure of data stored on said memory and executable on said processor, said intelligent assisted annotation procedure of data implementing, when executed by said processor, the steps of the intelligent assisted annotation method of data according to any one of claims 1 to 6.
CN202210591077.5A 2022-05-27 2022-05-27 Intelligent auxiliary labeling method and system for text data Pending CN114818679A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210591077.5A CN114818679A (en) 2022-05-27 2022-05-27 Intelligent auxiliary labeling method and system for text data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210591077.5A CN114818679A (en) 2022-05-27 2022-05-27 Intelligent auxiliary labeling method and system for text data

Publications (1)

Publication Number Publication Date
CN114818679A true CN114818679A (en) 2022-07-29

Family

ID=82518728

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210591077.5A Pending CN114818679A (en) 2022-05-27 2022-05-27 Intelligent auxiliary labeling method and system for text data

Country Status (1)

Country Link
CN (1) CN114818679A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116860979A (en) * 2023-09-04 2023-10-10 上海柯林布瑞信息技术有限公司 Medical text labeling method and device based on label knowledge base

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116860979A (en) * 2023-09-04 2023-10-10 上海柯林布瑞信息技术有限公司 Medical text labeling method and device based on label knowledge base
CN116860979B (en) * 2023-09-04 2023-12-08 上海柯林布瑞信息技术有限公司 Medical text labeling method and device based on label knowledge base

Similar Documents

Publication Publication Date Title
CN109032949A (en) A kind of front-end code quality determining method and device
CN107861954A (en) Information output method and device based on artificial intelligence
CN110287098A (en) Automatically create test script method, server and computer readable storage medium
CN108710571B (en) Method and device for generating automatic test code
WO2020136959A1 (en) Cartoon generation system and cartoon generation method
CN106294186A (en) Intelligence software automated testing method
CN112764784A (en) Automatic software defect repairing method and device based on neural machine translation
CN110705283A (en) Deep learning method and system based on matching of text laws and regulations and judicial interpretations
CN114818679A (en) Intelligent auxiliary labeling method and system for text data
CN115509485A (en) Filling-in method and device of business form, electronic equipment and storage medium
CN106815253A (en) A kind of method for digging based on mixed data type data
CN108287819A (en) A method of realizing that financial and economic news is automatically associated to stock
CN110197175A (en) A kind of method and system of books title positioning and part-of-speech tagging
CN116185853A (en) Code verification method and device
CN115438655A (en) Person gender identification method and device, electronic equipment and storage medium
CN115512696A (en) Simulation training method and vehicle
CN113435213A (en) Method and device for returning answers aiming at user questions and knowledge base
CN113408253A (en) Job review system and method
CN112199372A (en) Mapping relation matching method and device and computer readable medium
CN114066402B (en) Automatic flow implementation method and system based on character recognition
CN109800405A (en) A kind of online correction processing method and processing device of technical paper document
CN112101026A (en) Corpus sample set construction method, computing device and computer storage medium
KR20190025188A (en) Document translation server and translation method for generating original and translation files individually
CN114637849B (en) Legal relation cognition method and system based on artificial intelligence
CN110209831A (en) Model generation, the method for semantics recognition, system, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination