CN115438645A - Text data enhancement method and system for sequence labeling task - Google Patents
Text data enhancement method and system for sequence labeling task Download PDFInfo
- Publication number
- CN115438645A CN115438645A CN202211158611.XA CN202211158611A CN115438645A CN 115438645 A CN115438645 A CN 115438645A CN 202211158611 A CN202211158611 A CN 202211158611A CN 115438645 A CN115438645 A CN 115438645A
- Authority
- CN
- China
- Prior art keywords
- entity
- text
- task
- replaced
- enhanced
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/166—Editing, e.g. inserting or deleting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
- G06F40/242—Dictionaries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Software Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Medical Informatics (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Machine Translation (AREA)
Abstract
A text data enhancement method and system for a sequence labeling task are provided, the method comprises the following steps: dividing a text data set of a sequence labeling task into a training set, a verification set and a test set according to the proportion of 7; extracting entities and entity types from a training set of a sequence labeling task; combining different entities of each entity type into an entity list, wherein each entity type and the corresponding entity list are a key value pair, and a plurality of key value pairs form an entity dictionary; performing data enhancement on the training set of the sequence labeling task to generate an enhanced text; and carrying out deduplication processing on the generated plurality of enhanced texts, merging the training set and the enhanced texts to obtain an enhanced text set, and carrying out deep learning model training. The sequence labeling task of the application comprises the following steps: the named entity recognition task or the relation extraction task is subjected to data enhancement through an entity replacement method, so that context semantics among entities can be effectively reserved, and the generalization capability of the model is improved.
Description
Technical Field
The invention relates to the technical field of natural language processing, in particular to a text data enhancement method and system for a sequence labeling task.
Background
With the development of various deep learning techniques, various deep learning models are applied to the named entity recognition task and the relationship extraction task. The construction of deep learning models often requires a large number of labeled samples, however, such labeled samples do not exist in a specific field, and the labeling of specific field data not only requires professional knowledge of relevant experts, but also consumes a lot of time.
The data enhancement method is a method for expanding training data, and the quantity of the training set is expanded by transforming the training set, so that the generalization capability of the model is improved. The initial data enhancement methods were applied to image data enhancement and later developed to text data enhancement. The traditional text data enhancement method has a good effect when being applied to a named entity recognition task, but a data enhancement method for a relation extraction task is lacked.
The named entity recognition task and the relation extraction task are both sequence labeling tasks, and no effective solution is provided aiming at text data enhancement of the sequence labeling tasks at present.
Disclosure of Invention
In order to meet the requirement of a sequence labeling task on data enhancement, the invention aims to provide a text data enhancement method and a text data enhancement system for the sequence labeling task.
In order to realize the purpose, the technical scheme of the invention is as follows:
a text data enhancement method for a sequence labeling task comprises the following steps:
step 1: dividing a data set, namely dividing a text data set of a sequence labeling task into a training set, a verification set and a test set according to the proportion of 7;
step 2: acquiring an entity, namely extracting the entity and an entity type from a training set of a sequence labeling task;
and 3, step 3: the entity dictionary structure is used for combining different entities of each entity type into an entity list, each entity type and the corresponding entity list are a key value pair, and a plurality of key value pairs form an entity dictionary;
and 4, step 4: data enhancement, namely performing data enhancement on the training set of the sequence labeling task to generate an enhanced text;
and 5: removing the duplication of the enhanced texts, wherein the generated enhanced texts are subjected to duplication removal processing to obtain an enhanced text set;
and 6: model training, combining the training set and the enhanced text set, performing deep learning model training, testing the generalization error of the model through the verification set, and evaluating the model effect through the test set; wherein:
and 4, performing data enhancement on the training set of the sequence labeling task, which specifically comprises the following steps:
selecting a target text from a training set of the sequence labeling task, and determining an entity to be replaced of the target text;
for an entity to be replaced, randomly selecting whether the entity is subjected to entity replacement or not under the binomial distribution of the probability P;
if the entity to be replaced needs to be replaced, obtaining an entity list according to the entity type of the entity to be replaced and the entity dictionary, randomly selecting an entity from the entity list, and replacing the original entity;
if the entity does not need to be replaced, the entity remains unchanged;
and carrying out replacement operation on all entity types to be replaced in the target text to obtain the enhanced text.
According to the text data enhancement method for the sequence labeling task, the sequence labeling task comprises a named entity identification task or a relation extraction task.
A system for text data enhancement of a sequence annotation task, comprising:
the data acquisition module is used for acquiring and establishing a text data set of the sequence labeling task, and dividing the text data set into a training set, a verification set and a test set according to the proportion of (7);
the entity dictionary generating module is used for generating an entity dictionary by utilizing the training set of the sequence labeling task;
the data enhancement module is used for enhancing data of the training set of the sequence labeling task to generate an enhanced text;
the enhanced text deduplication module is used for performing deduplication processing on a plurality of generated enhanced texts to obtain an enhanced text set;
and the model training module is used for merging the training set and the enhanced text set, performing deep learning model training, testing the generalization error of the model through the verification set and evaluating the model effect through the test set.
The text data enhancement system of the sequence labeling task comprises a named entity identification task or a relation extraction task.
The entity dictionary generation module further comprises:
the entity acquisition unit is used for extracting entities and entity types from a training set of the sequence labeling task;
and the entity dictionary construction unit is used for combining different entities of each entity type into an entity list, each entity type and the corresponding entity list are a key value pair, and a plurality of key value pairs form an entity dictionary.
The data enhancement module further comprises:
the target text selection unit selects a target text from the training set of the sequence labeling task and determines an entity to be replaced of the target text;
the entity replacement unit randomly selects whether the entity to be replaced carries out entity replacement or not under the binomial distribution of the probability P;
if the entity to be replaced needs to be replaced, obtaining an entity list according to the entity type of the entity to be replaced and the entity dictionary, randomly selecting an entity from the entity list, and replacing the original entity;
if the entity does not need to be replaced, the entity remains unchanged;
and the enhanced text generation unit is used for carrying out replacement operation on all entity types to be replaced in the target text to obtain the enhanced text.
Compared with the prior art, the invention has the beneficial effects that:
the invention provides a text data enhancement method and a text data enhancement system for a sequence labeling task. The data enhancement method designed by the invention can be applied to sequence labeling tasks containing entity types, such as named entity identification tasks or relationship extraction tasks, can greatly retain context semantics among entities through a data enhancement method of entity replacement, and a model constructed by utilizing an enhanced text has better generalization capability.
Drawings
FIG. 1 is a flow chart of a text data enhancement method of the present invention;
FIG. 2 is a schematic view of example 1 of the present invention;
FIG. 3 is a schematic view of example 2 of the present invention;
fig. 4 is a block diagram schematically illustrating the structure of the text data enhancement system of the present invention.
Detailed Description
The invention is further explained with reference to the drawings and the embodiments.
Referring to fig. 1, the present invention provides a text data enhancement method for a sequence annotation task, which can be applied to a sequence annotation task containing an entity type. And for sequence labeling tasks containing entity types, such as named entity recognition tasks or relationship extraction tasks, constructing an enhanced text by using an entity replacement method so as to improve the generalization capability of the deep learning model. The specific method comprises the following steps:
1. dividing a data set, namely dividing a text data set of a sequence labeling task into a training set, a verification set and a test set according to the proportion of 7;
2. acquiring an entity, namely extracting the entity and an entity type from a training set of a sequence labeling task;
3. the entity dictionary structure is used for combining different entities of each entity type into an entity list, each entity type and the corresponding entity list are a key value pair, and a plurality of key value pairs form an entity dictionary;
4. data enhancement, namely performing data enhancement on the training set of the sequence labeling task to generate an enhanced text;
5. removing the duplication of the enhanced texts, wherein the generated enhanced texts are subjected to duplication removal processing to obtain an enhanced text set;
6. model training, combining the training set and the enhanced text set, performing deep learning model training, testing the generalization error of the model through the verification set, and evaluating the model effect through the test set; wherein:
the data enhancement of the training set of the sequence labeling task specifically comprises the following steps:
selecting a target text from a training set of the sequence labeling task, and determining an entity to be replaced of the target text;
for an entity to be replaced, randomly selecting whether the entity is subjected to entity replacement or not under the binomial distribution of the probability P;
if the entity to be replaced needs to be replaced, obtaining an entity list according to the entity type of the entity to be replaced and the entity dictionary, randomly selecting an entity from the entity list, and replacing the original entity;
if the entity does not need to be replaced, the entity remains unchanged;
and carrying out replacement operation on all entity types to be replaced in the target text to obtain the enhanced text.
The sequence labeling task comprises a named entity identification task or a relation extraction task.
For the data enhancement mode, a data enhancement sample is given, and table 1 is a data enhancement sample proposed according to the present invention, as shown in the table:
TABLE 1
For a target text "BRONZE PRESIDENT leafages Wmiexec" in the training set, the corresponding sequence label is "B-Attacker I-Attacker OB-Tool", wherein the "B-Attacker I-Attacker" sequence label indicates that the entity type of the entity "BRONZE PRESIDENT" is Attacker, and wherein the "B-Tool" sequence label indicates that the entity type of the entity "Wmiexec" is Tool. Assume that the entity list of Tool entity type in the entity dictionary contains entity Nmap, and the entity list of attcker entity type in the entity dictionary contains entity APT41. The data enhancement mode of the invention can obtain an enhanced text s1, an enhanced text s2 and an enhanced text s3, wherein the enhanced text s1 replaces an entity with an entity type of Tool, the enhanced text s2 replaces an entity with an entity type of Attacker, and the enhanced text s3 replaces entities with an entity type of Attacker and Tool. The sample is only one sample of entity replacement by using the entity dictionary, and the quantity of the enhanced texts which can be obtained for an original sentence is far more than 3. The data enhancement mode of the invention can greatly reserve the context semantics among the entities and can be applied to sequence labeling tasks, such as named entity identification tasks or relation extraction tasks.
Example 1
In this embodiment, a named entity recognition method is provided, and fig. 2 is a flowchart of a named entity recognition task implemented according to the present invention, and as shown in fig. 2, the flowchart includes the steps of the text data enhancement method, which specifically include:
1. dividing a text data set of the named entity recognition task into a training set, a verification set and a test set according to the proportion of 7;
2. extracting entities and entity types from a training set of a named entity recognition task;
3. combining different entities of each entity type into an entity list, wherein each entity type and the corresponding entity list are a key value pair, and a plurality of key value pairs form an entity dictionary;
4. performing data enhancement on the training set of the named entity recognition task to generate an enhanced text;
5. carrying out deduplication processing on the generated plurality of enhancement texts;
6. and combining the training set and the enhanced text, training by using a named entity recognition model BERT-BilSTM-ATTN-CRF, and evaluating the effect of the model through the test set by testing the generalization error of the model through the verification set.
The method for enhancing the data of the training set of the named entity recognition task comprises the following steps:
selecting a target text from a training set of a named entity recognition task, and determining an entity to be replaced of the target text;
for an entity to be replaced, randomly selecting whether the entity is subjected to entity replacement or not under the binomial distribution of the probability P;
if the entity to be replaced needs to be replaced, obtaining an entity list according to the entity type of the entity to be replaced and the entity dictionary, randomly selecting an entity from the entity list, and replacing the original entity;
if the entity does not need to be replaced, the entity remains unchanged;
and carrying out replacement operation on all entity types to be replaced in the target text to obtain an enhanced text.
The named entity recognition model can also be set as HMM, CRF, bilSTM-CRF, BERT-BilSTM-CRF and other named entity recognition models.
Example 2
In this embodiment, a relationship extraction method is provided, fig. 3 is a flowchart of a relationship extraction task implemented according to the present invention, and as shown in fig. 3, the flowchart includes the steps of the text data enhancement method, which specifically include:
1. dividing a text data set of the relation extraction task into a training set, a verification set and a test set according to the proportion of 7;
2. extracting entities and entity types from a training set of the relationship extraction task;
3. classifying different entities according to entity types to construct an entity dictionary;
4. performing data enhancement on the training set of the relation extraction task to generate an enhanced text;
5. carrying out deduplication processing on the generated plurality of enhancement texts;
6. and combining the training set and the enhanced text, training by using a relation extraction model RIFREE, testing the generalization error of the model through a verification set, and evaluating the effect of the model through the test set.
The method for enhancing the data of the training set of the relation extraction task comprises the following steps:
selecting a target text from a training set of the relation extraction task, and determining an entity to be replaced of the target text;
for an entity to be replaced, randomly selecting whether the entity is subjected to entity replacement or not under the binomial distribution of the probability P;
if the entity to be replaced needs to be replaced, obtaining an entity list according to the entity type of the entity to be replaced and the entity dictionary, randomly selecting an entity from the entity list, and replacing the original entity;
if the entity does not need to be replaced, the entity remains unchanged;
and carrying out replacement operation on all entity types to be replaced in the target text to obtain the enhanced text.
The relation extraction model can also be set as a relation extraction model such as CNN, attention-BLSTM, R-BERT and the like.
FIG. 4 is a block diagram schematically illustrating the structure of a text data enhancement system for a sequence annotation task implemented according to the present invention, as shown in FIG. 4, the system includes:
the data acquisition module is used for acquiring and establishing a text data set of the sequence labeling task, and dividing the text data set into a training set, a verification set and a test set according to the proportion of (7);
the entity dictionary generating module is used for generating an entity dictionary by utilizing the training set of the sequence labeling task;
the data enhancement module is used for enhancing data of the training set of the sequence labeling task to generate an enhanced text;
the enhanced text deduplication module is used for performing deduplication processing on a plurality of generated enhanced texts to obtain an enhanced text set;
and the model training module combines the training set and the enhanced text set, performs deep learning model training, tests the generalization error of the model through the verification set, and evaluates the model effect through the test set.
The text data enhancement system of the sequence labeling task comprises a named entity identification task or a relation extraction task.
The entity dictionary generation module further comprises:
the entity acquisition unit is used for extracting entities and entity types from a training set of the sequence labeling task;
and the entity dictionary constructing unit is used for combining different entities of each entity type into an entity list, each entity type and the corresponding entity list are a key value pair, and a plurality of key value pairs form the entity dictionary.
The data enhancement module further comprises:
the target text selection unit is used for selecting a target text from the training set of the sequence labeling task and determining an entity to be replaced of the target text;
the entity replacement unit randomly selects whether the entity to be replaced carries out entity replacement or not under the binomial distribution of the probability P;
if the entity to be replaced needs to be replaced, obtaining an entity list according to the entity type of the entity to be replaced and the entity dictionary, randomly selecting an entity from the entity list, and replacing the original entity;
if the entity does not need to be replaced, the entity remains unchanged;
and the enhanced text generation unit is used for carrying out replacement operation on all entity types to be replaced in the target text to obtain the enhanced text.
It should be understood that the above-described examples of the present invention are merely illustrative for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. It will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention, and these changes and modifications are within the scope of the invention as claimed.
Claims (6)
1. A text data enhancement method for a sequence labeling task is characterized by comprising the following steps:
step 1: dividing a data set, namely dividing a text data set of a sequence labeling task into a training set, a verification set and a test set according to the proportion of 7;
step 2: acquiring an entity, namely extracting the entity and an entity type from a training set of a sequence marking task;
and step 3: the entity dictionary structure is used for combining different entities of each entity type into an entity list, each entity type and the corresponding entity list are a key value pair, and a plurality of key value pairs form an entity dictionary;
and 4, step 4: data enhancement, namely performing data enhancement on the training set of the sequence labeling task to generate an enhanced text;
and 5: removing the duplication of the enhanced texts, wherein the generated enhanced texts are subjected to duplication removal processing to obtain an enhanced text set;
step 6: model training, combining the training set and the enhanced text set, performing deep learning model training, testing the generalization error of the model through the verification set, and evaluating the model effect through the test set; wherein:
and 4, performing data enhancement on the training set of the sequence labeling task, which specifically comprises the following steps:
selecting a target text from a training set of the sequence labeling task, and determining an entity to be replaced of the target text;
for an entity to be replaced, randomly selecting whether the entity is subjected to entity replacement or not under the binomial distribution of the probability P;
if the entity to be replaced needs to be replaced, obtaining an entity list according to the entity type of the entity to be replaced and the entity dictionary, randomly selecting an entity from the entity list, and replacing the original entity;
if the entity does not need to be replaced, the entity remains unchanged;
and carrying out replacement operation on all entity types to be replaced in the target text to obtain an enhanced text.
2. The method of claim 1, wherein the sequence annotation task comprises a named entity recognition task or a relationship extraction task.
3. A system for text data enhancement for a sequence annotation task, comprising:
the data acquisition module is used for acquiring and establishing a text data set of the sequence labeling task, and dividing the text data set into a training set, a verification set and a test set according to the proportion of 7;
the entity dictionary generating module is used for generating an entity dictionary by utilizing the training set of the sequence labeling task;
the data enhancement module is used for enhancing data of the training set of the sequence labeling task to generate an enhanced text;
the enhanced text deduplication module is used for performing deduplication processing on the generated enhanced texts to obtain an enhanced text set;
and the model training module combines the training set and the enhanced text set, performs deep learning model training, tests the generalization error of the model through the verification set, and evaluates the model effect through the test set.
4. The system of claim 3, wherein the sequence annotation task comprises a named entity recognition task or a relationship extraction task.
5. The system of claim 3, wherein the entity dictionary generation module further comprises:
the entity acquisition unit is used for extracting entities and entity types from a training set of the sequence labeling task;
and the entity dictionary constructing unit is used for combining different entities of each entity type into an entity list, each entity type and the corresponding entity list are a key value pair, and a plurality of key value pairs form the entity dictionary.
6. The system of claim 3, wherein the data enhancement module further comprises:
the target text selection unit is used for selecting a target text from the training set of the sequence labeling task and determining an entity to be replaced of the target text;
the entity replacement unit randomly selects whether the entity to be replaced carries out entity replacement or not under the binomial distribution of the probability P;
if the entity to be replaced needs to be replaced, obtaining an entity list according to the entity type of the entity to be replaced and the entity dictionary, randomly selecting an entity from the entity list, and replacing the original entity;
if the entity does not need to be replaced, the entity remains unchanged;
and the enhanced text generation unit is used for carrying out replacement operation on all entity types to be replaced in the target text to obtain an enhanced text.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211158611.XA CN115438645A (en) | 2022-09-22 | 2022-09-22 | Text data enhancement method and system for sequence labeling task |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211158611.XA CN115438645A (en) | 2022-09-22 | 2022-09-22 | Text data enhancement method and system for sequence labeling task |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115438645A true CN115438645A (en) | 2022-12-06 |
Family
ID=84248785
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211158611.XA Pending CN115438645A (en) | 2022-09-22 | 2022-09-22 | Text data enhancement method and system for sequence labeling task |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115438645A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116776884A (en) * | 2023-06-26 | 2023-09-19 | 中山大学 | Data enhancement method and system for medical named entity recognition |
-
2022
- 2022-09-22 CN CN202211158611.XA patent/CN115438645A/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116776884A (en) * | 2023-06-26 | 2023-09-19 | 中山大学 | Data enhancement method and system for medical named entity recognition |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108052577B (en) | Universal text content mining method, device, server and storage medium | |
CN102982021B (en) | For eliminating the method for the ambiguity of the multiple pronunciations in language conversion | |
CN107943911A (en) | Data pick-up method, apparatus, computer equipment and readable storage medium storing program for executing | |
CN109408821B (en) | Corpus generation method and device, computing equipment and storage medium | |
CN111124487B (en) | Code clone detection method and device and electronic equipment | |
CN104391885A (en) | Method for extracting chapter-level parallel phrase pair of comparable corpus based on parallel corpus training | |
JP2007087397A (en) | Morphological analysis program, correction program, morphological analyzer, correcting device, morphological analysis method, and correcting method | |
CN112102813B (en) | Speech recognition test data generation method based on context in user comment | |
CN109190099B (en) | Sentence pattern extraction method and device | |
CN107784048B (en) | Question classification method and device for question and answer corpus | |
CN111914550A (en) | Knowledge graph updating method and system for limited field | |
CN107844531B (en) | Answer output method and device and computer equipment | |
CN114090736A (en) | Enterprise industry identification system and method based on text similarity | |
CN111680669A (en) | Test question segmentation method and system and readable storage medium | |
CN115438645A (en) | Text data enhancement method and system for sequence labeling task | |
CN110737770B (en) | Text data sensitivity identification method and device, electronic equipment and storage medium | |
Sagcan et al. | Toponym recognition in social media for estimating the location of events | |
CN112395858A (en) | Multi-knowledge point marking method and system fusing test question data and answer data | |
CN116562296A (en) | Geographic named entity recognition model training method and geographic named entity recognition method | |
CN111191413A (en) | Method, device and system for automatically marking event core content based on graph sequencing model | |
CN112115362B (en) | Programming information recommendation method and device based on similar code recognition | |
CN112784015B (en) | Information identification method and device, apparatus, medium, and program | |
CN112035670B (en) | Multi-modal rumor detection method based on image emotional tendency | |
CN113901793A (en) | Event extraction method and device combining RPA and AI | |
CN109344254B (en) | Address information classification method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |