CN115438645A

CN115438645A - Text data enhancement method and system for sequence labeling task

Info

Publication number: CN115438645A
Application number: CN202211158611.XA
Authority: CN
Inventors: 何道敬; 成青园; 顾鸿杰
Original assignee: Shanghai Jingshan Technology Co ltd; East China Normal University
Current assignee: Shanghai Jingshan Technology Co ltd; East China Normal University
Priority date: 2022-09-22
Filing date: 2022-09-22
Publication date: 2022-12-06

Abstract

A text data enhancement method and system for a sequence labeling task are provided, the method comprises the following steps: dividing a text data set of a sequence labeling task into a training set, a verification set and a test set according to the proportion of 7; extracting entities and entity types from a training set of a sequence labeling task; combining different entities of each entity type into an entity list, wherein each entity type and the corresponding entity list are a key value pair, and a plurality of key value pairs form an entity dictionary; performing data enhancement on the training set of the sequence labeling task to generate an enhanced text; and carrying out deduplication processing on the generated plurality of enhanced texts, merging the training set and the enhanced texts to obtain an enhanced text set, and carrying out deep learning model training. The sequence labeling task of the application comprises the following steps: the named entity recognition task or the relation extraction task is subjected to data enhancement through an entity replacement method, so that context semantics among entities can be effectively reserved, and the generalization capability of the model is improved.

Description

Text data enhancement method and system for sequence labeling task

Technical Field

The invention relates to the technical field of natural language processing, in particular to a text data enhancement method and system for a sequence labeling task.

Background

With the development of various deep learning techniques, various deep learning models are applied to the named entity recognition task and the relationship extraction task. The construction of deep learning models often requires a large number of labeled samples, however, such labeled samples do not exist in a specific field, and the labeling of specific field data not only requires professional knowledge of relevant experts, but also consumes a lot of time.

The data enhancement method is a method for expanding training data, and the quantity of the training set is expanded by transforming the training set, so that the generalization capability of the model is improved. The initial data enhancement methods were applied to image data enhancement and later developed to text data enhancement. The traditional text data enhancement method has a good effect when being applied to a named entity recognition task, but a data enhancement method for a relation extraction task is lacked.

The named entity recognition task and the relation extraction task are both sequence labeling tasks, and no effective solution is provided aiming at text data enhancement of the sequence labeling tasks at present.

Disclosure of Invention

In order to meet the requirement of a sequence labeling task on data enhancement, the invention aims to provide a text data enhancement method and a text data enhancement system for the sequence labeling task.

In order to realize the purpose, the technical scheme of the invention is as follows:

a text data enhancement method for a sequence labeling task comprises the following steps:

step 1: dividing a data set, namely dividing a text data set of a sequence labeling task into a training set, a verification set and a test set according to the proportion of 7;

step 2: acquiring an entity, namely extracting the entity and an entity type from a training set of a sequence labeling task;

and 3, step 3: the entity dictionary structure is used for combining different entities of each entity type into an entity list, each entity type and the corresponding entity list are a key value pair, and a plurality of key value pairs form an entity dictionary;

and 4, step 4: data enhancement, namely performing data enhancement on the training set of the sequence labeling task to generate an enhanced text;

and 5: removing the duplication of the enhanced texts, wherein the generated enhanced texts are subjected to duplication removal processing to obtain an enhanced text set;

and 6: model training, combining the training set and the enhanced text set, performing deep learning model training, testing the generalization error of the model through the verification set, and evaluating the model effect through the test set; wherein:

and 4, performing data enhancement on the training set of the sequence labeling task, which specifically comprises the following steps:

selecting a target text from a training set of the sequence labeling task, and determining an entity to be replaced of the target text;

for an entity to be replaced, randomly selecting whether the entity is subjected to entity replacement or not under the binomial distribution of the probability P;

if the entity to be replaced needs to be replaced, obtaining an entity list according to the entity type of the entity to be replaced and the entity dictionary, randomly selecting an entity from the entity list, and replacing the original entity;

if the entity does not need to be replaced, the entity remains unchanged;

and carrying out replacement operation on all entity types to be replaced in the target text to obtain the enhanced text.

According to the text data enhancement method for the sequence labeling task, the sequence labeling task comprises a named entity identification task or a relation extraction task.

A system for text data enhancement of a sequence annotation task, comprising:

the data acquisition module is used for acquiring and establishing a text data set of the sequence labeling task, and dividing the text data set into a training set, a verification set and a test set according to the proportion of (7);

the entity dictionary generating module is used for generating an entity dictionary by utilizing the training set of the sequence labeling task;

the data enhancement module is used for enhancing data of the training set of the sequence labeling task to generate an enhanced text;

the enhanced text deduplication module is used for performing deduplication processing on a plurality of generated enhanced texts to obtain an enhanced text set;

and the model training module is used for merging the training set and the enhanced text set, performing deep learning model training, testing the generalization error of the model through the verification set and evaluating the model effect through the test set.

The text data enhancement system of the sequence labeling task comprises a named entity identification task or a relation extraction task.

The entity dictionary generation module further comprises:

the entity acquisition unit is used for extracting entities and entity types from a training set of the sequence labeling task;

and the entity dictionary construction unit is used for combining different entities of each entity type into an entity list, each entity type and the corresponding entity list are a key value pair, and a plurality of key value pairs form an entity dictionary.

The data enhancement module further comprises:

the target text selection unit selects a target text from the training set of the sequence labeling task and determines an entity to be replaced of the target text;

the entity replacement unit randomly selects whether the entity to be replaced carries out entity replacement or not under the binomial distribution of the probability P;

if the entity does not need to be replaced, the entity remains unchanged;

and the enhanced text generation unit is used for carrying out replacement operation on all entity types to be replaced in the target text to obtain the enhanced text.

Compared with the prior art, the invention has the beneficial effects that:

the invention provides a text data enhancement method and a text data enhancement system for a sequence labeling task. The data enhancement method designed by the invention can be applied to sequence labeling tasks containing entity types, such as named entity identification tasks or relationship extraction tasks, can greatly retain context semantics among entities through a data enhancement method of entity replacement, and a model constructed by utilizing an enhanced text has better generalization capability.

Drawings

FIG. 1 is a flow chart of a text data enhancement method of the present invention;

FIG. 2 is a schematic view of example 1 of the present invention;

FIG. 3 is a schematic view of example 2 of the present invention;

fig. 4 is a block diagram schematically illustrating the structure of the text data enhancement system of the present invention.

Detailed Description

The invention is further explained with reference to the drawings and the embodiments.

Referring to fig. 1, the present invention provides a text data enhancement method for a sequence annotation task, which can be applied to a sequence annotation task containing an entity type. And for sequence labeling tasks containing entity types, such as named entity recognition tasks or relationship extraction tasks, constructing an enhanced text by using an entity replacement method so as to improve the generalization capability of the deep learning model. The specific method comprises the following steps:

1. dividing a data set, namely dividing a text data set of a sequence labeling task into a training set, a verification set and a test set according to the proportion of 7;

2. acquiring an entity, namely extracting the entity and an entity type from a training set of a sequence labeling task;

3. the entity dictionary structure is used for combining different entities of each entity type into an entity list, each entity type and the corresponding entity list are a key value pair, and a plurality of key value pairs form an entity dictionary;

4. data enhancement, namely performing data enhancement on the training set of the sequence labeling task to generate an enhanced text;

5. removing the duplication of the enhanced texts, wherein the generated enhanced texts are subjected to duplication removal processing to obtain an enhanced text set;

6. model training, combining the training set and the enhanced text set, performing deep learning model training, testing the generalization error of the model through the verification set, and evaluating the model effect through the test set; wherein:

the data enhancement of the training set of the sequence labeling task specifically comprises the following steps:

if the entity does not need to be replaced, the entity remains unchanged;

The sequence labeling task comprises a named entity identification task or a relation extraction task.

For the data enhancement mode, a data enhancement sample is given, and table 1 is a data enhancement sample proposed according to the present invention, as shown in the table:

TABLE 1

For a target text "BRONZE PRESIDENT leafages Wmiexec" in the training set, the corresponding sequence label is "B-Attacker I-Attacker OB-Tool", wherein the "B-Attacker I-Attacker" sequence label indicates that the entity type of the entity "BRONZE PRESIDENT" is Attacker, and wherein the "B-Tool" sequence label indicates that the entity type of the entity "Wmiexec" is Tool. Assume that the entity list of Tool entity type in the entity dictionary contains entity Nmap, and the entity list of attcker entity type in the entity dictionary contains entity APT41. The data enhancement mode of the invention can obtain an enhanced text s1, an enhanced text s2 and an enhanced text s3, wherein the enhanced text s1 replaces an entity with an entity type of Tool, the enhanced text s2 replaces an entity with an entity type of Attacker, and the enhanced text s3 replaces entities with an entity type of Attacker and Tool. The sample is only one sample of entity replacement by using the entity dictionary, and the quantity of the enhanced texts which can be obtained for an original sentence is far more than 3. The data enhancement mode of the invention can greatly reserve the context semantics among the entities and can be applied to sequence labeling tasks, such as named entity identification tasks or relation extraction tasks.

Example 1

In this embodiment, a named entity recognition method is provided, and fig. 2 is a flowchart of a named entity recognition task implemented according to the present invention, and as shown in fig. 2, the flowchart includes the steps of the text data enhancement method, which specifically include:

1. dividing a text data set of the named entity recognition task into a training set, a verification set and a test set according to the proportion of 7;

2. extracting entities and entity types from a training set of a named entity recognition task;

3. combining different entities of each entity type into an entity list, wherein each entity type and the corresponding entity list are a key value pair, and a plurality of key value pairs form an entity dictionary;

4. performing data enhancement on the training set of the named entity recognition task to generate an enhanced text;

5. carrying out deduplication processing on the generated plurality of enhancement texts;

6. and combining the training set and the enhanced text, training by using a named entity recognition model BERT-BilSTM-ATTN-CRF, and evaluating the effect of the model through the test set by testing the generalization error of the model through the verification set.

The method for enhancing the data of the training set of the named entity recognition task comprises the following steps:

selecting a target text from a training set of a named entity recognition task, and determining an entity to be replaced of the target text;

if the entity does not need to be replaced, the entity remains unchanged;

and carrying out replacement operation on all entity types to be replaced in the target text to obtain an enhanced text.

The named entity recognition model can also be set as HMM, CRF, bilSTM-CRF, BERT-BilSTM-CRF and other named entity recognition models.

Example 2

In this embodiment, a relationship extraction method is provided, fig. 3 is a flowchart of a relationship extraction task implemented according to the present invention, and as shown in fig. 3, the flowchart includes the steps of the text data enhancement method, which specifically include:

1. dividing a text data set of the relation extraction task into a training set, a verification set and a test set according to the proportion of 7;

2. extracting entities and entity types from a training set of the relationship extraction task;

3. classifying different entities according to entity types to construct an entity dictionary;

4. performing data enhancement on the training set of the relation extraction task to generate an enhanced text;

6. and combining the training set and the enhanced text, training by using a relation extraction model RIFREE, testing the generalization error of the model through a verification set, and evaluating the effect of the model through the test set.

The method for enhancing the data of the training set of the relation extraction task comprises the following steps:

selecting a target text from a training set of the relation extraction task, and determining an entity to be replaced of the target text;

if the entity does not need to be replaced, the entity remains unchanged;

The relation extraction model can also be set as a relation extraction model such as CNN, attention-BLSTM, R-BERT and the like.

FIG. 4 is a block diagram schematically illustrating the structure of a text data enhancement system for a sequence annotation task implemented according to the present invention, as shown in FIG. 4, the system includes:

and the model training module combines the training set and the enhanced text set, performs deep learning model training, tests the generalization error of the model through the verification set, and evaluates the model effect through the test set.

The entity dictionary generation module further comprises:

and the entity dictionary constructing unit is used for combining different entities of each entity type into an entity list, each entity type and the corresponding entity list are a key value pair, and a plurality of key value pairs form the entity dictionary.

The data enhancement module further comprises:

the target text selection unit is used for selecting a target text from the training set of the sequence labeling task and determining an entity to be replaced of the target text;

if the entity does not need to be replaced, the entity remains unchanged;

It should be understood that the above-described examples of the present invention are merely illustrative for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. It will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention, and these changes and modifications are within the scope of the invention as claimed.

Claims

1. A text data enhancement method for a sequence labeling task is characterized by comprising the following steps:

step 2: acquiring an entity, namely extracting the entity and an entity type from a training set of a sequence marking task;

and step 3: the entity dictionary structure is used for combining different entities of each entity type into an entity list, each entity type and the corresponding entity list are a key value pair, and a plurality of key value pairs form an entity dictionary;

step 6: model training, combining the training set and the enhanced text set, performing deep learning model training, testing the generalization error of the model through the verification set, and evaluating the model effect through the test set; wherein:

if the entity does not need to be replaced, the entity remains unchanged;

2. The method of claim 1, wherein the sequence annotation task comprises a named entity recognition task or a relationship extraction task.

3. A system for text data enhancement for a sequence annotation task, comprising:

the data acquisition module is used for acquiring and establishing a text data set of the sequence labeling task, and dividing the text data set into a training set, a verification set and a test set according to the proportion of 7;

the enhanced text deduplication module is used for performing deduplication processing on the generated enhanced texts to obtain an enhanced text set;

4. The system of claim 3, wherein the sequence annotation task comprises a named entity recognition task or a relationship extraction task.

5. The system of claim 3, wherein the entity dictionary generation module further comprises:

6. The system of claim 3, wherein the data enhancement module further comprises:

if the entity does not need to be replaced, the entity remains unchanged;

and the enhanced text generation unit is used for carrying out replacement operation on all entity types to be replaced in the target text to obtain an enhanced text.