CN111738007B

CN111738007B - Chinese named entity identification data enhancement algorithm based on sequence generation countermeasure network

Info

Publication number: CN111738007B
Application number: CN202010635292.1A
Authority: CN
Inventors: 李思; 王蓬辉; 李明正; 孙忆南
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2020-07-03
Filing date: 2020-07-03
Publication date: 2021-04-13
Anticipated expiration: 2040-07-03
Also published as: CN111738007A

Abstract

The invention provides a method for expanding training data of a target domain by selecting positive sample data in source domain data through fusing semantic difference and label difference of sentences in a source domain and the target domain so as to enhance named entity recognition performance of the target domain. On the basis of a conventional Bi-LSTM + CRF model, in order to fuse semantic differences and label differences of sentences in a source domain and a target domain, the semantic differences and the label differences are introduced through state representation and reward setting in reinforcement learning, so that a trained decision network can select sentences which have positive influence on the recognition performance of named entities of the target domain in data of the source domain, the training data of the target domain is expanded, the problem of insufficient training data of the target domain is solved, and the recognition performance of the named entities of the target domain is improved.

Description

Chinese named entity identification data enhancement algorithm based on sequence generation countermeasure network

Technical Field

The invention relates to the technical field of internet, in particular to a method for generating an antagonistic network by adopting a sequence to enhance data and improve the performance of Chinese named entity recognition.

Background

In recent years, deep learning has made great progress in image, speech, and natural language processing. Deep learning is an emerging technology of machine learning algorithm, and the motivation is to establish a neural network simulating human brain for analysis learning. In the field of images, people utilize a deep neural network to realize target detection in the images, for example, a convolutional neural network is combined with a candidate window to realize detection on pedestrians in the images; in the speech field, deep learning is used for speech synthesis and recognition, and an intelligent speech system is provided for us; in the field of natural language processing, deep learning is more applied to various life scenes, such as analyzing browsing records and consumption behaviors of users by using a neural network, pushing products which the users may like, and training a translation system by using a large amount of parallel corpora to enable a machine to achieve high-level translation capability. With the rapid development of the internet, more and more user information is generated, and how to automatically extract useful information from the user information to serve the user has very important significance. The Chinese named entity recognition is used as an upstream task of information extraction, and the development of the Chinese named entity recognition is very critical to the information extraction technology.

The Chinese named entity recognition means that given a Chinese text, predefined entity categories in the text are recognized, and the entity categories generally comprise entities of names of people, places and time. The recognition of the entities in the Chinese text is helpful for extracting key entity information in the text and knowing the entities and entity relations in the text, which plays an important role in machine translation, dialogue system and knowledge graph construction.

Chinese named entity recognition typically involves two processes: (1) determining the boundary of an entity; (2) the type of the entity is identified. Generally, we look at named entity identification as a problem with sequence labeling, employing labeling rules to label both the type and the boundaries of an entity. Traditional methods for named entity recognition include maximum entropy models, support vector machine models, and conditional random fields. In recent years, advanced learning methods such as recurrent neural networks, convolutional neural networks and the like are also widely applied to Chinese named entity recognition, and higher accuracy is achieved on a plurality of large corpora.

Deep learning is a feature that allows a neural network to capture data automatically, and a large amount of data is often required to obtain high accuracy. However, in the aspect of Chinese named entity recognition, the existing large corpus is only in the news field, and the labeled corpus in the fields of microblog, medical treatment and the like is few, so that the trained neural network cannot achieve good accuracy in the field. In recent years, named entity recognition tasks in the fields of microblogging, medical treatment and the like are mainly started from two aspects in order to improve accuracy.

One aspect is modifying the structure of the depth model. Wherein, the multi-task learning mechanism is used for comprehensively considering the recognition results of different types of entities in the Chinese text; the attention mechanism gives different characters with different weights in Chinese text to extract more important information in the text. On the other hand, the introduction of external resources can also help to improve the accuracy of named entity identification. External resources include additional entity dictionaries, large unlabeled or weakly labeled corpora, and the like. These additional resources are typically integrated into the model in the form of additional features.

In the article "An attribute-based deep learning model for a clinical textual information and textual electronic textual records", the author uses An attention mechanism to obtain key information and textual features in medical text.

Firstly, obtaining the context-related hidden layer characteristics of each character in the text by using a bidirectional long-short term memory network, and then calculating the attention weight of each character hidden layer characteristic according to the following formula:

e_tj＝tanh(W_a[x_t:x_j])

wherein x is_tIs the word vector, x, of the current character_jIs a word vector of characters at arbitrary positions, alpha_tjIs the weight of the character vector at the j position at the t position, W_aIs a trainable parameter, c_tIs the output after the attention layer.

In the article "incorporated dictionary networks for the Chinese clinical detailed recording", authors have enhanced the ability of the model to identify entities that do not appear in the training set to some extent by introducing additional entity dictionaries and artificially designed features. The adding mode of the solid dictionary is that if an n-gram template of a character appears in the solid dictionary, a vector with the value of 1 is spliced behind a word vector of the character, otherwise, a vector with the value of 0 is spliced.

The inventor finds out in the research process that: for the "An attribute-based deep learning model for a clinical textual registration of a clinical electronic medical registration", "associating entries of a clinical networks for the clinical textual registration" prior art:

1. modifying the structure of the depth model, while it may enhance the semantic representation of the text, does not address the lack of large amounts of annotation data.

2. The introduction of external resources requires a lot of manpower and time to collect the external resources, and efficient rules need to be designed to add the external resources to the model.

Disclosure of Invention

In order to solve the technical problems, the invention provides a method for performing data enhancement on a named entity recognition task by a countermeasure network based on sequence generation.

The invention provides a Chinese named entity recognition data enhancement algorithm based on a sequence generation countermeasure network, wherein a generator in the sequence generation countermeasure network can learn the relation between an entity and a non-entity in a training set, a discriminator is used for judging the quality of data generated by the generator, and finally the training set is expanded by utilizing high-quality generated data so as to achieve the purpose of improving the recognition performance of the named entity.

Step one, processing sentences in a corpus, dividing each sentence into an entity part and a non-entity part according to entity marking information of the sentences, and adding the entity part and the non-entity part into a dictionary.

And step two, mapping the entity and non-entity parts in each sentence into corresponding indexes in the dictionary according to the dictionary formed by the entity and the non-entity parts to form an index sequence.

And step three, randomly initializing a mapping dictionary indexed to the vector, and mapping each sentence into a numerical matrix formed by connecting vectors corresponding to the entity and the non-entity.

And step four, the generator generates a text by adopting a left-to-right strategy, a Bidirectional Long-Short Term Memory neural network (Bi-LSTM) is used for extracting the characteristic information of the output unit related to the previous moment, and the feedforward neural network maps the characteristic information into the probability of all possible output units.

And step five, considering the influence of the output of the current unit on the whole output sequence, and sampling the output unit at the later moment by adopting a roll-out strategy of Monte Carlo search.

And step six, judging the complete sequence formed after sampling by the discriminator, giving out corresponding scores and guiding the data generation of the generator.

And step seven, calculating the reward of the current sentence and the objective function of the generator according to the fraction of the discriminator obtained in the step six, and automatically generating a large amount of data by utilizing a good generator model obtained by back propagation and gradient updating.

And step eight, performing character string matching on the data generated in the step seven and the dictionary in the step one to obtain an entity label corresponding to the generated data.

And step nine, using the generated text data to expand a training set, and numerically converting sentences in the training set into a vector matrix through a word vector dictionary.

Step ten, extracting feature vector representation related to each character context in the input sentence by adopting a Bidirectional Long-Short Term Memory neural network (Bi-LSTM).

And eleventh, adopting conditional random field decoding to obtain a prediction label corresponding to each character, calculating a loss function, and calculating parameters in the model by using back propagation.

And step twelve, continuously repeating the step nine to the step eleven, testing the trained named entity recognition model on the development set, selecting the model with the maximum F value on the development set, and storing.

Further, in the non-training case, the steps one to twelve are replaced by:

mapping sentences in a test set into a corresponding vector matrix through a word vector dictionary;

inputting the vector representation of each sentence into a bidirectional long-short term memory neural network to obtain the feature representation of each sentence related to the context;

and step three, decoding by adopting a conditional random field to obtain an optimal tag sequence of each sentence in the test set, wherein the optimal tag sequence is used as a result of named entity recognition.

Further, in the step one, the sentences in the corpus are processed, each sentence is divided into an entity part and a non-entity part according to entity marking information of the sentence, and the entity part and the non-entity part are added into a dictionary, and the specific process is as follows:

assume a text sequence c₁,c₂,c₃,c₄,c₅,c₆The label of { O, O, B-PER, I-PER, O, O }, c can be replaced with₁c₂,c₅c₆Classified as a non-entity part, c₃c₄Grouped into entity parts, which are then added to the dictionary along with their corresponding tags.

Further, in the fourth step, the bidirectional long-short term memory neural network is used to extract the feature information of the output unit related to the previous time, and the feedforward neural network maps the feature information into the probabilities of all possible output units, and the calculation process is as follows:

h_i＝LSTM(h_i-1,l_i-1)

p(l_i|l₀,l₁,…,l_i-1)＝softmax(W·h_i+b)

where LSTM denotes an LSTM cell, h_i-1Refers to the hidden layer output of LSTM in the i-1 moment generator, W and b are trainable parameter weights of the feedforward network, and h is adopted_i-1The LSTM for the i-time is initialized to introduce information of the previous time, softmax is used for normalization.

Further, in the fifth step, a roll-out strategy of monte carlo search is adopted to sample the output unit at the later time:

{(l₁,l₂,…,l_m)¹,(l₁,l₂,…,l_m)²,…,(l₁,l₂,…,l_m)^K}＝MC((l₁,…,l_i),K)

wherein l_iIs the output unit at the current time, m is the set maximum output sequence length, K is the number of samples of the monte carlo search, and MC is the monte carlo search method.

Further, the discriminator in the sixth step judges the complete sequence formed after sampling, and gives out corresponding scores, and the specific calculation process is as follows:

o＝max{h₁,h₂,…,h_m}

p＝sigmoid(W_o*o+b_o)

wherein

Is a character c_iWord vector of, t_iIs a character c_iThe corresponding label is marked with a corresponding label,

and

is a parameter of the convolution kernel, window is the window size of the convolution kernel, max is the maximum pooling operation used to obtain the final text feature representation, W_oAnd b_oFor mapping text features to probability scores that the input text is true text, sigmoid is used to limit the output probability to between 0 and 1.

Further, the reward of the current sentence in the step seven and the objective function of the generator are calculated in the following specific steps:

wherein R is_iIs the reward obtained by the generator at time i, J (theta) is the objective function of the generator, Discriminor is the calculation process at the arbiter, K is the number of samples of the Monte Carlo search, G_θIs a parameter of the generator.

The invention provides a Chinese named entity recognition data enhancement algorithm based on a sequence generation countermeasure network, which adopts entities and non-entities in a training set as basic generation units in a generator, thereby avoiding the problem that a generated text has no label; meanwhile, the relation between the generator and the discriminator is established by using a reinforcement learning mode, the problem that gradient updating cannot be transmitted between the generator and the discriminator is solved, meanwhile, the generator learns the relation between an entity and a non-entity, and the labeled data is automatically generated, so that the influence of lack of a large amount of labeled data is reduced, and the performance of named entity identification is improved.

Drawings

FIG. 1 is a flow chart of a first embodiment;

FIG. 2 is a network structure diagram of the Chinese named entity recognition data enhancement algorithm for sequence-based generation of countermeasure networks according to the present invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention. Wherein, the abbreviations and key terms appearing in this embodiment are defined as follows:

CRF, Conditional Random Field;

Bi-LSTM, a Bidirectional Long Short-Term Memory neural network;

example one

Referring to fig. 1 and 2, the present invention provides a method for applying a data enhancement algorithm based on a sequence-generated countermeasure network to a named entity recognition task, and specifically, during training, the method includes:

step one, processing sentences in a corpus, dividing each sentence into an entity part and a non-entity part according to entity marking information of the sentences, and adding the entity part and the non-entity part into a dictionary. Assume a text sequence c₁,c₂,c₃,c₄,c₅,c₆The label of { O, O, B-PER, I-PER, O, O }, c can be replaced with₁c₂,c₅c₆Classified as a non-entity part, c₃c₄Grouped into entity parts, which are then added to the dictionary along with their corresponding tags.

And step three, randomly initializing a mapping dictionary indexed to the vector, and mapping each sentence into a numerical matrix formed by connecting the entity vector and the non-entity vector.

And step four, the generator generates a text by adopting a left-to-right strategy, a Bidirectional Long-Short Term Memory neural network (Bi-LSTM) is used for extracting the characteristic information of the output unit related to the previous moment, and the feedforward neural network maps the characteristic information into the probability of all possible output units. The calculation process is as follows:

h_i＝LSTM(h_i-1,l_i-1)

p(l_i|l₀,l₁,…,l_i-1)＝softmax(W·h_i+b)

where LSTM denotes an LSTM cell, h_i-1Refers to the hidden layer output of LSTM in the i-1 moment generator, W and b are trainable parameter weights of the feedforward network, and h is adopted_i-1The LSTM for initializing time i is to introduce information of the previous time.

And step five, considering the influence of the output of the current unit on the whole output sequence, adopting a roll-out strategy of Monte Carlo search to sample the output unit at the later moment, wherein the sampling process comprises the following steps:

wherein l_iIs the output at the current time, m is the set maximum output sequence length, K is the number of samples of the monte carlo search, and MC is the monte carlo search method.

And step six, judging the complete sequence formed after sampling by the discriminator, giving out corresponding scores and guiding the data generation of the generator. The specific calculation process is as follows:

o＝max{h₁,h₂,…,h_m}

p＝sigmoid(W_o*o+b_o)

wherein

and

And step seven, calculating the reward of the current sentence and the objective function of the generator according to the fraction of the discriminator obtained in the step six, and automatically generating a large amount of data by utilizing a good generator model obtained by back propagation and gradient updating. The reward rewarding the current sentence and the objective function of the generator, the specific calculation process is as follows:

wherein R is_iIs the reward obtained by the generator at time i, J (θ) is the objective function of the generator, Discriminator is the calculation process at the arbiter, and K is the number of samples of the monte carlo search.

And step eight, performing character string matching on the data generated in the step seven and the dictionary in the step one, and obtaining entity labels corresponding to the generated data from the generated data.

Step eleven, decoding by adopting a Conditional Random Field (CRF) to obtain a prediction label corresponding to each character, calculating a loss function, and calculating parameters in the model by utilizing back propagation.

In the non-training case, step one to step twelve are replaced by:

inputting the vector representation of each sentence into a saved named entity recognition model, and obtaining the feature representation of each sentence related to the context through a bidirectional long-short term memory neural network;

In a preferred embodiment, sentences in a training set are converted into a sequence consisting of entities and non-entities according to entity labels of the sentences, then the entities and the non-entities in the sentences are mapped into a dense vector through a vector dictionary, the vector dimension is n, and the features of an output unit at the current moment and units at the previous moment are extracted through a bidirectional long-short term memory neural network; mapping the extracted features to the probabilities of all possible output units through a feedforward neural network, and taking the unit with the maximum probability value as the output unit at the current moment; meanwhile, in order to solve the problem of gradient update transmission of the generator and the discriminator, a reinforcement learning mechanism is introduced, the output result of the discriminator is used as reward, and the data generation process of the generator is guided. The data generated by the generator will be used in conjunction with the original training set to train the named entity recognition model.

The invention provides a data enhancement algorithm for generating an anti-network based on a sequence, which is used for naming an entity recognition model, and the entity and the non-entity in a training set are used as basic generating units in a generator, so that the problem that a generated text has no label is avoided through character string matching; meanwhile, the relation between the generator and the discriminator is established by using a reinforcement learning mode, the problem that gradient updating cannot be transmitted between the generator and the discriminator is solved, finally, the generator learns the relation between the entity and the non-entity, and the labeled data is automatically generated, so that the influence of lack of a large amount of labeled data is reduced, and the performance of named entity identification is improved.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

Claims

1. A Chinese named entity recognition data enhancement method based on a sequence generation confrontation network is characterized in that a method of generating the confrontation network by the sequence is adopted, and the relation between an entity and a non-entity in a training set is learned to enhance data and improve the performance of named entity recognition, and the method comprises the following steps:

(1) processing sentences in the corpus, dividing each sentence into an entity part and a non-entity part according to entity marking information of the sentences, and adding the entity part and the non-entity part into a dictionary;

(2) mapping the entity and non-entity parts in each sentence into corresponding indexes in the dictionary according to the dictionary formed by the entity and the non-entity parts to form an index sequence;

(3) randomly initializing a mapping dictionary indexed to vectors, and mapping each sentence into a numerical matrix formed by connecting vectors corresponding to entities and non-entities;

(4) the generator generates a text by adopting a left-to-right strategy, a Bidirectional Long-Short Term Memory neural network (Bi-LSTM) is used for extracting characteristic information of an output unit related to previous time, and the feedforward neural network maps the characteristic information into probabilities of all possible output units;

(5) considering the influence of the output of the current unit on the whole output sequence, adopting a roll-out strategy of Monte Carlo search to sample the output unit at the later moment;

(6) the discriminator judges the complete sequence formed after sampling, gives corresponding scores and guides the data generation of the generator;

(7) calculating the reward of the current sentence and the objective function of the generator according to the fraction of the discriminator obtained in the step (6), obtaining a good generator model by utilizing back propagation and gradient updating, and automatically generating a large amount of data;

(8) performing character string matching on the data generated in the step (7) and the dictionary in the step (1) to obtain an entity label corresponding to the generated data;

(9) using the generated text data to expand a training set, and numerically converting sentences in the training set into a vector matrix through a word vector dictionary;

(10) extracting context-dependent feature vector representations of each character in an input sentence by adopting a bidirectional long-short term memory neural network;

(11) adopting conditional random field decoding to obtain a prediction label corresponding to each character, calculating a loss function, and calculating parameters in a model by utilizing back propagation;

(12) and (5) continuously repeating the step (9) to the step (11), testing the trained named entity recognition model on the development set, selecting the model with the maximum F value on the development set, and storing.

2. The method of claim 1, wherein the process of entity recognition in the untrained case comprises:

(2.1) mapping sentences in the test set into corresponding vector matrixes through a word vector dictionary;

(2.2) inputting the vector representation of each sentence into the bidirectional long-short term memory neural network to obtain the feature representation related to each sentence and context;

and (2.3) decoding by adopting a conditional random field to obtain an optimal tag sequence of each sentence in the test set as a result of named entity recognition.

3. The method according to claim 1, wherein in the step (1), the sentences in the corpus are processed, each sentence is divided into solid and non-solid parts according to the entity tagging information of the sentence, and the solid and non-solid parts are added into the dictionary at the same time, and the specific process comprises the following steps:

assume a text sequence c₁，c₂，c₃，c₄，c₅，c₆The label of { O, O, B-PER, I-PER, O, O }, will be c₁c₂，c₅c₆Classified as a non-entity part, c₃c₄Grouped into entity parts, which are then added to the dictionary along with their corresponding tags.

4. The method of claim 1, wherein in step (4), the bi-directional long-short term memory neural network is used to extract the feature information of the output unit related to the previous time, and the feedforward neural network maps the feature information to the probability of all possible output units, and the calculation process is as follows:

h_i＝LSTM(h_i-1，l_i-1)

p(l_i|l₀，l₁，…，l_i-1)＝softmax(W·h_i+b)

5. The method as claimed in claim 1, wherein in the step (5), the roll-out strategy of the monte carlo search is adopted to sample the output unit at the later time:

{(l₁，l₂，…，l_m)¹，(l₁，l₂，…，l_m)²，…，(l₁，l₂，…，l_m)^K}＝MC((l₁，…，l_i)，K)

6. The method as claimed in claim 1, wherein the discriminator in step (6) judges the complete sequence formed after sampling, and gives corresponding scores, and the specific calculation process is as follows:

o＝max{h₁，h₂，…，h_m}

p＝sigmoid(W_o*o+b_o)

wherein

and

is a parameter of the convolution kernel, window is the window size of the convolution kernel, max is the maximum pooling operation used to obtain the final text feature representation, W_oAnd b_oForMapping text features to probability scores that the input text is real text, sigmoid is used to limit the output probability to between 0 and 1.

7. The method as claimed in claim 1, wherein the step (7) of rewarding the current sentence and the objective function of the generator is implemented by the following steps:

wherein R is_iIs the reward obtained by the generator at time i, J (theta) is the objective function of the generator, Discriminor is the calculation process at the arbiter, K is the number of samples of the Monte Carlo search, G_θIs a parameter of the generator and m is the set maximum output sequence length.