CN111738007B - Chinese named entity identification data enhancement algorithm based on sequence generation countermeasure network - Google Patents

Chinese named entity identification data enhancement algorithm based on sequence generation countermeasure network Download PDF

Info

Publication number
CN111738007B
CN111738007B CN202010635292.1A CN202010635292A CN111738007B CN 111738007 B CN111738007 B CN 111738007B CN 202010635292 A CN202010635292 A CN 202010635292A CN 111738007 B CN111738007 B CN 111738007B
Authority
CN
China
Prior art keywords
entity
sentence
generator
dictionary
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010635292.1A
Other languages
Chinese (zh)
Other versions
CN111738007A (en
Inventor
李思
王蓬辉
李明正
孙忆南
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Posts and Telecommunications
Original Assignee
Beijing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Posts and Telecommunications filed Critical Beijing University of Posts and Telecommunications
Priority to CN202010635292.1A priority Critical patent/CN111738007B/en
Publication of CN111738007A publication Critical patent/CN111738007A/en
Application granted granted Critical
Publication of CN111738007B publication Critical patent/CN111738007B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Abstract

The invention provides a method for expanding training data of a target domain by selecting positive sample data in source domain data through fusing semantic difference and label difference of sentences in a source domain and the target domain so as to enhance named entity recognition performance of the target domain. On the basis of a conventional Bi-LSTM + CRF model, in order to fuse semantic differences and label differences of sentences in a source domain and a target domain, the semantic differences and the label differences are introduced through state representation and reward setting in reinforcement learning, so that a trained decision network can select sentences which have positive influence on the recognition performance of named entities of the target domain in data of the source domain, the training data of the target domain is expanded, the problem of insufficient training data of the target domain is solved, and the recognition performance of the named entities of the target domain is improved.

Description

Chinese named entity identification data enhancement algorithm based on sequence generation countermeasure network
Technical Field
The invention relates to the technical field of internet, in particular to a method for generating an antagonistic network by adopting a sequence to enhance data and improve the performance of Chinese named entity recognition.
Background
In recent years, deep learning has made great progress in image, speech, and natural language processing. Deep learning is an emerging technology of machine learning algorithm, and the motivation is to establish a neural network simulating human brain for analysis learning. In the field of images, people utilize a deep neural network to realize target detection in the images, for example, a convolutional neural network is combined with a candidate window to realize detection on pedestrians in the images; in the speech field, deep learning is used for speech synthesis and recognition, and an intelligent speech system is provided for us; in the field of natural language processing, deep learning is more applied to various life scenes, such as analyzing browsing records and consumption behaviors of users by using a neural network, pushing products which the users may like, and training a translation system by using a large amount of parallel corpora to enable a machine to achieve high-level translation capability. With the rapid development of the internet, more and more user information is generated, and how to automatically extract useful information from the user information to serve the user has very important significance. The Chinese named entity recognition is used as an upstream task of information extraction, and the development of the Chinese named entity recognition is very critical to the information extraction technology.
The Chinese named entity recognition means that given a Chinese text, predefined entity categories in the text are recognized, and the entity categories generally comprise entities of names of people, places and time. The recognition of the entities in the Chinese text is helpful for extracting key entity information in the text and knowing the entities and entity relations in the text, which plays an important role in machine translation, dialogue system and knowledge graph construction.
Chinese named entity recognition typically involves two processes: (1) determining the boundary of an entity; (2) the type of the entity is identified. Generally, we look at named entity identification as a problem with sequence labeling, employing labeling rules to label both the type and the boundaries of an entity. Traditional methods for named entity recognition include maximum entropy models, support vector machine models, and conditional random fields. In recent years, advanced learning methods such as recurrent neural networks, convolutional neural networks and the like are also widely applied to Chinese named entity recognition, and higher accuracy is achieved on a plurality of large corpora.
Deep learning is a feature that allows a neural network to capture data automatically, and a large amount of data is often required to obtain high accuracy. However, in the aspect of Chinese named entity recognition, the existing large corpus is only in the news field, and the labeled corpus in the fields of microblog, medical treatment and the like is few, so that the trained neural network cannot achieve good accuracy in the field. In recent years, named entity recognition tasks in the fields of microblogging, medical treatment and the like are mainly started from two aspects in order to improve accuracy.
One aspect is modifying the structure of the depth model. Wherein, the multi-task learning mechanism is used for comprehensively considering the recognition results of different types of entities in the Chinese text; the attention mechanism gives different characters with different weights in Chinese text to extract more important information in the text. On the other hand, the introduction of external resources can also help to improve the accuracy of named entity identification. External resources include additional entity dictionaries, large unlabeled or weakly labeled corpora, and the like. These additional resources are typically integrated into the model in the form of additional features.
In the article "An attribute-based deep learning model for a clinical textual information and textual electronic textual records", the author uses An attention mechanism to obtain key information and textual features in medical text.
Firstly, obtaining the context-related hidden layer characteristics of each character in the text by using a bidirectional long-short term memory network, and then calculating the attention weight of each character hidden layer characteristic according to the following formula:
Figure BDA0002569265990000011
etj=tanh(Wa[xt:xj])
Figure BDA0002569265990000012
wherein x istIs the word vector, x, of the current characterjIs a word vector of characters at arbitrary positions, alphatjIs the weight of the character vector at the j position at the t position, WaIs a trainable parameter, ctIs the output after the attention layer.
In the article "incorporated dictionary networks for the Chinese clinical detailed recording", authors have enhanced the ability of the model to identify entities that do not appear in the training set to some extent by introducing additional entity dictionaries and artificially designed features. The adding mode of the solid dictionary is that if an n-gram template of a character appears in the solid dictionary, a vector with the value of 1 is spliced behind a word vector of the character, otherwise, a vector with the value of 0 is spliced.
The inventor finds out in the research process that: for the "An attribute-based deep learning model for a clinical textual registration of a clinical electronic medical registration", "associating entries of a clinical networks for the clinical textual registration" prior art:
1. modifying the structure of the depth model, while it may enhance the semantic representation of the text, does not address the lack of large amounts of annotation data.
2. The introduction of external resources requires a lot of manpower and time to collect the external resources, and efficient rules need to be designed to add the external resources to the model.
Disclosure of Invention
In order to solve the technical problems, the invention provides a method for performing data enhancement on a named entity recognition task by a countermeasure network based on sequence generation.
The invention provides a Chinese named entity recognition data enhancement algorithm based on a sequence generation countermeasure network, wherein a generator in the sequence generation countermeasure network can learn the relation between an entity and a non-entity in a training set, a discriminator is used for judging the quality of data generated by the generator, and finally the training set is expanded by utilizing high-quality generated data so as to achieve the purpose of improving the recognition performance of the named entity.
Step one, processing sentences in a corpus, dividing each sentence into an entity part and a non-entity part according to entity marking information of the sentences, and adding the entity part and the non-entity part into a dictionary.
And step two, mapping the entity and non-entity parts in each sentence into corresponding indexes in the dictionary according to the dictionary formed by the entity and the non-entity parts to form an index sequence.
And step three, randomly initializing a mapping dictionary indexed to the vector, and mapping each sentence into a numerical matrix formed by connecting vectors corresponding to the entity and the non-entity.
And step four, the generator generates a text by adopting a left-to-right strategy, a Bidirectional Long-Short Term Memory neural network (Bi-LSTM) is used for extracting the characteristic information of the output unit related to the previous moment, and the feedforward neural network maps the characteristic information into the probability of all possible output units.
And step five, considering the influence of the output of the current unit on the whole output sequence, and sampling the output unit at the later moment by adopting a roll-out strategy of Monte Carlo search.
And step six, judging the complete sequence formed after sampling by the discriminator, giving out corresponding scores and guiding the data generation of the generator.
And step seven, calculating the reward of the current sentence and the objective function of the generator according to the fraction of the discriminator obtained in the step six, and automatically generating a large amount of data by utilizing a good generator model obtained by back propagation and gradient updating.
And step eight, performing character string matching on the data generated in the step seven and the dictionary in the step one to obtain an entity label corresponding to the generated data.
And step nine, using the generated text data to expand a training set, and numerically converting sentences in the training set into a vector matrix through a word vector dictionary.
Step ten, extracting feature vector representation related to each character context in the input sentence by adopting a Bidirectional Long-Short Term Memory neural network (Bi-LSTM).
And eleventh, adopting conditional random field decoding to obtain a prediction label corresponding to each character, calculating a loss function, and calculating parameters in the model by using back propagation.
And step twelve, continuously repeating the step nine to the step eleven, testing the trained named entity recognition model on the development set, selecting the model with the maximum F value on the development set, and storing.
Further, in the non-training case, the steps one to twelve are replaced by:
mapping sentences in a test set into a corresponding vector matrix through a word vector dictionary;
inputting the vector representation of each sentence into a bidirectional long-short term memory neural network to obtain the feature representation of each sentence related to the context;
and step three, decoding by adopting a conditional random field to obtain an optimal tag sequence of each sentence in the test set, wherein the optimal tag sequence is used as a result of named entity recognition.
Further, in the step one, the sentences in the corpus are processed, each sentence is divided into an entity part and a non-entity part according to entity marking information of the sentence, and the entity part and the non-entity part are added into a dictionary, and the specific process is as follows:
assume a text sequence c1,c2,c3,c4,c5,c6The label of { O, O, B-PER, I-PER, O, O }, c can be replaced with1c2,c5c6Classified as a non-entity part, c3c4Grouped into entity parts, which are then added to the dictionary along with their corresponding tags.
Further, in the fourth step, the bidirectional long-short term memory neural network is used to extract the feature information of the output unit related to the previous time, and the feedforward neural network maps the feature information into the probabilities of all possible output units, and the calculation process is as follows:
hi=LSTM(hi-1,li-1)
p(li|l0,l1,…,li-1)=softmax(W·hi+b)
where LSTM denotes an LSTM cell, hi-1Refers to the hidden layer output of LSTM in the i-1 moment generator, W and b are trainable parameter weights of the feedforward network, and h is adoptedi-1The LSTM for the i-time is initialized to introduce information of the previous time, softmax is used for normalization.
Further, in the fifth step, a roll-out strategy of monte carlo search is adopted to sample the output unit at the later time:
{(l1,l2,…,lm)1,(l1,l2,…,lm)2,…,(l1,l2,…,lm)K}=MC((l1,…,li),K)
wherein liIs the output unit at the current time, m is the set maximum output sequence length, K is the number of samples of the monte carlo search, and MC is the monte carlo search method.
Further, the discriminator in the sixth step judges the complete sequence formed after sampling, and gives out corresponding scores, and the specific calculation process is as follows:
Figure BDA0002569265990000037
Figure BDA0002569265990000031
o=max{h1,h2,…,hm}
p=sigmoid(Wo*o+bo)
wherein
Figure BDA0002569265990000032
Is a character ciWord vector of, tiIs a character ciThe corresponding label is marked with a corresponding label,
Figure BDA0002569265990000033
and
Figure BDA0002569265990000034
is a parameter of the convolution kernel, window is the window size of the convolution kernel, max is the maximum pooling operation used to obtain the final text feature representation, WoAnd boFor mapping text features to probability scores that the input text is true text, sigmoid is used to limit the output probability to between 0 and 1.
Further, the reward of the current sentence in the step seven and the objective function of the generator are calculated in the following specific steps:
Figure BDA0002569265990000035
Figure BDA0002569265990000036
wherein R isiIs the reward obtained by the generator at time i, J (theta) is the objective function of the generator, Discriminor is the calculation process at the arbiter, K is the number of samples of the Monte Carlo search, GθIs a parameter of the generator.
The invention provides a Chinese named entity recognition data enhancement algorithm based on a sequence generation countermeasure network, which adopts entities and non-entities in a training set as basic generation units in a generator, thereby avoiding the problem that a generated text has no label; meanwhile, the relation between the generator and the discriminator is established by using a reinforcement learning mode, the problem that gradient updating cannot be transmitted between the generator and the discriminator is solved, meanwhile, the generator learns the relation between an entity and a non-entity, and the labeled data is automatically generated, so that the influence of lack of a large amount of labeled data is reduced, and the performance of named entity identification is improved.
Drawings
FIG. 1 is a flow chart of a first embodiment;
FIG. 2 is a network structure diagram of the Chinese named entity recognition data enhancement algorithm for sequence-based generation of countermeasure networks according to the present invention.
Detailed Description
In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention. Wherein, the abbreviations and key terms appearing in this embodiment are defined as follows:
CRF, Conditional Random Field;
Bi-LSTM, a Bidirectional Long Short-Term Memory neural network;
example one
Referring to fig. 1 and 2, the present invention provides a method for applying a data enhancement algorithm based on a sequence-generated countermeasure network to a named entity recognition task, and specifically, during training, the method includes:
step one, processing sentences in a corpus, dividing each sentence into an entity part and a non-entity part according to entity marking information of the sentences, and adding the entity part and the non-entity part into a dictionary. Assume a text sequence c1,c2,c3,c4,c5,c6The label of { O, O, B-PER, I-PER, O, O }, c can be replaced with1c2,c5c6Classified as a non-entity part, c3c4Grouped into entity parts, which are then added to the dictionary along with their corresponding tags.
And step two, mapping the entity and non-entity parts in each sentence into corresponding indexes in the dictionary according to the dictionary formed by the entity and the non-entity parts to form an index sequence.
And step three, randomly initializing a mapping dictionary indexed to the vector, and mapping each sentence into a numerical matrix formed by connecting the entity vector and the non-entity vector.
And step four, the generator generates a text by adopting a left-to-right strategy, a Bidirectional Long-Short Term Memory neural network (Bi-LSTM) is used for extracting the characteristic information of the output unit related to the previous moment, and the feedforward neural network maps the characteristic information into the probability of all possible output units. The calculation process is as follows:
hi=LSTM(hi-1,li-1)
p(li|l0,l1,…,li-1)=softmax(W·hi+b)
where LSTM denotes an LSTM cell, hi-1Refers to the hidden layer output of LSTM in the i-1 moment generator, W and b are trainable parameter weights of the feedforward network, and h is adoptedi-1The LSTM for initializing time i is to introduce information of the previous time.
And step five, considering the influence of the output of the current unit on the whole output sequence, adopting a roll-out strategy of Monte Carlo search to sample the output unit at the later moment, wherein the sampling process comprises the following steps:
{(l1,l2,…,lm)1,(l1,l2,…,lm)2,…,(l1,l2,…,lm)K}=MC((l1,…,li),K)
wherein liIs the output at the current time, m is the set maximum output sequence length, K is the number of samples of the monte carlo search, and MC is the monte carlo search method.
And step six, judging the complete sequence formed after sampling by the discriminator, giving out corresponding scores and guiding the data generation of the generator. The specific calculation process is as follows:
Figure BDA0002569265990000047
Figure BDA0002569265990000041
o=max{h1,h2,…,hm}
p=sigmoid(Wo*o+bo)
wherein
Figure BDA0002569265990000042
Is a character ciWord vector of, tiIs a character ciThe corresponding label is marked with a corresponding label,
Figure BDA0002569265990000043
and
Figure BDA0002569265990000044
is a parameter of the convolution kernel, window is the window size of the convolution kernel, max is the maximum pooling operation used to obtain the final text feature representation, WoAnd boFor mapping text features to probability scores that the input text is true text, sigmoid is used to limit the output probability to between 0 and 1.
And step seven, calculating the reward of the current sentence and the objective function of the generator according to the fraction of the discriminator obtained in the step six, and automatically generating a large amount of data by utilizing a good generator model obtained by back propagation and gradient updating. The reward rewarding the current sentence and the objective function of the generator, the specific calculation process is as follows:
Figure BDA0002569265990000045
Figure BDA0002569265990000046
wherein R isiIs the reward obtained by the generator at time i, J (θ) is the objective function of the generator, Discriminator is the calculation process at the arbiter, and K is the number of samples of the monte carlo search.
And step eight, performing character string matching on the data generated in the step seven and the dictionary in the step one, and obtaining entity labels corresponding to the generated data from the generated data.
And step nine, using the generated text data to expand a training set, and numerically converting sentences in the training set into a vector matrix through a word vector dictionary.
Step ten, extracting feature vector representation related to each character context in the input sentence by adopting a Bidirectional Long-Short Term Memory neural network (Bi-LSTM).
Step eleven, decoding by adopting a Conditional Random Field (CRF) to obtain a prediction label corresponding to each character, calculating a loss function, and calculating parameters in the model by utilizing back propagation.
And step twelve, continuously repeating the step nine to the step eleven, testing the trained named entity recognition model on the development set, selecting the model with the maximum F value on the development set, and storing.
In the non-training case, step one to step twelve are replaced by:
mapping sentences in a test set into a corresponding vector matrix through a word vector dictionary;
inputting the vector representation of each sentence into a saved named entity recognition model, and obtaining the feature representation of each sentence related to the context through a bidirectional long-short term memory neural network;
and step three, decoding by adopting a conditional random field to obtain an optimal tag sequence of each sentence in the test set, wherein the optimal tag sequence is used as a result of named entity recognition.
In a preferred embodiment, sentences in a training set are converted into a sequence consisting of entities and non-entities according to entity labels of the sentences, then the entities and the non-entities in the sentences are mapped into a dense vector through a vector dictionary, the vector dimension is n, and the features of an output unit at the current moment and units at the previous moment are extracted through a bidirectional long-short term memory neural network; mapping the extracted features to the probabilities of all possible output units through a feedforward neural network, and taking the unit with the maximum probability value as the output unit at the current moment; meanwhile, in order to solve the problem of gradient update transmission of the generator and the discriminator, a reinforcement learning mechanism is introduced, the output result of the discriminator is used as reward, and the data generation process of the generator is guided. The data generated by the generator will be used in conjunction with the original training set to train the named entity recognition model.
The invention provides a data enhancement algorithm for generating an anti-network based on a sequence, which is used for naming an entity recognition model, and the entity and the non-entity in a training set are used as basic generating units in a generator, so that the problem that a generated text has no label is avoided through character string matching; meanwhile, the relation between the generator and the discriminator is established by using a reinforcement learning mode, the problem that gradient updating cannot be transmitted between the generator and the discriminator is solved, finally, the generator learns the relation between the entity and the non-entity, and the labeled data is automatically generated, so that the influence of lack of a large amount of labeled data is reduced, and the performance of named entity identification is improved.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

Claims (7)

1. A Chinese named entity recognition data enhancement method based on a sequence generation confrontation network is characterized in that a method of generating the confrontation network by the sequence is adopted, and the relation between an entity and a non-entity in a training set is learned to enhance data and improve the performance of named entity recognition, and the method comprises the following steps:
(1) processing sentences in the corpus, dividing each sentence into an entity part and a non-entity part according to entity marking information of the sentences, and adding the entity part and the non-entity part into a dictionary;
(2) mapping the entity and non-entity parts in each sentence into corresponding indexes in the dictionary according to the dictionary formed by the entity and the non-entity parts to form an index sequence;
(3) randomly initializing a mapping dictionary indexed to vectors, and mapping each sentence into a numerical matrix formed by connecting vectors corresponding to entities and non-entities;
(4) the generator generates a text by adopting a left-to-right strategy, a Bidirectional Long-Short Term Memory neural network (Bi-LSTM) is used for extracting characteristic information of an output unit related to previous time, and the feedforward neural network maps the characteristic information into probabilities of all possible output units;
(5) considering the influence of the output of the current unit on the whole output sequence, adopting a roll-out strategy of Monte Carlo search to sample the output unit at the later moment;
(6) the discriminator judges the complete sequence formed after sampling, gives corresponding scores and guides the data generation of the generator;
(7) calculating the reward of the current sentence and the objective function of the generator according to the fraction of the discriminator obtained in the step (6), obtaining a good generator model by utilizing back propagation and gradient updating, and automatically generating a large amount of data;
(8) performing character string matching on the data generated in the step (7) and the dictionary in the step (1) to obtain an entity label corresponding to the generated data;
(9) using the generated text data to expand a training set, and numerically converting sentences in the training set into a vector matrix through a word vector dictionary;
(10) extracting context-dependent feature vector representations of each character in an input sentence by adopting a bidirectional long-short term memory neural network;
(11) adopting conditional random field decoding to obtain a prediction label corresponding to each character, calculating a loss function, and calculating parameters in a model by utilizing back propagation;
(12) and (5) continuously repeating the step (9) to the step (11), testing the trained named entity recognition model on the development set, selecting the model with the maximum F value on the development set, and storing.
2. The method of claim 1, wherein the process of entity recognition in the untrained case comprises:
(2.1) mapping sentences in the test set into corresponding vector matrixes through a word vector dictionary;
(2.2) inputting the vector representation of each sentence into the bidirectional long-short term memory neural network to obtain the feature representation related to each sentence and context;
and (2.3) decoding by adopting a conditional random field to obtain an optimal tag sequence of each sentence in the test set as a result of named entity recognition.
3. The method according to claim 1, wherein in the step (1), the sentences in the corpus are processed, each sentence is divided into solid and non-solid parts according to the entity tagging information of the sentence, and the solid and non-solid parts are added into the dictionary at the same time, and the specific process comprises the following steps:
assume a text sequence c1,c2,c3,c4,c5,c6The label of { O, O, B-PER, I-PER, O, O }, will be c1c2,c5c6Classified as a non-entity part, c3c4Grouped into entity parts, which are then added to the dictionary along with their corresponding tags.
4. The method of claim 1, wherein in step (4), the bi-directional long-short term memory neural network is used to extract the feature information of the output unit related to the previous time, and the feedforward neural network maps the feature information to the probability of all possible output units, and the calculation process is as follows:
hi=LSTM(hi-1,li-1)
p(li|l0,l1,…,li-1)=softmax(W·hi+b)
where LSTM denotes an LSTM cell, hi-1Refers to the hidden layer output of LSTM in the i-1 moment generator, W and b are trainable parameter weights of the feedforward network, and h is adoptedi-1The LSTM for the i-time is initialized to introduce information of the previous time, softmax is used for normalization.
5. The method as claimed in claim 1, wherein in the step (5), the roll-out strategy of the monte carlo search is adopted to sample the output unit at the later time:
{(l1,l2,…,lm)1,(l1,l2,…,lm)2,…,(l1,l2,…,lm)K}=MC((l1,…,li),K)
wherein liIs the output unit at the current time, m is the set maximum output sequence length, K is the number of samples of the monte carlo search, and MC is the monte carlo search method.
6. The method as claimed in claim 1, wherein the discriminator in step (6) judges the complete sequence formed after sampling, and gives corresponding scores, and the specific calculation process is as follows:
Figure FDA0002872431350000027
Figure FDA0002872431350000021
o=max{h1,h2,…,hm}
p=sigmoid(Wo*o+bo)
wherein
Figure FDA0002872431350000022
Is a character ciWord vector of, tiIs a character ciThe corresponding label is marked with a corresponding label,
Figure FDA0002872431350000023
and
Figure FDA0002872431350000024
is a parameter of the convolution kernel, window is the window size of the convolution kernel, max is the maximum pooling operation used to obtain the final text feature representation, WoAnd boForMapping text features to probability scores that the input text is real text, sigmoid is used to limit the output probability to between 0 and 1.
7. The method as claimed in claim 1, wherein the step (7) of rewarding the current sentence and the objective function of the generator is implemented by the following steps:
Figure FDA0002872431350000025
Figure FDA0002872431350000026
wherein R isiIs the reward obtained by the generator at time i, J (theta) is the objective function of the generator, Discriminor is the calculation process at the arbiter, K is the number of samples of the Monte Carlo search, GθIs a parameter of the generator and m is the set maximum output sequence length.
CN202010635292.1A 2020-07-03 2020-07-03 Chinese named entity identification data enhancement algorithm based on sequence generation countermeasure network Active CN111738007B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010635292.1A CN111738007B (en) 2020-07-03 2020-07-03 Chinese named entity identification data enhancement algorithm based on sequence generation countermeasure network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010635292.1A CN111738007B (en) 2020-07-03 2020-07-03 Chinese named entity identification data enhancement algorithm based on sequence generation countermeasure network

Publications (2)

Publication Number Publication Date
CN111738007A CN111738007A (en) 2020-10-02
CN111738007B true CN111738007B (en) 2021-04-13

Family

ID=72653123

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010635292.1A Active CN111738007B (en) 2020-07-03 2020-07-03 Chinese named entity identification data enhancement algorithm based on sequence generation countermeasure network

Country Status (1)

Country Link
CN (1) CN111738007B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113807097A (en) * 2020-10-30 2021-12-17 北京中科凡语科技有限公司 Named entity recognition model establishing method and named entity recognition method
CN112420191A (en) * 2020-11-23 2021-02-26 北京麦岐科技有限责任公司 Traditional Chinese medicine auxiliary decision making system and method
CN112613036A (en) * 2020-12-29 2021-04-06 北京天融信网络安全技术有限公司 Malicious sample enhancement method, malicious program detection method and corresponding devices
CN112765319B (en) * 2021-01-20 2021-09-03 中国电子信息产业集团有限公司第六研究所 Text processing method and device, electronic equipment and storage medium
CN113158652B (en) * 2021-04-19 2024-03-19 平安科技(深圳)有限公司 Data enhancement method, device, equipment and medium based on deep learning model
CN112966517B (en) * 2021-04-30 2022-02-18 平安科技(深圳)有限公司 Training method, device, equipment and medium for named entity recognition model
CN113158678A (en) * 2021-05-19 2021-07-23 云南电网有限责任公司电力科学研究院 Identification method and device applied to electric power text named entity
CN117370583B (en) * 2023-12-08 2024-03-19 湘江实验室 Knowledge-graph entity alignment method and system based on generation of countermeasure network

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10431209B2 (en) * 2016-12-30 2019-10-01 Google Llc Feedback controller for data transmissions
WO2017075081A1 (en) * 2015-10-26 2017-05-04 24/7 Customer, Inc. Method and apparatus for facilitating customer intent prediction
US10586174B2 (en) * 2016-02-04 2020-03-10 Gartner, Inc. Methods and systems for finding and ranking entities in a domain specific system
CN109597982B (en) * 2017-09-30 2022-11-22 北京国双科技有限公司 Abstract text recognition method and device
CN109165664B (en) * 2018-07-04 2020-09-22 华南理工大学 Attribute-missing data set completion and prediction method based on generation of countermeasure network
CN109726195B (en) * 2018-11-26 2020-09-11 北京邮电大学 Data enhancement method and device
CN111414757B (en) * 2019-01-04 2023-06-20 阿里巴巴集团控股有限公司 Text recognition method and device
US10607042B1 (en) * 2019-02-12 2020-03-31 Live Objects, Inc. Dynamically trained models of named entity recognition over unstructured data
CN111241837B (en) * 2020-01-04 2022-09-20 大连理工大学 Theft case legal document named entity identification method based on anti-migration learning
CN111291565A (en) * 2020-01-17 2020-06-16 创新工场(广州)人工智能研究有限公司 Method and device for named entity recognition
CN111415131A (en) * 2020-03-13 2020-07-14 浙江华坤道威数据科技有限公司 Big data talent resume analysis method based on natural language processing technology

Also Published As

Publication number Publication date
CN111738007A (en) 2020-10-02

Similar Documents

Publication Publication Date Title
CN111738007B (en) Chinese named entity identification data enhancement algorithm based on sequence generation countermeasure network
CN110765775B (en) Self-adaptive method for named entity recognition field fusing semantics and label differences
CN108984526B (en) Document theme vector extraction method based on deep learning
CN111967266B (en) Chinese named entity recognition system, model construction method, application and related equipment
CN109657239B (en) Chinese named entity recognition method based on attention mechanism and language model learning
CN107943784B (en) Relationship extraction method based on generation of countermeasure network
CN109614471B (en) Open type problem automatic generation method based on generation type countermeasure network
CN112487820B (en) Chinese medical named entity recognition method
CN113505200B (en) Sentence-level Chinese event detection method combined with document key information
CN112541356B (en) Method and system for recognizing biomedical named entities
CN111339407B (en) Implementation method of information extraction cloud platform
CN114676255A (en) Text processing method, device, equipment, storage medium and computer program product
CN111967267B (en) XLNET-based news text region extraction method and system
CN114691864A (en) Text classification model training method and device and text classification method and device
CN110968708A (en) Method and system for labeling education information resource attributes
CN111145914A (en) Method and device for determining lung cancer clinical disease library text entity
CN113191150B (en) Multi-feature fusion Chinese medical text named entity identification method
CN112989803B (en) Entity link prediction method based on topic vector learning
CN112015903B (en) Question duplication judging method and device, storage medium and computer equipment
CN113536799A (en) Medical named entity recognition modeling method based on fusion attention
CN111159405B (en) Irony detection method based on background knowledge
Mossie Social media dark side content detection using transfer learning emphasis on hate and conflict
CN116595170A (en) Medical text classification method based on soft prompt
CN115391534A (en) Text emotion reason identification method, system, equipment and storage medium
CN115221284A (en) Text similarity calculation method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant