CN111222318B

CN111222318B - Trigger word recognition method based on double-channel bidirectional LSTM-CRF network

Info

Publication number: CN111222318B
Application number: CN201911130490.6A
Authority: CN
Inventors: 陈一飞; 孙玉星
Original assignee: NANJING AUDIT UNIVERSITY
Current assignee: NANJING AUDIT UNIVERSITY
Priority date: 2019-11-19
Filing date: 2019-11-19
Publication date: 2023-09-12
Anticipated expiration: 2039-11-19
Also published as: CN111222318A

Abstract

The application discloses a trigger word recognition method based on a dual-channel bidirectional LSTM-CRF network, which comprises the steps of firstly using linear and nonlinear embedded vector features respectively input by a dual-channel input layer, extracting higher-level abstract features from a subsequent bidirectional LSTM layer, selecting and fusing the generated linear and nonlinear abstract features in a pooling layer, and finally training the CRF layer by utilizing the fused features for final sequence marking. And for the biomedical text, obtaining the analysis and part-of-speech tagging of the dependency tree, taking the nonlinear context obtained from the dependency tree and the linear context as an input sequence for training a two-channel bidirectional LSTM-CRF network model, obtaining an optimal tagging sequence of the input sequence, and tagging event trigger words in the biomedical text according to the optimal tagging sequence. The method can effectively identify the event trigger words in the biomedical text.

Description

Trigger word recognition method based on double-channel bidirectional LSTM-CRF network

Technical Field

The application relates to a trigger word recognition method based on a double-channel bidirectional LSTM-CRF network, and belongs to the technical field of data mining.

Background

In recent years, as the interest in biomedical research has grown, a large number of documents have been published on the internet. Accordingly, biomedical text mining is increasingly being used to automatically track new findings and theories in these biomedical papers. These biomedical text tasks include naming entities (such as the genes and proteins mentioned), identifying, relationships between entities (such as protein interactions), extracting, and extracting events (such as gene transcription and regulation), among others.

Biomedical event extraction refers to the automatic extraction of structured representations of biomedical relationships, functions, and processes from text. Event extraction has become a research hotspot since the task of sharing bionp '09 and bionp' 11. It defines the structure of each event as any number of participants to indicate functions and processes at the molecular level, for example: regulating and phosphorylating. These two events occur when a given protein can regulate the expression of a gene, and its products are in turn involved in some phosphorylation process. Among these two shared tasks, 9 common biomolecular events were selected, which involved in proteins and genes as an important component of the biological system landscape. In addition, in the MLEE corpus, events from molecular level to multiple levels of the whole organism are annotated. The event types have been extended to 19. In the near future, for a comprehensive understanding of biological systems, more and more biomedical events may be extracted from multiple layers of biological tissue. It is highly desirable to be able to automatically identify biomedical events in text.

The event extraction task typically comprises two main steps: event trigger words are identified, and then event parameters are identified. The recognition of trigger words is the first, and most critical, step in event extraction, and aims to recognize text blocks that indicate events and act as predictors. Event extraction performance is entirely dependent on the trigger words that have been identified, studies have shown that over 60% of event extraction errors are attributed to the trigger word identification phase.

Event-triggered recognition methods can be classified into rule-based, dictionary-based matching, and Machine Learning (ML) -based methods. In ML-based approaches, conditional Random Field (CRF), support Vector Machine (SVM), and Deep Neural Network (DNN) models have been used more successfully to build event-triggered recognition systems. When the CRF and SVM methods construct the trigger word recognition model, the characteristics are generally summarized and extracted by a manual method, and the method has high cost performance and poor generalization capability of the system. The characteristics comprise morphological characteristics, stems, parts of speech, sentence characteristics, grammar characteristics, position characteristics and other information of the trigger words. In order to solve the problem of the tedious process of manually designing features in the trigger word recognition process, a deep learning method based on a neural network has recently become a research hotspot. DNNs typically use word-embedded vectors as the input to the model, which avoid a number of artificial feature design problems. The network model can automatically learn abstract features through training in the establishment process to acquire semantic information among words. This advantage makes DNN widely used in the field of event-triggered recognition. The LSTM (Long Short Term Memory) network regards the trigger word recognition process as a sequence marking problem, and by adding memory units in each nerve unit of the hidden layer, the memory information on the sequence is controllable, so that the network has a long-term memory function, and becomes a current research hotspot.

Currently, word embedding vectors commonly used as LSTM network inputs can only reflect word-to-word linear contextual semantic relationships. However, in addition to this, biomedical event triggers require more information from the dependency tree based context. The dependency tree analyzes sentences into a syntactic tree describing the dependency relationships between words. That is, a syntactically collocation relationship between words is indicated, which is semantically associated. The context depicted by the dependency tree may capture nonlinear inter-word relationships that are difficult for linear contexts to obtain. Thus, the non-linear dependency tree context may provide more rich language information for trigger word recognition than linear context information, thereby providing better recognition performance. There have been studies on dependency tree-based dependency word embedding vectors successfully learned from text using skip-gram models, but there is still a lack of a method to integrate them well with linear context semantic information in LSTM networks to get better trigger word recognition effect.

Disclosure of Invention

The application aims to: aiming at the defects and defects existing in the prior biomedical event trigger word recognition method that a feature method is generated by utilizing a context relation, the application provides a trigger word recognition method based on a dual-channel bidirectional LSTM-CRF network by fully utilizing information provided by a dependency tree obtained by syntactic analysis of the feature method, and achieves a better feature expression effect by fusing nonlinear dependency tree context information.

The application mainly solves the following technical problems:

(1) How to construct nonlinear long-distance feature vectors by utilizing context information of a dependency syntax tree obtained by sentence grammar analysis, and the nonlinear long-distance feature vectors are used as input of a deep learning network model so as to obtain better feature representation. And the LSTM identification network is constructed by the two-channel technology and the parallel and simultaneous input of the linear context information. This is a key factor in achieving a good result of the identification method of the present application.

(2) How to use pooling technology to combine the extracted linear and nonlinear abstract features in the dual-channel structure is another key factor that the identification method of the application can achieve better effect.

(3) The bi-directional LSTM network is a further improvement over traditional LSTM deep learning models and learns the strong dependencies between output labels into the most probable predicted sequence by adding CRF layers at the output layer. The method has a certain effect on improving the recognition performance.

Experiments prove that the identification performance of the event trigger words in the biomedical text can be effectively improved.

The technical scheme is as follows: a trigger word recognition method based on a double-channel bidirectional LSTM-CRF network is particularly applied to recognition of biological medicine event trigger words; the method comprises the following steps:

step 1, performing text preprocessing on a biomedical text training set to obtain dependency tree analysis and part-of-speech tagging, and taking nonlinear context relations and linear context acquired from the dependency tree as input for trigger word recognition;

step 2, adopting the text preprocessing method in the same step 1 for the test set;

step 3, pre-trained embedded vector X learned from pubMed abstract article ₁ And a pre-training embedded vector lookup table Y ₁ ；

Step 4, constructing a linear context embedded channel layer, a nonlinear context embedded channel layer, a bidirectional LSTM layer, a maximum pooling layer, a full connection layer and a CRF layer of the two-channel bidirectional LSTM-CRF network model by using the preprocessed training data in the step 1;

iteratively optimizing the connection weights among layers of the neural network on the training data set;

step 5, after training, inputting test data into the trained double-channel bidirectional LSTM-CRF network model to obtain an optimal labeling sequence of the input sequence;

and 6, marking event trigger words in the biomedical text by utilizing the optimal marking sequence output by the model.

In the step 4, a general small-batch gradient descent forward and reverse training process is used; dividing the whole training data into a plurality of batches for each iteration, and processing one batch at a time; each batch contains a sentence determined by a batch size parameter; for each batch, firstly, running bidirectional LSTM-CRF model forward transfer, and transferring the forward state and the backward state of the LSTM; obtaining the output of all tags (tag sequences y= (t 1, t2, … tn) of all tags 1 to n in all positions; then, running the CRF layer forward and backward pass to calculate the gradients of the network output and state transition edges; after this, errors are propagated back from the output to the input, updating network parameters, including parameters of all LSTM forward and reverse states, the linear, nonlinear context embedding the transfer parameters of the random initialization look-up table, CRF, of the channel layer.

The linear context embeds into the channel layer: for an input sentence s, each input word w _i The linear channel input layer is converted into corresponding real feature vectors through a series of lookup tables, namely, the cascade representation of the following vectors:

(1) Embedded word feature vector E ^w_L : using embedded word look-up tablesEach word w in the sentence to be input _i Mapping to an embedded word vector +.>The vector contains a vector from X ₁ Semantic information of a mid-linear context;

(2) Character embedding feature vector E ^c_L : learning spelling patterns for each word at a character level using an LSTM network; the LSTM network parameters are randomly initialized and the sequence x is input _t To compose word w _i Wherein t is the number of characters contained in the word; the output is character level embedded vector sequence h _t The method comprises the steps of carrying out a first treatment on the surface of the Character(s)Embedding feature vector E ^c_L Can be input from word w _i Extracting spelling characteristic information from the character sequence of (a);

(3) Part-of-speech embedded feature vector E ^p_L : extending word embedding features using parts of speech (pos); using embedded feature lookup tablesEach word w in the sentence to be input _i Mapping part of speech tagging of (a) to an embedded word vector +.>The vector contains a vector from X ₂ Chinese word character is in the linear context kind of correlation information, extract the sentence information of the context from the input word; lookup table matrix->Randomly initializing;

(4) Named entity type embedding feature vector E ^e_L : by embedding feature lookup tablesEach word w in the linear context of the input sentence _i Mapping of named entity types to embedded vectors +.>Embedded feature lookup tableRandom initialization, where r ₆ For the number of all named entity categories, r ₇ A dimension for the embedded word vector;

thus, each input word w is embedded through the linear context-embedding channel layer _i Conversion to a linear vector string

The nonlinear context embedding channel layer (nonlinear channel inputLayer): for an input sentence s, the embedding layer converts each input feature into a real representation vector through a series of look-up tables; thus, each input word w _i Conversion to a cascade representation of the following vectors:

(1) Word embedding feature vector E based on dependency tree ^w_NL : each word in the input sentence is mapped into a word embedding vector based on the dependency tree; using embedded word look-up tablesEach word w in the sentence to be input _i Mapping to a non-linear embedded word vector +.>

(2) Dependency tree based part-of-speech embedding feature vector E ^p_NL : expanding word embedding using parts of speech (pos) mapping pos labels for each word in a dependency tree context corresponding to an input sentence dependency tree to a pos embedding vector that extracts syntactic information from the input word; using embedded feature lookup tablesEach word w in the sentence to be input _i Mapping part of speech tags to a non-linear part of speech embedded vector->The vector contains the vector from Y ₂ Part-of-speech relevance information for a medium part-of-speech nonlinear context; lookup table matrix->Randomly initializing;

(3) Named entity type embedded feature vector E based on dependency tree ^e_NL : mapping the named entity type of each word in the nonlinear context corresponding to the input sentence dependency tree to an embedded vector, wherein the embedded vector extracts nonlinear information related to the field from the input; by embedding a feature lookup table Y ₃ Each word w in the sentence dependency tree context will be input _i Mapping named entity types to embedded vectorsThe embedded vector extracts non-linear information related to the field from the input; embedded feature lookup table->Randomly initializing;

thus, each input word w is embedded in the channel layer via a nonlinear context _i Conversion to a nonlinear vector string

The bi-directional LSTM layer:

(1) Bidirectional LSTM layer of linear channel (linear channel abstraction layer): the bi-directional LSTM layer embeds linear contexts into the output x of the channel layer ^L As an input, the input may be, among other things,let the input sequence of an LSTM cell be the vector x of length t ₁ ，x ₂ ，…，x _t It obtains the output sequence h with the same length by applying the nonlinear transformation (formula 1) learned during training ₁ ，h ₂ ，…，h _t ；

In each LSTM state time step t, i _t Is an input door, f _t Is a forgetful door o _t Is an output door c _t Is a memory unit, and is used for storing data,is a candidate memory cell, h _t Is an implicit state; all W and b are trainable parameters of LSTM, sigma (&) and tan h (&) represent sigmoid function and hyperbolaTangent activation function, ++represents inner product;

when the vector sequence x is input ^L Is forward, then a linear forward STM network output h is obtained ^F_L The method comprises the steps of carrying out a first treatment on the surface of the When the vector sequence x is input ^L Is backward, then linear backward STM network output h is obtained ^B_L The method comprises the steps of carrying out a first treatment on the surface of the The outputs of the two LSTM networks in the forward direction and the backward direction are cascaded to obtain the output of the bidirectional LSTM layer of the final linear channel, h ^L ＝[h ^F_L ；h ^B_L ]；

(2) Bidirectional LSTM layer of nonlinear channel (nonlinear channel abstraction layer): likewise, the LSTM layer embeds a nonlinear context into the output x of the channel layer ^NL As an input, the input may be, among other things, also, it obtains an output sequence of the same length by applying a nonlinear transformation (equation 1) learned during training; by combining the outputs h of the forward and backward LSTM networks ^F_NL And h ^B_NL Cascading to obtain the final bidirectional LSTM layer output of the nonlinear channel, h ^NL ＝[h ^F_NL ；h ^B ^_NL ]。

The max pooling layer (feature fusion layer): abstract features from both forward and backward directions based on linear and nonlinear context information are extracted through the two-way LSTM layers of the two different channels in front, respectively; capturing the most useful features by using a maximum pool technique, and dynamically selecting the features by acquiring a maximum value for each dimension j; f (F) _max Is the maximum pooled dynamic feature output:

F _max ＝max(h _j ^F_L ,h _j ^B_L ,h _j ^F_NL ,h _j ^B_NL ) (2)

F＝(F ₁ ,F ₂ ,…,F _n ) (3)；

the CRF layer (sequence notation): based on the input sequence s= (w ₁ ，w ₂ ，…w _n ) The pooling layer outputs its fused abstract feature sequence F, assuming the tag sequence y= (t) ₁ ，t ₂ ，…t _n ) Is the final output of the CRF layer; given the fusion abstract feature sequence F and tag sequence y for each training instance, the CRF layer defines the maximization function of the target:

where F is a function of assigning a score to each pair F and y,representing the tag sequence space of F. The cost function cost (F, y ') is based on the principle of maximum profit, i.e. the high cost label y' should be subjected to a greater penalty. The CRF layer can learn the strong dependencies F (F, y) between the output tags, resulting in the most likely output tag sequences.

The beneficial effects are that: compared with the prior art, the trigger word recognition method based on the two-channel two-way LSTM-CRF network provided by the application fully fuses the nonlinear dependency tree context information method, constructs the two-channel two-way LSTM-CRF network and effectively recognizes event trigger words in the biomedical text. The advantages of the model are mainly embodied in the following 3 aspects:

(1) And constructing nonlinear and long-distance feature vectors by utilizing context information of a dependency syntax tree obtained by sentence grammar analysis, inputting the nonlinear and long-distance feature vectors in parallel with linear context information by a two-channel technology, constructing an LSTM recognition network, and effectively capturing multi-azimuth abstract features of text sentences simultaneously.

(2) The maximum pooling technology is utilized to fuse the extracted linear and nonlinear abstract features in the dual-channel structure, and the fusion can be effectively selected dynamically according to the values of the abstract features so as to achieve the effect of better feature expression.

(3) The bi-directional LSTM network, in combination with the CRF layer of the output layer, can learn the strong dependencies between the output labels and form the most probable predicted sequence.

Drawings

FIG. 1 is a diagram of a dual channel bi-directional LSTM-CRF network model in accordance with an embodiment of the present application;

FIG. 2 is a flow chart of event trigger word annotation in biomedical text in accordance with an embodiment of the present application.

Detailed Description

The present application is further illustrated below in conjunction with specific embodiments, it being understood that these embodiments are meant to be illustrative of the application and not limiting the scope of the application, and that modifications of the application, which are equivalent to those skilled in the art to which the application pertains, fall within the scope of the application defined in the appended claims after reading the application.

The trigger word recognition method based on the double-channel bidirectional LSTM-CRF network is particularly applied to the recognition of trigger words of biological medicine events, trains a double-channel bidirectional LSTM-CRF network model, firstly uses linear and nonlinear embedded vector features respectively input by a double-channel input layer, extracts higher-level abstract features from a subsequent bidirectional LSTM layer, selects and fuses the generated linear and nonlinear abstract features in a pooling layer, and finally trains the CRF layer by utilizing the fused features for final sequence marking. And for the biomedical text, obtaining the analysis and part-of-speech tagging of the dependency tree, taking the nonlinear context obtained from the dependency tree and the linear context as an input sequence for training a two-channel bidirectional LSTM-CRF network model, obtaining an optimal tagging sequence of the input sequence, and tagging event trigger words in the biomedical text according to the optimal tagging sequence.

Preprocessed sentence s= (w) for a given input ₁ ，w ₂ ，…w _n ) The output of the trigger word recognition is to obtain a tag sequence y= (t) ₁ ，t ₂ ，…t _n ) Wherein w is _i Is a word in a sentence, t _i Indicating its corresponding type tag, with a length n.

As shown in fig. 1, the specific structure of the dual-channel bidirectional LSTM-CRF network model used in the present application is as follows:

1) Linear context embedding channel layer (linear channel input layer): to represent linear context grammar semantic features of the input sentence s, except for each word w _i In addition, we also followThe composition of the word, the part of speech (pos), and the named entity type of the word extract other 3 richer features (the composition of the word, the part of speech (pos), and the named entity type of the word). Each input word w _i The linear channel input layer is converted into corresponding real feature vectors through a series of lookup tables, namely, the cascade representation of the following vectors:

(1) Embedded word feature vector E ^w_L : using embedded word look-up tablesEach word w in the sentence to be input _i Mapping to an embedded word vector +.>The vector contains a vector from X ₁ Semantic information of a linear context. Pre-trained embedded vector learned from pubMed abstract article using Word2Vec model +.>(embedded word look-up table), where r ₁ R is the number of all words ₂ The value 200 is taken for the embedded word vector dimension.

(2) Character embedding feature vector E ^c_L : an additional LSTM network is used to learn the spelling patterns of each word at the character level. The LSTM network parameters are randomly initialized and the sequence x is input _t To compose word w _i Wherein t is the number of characters contained in the word; the output is character level embedded vector sequence h _t . The network parameter formula is shown in formula (1), and is updated and optimized in the LSTM network training process. The character is embedded into a feature vector E ^c_L Can be input from word w _i Extracting spelling pattern information from the character sequences of (a),r ₃ for the embedded word vector dimension, the value 100 is taken.

(3) Part-of-speech embedded feature vector E ^p_L : expanding a list using parts of speech (pos)The word embeds the feature. Using embedded feature lookup tablesEach word w in the sentence to be input _i Mapping part of speech tagging of (a) to an embedded word vector +.>The vector contains a vector from X ₂ Chinese character is in linear context correlation information, and context sentence information is extracted from input words. Lookup table matrix->Random initialization and updating and optimization during network training, where r ₄ For the number of all part-of-speech categories, r ₅ For the embedded word vector dimension, the value 50 is taken.

(4) Named entity type embedding feature vector E ^e_L : by embedding feature lookup tablesEach word w in the linear context of the input sentence _i Mapping of named entity types to embedded vectors +.>The embedded vector extracts domain-related information from the input. The named entity is provided by the task data and the type of other non-entity words is set to "NONE". Embedded feature lookup table->Random initialization and updating and optimization during network training, where r ₆ For the number of all named entity categories, r ₇ The value 10 is taken for the embedded word vector dimension.

2) Non-linear context embedding channel layer (non-linear channel input layer): to represent nonlinear context grammar semantic information of an input sentence s, except for each word w _i Other 2 features are extracted from parts of speech (pos), named entity types. The embedding layer converts each input feature into a real representation vector through a series of look-up tables. Thus, each input word w _i Conversion to a cascade representation of the following vectors:

(1) Word embedding feature vector E based on dependency tree ^w_NL : to extend from a linear context to a non-linear context derived from a dependency tree, each word in the input sentence is mapped to a dependency tree based word embedding vector that contains rich syntactic function information. Using embedded word look-up tablesEach word w in the sentence to be input _i Mapping to a non-linear embedded word vector +.>The vector contains the vector from Y ₁ Semantic information of nonlinear context. In the application, we use the abstract text of PunMed, analyze through clause and dependency tree, use skip-gram word vector training model, the pre-training embedded word lookup table of learning +.>Wherein r is ₁ R is the number of all words ₁₂ For the embedded word vector dimension, value 300 is taken.

(2) Dependency tree based part-of-speech embedding feature vector E ^p_NL : we use parts of speech (pos) to extend word embedding. Part of speech (pos) maps the pos tag of each word in the dependency tree context to which the input sentence dependency tree corresponds to a pos embedded vector that extracts syntactic information from the input word. Using embedded feature lookup tablesEach word w in the sentence to be input _i Mapping part of speech tags to a non-linear part of speech embedded vector->The vector contains the vector from Y ₂ Part-of-speech relevance information for a non-linear context of part-of-speech. Lookup table matrix->Random initialization and updating and optimization during network training, where r ₄ For the number of all part-of-speech categories, r ₅ For the embedded word vector dimension, the value 50 is taken.

(3) Named entity type embedded feature vector E based on dependency tree ^e_NL : mapping the named entity type of each word in the nonlinear context corresponding to the input sentence dependency tree to an embedded vector, wherein the embedded vector extracts nonlinear information related to the field from the input. By embedding a feature lookup table Y ₃ Each word w in the sentence dependency tree context will be input _i Mapping named entity types to embedded vectorsThe embedded vector extracts domain-related nonlinear information from the input. Embedded feature lookup table->Random initialization and updating and optimization during network training ("gradient descent" method for parameter optimization), where r ₆ For the number of all part-of-speech categories, r ₇ The value 10 is taken for the embedded word vector dimension.

3) Bidirectional LSTM layer:

(1) Bidirectional LSTM layer of linear channel (linear channel abstraction layer): the bi-directional LSTM layer embeds linear contexts into the output x of the channel layer ^L As an input, the input may be, among other things,a Recurrent Neural Network (RNN) is a powerful tool for sequence tagging tasks because it can process current inputs using previous dependency information in the sequence. LSTM is a practical variant of RNN in natural language processing applications, which designs a storage unit that gathers previous information of the input sequence and then learns long distance dependencies in a specific order. Having generality, assume that the input sequence of one LSTM cell is a vector x of length t ₁ ，x ₂ ，…，x _t It obtains the output sequence h with the same length by applying the nonlinear transformation (formula 1) learned during training ₁ ，h ₂ ，…，h _t 。

In each LSTM state time step t, i _t Is an input door, f _t Is a forgetful door o _t Is an output door c _t Is a memory unit, and is used for storing data,is a candidate memory cell, h _t Is an implicit state. All W and b are trainable parameters of LSTM, σ (. Cndot.) and tanh (. Cndot.) represent sigmoid function and hyperbolic tangent activation function, and by # represents inner product.

When the vector sequence x is input ^L Is forward, then a linear forward STM network output h is obtained ^F_L The method comprises the steps of carrying out a first treatment on the surface of the When the vector sequence x is input ^L Is backward, then linear backward STM network output h is obtained ^B_L . The outputs of the two LSTM networks in the forward direction and the backward direction are cascaded to obtain the output of the bidirectional LSTM layer of the final linear channel, h ^L ＝[h ^F_L ；h ^B_L ]。

(2) Bidirectional LSTM layer of nonlinear channel (nonlinear channel abstraction layer): likewise, the TM layer embeds a nonlinear context into the output x of the channel layer ^NL As an input, the input may be, among other things, also, it obtains the output sequences of the same length by applying the nonlinear transformation learned during training (equation 1). By combining the outputs h of the forward and backward LSTM networks ^F_NL And h ^B_NL Cascading to obtain the final bidirectional LSTM layer output of the nonlinear channel, h ^NL ＝[h ^F_NL ；h ^B_NL ]。

4) Max pooling layer (feature fusion layer): abstract feature inputs based on linear and nonlinear context information from both forward and backward directions, i.e. [ h ], are extracted by the two-way LSTM layer of the two different channels in front ^F_L ；h ^B_L ；h ^F_NL ；h ^B_NL ]。h _j ^F_L ,h _j ^B_L ,h _j ^F_NL ，h _j ^B_NL The abstract features, which refer to the forward and backward directions of the word of dimension j, based on linear and non-linear context information, respectively, but the roles of these context semantic features from different bases are not equally important in the subsequent recognition process.

Thus, we use the max pool technique to capture the most useful features and dynamically make feature selection by taking a maximum value for each dimension j. F (F) _max Is the maximum pooled dynamic feature output:

F _max ＝max(h _j ^F_L ,h _j ^B_L ,h _j ^F_NL ,h _j ^B_NL ) (2)

F＝(F ₁ ,F ₂ ,…,F _n ) (3)；

5) CRF layer (sequence notation): based on the input sequence s= (w ₁ ，w ₂ ，…w _n ) The pooling layer outputs its fused abstract feature sequence F, assuming the tag sequence y= (t) ₁ ，t ₂ ，…t _n ) Is the final output of the CRF layer. Given the fusion abstract feature sequence F and tag sequence y for each training instance, the CRF layer defines the maximization function of the target:

The specific process of event trigger word labeling in biomedical texts is as follows:

(1) For training set T _r Text preprocessing (including clause, punctuation, gdep parsing tool to obtain dependency tree parsing result and part-of-speech tagging result) is performed.

Since more than 95% of biomedical events in a biological context occur within sentences, the extraction of biomedical trigger words is in sentences. During preprocessing, a biomedical field word segmentation tool Genia Sentence Splitter is used for segmenting texts in a corpus. Punctuation is then removed for each sentence.

Then, the dependency tree analysis and part-of-speech tagging are obtained by using the Gdep analysis tool. Gdep is a dependent syntactic analysis tool specific to biomedical texts, and can carry out syntactic analysis on biomedical texts in a language library with high accuracy. The dependency context can obtain the inter-word long-distance nonlinear relation which is difficult to obtain by the linear context, and the nonlinear context obtained from the dependency tree is taken as the input for triggering word recognition together with the linear context.

(2) The same processing method is adopted for the test set Te.

(3) In addition, pre-trained embedded vector X learned from pubMed abstract article through clause using Word2Vec model ₁ The method comprises the steps of carrying out a first treatment on the surface of the Meanwhile, the digest text of PunMed is analyzed through clauses and dependency trees, a skip-gram word vector training model is used, and a pre-training embedded vector lookup table Y is learned ₁ 。

(4) The linear context embedded channel layer, the nonlinear context embedded channel layer, the bidirectional LSTM layer, the max pooling layer, the full connection layer and the CRF layer of the two-channel bidirectional LSTM-CRF network model (shown in figure 1) are constructed by the preprocessed training data.

(5) On the training data set, the link weights between the layers of the neural network are iteratively optimized.

The present application uses a general small batch gradient descent forward and reverse training process. For each iteration, we divide the whole training data into a number of batches, one batch at a time. Each batch contains a sentence determined by a batch size parameter. In our experiments, the number of batches used was 20, which is meant to include sentences having a total length of no more than 20. For each batch, a bidirectional LSTM-CRF model forward pass is first run, including forward pass passing both the forward state and the backward state of the LSTM. Thus, the output f of all tags at all positions is obtained _θ . The CRF layer is then run forward and backward pass to compute the gradients of the network output and state transition edges. After this, errors are propagated back from the output to the input, updating network parameters, including parameters of all LSTM forward and reverse states, the linear, nonlinear context embedding the transfer parameters of the random initialization look-up table, CRF, of the channel layer.

(6) After training, inputting the test data into a trained two-channel bidirectional LSTM-CRF network model to obtain the optimal labeling sequence of the input sequence.

(7) Finally, the optimal labeling sequence output by the model is utilized to label event trigger words in the biomedical text.

The technical scheme provided by the embodiment of the application is applied to the actual identification of the biological medicine text event trigger word. We used the corpus databiolp 09, which was from the biolp GENIA challenge race (2009). In the biollp' 09 sharing task, trigger word recognition is only one intermediate step in the biomedical event extraction task, so there is no recognition result in the test set, and therefore we use the validation dataset as the test dataset.

Data set DataBioNLP09

The corpus was taken from a 2009 biolp challenge, and table 1 gives detailed statistical analysis of the training dataset and validation dataset in databiolp 09. As trigger recognition tasks, we have 9 trigger types (corresponding to 9 biomedical events) that can be divided into three categories: simple event trigger words, binding event trigger words and complex event trigger words. Plus the negative category of non-trigger words, we have to deal with the labeling of class 10 tags.

All experiments were performed using Tensorflow to achieve training of network parameters, and super parameters were adjusted by cross-validation, and then the final model was trained on the optimal combination set.

TABLE 1 list of types and numbers of trigger and non-trigger words in DataBioNLP09

(II) results of experiments

(1) Biological trigger word recognition effect comparison

Firstly, based on the same data set DataBioNLP09, the performance of the biological trigger word recognition method based on the two-channel bidirectional LSTM-CRF network and other existing recognition methods is compared. Table 2 details the method of the present application and other 4-feature-lead identification methods, using the following 3 metrics: p (precision), R (recall) and F1 (F value), which are weighted geometric averages of recall and precision).

Table 2 comparison of biological trigger word recognition system performance

The method of the application	P	R	F1
				Dual channel bidirectional LSTM-CRF	70.7	70.9	70.8
System CRF	65.0	30.2	41.2
				System SVM	70.2	52.6	60.1
Turku	70.5	60.6	65.2
				TrigNER	69.3	57.3	62.7

The System CRF provides a biological event trigger word recognition System based on a Conditional Random Field (CRF) algorithm and a neighborhood hash feature extraction method. The system was reported to reach the most advanced performance associated with comparable evaluations on the biollp dataset. The System SVM is a biological event trigger word recognition System based on a Support Vector Machine (SVM) and rich feature extraction. Turku is the best system for shared task of biological event recognition of biolp' 09. TrigNER is a biomedical event trigger word recognition system based on CRF.

The results in Table 2 show that the method of the present application achieves the best overall performance and is significantly different from other systems. Furthermore, the improvement of the method of the present application is based on an improvement in recall, which means that more possible trigger words are identified, which may further improve the performance of the next stage event recognition. The application constructs nonlinear long-distance characteristic vector by utilizing context information of the dependency syntax tree obtained by sentence grammar analysis, and uses the nonlinear long-distance characteristic vector as input of a deep learning network model. The method uses a two-channel technology, and simultaneously inputs the two-channel technology and the linear context information in parallel, and further dynamically fuses the linear and nonlinear abstract features extracted from the two-channel structure through a pooling technology, so that better feature representation is obtained, the trigger word recognition performance is effectively improved, and a better foundation is provided for improving the event recognition performance of the next stage.

(2) Dual channel performance analysis

The following compares the bi-directional LSTM-CRF network model performance using two channels with bi-directional LSTM-CRF performance using only a single channel. The single channel model contains different models: (1) a linear single channel bi-directional LSTM-CRF model; (2) nonlinear single channel bidirectional LSTM-CRF model. The performance comparisons are shown in table 3.

Table 3 two-channel performance comparison

Biological trigger word recognition system	P	R	F1
				Dual channel bidirectional LSTM-CRF	70.7	70.9	70.8
Linear single channel bidirectional LSTM-CRF	66.5	67.6	67.0
				Nonlinear single channel bidirectional LSTM-CRF	70.2	66.6	68.3

As can be seen from the results of Table 3, by further comparing the two-channel models, the performance of the two-channel model is higher than that of the two-channel model, which means that he can effectively acquire context information of different layers of sentences, and by the method of maximum pooling, abstract features provided by the information can be effectively and dynamically selected and fused, so that the trigger word recognition performance is effectively improved.

Claims

1. A trigger word recognition method based on a dual-channel bidirectional LSTM-CRF network is characterized in that,

the method is applied to the identification of the biological medicine event trigger words; the method comprises the following steps:

step 6, marking event trigger words in the biomedical text by utilizing the optimal marking sequence output by the model;

in step 4, the linear context is embedded in a channel layer: for the input sentence s, each input word wi is converted into a corresponding real feature vector by a series of look-up tables at the linear channel input layer, i.e. into a cascade representation of the following vectors:

(2) Character embedding feature vector E ^c_L : learning spelling patterns for each word at a character level using an LSTM network; the LSTM network parameters are randomly initialized and the sequence x is input _t To compose wordsw _i Wherein t is the number of characters contained in the word; the output is character level embedded vector sequence h _t The method comprises the steps of carrying out a first treatment on the surface of the Character embedding feature vector E ^c_L Can be input from word w _i Extracting spelling characteristic information from the character sequence of (a);

(3) Part-of-speech embedded feature vector E ^p_L : extending word embedding features using part of speech; using embedded feature lookup tablesEach word w in the sentence to be input _i Mapping part of speech tagging of (a) to an embedded word vector +.>The vector contains a vector from X ₂ Chinese word character is in the linear context kind of correlation information, extract the sentence information of the context from the input word; lookup table matrix->Randomly initializing;

In step 4, the nonlinear context is embedded in a channel layer: for an input sentence s, the embedding layer converts each input feature into a real representation vector through a series of look-up tables; thus, each input word w _i Conversion to a cascade representation of the following vectors:

(2) Dependency tree based part-of-speech embedding feature vector E ^p_NL : expanding word embedding by using part of speech, the part of speech mapping pos labels of each word in a dependency tree context corresponding to an input sentence dependency tree to pos embedding vectors, the pos embedding vectors extracting syntactic information from the input words; using embedded feature lookup tablesEach word w in the sentence to be input _i Mapping part of speech tags to a non-linear part of speech embedded vector->The vector contains the vector from Y ₂ Part-of-speech relevance information for a medium part-of-speech nonlinear context; lookup table matrix->Randomly initializing;

(3) Named entity type embedded feature vector based on dependency treeE ^e_NL : mapping the named entity type of each word in the nonlinear context corresponding to the input sentence dependency tree to an embedded vector, wherein the embedded vector extracts nonlinear information related to the field from the input; by embedding a feature lookup table Y ₃ Each word w in the sentence dependency tree context will be input _i Mapping named entity types to embedded vectorsThe embedded vector extracts non-linear information related to the field from the input; embedded feature lookup table->Randomly initializing;

thus, through the non-linear context-embedding channel layer, each input word wi is converted into a non-linear vector string

In step 4, the bidirectional LSTM layer:

(1) Bidirectional LSTM layer for linear channel: the bi-directional LSTM layer embeds linear contexts into the output x of the channel layer ^L As an input, the input may be, among other things,let the input sequence of an LSTM cell be the vector x of length t ₁ ，x ₂ ，…，x _t It obtains the output sequence h with the same length by applying the nonlinear transformation learned during training as shown in the formula 1 ₁ ，h ₂ ，…，h _t ；

In each LSTM state time step t, i _t Is an input door, f _t Is a forgetful door o _t Is an output door c _t Is a recordThe memory cell is a memory cell that,is a candidate memory cell, h _t Is an implicit state; all W and b are trainable parameters of LSTM, sigma (&) and tan h (&) represent sigmoid function and hyperbolic tangent activation function, and plus represents inner product;

when the vector sequence x is input ^L Is forward, then a linear forward STM network output h is obtained ^F_L The method comprises the steps of carrying out a first treatment on the surface of the When the vector sequence x is input ^L Is backward, then linear backward STM network output h is obtained ^B_L The method comprises the steps of carrying out a first treatment on the surface of the Cascading the outputs of the two LSTM networks in the forward direction and the backward direction to obtain the output of the bidirectional LSTM layer of the final linear channel, and h ^L ＝[h ^F_L ；h ^B_L ]；

(2) Bidirectional SLTM layer of nonlinear channel: likewise, the LSTM layer embeds a nonlinear context into the output x of the channel layer ^NL As an input, the input may be, among other things,also, it obtains the output sequences of the same length as shown in equation 1 by applying the nonlinear transformation learned during training; by combining the outputs h of the forward and backward LSTM networks ^F_NL And h ^B_NL Cascading to obtain the final bidirectional LSTM layer output of the nonlinear channel, h ^NL ＝[h ^F_NL ；h ^B_NL ]；

In step 4, the maximum pooling layer: abstract features from both forward and backward directions based on linear and nonlinear context information are extracted through the two-way LSTM layers of the two different channels in front, respectively; capturing the most useful features by using a maximum pool technique, and dynamically selecting the features by acquiring a maximum value for each dimension j; f (F) _max Is the maximum pooled dynamic feature output:

F _max ＝max(h _j ^F_L ，h _j ^B_L ，h _j ^F_NL ，h _j ^B_NL ) (2)

F＝(F ₁ ，F ₂ ，...，F _n ) (3)；

in step 4, the CRF layer: based on the input sequence s= (w ₁ ，w ₂ ，...w _n ) The pooling layer outputs its fused abstract feature sequence F, assuming the tag sequence y= (t) ₁ ，t ₂ ，...t _n ) Is the final output of the CRF layer; given the fusion abstract feature sequence F and tag sequence y for each training instance, the CRF layer defines the maximization function of the target:

where F is a function of assigning a score to each pair F and y,a tag sequence space representing F; the cost function cost (F, y ') is based on the principle of maximum profit, i.e. the high cost label y' should be subjected to a greater penalty; the CRF layer learns the strong dependencies F (F, y) between the output tags, resulting in the most likely output tag sequences.

2. The trigger word recognition method based on the dual-channel bidirectional LSTM-CRF network as set forth in claim 1, wherein in said step 4, a general small-batch gradient descent forward and reverse training process is used; dividing the whole training data into a plurality of batches for each iteration, and processing one batch at a time; each batch contains a sentence determined by a batch size parameter; for each batch, firstly, running bidirectional LSTM-CRF model forward transfer, and transferring the forward state and the backward state of the LSTM; obtaining the output of all labels at all positions; then, running the CRF layer forward and backward pass to calculate the gradients of the network output and state transition edges; after this, errors are propagated back from the output to the input, updating network parameters, including parameters of all LSTM forward and reverse states, the linear, nonlinear context embedding the transfer parameters of the random initialization look-up table, CRF, of the channel layer.