CN115238690A

CN115238690A - Military field composite named entity identification method based on BERT

Info

Publication number: CN115238690A
Application number: CN202111408527.4A
Authority: CN
Inventors: 周焕来; 张博阳; 乔磊崖; 高源�; 郭健煜; 唐小龙; 贾海涛; 王俊
Original assignee: Yituo Communications Group Co ltd
Current assignee: Yituo Communications Group Co ltd
Priority date: 2021-11-26
Filing date: 2021-11-26
Publication date: 2022-10-25

Abstract

The invention provides a method for recognizing a compound named entity in the military field based on BERT, which comprises the following steps: representing Word vectors by using a BERT pre-training model and Word vectors by using Word2Vec in an input layer, performing Word Embedding on the Word vectors and the Word vectors in a combined manner, introducing data enhancement operation, splicing the Word vectors and the Word vectors on a Word vector representation layer to enhance original input information, and constructing sentence initial vectors; capturing global semantic information of the text by using Bi-On-LSTM at an encoding layer; introducing an attention layer and updating semantic weight; and an LSTM Unit long-short term memory network is adopted in a decoding layer to extract nested named entities, the number of output labels is not limited to a single label, the advantages of the traditional On-LSTM in the aspect of extracting sentence level information are inherited, softmax is used for predicting and outputting, and a CRF layer ensures that sequence labeling results meet the dependency relationship among the labels, so that the field nested entity extraction task is completed. Because the decoding layer adopts the decoding mode of the LSTM Unit network and introduces the CRF layer to ensure the sequence dependency relationship, the problems of entity nesting and the like can be well solved, and the accuracy of the nested entity extraction in the military field is improved.

Description

Military field composite named entity identification method based on BERT

Technical Field

The invention belongs to the field of natural language processing.

Background

In recent years, a computer and a related information processing technology which are mature and developed day by day provide an effective means for further improving military command efficiency, and the informatization and intelligentization construction of our military is steadily promoted. As a new technology, the knowledge graph can integrate complex and massive data together, and the data are mutually linked through the mined relation, so that the knowledge graph has strong data description capacity and rich semantic relation functions. Entities are important language units carrying information in texts and are also core elements forming a knowledge graph. The entity identification and extraction is the basis for the subsequent work of attribute relationship extraction, event extraction, knowledge graph construction and the like, and the main task is to identify and classify entities with specific meanings in texts, for example, the general fields generally take names of people, places and organizations as extraction targets. The subject of the present invention is the target entity in the military field, meaning the named entities with high value in unstructured military text, such as military characters, weaponry, military events, etc., which often contain rich military knowledge. The method can correctly and efficiently identify the military named entity, provide support for subsequent operations such as battlefield information acquisition, information retrieval, information filtering, information association, semantic search and the like, improve the efficiency of actions such as information reconnaissance, command decision, organization and implementation and the like, and further improve the automation and intelligentization performance of military combat command.

Named entity recognition is an important research direction in the field of information extraction, meaning that entities involved in unstructured text containing entities are extracted from it. The composite named entity recognition is an optimization of named entity recognition specific situations, and the core objective of the composite named entity recognition is to extract all elements which may be entities from a text and distinguish nesting situations between the entities.

Currently, researchers are concerned more with common named entity recognition, and research on compound named entity recognition is limited. Therefore, most of the existing methods can only perform common named entity recognition, and cannot achieve good effect when facing a composite entity, so that the models have no good portability, and are difficult to complete the entity extraction task of an open domain. Unlike most scholars' choices, the present invention employs a completely different model than the common named entity recognition in the entity extraction stage. In view of the good performance of the codec framework and the attention mechanism on other natural language processing tasks, and the task researched by the invention also belongs to one of the sequence labeling tasks fundamentally, therefore, the invention builds a model for extracting the compound entity in the event element by combining the attention mechanism on the basis of the codec framework.

Disclosure of Invention

The invention provides a method for extracting a composite entity based on an encoding and decoding model, aiming at improving the accuracy and efficiency of the extraction of the composite entity. The method comprises the following steps:

(1) And selecting features at an input layer to construct an initial sentence vector.

(2) Hierarchical structure information and sequence information are captured at the coding layer.

(3) And capturing word-word information in the sentence at the attention layer and calculating corresponding weight.

(4) Features prior to layer synthesis are decoded and further abstract features are extracted.

(5) And obtaining a factor recognition result by using a softmax function at an output layer.

Drawings

FIG. 1 is a block diagram of a codec framework employed in the present invention.

FIG. 2 is a schematic diagram of the features of the constructed text vector employed in the present invention.

FIG. 3 defines military nested entity types for the present invention.

FIG. 4 is an input level conjunctive word embedding representation of the present invention.

Fig. 5 is a schematic diagram of EDA data enhancement added after a presentation layer.

FIG. 6 is a view showing the structure of On-LSTM.

Fig. 7 illustrates the core concept of the attention mechanism used in the present invention.

Fig. 8 is a schematic diagram of a composite entity tag.

Fig. 9 is a general configuration diagram of a decoding layer.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention.

As shown in fig. 1, the event element extraction method is mainly based on a codec framework and combines an Attention mechanism to extract event elements, and mainly comprises five parts, namely an Input Layer (Input Layer), an encoding Layer (Encoder Layer), an Attention Layer (Attention Layer), a decoding Layer (Decoder Layer) and an Output Layer (Output Layer). The specific implementation mode is as follows:

the method comprises the following steps: input layer

In other named entity extraction models, the entity boundary and part-of-speech features obtained by Chinese segmentation are two main features commonly used in named entity recognition models. However, in practical applications, when a nested entity is to be obtained, information such as word segmentation and part of speech is obtained by using a natural language processing tool (such as the LTP in hayage) in a text preprocessing stage, which may cause error propagation, thereby reducing model performance. Therefore, in order to ensure the accuracy and efficiency of the entity recognition model, the original features which can be directly extracted from the text are selected.

In the invention, word embedding maps a text sequence to a high-dimensional vector space model to construct, a strategy with better effect in the existing research is fused, chinese single word embedding containing full text information is obtained through a pre-training semantic model BERT (Bidirectional Encoder retrieval from transformations), and the recognition of nested entities in the military field is assisted by using the language rules and semantic knowledge in word vectors; meanwhile, the Word2Vec technology is fused, the boundary characteristics and the part-of-speech characteristics of Chinese words are introduced, and the performance of entity recognition is improved.

The invention adds data enhancement in the middle of the vector representation layer output to the encoder to generate the negative example of sentence vector representation and enhance the performance of model training.

As shown in FIG. 2, in the military field nested entity recognition task, the invention selects BERT word vectors (position characteristics, sentence characteristics and word characteristics) and W2V word vectors (part of speech characteristics and boundary characteristics). In fig. 2, the nested entity "cross-border business contract negotiation" in this sentence belongs to the target event entity category as a nested noun. With L = { w ₁ ,w ₂ ,…,w _n Denotes a sentence.

Wherein the length of the sentence is n, the ith word in the sentence is w _i We will say that each word w _i Converting a vector x consisting of three parts _i 。

The input layer of the invention comprises 3 steps as follows:

BERT word vector representation

The input part of the BERT layer is composed of superposition of word embedding (token embedding), sentence embedding (segment embedding), and position embedding (position embedding) for each character. In addition, since the experimental corpus is input as a single sentence, word embedding is used to represent sentence information, and sentence sequences are represented by inserting sentence heads and sentence tails with symbols [ CLS ] and [ SEP ]. The BERT layer maps each word in the sentence to a low-dimensional dense word vector.

In the invention, BERT utilizes bidirectional Transformer calculation training to map each word in a sentence into a low latitude dense word vector.

Word2vec word vector representation

With e (w) _i ) Represents the word w _i The word vector of (2). The invention uses Word2Vec in Python Gensim theme model package in the process of training Word2Vec, and adopts Skip-gram model.

On the basis of character features extracted by BERT, semantic features of parts of speech and boundaries are introduced to serve as one of expression features of a subsequent neural network. The method removes part of parts of speech of the original general field, introduces a part of speech tag set with military characteristics, and includes the parts of speech and types of named entities related to the military field. The model defines the military domain nested entities as shown in fig. 3.

Each part of speech in the part of speech corpus with good word segmentation results is looked up by a table to obtain corresponding part of speech vector representation P _i (POS embedding), and then carrying out word vector C obtained by passing the POS embedding and the word sequence through a BERT pre-training model _i Splicing and fusing by adopting a proper mode to obtain a mixed feature vector representation H _i 。

The potential information of the Chinese text is utilized, chinese part-of-speech rules are better utilized, word boundary information ignored in a word model is introduced, and vocabulary information and entity boundary information are fused in a model word embedding layer. Therefore, the auxiliary model can judge the entity type and the boundary when the entity is identified, and the function of improving the identification capability of the model is achieved. After two kinds of vector representations are respectively obtained through BERT and W2V, the two kinds of vectors are combined in a splicing mode, and text joint features are fused. As shown in fig. 4.

3. Data enhancement

The Data enhancement technology (EDA, easy Data enhancement technologies) is first proposed in image processing, and has become a standard configuration in the image field nowadays, and Data enhancement is realized by Techniques such as inversion, rotation, mirror image, white gaussian noise and the like of an image, so as to improve the robustness of successful classifier identification in an image identification environment. In natural language processing, various variants are generated for different task data enhancement techniques, such as text classification, part-of-speech tagging, and the like. So-called competitive training, which is actually considered as a method for improving the generalization capability of a model by expanding limited data, data enhancement techniques improve the performance of the model by generating perturbations that are easily recognized as false instances by a classifier.

The training data of the model is from partial cases, and 4 operations are introduced to enhance the data in order to improve the performance of the nested entity recognition model in the military field, so that overfitting is prevented, and the generalization capability of the model is improved.

1) Synonym substitution (SR: synonyms Replace): regardless of stopwords, n words are randomly extracted from a sentence, and then synonyms are randomly extracted from a synonym dictionary and replaced.

2) Random insertion (RI: random Insert): regardless of stopwords, a word is randomly extracted and then randomly selected from a set of synonyms for the word and inserted into a random position in the original sentence. This process may be repeated n times.

3) Random crossover (RS: random Swap): in the sentence, two words are randomly selected, and positions are exchanged. This process may be repeated n times.

4) Random deletion (RD: random Delete): each word in the sentence is deleted randomly with probability p. The original input information is enhanced by adding a data enhancement operation on top of the concatenated word vector representation layer, as shown in fig. 5.

The input representation layer model comprises word embedding fusing BERT word vectors with W2V and enhanced data operations, by adding small perturbation functions in training data. As shown in equation 1.

I.e. by manipulating the worst case enhancement data η _adv Added to the original embedded vector ω to maximize the loss function. Wherein the content of the first and second substances,

is a copy of the current model parameters. The original example and the generated enhanced data statement are then jointly trained, so the final penalty is shown in equation 2.

In summary, the word w _i Real value vector x of _i This can be expressed as shown in equation 3.

Wherein x is _i ∈R ^d The dimension is d dimension,

meaning that vectors are merged in a concatenated manner, may be represented by X = { X = { (X) ₁ ,x ₂ ,…,x _n Represents an event sentence L of length n; wherein X ∈ R ^n×d Dimension is n × d dimension, x _i Is the ith word w _i The real-valued vector of (1).

Step two: coding layer

For different tasks, the coding layer and the decoding layer can be selected in different combination modes, for example, on the image processing task, a convolutional neural network is usually used for forming the coding layer, but for the natural language processing field task of extracting event elements, a cyclic neural network is usually selected; because a sentence can be represented as a hierarchical structure, neurons in a conventional recurrent neural network such as LSTM are often unordered, so that hierarchical information of the sentence cannot be extracted. Therefore, the invention selects a bidirectional ordered long-short term memory network (Bi-On LSTM) as the basic structure of the coding layer. The forward calculation formula of the On-LSTM is shown in formula 4, and FIG. 6 is a schematic structural diagram of the On-LSTM unit.

Wherein, the ON-LSTM is mainly a new main forgetting gate

Main input gate

And

right/left cumsum operation, respectively;

computing word x at t time by forward On-LSTM _t Left state

And then the word x at the t moment is calculated by utilizing the backward On-LSTM _t State of the right

The output result of the coding layer at time t is

Step three: attention layer

In short, the attention mechanism is to ignore unimportant features among a large number of features while enhancing the focus on useful features. The Attention mechanism is divided into a Soft-Attention model and a Self-Attention model, and as shown in FIG. 7, the core idea of the Attention mechanism used in the present invention is shown.

In the Soft-Attention model, firstly, terms in an input sequence S are abstracted into data pairs in the form of < Key, value >, so that for a certain term Query in a target sequence T, a weight coefficient Value corresponding to each Key in the input sequence can be obtained by calculation and correlation between the Key and the Query, and final Attention (Attention Value) can be obtained by weighting and summing the values corresponding to all keys, as shown in formula 5.

Wherein the length of the input sequence S is L _x 。

The Self-Attention model, which is adopted by the latest machine translation model of Google, is also referred to as the Self-Attention mechanism. In the Soft-Attention model, the Attention mechanism mainly acts between each word in the input sequence S and Query in the target sequence T. In the Self-Attention model, attention is mainly focused on the internal words of the input sequence S or the target sequence T, and the mode of calculation of the Attention is similar to that of the Soft-Attention model, and the difference is only that the calculation objects are different.

Aiming at natural language processing tasks, the Self-Attention model has strong capability of capturing semantic features among words in sentences. For a cyclic neural network and a gated cyclic unit network, after the Self-orientation model is added, related information among words which can be calculated through a plurality of steps can be directly associated easily through one step, so that the dependency relationship of the words can be captured well in long sentences.

Unlike most of the past natural language processing tasks, the model of the present invention does not directly hide the state h of the Encoder layer _t The weighted sum is used as the context vector c of the attention layer _t But c at the time of calculating t _t When ignoring the hidden state h _t And predicting the final result y _t Time-to-time decoding layer output result s _t And h of the coding layer _t Taken together as a feature. This is because at time t, h of the coding layer _t Semantic information representing current candidate event elements, which is a prediction y _t The most effective information of the result; and c is a _t Can represent the influence of other words in the sentence on the candidate event elements. Thus, context vector c _t The calculation formula of (c) is shown in formula 6.

Wherein, the hidden state of the coding layer is recorded as h _j Attention distribution weight is denoted as a _t,j The formula is calculated as shown in formula 7.

Wherein the attention score (attention score) is recorded as e _t,j Following withe _t,j Increase of (2), attention is assigned a weight a _t,j Semantic coding h of the coding layer, which is enlarged _j For context vector c _t The influence of (c) will also increase, and thus the influence on the final element type will also increase.

As shown in the following equation, the attention score e is calculated for different deep learning tasks _t,j The method (1) comprises dot product (dot), multiplication (general) and addition (concat). Wherein the matrix

Is a matrix s _t For the transpose of (1), the weight matrix in the attention layer is W _a . A large number of experiments show that better final results can be obtained by calculating the attention score in a general way for the natural language processing task, and therefore, the general calculation way is selected for the event element extraction task studied by the present invention, as shown in formula 8.

The core of the model designed by the invention is the attention mechanism, and the core of the attention mechanism is how to allocate attention weights, which has great influence on the final result of the event element model. When the role type of the event element is judged, the attention mechanism endows each word in the sentence with a certain weight value, and when the model is identified and classified, the words with larger weight are more concerned.

Step four: decoding layer

The function of the decoding layer of the invention is to label the token with a label not limited to one based on the token and other information output by the coding layer and the attention layer.

The decoding layer adopts an LSTM Unit long-short term memory network. As with other models using codec frameworks, the first hidden state of the decoding layer is also calculated by the last hidden state of the coding layer, and the state can be initialized during training or updated during trainingAnd (5) new. For the label l _i e.L, which corresponds to a representation g _i . Meanwhile, the hidden layer output h of the last time step Unit can be obtained _t-1 Output model o of last time step _t-1 And token for the current time step input represents x _t The task of this layer is how to find the current hidden layer output h based on these parameters _t Output o of the model at the current time step _t . The method comprises the following steps:

1) Model output of last time step o _t-1 The method comprises the steps of mapping the whole distribution to a 01 interval through softmax, and finding out all labels with the probability larger than a preset threshold value T. For example, using FIG. 8, if the model achieves our expected result, for the current token "his", the model should find two tags U that meet the threshold requirement _PER And B _PER . Note that since the label O is not predicted here, but to efficiently model the probability of the beginning of each new entity, each token must have at least one O label, i.e., the model needs to eventually predict O, U _PER And B _PER ；

2) Since the last time step model predicts three possible labels, we need to examine which label the three labels may correspond to at the current step. So we will be x at present _t Three copies were copied, each one calculating the current hidden layer output by the LSTM unit. In particular, for the possible label k of the previous time step, the hidden layer result of the current time step is shown in equation 9:

3) Thus, for each prediction result of the previous time step, a hidden layer representation of the corresponding current time step can be obtained. We now have three hidden layer representations, which are averaged as shown in equation 10:

wherein | G _t-1 I is the number of all tags meeting the threshold in the previous time step, and the value is 3 in the present example;

4) Obtaining an output o of the current time step _t ＝Uh _t + b, where U and b are the weight matrix and the regular term of FFN, respectively.

The above is the form of outputting all possible labels of each token one by one using the form similar to Decoder, and if the above expression is difficult to be directly understood, it can also be understood by comparing the model structure diagram 9.

The formula for the loss function based on the above idea is given here to describe equation 11:

the method is essentially a multi-classification cross entropy loss function, the output of each training is corrected according to the function, and the training can be finished after the model effect reaches the standard. The number of labels output by using the model is not limited to a single label, the advantages of the traditional On-LSTM in the aspect of extracting sentence level information are inherited, and the method is a core process in a composite named entity recognition decoding structure.

Step five: output layer

And the output layer processes the output result of the decoding layer by adopting a softmax function so as to obtain a classification result. In the task studied by the invention, the most influence on the classification result is that the current semantic meaning (candidate event element) is a simple entity or a nested entity, so the invention calculates the attention weight c _t Only the influence of other words in the calculation sentence on the candidate event is calculated, and the semantic information h obtained after the candidate event element passes through the coding layer is ignored _t In order to preserve as much information as possible about the words themselves. Y of the output layer _t Contains c _t And h _t Two parts, as shown in equation 12:

y _t ＝softmax(w _h h _t +w _c c _t +b) (12)

wherein the weight matrix w _h 、w _c Are all randomly generated, b is a bias vector, and the prediction result of the current word is y _t 。

The number of training samples is set as T, and the ith sample is (x) _i ,y _i ) The model parameters are represented by θ, and the model target loss function is shown in equation 13:

during training, the invention performs objective loss function optimization by Adam while preventing overfitting using Dropout.

Step six: CRF layer

y _i Is x _i The probability matrix of each corresponding label is obtained, and the final output result is y _i The label corresponding to the maximum value. However, there are cases where the chosen label does not comply with the constraint rules, such as label PER-B followed by label ORG-E. In order to ensure that the labeling result of the whole sequence conforms to the dependency relationship among the labels, a transfer matrix T and an element T are introduced _ij Indicating the probability of transitioning from label i to label j. The layer applies the viterbi algorithm, the calculation result is obtained by the common calculation of the upper layer output matrix Y and the transition matrix T, and the prediction output of the whole sequence is as shown in formula 14:

although illustrative embodiments of the present invention have been described above to facilitate the understanding of the present invention by those skilled in the art, it should be understood that the present invention is not limited in scope to the specific embodiments. Such variations are obvious and all inventions utilizing the inventive concept are intended to be protected.

Claims

1. A military field composite named entity recognition method based on BERT is characterized in that the method aims at recognizing and classifying simple entities and nested entities in military field event sentences, and the method comprises the following steps:

step 1: selecting features from an input layer to construct a word vector and a word vector, and performing data enhancement;

and 2, step: capturing hierarchical structure information and sequence information at a coding layer;

and step 3: capturing word-word information in the sentence at the attention layer and calculating corresponding weight;

and 4, step 4: features before the integration of the decoding layers are combined, and abstract features are further extracted;

and 5: and obtaining element identification results by using a softmax function and a CRF algorithm on an output layer.

2. The BERT-based military field composite named entity recognition method of claim 1, wherein in the step 1, features are selected from an input layer to construct word vectors and word vectors, and data enhancement specifically refers to: in the military field composite named entity recognition task, two types of characters of a word vector and a word vector are selected, and L = { w = (zero-rank) is used ₁ ,w ₂ ,…,w _n The manner shown represents a sentence;

wherein the sentence length is n, the ith word in the sentence is w _i We will put each word w _i Converting a vector x consisting of _i ；

Step 1.1: training word vector

Step 1.2: training word vector

Each part of speech in the part of speech corpus with good word segmentation results is subjected to table lookup to obtain corresponding part of speech vector representation P _i (POS embedding), and then carrying out word vector C obtained by passing the POS embedding and the word sequence through a BERT pre-training model _i Splicing and fusing by adopting a proper mode to obtain a mixed feature vector representation H _i 。

Step 1.3: data enhancement

1) Synonym replacement (SR: synonyms Replace): regardless of stopwords, n words are randomly extracted from a sentence, and then synonyms are randomly extracted from a synonym dictionary and replaced.

4) Random deletion (RD: random Delete): each word in the sentence is deleted randomly with a probability p.

The original input information is enhanced by adding a data enhancement operation on top of the stitched word vector representation layer, as shown in fig. 5.

The input representation layer model comprises word embedding and data enhancement operations of fusing BERT word vectors and W2V, and small perturbation functions are added in training data. As shown in equation 1.

And (4) melting. Wherein, the first and the second end of the pipe are connected with each other,

Wherein x is _i ∈R ^d The dimension is d dimension,

meaning that vectors are merged in a concatenated manner, may be represented by X = { X = { (X) ₁ ,x ₂ ,…,x _n Represents an event sentence L of length n; wherein X ∈ R ^n×d Dimension is n × d dimension, x _i Is the ith word w _i The real-valued vector of (a).

3. The method for extracting event element entity relationship based on coding and decoding model as claimed in claim 2, wherein the capturing hierarchical structure information and sequence information at the coding layer in step 2 specifically refers to:

for different tasks, the coding layer and the decoding layer can be selected in different combination modes, for example, on the image processing task, a convolutional neural network is usually used for forming the coding layer, but for the natural language processing field task of extracting event elements, a cyclic neural network is usually selected; one sentence can be represented as a hierarchical structure, and neurons in a conventional recurrent neural network such as LSTM are generally unordered, so that hierarchical information of the sentence cannot be extracted; therefore, the invention selects a bidirectional ordered long-short term memory network (Bi-OnLSTM) as a basic structure of a coding layer; forward calculation of On-LSTM, as shown in equation 4:

wherein, the ON-LSTM is mainly a new main forgetting gate

Main input gate

And

right/left cumsum operation, respectively;

computing word x at t time by forward On-LSTM _t Left state

Then, backward On-LSTM is utilized to calculate word x at time t _t Right state

The output result of the coding layer at the time t is

4. The method for extracting event element entity relationship based on coding and decoding model as claimed in claim 3, wherein said capturing word-word information in sentence at attention level and calculating corresponding weight in step 3 specifically means:

the attention layer is a core part of the model of the present invention. In short, the attention mechanism is to ignore unimportant features among a large number of features while enhancing the focus on useful features. The Attention mechanism is divided into a Soft-Attention model and a Self-Attention on model, and as shown in FIG. 7, the core idea of the Attention mechanism used in the present invention is shown.

In the Soft-Attention model, firstly, terms in an input sequence S are abstracted into data pairs in the form of < Key, value >, for a certain term Query in a target sequence T, weight coefficients Value corresponding to keys in the input sequence can be obtained by calculation and correlation between the Key and the Query, and Value corresponding to all keys is weighted and summed to obtain final Attention (Attention Value), as shown in formula 5.

Wherein the length of the input sequence S is L _x 。

The Self-Attention model, which is adopted by Google's latest machine translation model, is also referred to as the Self-Attention mechanism. In the Soft-Attention model, the Attention mechanism mainly acts between each word in the input sequence S and Query in the target sequence T. In the Self-Attention model, attention is mainly focused on the internal words of the input sequence S or the target sequence T, and the mode of calculation of the Attention is similar to that of the Soft-Attention model, and the difference is only that the calculation objects are different.

Aiming at natural language processing tasks, the Self-Attention model has strong capability of capturing semantic features among words in sentences. For a cyclic neural network and a gated cyclic unit network, after a Self-orientation model is added, related information among words which can be calculated through a plurality of steps can be directly related easily through one step, so that the dependency relationship of the words can be captured well in long sentences.

Unlike most of the past natural language processing tasks, the model of the present invention does not directly hide the state h of the Encoder layer _t The weighted sum is used as the context vector c of the attention layer _t But c at the time of calculating t _t When ignoring the hidden state h _t And predicting the final result y _t Output result s of time-to-be-decoded layer _t And h of the coding layer _t Taken together as a feature. This is because at time t, h of the coding layer _t Semantic information representing current candidate event elements, which is a prediction y _t The most effective information of the result; and c _t Can represent the influence of other words in the sentence on the candidate event elements. Thus, context vector c _t The calculation formula of (c) is shown in formula 6.

Wherein, the hidden state of the coding layer is recorded as h _j Attention assignment weight is denoted as a _t,j It calculates the formula as shown in formula 7.

Wherein the attention score (attention score) is recorded as e _t,j With e _t,j Increase of (2), attention assigning weight a _t,j Semantic coding h of the coding layer, which is enlarged _j For context vector c _t The influence of (c) will also increase, and thus the influence on the final element type will also increase.

As shown in the following equation, the attention score e is calculated for different deep learning tasks _t,j The method of (1) comprises three methods of dot product (dot), multiplication (general) and addition (concat). Wherein, the matrix

Is a matrix s _t The weight matrix in the attention layer is W _a . A large number of experiments show that better final results can be obtained by calculating the attention score in a general way for the natural language processing task, and therefore, the general calculation way is selected for the event element extraction task studied by the present invention, as shown in formula 8.

5. The method for extracting event element entity relationship based on coding and decoding model as claimed in claim 4, wherein the feature before decoding layer synthesis and further extracting abstract feature in step 4 specifically refer to:

1) Model output o of last time step _t-1 The method comprises the steps of mapping the whole distribution to a 01 interval through softmax, and finding out all labels with the probability larger than a preset threshold value T. For example, using FIG. 8, if the model achieves our expected result, for the current token "his", the model should find two tags U that meet the threshold requirement _PER And B _PER . Note that since the label O is not predicted here, but to efficiently model the probability of the beginning of each new entity, there must be at least one O label per to ken, i.e., the model needs to be eventually predicted to obtain O, U _PER And B _PER ；

2) Since the last time step model predicts three possible labels, we need to consider which label the three labels may correspond to respectively at the current step. So we will be x at present _t Three copies were copied, each one calculating the current hidden layer output by the LSTM unit. In particular, for the possible label k of the previous time step, the hidden layer result of the current time step is shown in equation 9:

wherein | G _t-1 I is the number of all tags meeting the threshold in the last time step, and the value is 3 in the example presented by us at present;

4) Obtaining the output of the current time stepGo out of _t ＝Uh _t + b, where U and b are the weight matrix and the regular term of FFN, respectively.

the method is essentially a multi-classification cross entropy loss function, the output of each training is corrected according to the function, and the training can be finished after the model effect reaches the standard. The number of labels output by using the model is not limited to a single label, the advantage of the traditional On-LSTM in the aspect of extracting sentence level information is inherited, and the method is a core process in a compound named entity recognition decoding structure.

6. The method for extracting event element entity relationship based on coding and decoding model as claimed in claim 5, wherein the deriving element identification result at output layer using softmax function in step 5 specifically refers to: identifying event elements in the sentence by the extracted features through a classifier and classifying roles;

step 6.1: softmax function

The output layer processes the output result of the decoding layer by adopting a softmax function so as to obtain a classification result; in the task studied by the invention, the most influential to the classification result should be the semantic information of the current word (candidate event element) itself, and therefore, the invention is calculating the attention weight c _t Only the influence of other words in the calculation sentence on the candidate event elements is calculated, and the semantic information h obtained after the candidate event elements pass through the coding layer is ignored _t So as to keep the information of the words as much as possible; output layer y _t Contains c _t And h _t Two partsAs shown in equation 12;

y _t ＝softmax(w _h h _t +w _c c _t +b) (12)

wherein the weight matrix w _h 、w _c Are all randomly generated, b is a bias vector, and the prediction result of the current word is y _t ；

The number of training samples is set as T, and the ith sample is (x) _i ,y _i ) Representing that the model parameter is represented by theta, then the model target loss function is represented by formula 13;

Step 6.2: CRF prediction

y _i Is x _i The probability matrix of each corresponding label is obtained, and the final output result is y _i The label corresponding to the maximum value. However, there are cases where the chosen label does not comply with the constraint rules, such as label PER-B followed by label ORG-E. In order to ensure that the labeling result of the whole sequence conforms to the dependency relationship among the labels, a transfer matrix T and an element T are introduced _ij Indicating the probability of transitioning from label i to label j. The layer applies the Viterbi algorithm, the calculation result is obtained by the common calculation of an upper layer output matrix Y and a transfer matrix T, and the prediction output formula of the whole sequence is as follows:

。