CN113807079A

CN113807079A - End-to-end entity and relation combined extraction method based on sequence-to-sequence

Info

Publication number: CN113807079A
Application number: CN202010531196.2A
Authority: CN
Inventors: 何小海; 刘露平; 卿粼波; 罗晓东; 吴晓红; 任超; 吴小强
Original assignee: Sichuan University
Current assignee: Sichuan University
Priority date: 2020-06-11
Filing date: 2020-06-11
Publication date: 2021-12-17
Anticipated expiration: 2040-06-11
Also published as: CN113807079B

Abstract

The invention discloses a method for jointly extracting end-to-end entities and relations based on sequences. The method adopts a sequence-to-sequence network structure to generate a triple sequence, and the network consists of an encoding network, a relation decoding network and a pointer network which are integrated with syntax dependence information. The coding network fusing the syntax dependence information is realized based on a Transformer network, and the syntax dependence tree information of the sentence is fused to obtain better coding representation during coding so as to reduce noise information. The relational decoding outputs a relational sequence based on a transform decoding network. The pointer network is composed of two networks with the same structure and is respectively used for extracting a head entity and a tail entity. The pointer network is implemented in a multi-headed attention mechanism, where the attention matrix is used as a pointer for selecting the starting position of an entity from an input sentence. The network provided by the invention realizes the parallel output of the relationship and the entity by adopting multi-head attention, thereby enhancing the dependence of the entity and the relationship on one hand and accelerating the decoding speed on the other hand.

Description

End-to-end entity and relation combined extraction method based on sequence-to-sequence

Technical Field

The invention designs an end-to-end information extraction method based on sequence-to-sequence, belonging to the technical field of natural language processing.

Background

Information extraction is a basic and important task in natural language processing, which is the basis for constructing a knowledge graph and is also an important step for realizing the conversion from unstructured data to structured data. In information extraction, entity and relationship joint extraction refers to directly extracting entity pairs and corresponding relationships from an original text to form effective triples, and the triples are widely applied to tasks such as knowledge graph construction, internet data structured extraction and structured extraction of data in the fields of judicial science, medical treatment and the like. Due to the wide application prospect in reality, the triple extraction has been widely concerned by researchers in academia and industry. In recent years, with the development of deep learning technology, the rapid development of the technology is promoted. But in real-world scenarios, due to the complexity and diversity of text expressions, the extraction of triplet information still faces some challenges. Among these challenges, the overlap relationship abstraction is a more complex challenge. In the extraction of the overlapping relationship, different relationships exist between an entity and a plurality of entities, and even different relationships exist between an entity pair. These widely existing overlapping relationships do not allow current methods to handle very efficiently.

In the traditional method, a triple extraction task is finished as two independent tasks, namely named entity identification and relationship classification. In these methods, all entities in a sentence are first identified by a named entity identification method, and then their relationships are predicted for different pairs of entities. The method is simple to implement, but has the defects of error propagation and the like, namely, errors generated in the named entity identification stage can influence the subsequent relation classification. Furthermore, such methods fail to model the relevance of the two tasks. To solve this problem, the Joint learning method was proposed by researchers, which models the correlation between two networks by a way of sharing the network, thus achieving a better effect in this task, which has been proven in many existing studies (Zheng S, Hao Y, Lu D, Bao H, Xu J, Hao H, et al. Joint entry and Relation expression based on a hybrid network. neural expression.2017; 257: 59-66; Katiyer A, card C, edition. Gout a symbol: Joint Extraction of Entity and Relations with Dependency Trees2017: Association for use in computers; Miwa M, ball M, object Entity and Relation of Relations between Trees and Relations between models about the Relations of Relations between models and Relations of Relations between models 2016).

Although the progress of joint extraction is promoted to a certain extent by joint learning, the problem of extraction of overlapping relationships cannot be solved well by the joint extraction method. The joint learning uses a classification method to classify the relationship between two entity pairs, so if all triples need to be extracted, all entity pairs need to be enumerated and classified, which brings a large calculation cost. Furthermore, many entity pairs do not contain any relationship, so in the training phase, there will be a large number of entity pairs assigned 'None' labels, which will make it difficult for the neural network model to learn the true relationship. Furthermore, classification models cannot efficiently handle scenarios where one entity pair contains two relationships.

In order to solve the problems, the invention provides a combined information extraction method based on sequence-to-sequence. The method models triple generation as a sequence generation task, so that an entity or relationship can be generated multiple times to meet the need for overlapping triple generation. In conventional sequence generation tasks, it is based on word-level generation, i.e. only one word is generated at a time. In the joint extraction of triples, it is necessary to extract a sequence of triples, and therefore the extraction needs to be modeled at the sequence level, i.e. one triplet is generated at a time. Furthermore, in the joint extraction of triples, entity pairs come from input sentences, not from vocabularies. In order to meet the two requirements, the invention constructs a joint information extraction method based on the combination of a self-attention mechanism and a pointer network.

Disclosure of Invention

The invention provides a joint information extraction method based on the combination of a self-attention mechanism network and a pointer network, aiming at the problem of overlapping relation in triple extraction. The network is an end-to-end network structure, which is composed of an encoding network, a relation decoding network and a pointer network. Wherein the coding network and the decoding network are mainly implemented based on a Transformer network structure. The Transformer network is a network structure based on the autofocusing mechanism proposed by google, and its specific attentive mechanism enables the network to compute in parallel, thereby increasing training and reasoning speed (a. vaswani, n. shazer, n.parmar, j.uszkoreot, l.jones, a.n.gomez, l.u.kaiser, i.polosukhin, Attention is all you needed, in: i.guyon, u.v.luxburg, s.bengio, h.wallach, r.ferus, s.viswa, r.garnetet (Eds)), advance in Neural Processing Information, Curran Associates, system 2017, pp.5998{6008 }.6008 }. In the self-attention mechanism of the Transformer network, attention is paid to all words in a sentence when attention calculation is performed. In the task of extracting the triples, one triplet can be generally determined only by a part of a sentence, and certain dependency relationship exists among the words. Thus, when performing attention calculations, noise may be introduced if attention is calculated for all words of a sentence. In order to solve the problem, in the method, the dependency tree relationship of the sentences is taken as the prior knowledge of display to be merged into the self-attention network, so that the model can give more attention to some important related words, and the performance of the model is improved.

The invention realizes the purpose through the following technical scheme:

1. the invention discloses a sequence-to-sequence-based information extraction model which is shown in figure 1 and comprises an encoding network, a relation decoding network and a pointer network. The entity and relationship extraction of the invention comprises two processes of training and testing, and is carried out according to the following method:

(1) for a sentence to be trained, firstly, the sentence is encoded by using a Transformer encoder, and hidden layer vector output of a model is obtained and is used for representing a semantic vector of the sentence.

(2) And sending the encoded semantic vector and the right-shifted target sequence into a relational decoding network, outputting a decoding vector, and inputting the decoding vector into a full connection layer and a Softmax classification network to obtain relational category output.

(3) And sending the hidden layer vector output by the coding network and the vector output by the decoding layer into a pointer network, and decoding and outputting a corresponding entity pair.

(4) Calculating loss of the output relation class probability of the step 2 and the step 3, the probability of the entity to the initial position and the real sequence label, sending the loss to an optimizer to optimize network parameters, and storing the model after the model is converged.

(5) During testing, the stored model is loaded, the newly input sentences are subjected to relationship and corresponding entity recognition by using a network, and then corresponding triple information is output.

Specifically, in step (1), the sentence is encoded into two steps, first, the sentence is encoded by using a transform encoder, then the encoded result is sent to an encoding network based on the syntax tree guidance, and finally the outputs of the two networks are weighted and summed to obtain the output of the encoder, which is shown in fig. 2. The whole coding network consists of 4 layers, namely a word embedding layer, a multi-head attention layer, a forward feedback layer and a grammar-guided self-attention layer, and the layers are described in detail below.

1) And a word embedding layer, wherein each sentence is represented as an N multiplied by D word vector, wherein N is the number of words in the sentence, and D is the dimension of the corresponding word vector. In this invention, the word vector is composed of three parts: word-level based word-embedding, character-level based word-embedding, and position-embedding vector information. The word embedding is obtained by adopting a random vector initialization mode, and the dimensionality of the word embedding is 300. The character-level vector is obtained by encoding each character vector in the word through a one-dimensional Convolutional Neural Network (CNN), wherein the size of a filter window is 3 and the number of Convolutional kernels is 212 in the Convolutional Neural network. In character vector encoding, the maximum number of characters of a word is 10. The calculation process is as follows:

V_w＝Conv1D(V_char)

wherein V_charIs a word-embedded representation of the word characters. In the case of the character-level word-embedding, the resulting word-embedding size is 212 dimensions. Finally, the word-level embedded word representations and the character-level embedded word representations are combined to form 512-dimensional embedded word representations. The vector is further added with position vector information to obtain the coding vector representation of the sentence, and the position vector information adopts a transform position coding mode.

2) In the invention, a total of 8 attention heads are adopted, and in each attention head, an encoded representation of a word is obtained through the attention mechanism firstly, and the calculation process is as follows:

and after obtaining the attention mechanism of a single head, splicing and fusing all multi-head attention, and then obtaining the coding vector representation of the sentence through a full connection layer. The process is represented as follows:

m＝MultiHead(Q，K，V)＝concat(head₁，...，head_h)W^o

in order to prevent gradient disappearance, the output of the multi-head attention network further passes through a residual connection and normalization layer to obtain the output of the coding network.

h_m＝LayerNorm(m+x)

Where x is the encoded vector of the word-embedding and m is the output of the multi-head attention layer.

3) And the forward feedback network layer is used for further fusing the vectors passing through the multi-head attention mechanism layer. The network consists of two full connections and a ReLU activation unit, and the calculation process is as follows:

h＝FFN(m)＝max(0，mW₁+b₁)W₂+b₂

the output of the forward feedback vector also passes through a residual connecting and normalizing layer to obtain the output of the coding layer.

h_e＝LayerNorm(m+h_m)

4) The main role of the syntactic tree-guided attention layer is to blend the syntactic dependency information of sentences into the network, so that the sentences can be expressed in a better semantic mode. In specific operation, a matrix based on the syntactic dependency tree is firstly constructed, and then the matrix is fused into a network when attention mechanism calculation is carried out, so that sentences can be better represented semantically. When the syntactic dependency tree matrix is constructed, the sentence is analyzed by the syntactic dependency tree to extract the dependency relationship, and then the syntactic dependency tree matrix is constructed according to the following rules.

a) First a matrix of size N x N is constructed, where N is the number of sentence words, and then the matrix is all initialized to 0.

b) The matrix is assigned according to the syntactic dependency information of the sentence, and the assignment rule is as follows, aiming at a certain node M in the matrix_ij，

If node j is an "ancestor" of node i in the syntactic dependency tree, then the position is assigned a value of 1, otherwise the position is assigned a value of 0,

the process is represented as follows.

An example of building a syntactic dependency tree is shown in FIG. 2. After the syntax dependency tree matrix is obtained, the output of the coding network is combined with the syntax dependency tree matrix to perform further multi-head attention mechanism calculation, and the calculation process is as follows:

h′_i＝A′_iV′

after the dependency tree matrix is merged and the attention mechanism representation is carried out, the sentence only carries out self-attention calculation on the words with the dependency relationship, and therefore irrelevant noise information in the sentence can be removed. After the output is obtained through calculation based on the syntactic dependency network, the output is weighted and summed with the output of the original transform coding network to obtain a complete sentence semantic representation, and the formula is represented as follows:

wherein

Is a hyperparameter, and in the present invention, has a value of 0.5.

In step (2), the decoder is used to encode the encoded vector, which also includes three layers, namely a multi-head attention layer based on Mask operation, an encoding-decoding multi-head attention layer, and a forward feedback layer.

a) And a multi-head attention layer based on Mask operation, wherein the multi-head attention layer is used for coding sequence information output before the current decoding moment. In performing the attention mechanism calculation, the future information of the output sequence is prevented from being seen by multiplying by a Mask matrix M', and the process is expressed as follows. Where M' is N × N, where N is the number of words in the input sentence, and the structure is shown in fig. 2. The calculation process for this layer is represented as follows:

b) encoding-decoding a multi-head attention layer for mutual attention between the output with mask layer and the output vector of the encoded net, the calculation process is as follows:

h_{e_d}＝LayerNorm(h_m+MultiHead(h_m，H，H)

where H is the output of the coding network and H_mIs based on the output of the mask attention network.

c) And the forward feedback network layer is used for performing characteristic fusion on the output of the second layer so as to obtain better characteristic representation. The network computation process is as follows.

h′_{e_d}＝FFN(h_{e_d})

The forward feedback network layer is connected with a residual connecting and normalizing network layer, and the calculation process is as follows:

H_{e_d}＝LayerNorm(h_m+MultiHead(h_m，h′_{e_d}，h′_{e_d})

the output of the coding network finally passes through a full connection layer and then passes through a softmax classifier to obtain the probability of the relation class, and the process is represented as follows:

P_r＝softmax(H_{e_d}W_o+b_o)

in step (3), two identical decoders are designed, each of which is based on the structure of the multi-headed attention mechanism. The input of the decoding network is divided into the output of the coding network and the output of the relation decoding network, and the attention matrix is obtained after the two are calculated through the multi-head attention mechanism.

After the attention mechanism, the attention matrix is directly used as a pointer to select the boundary of the corresponding entity from the sentence. In the present invention, an entity contains a start position and an end position, because we split the multi-head attention sentence into two parts, in which the sum of the attention matrixes of the heads of the former part is used to indicate the start position of the entity, and the sum of the attention matrixes of the heads of the latter part is used to indicate the end position of the entity, which is expressed as follows.

In step (4), the training objective function is designed as follows:

where B is the number of each Batch, which is 64 in the present invention, and T is the maximum number of triples in a sentence, which is set to 10 in the present invention. And r_tSoftmax score information expressed as true category, e1_sProbability score of softmax expressed as the start position of the real subject (subject), e1_eProbability score of softmax expressed as the end position of the true subject (subject), e2_sProbability score of softmax expressed as start position of real object (object), e2_eThe softmax probability score is expressed as the start position of the real object (object). And optimizing the network by using an Adam optimizer during model training, wherein the learning rate is 1 e-5. Meanwhile, in order to prevent the overfitting of the model, an "early stop" mechanism is also adopted, namely if in 10 consecutive epochs, if the F1 value of the network is not promoted any more, the training of the network is stopped.

Drawings

FIG. 1 is a main framework of the network model proposed by the present invention

FIG. 2 is a network structure of a coding layer

FIG. 3 is an example of syntax tree dependency matrix generation

FIG. 4 is a schematic diagram of the shape of a mask matrix

Detailed Description

The invention will be further described with reference to the accompanying drawings in which:

fig. 1 is a structure of an entire network, which is composed of a syntax guidance-based encoding network, a relationship decoding network, and a pointer network. In the coding network, an input sentence firstly passes through word-embedding with word strength and word-embedding with character strength, then the word-embedding with the character strength and the word-embedding with the character strength are combined to obtain a word vector with the dimension of 512 dimensions, and then the word vector is further added with a position coding vector to obtain the input of the network. The input vector first passes through a Transformer encoder consisting of 4 stacked layers, where each layer consists of a multi-headed attention subnetwork and a forward feedback subnetwork, each with one residual connection and a normalized connection. After Transformer coding output, the coding vector is further input into network coding which integrates syntax dependence information to obtain syntax dependence enhanced vector coding representation, and then the original output of the Transformer coding network and the output of the syntax guide network are weighted and summed to obtain the output of the network. The output will be further sent to a decoding network for decoding. The decoding network is also made up of 4 stacked layers, where each layer contains three sublayers: a multi-head attention mechanism sublayer with mask operation, an encoding-decoding multi-head attention mechanism sublayer and a forward feedback network sublayer. Firstly, the target vector after shifting is coded through a multi-head attention layer with a mask, then the target vector and the coding network output are subjected to coding-decoding multi-head attention mechanism operation to obtain output vector information, and the output vector information is further subjected to a forward feedback network to obtain decoding output. And finally, decoding and outputting the relation class probability information after passing through a full connection layer and a softmax classifier. And finally, the output of the coding network and the output of the decoding network are sent to the pointer decoding network together to obtain the boundary information of the entity, and finally, corresponding triple information is extracted according to the outputs of the relation decoding network and the pointer decoding network.

Fig. 2 shows two steps of a coding network, where an input sentence vector is first output through a transform encoder, then further coded through a coding network guided by syntax to obtain a coded output with enhanced syntax, and finally the two are weighted and summed to obtain a final coded output.

FIG. 3 is a schematic diagram of the process of generating a syntactic dependency tree matrix, first constructing an N matrix, where N is the number of words in a sentence and all values are set to 0, and then setting the dependent positions to 1 according to the syntactic dependency tree, with the specific rule that for position (i, j), if the word j is an "ancestor" of the word i in the syntactic dependency tree, then the position is set to 1, otherwise the position is set to 0.

Fig. 4 is a schematic diagram of a mask matrix, which has a size of M × M, where M is the length of the target sequence, and in this matrix, a diagonal line, i.e., a part below the diagonal line, is 1, and the other parts have values of 0.

Table 1 shows the experimental results of the invention on NYT24, NYT29 and WebNLG data sets, and experiments show that the comprehensive evaluation index F of the proposed model is compared with the best existing model₁The values gave the best results.

TABLE 1 Experimental comparison of the network model of the present invention on NYT24, NYT29 and WebNLG datasets with other existing models

Table 2 is some examples of experimental procedures in which the inventive method can also output overlapping triplet sequence information for some complex scenarios.

Table 2 some practical results of the invention on validation of data sets

The above embodiments are only preferred embodiments of the present invention, and are not intended to limit the technical solutions of the present invention, so long as the technical solutions can be realized on the basis of the above embodiments without creative efforts, which should be considered to fall within the protection scope of the patent of the present invention.

Claims

1. An end-to-end entity and relation joint extraction method based on sequence-to-sequence is characterized by comprising the following steps:

(1) for a sentence to be trained, firstly, coding the sentence by using a coder with grammar dependence enhancement to obtain hidden layer vector output of a model for representing a semantic vector of the sentence;

(2) sending the encoded semantic vector and the right-shifted target sequence into a relational decoding network, outputting a decoding vector, and inputting the decoding vector into a full connection layer and a Softmax classification network to obtain relational category output;

(3) sending the hidden layer vector output by the coding network and the vector output by the decoding layer into a pointer network, and decoding and outputting a corresponding entity pair;

(4) calculating loss of the relation category probability and the probability of the entity to the initial position output in the step 2 and the step 3 and the real sequence label, sending the loss into an optimizer to optimize network parameters, and storing the model after the model is converged;

(5) and during application, loading the stored model, then carrying out relationship and corresponding entity identification on the newly input sentence, and outputting corresponding triple information.

2. The entity and relationship joint extraction method of claim 1, wherein: each sentence is represented as an N × D word vector, where N is the number of words in the sentence and D is the dimension of the corresponding word vector; in this invention, the word vector is composed of three parts: word embedding based on word level, word embedding based on character level and position vector information, wherein the word embedding is obtained by adopting a random vector initialization mode, and the dimensionality of the word embedding is 300; the vector of the character level is obtained by encoding each character vector in the word through a one-dimensional Convolutional Neural Network (CNN), wherein in the one-dimensional Convolutional Neural network, the size of a filtering window is 3, and the number of Convolutional kernels is 212; in character vector encoding, the maximum number of characters of a word is 10, and the calculation process is as follows:

V_w＝Conv1D(V_char)

wherein V_charIs the word-embedded expression of the characters in the words, and the size of the obtained word-embedded expression is 212 dimensions in the word-level word-embedded expression; finally, word and word embedding of characters are connected to form 512-dimensional vector representation, the vector representation is further added with a position coding vector to obtain final vector representation, and the position vector coding uses a position coding method of an original Transformer coder.

3. The entity and relationship joint extraction method of claim 1, wherein: after a sentence is coded by a transform, the sentence is input into a coding network of a merged syntax dependency tree to obtain a representation of the merged syntax dependency, wherein a syntax dependency tree matrix construction process is as follows:

a) firstly, constructing a matrix with the size of N multiplied by N, wherein N is the number of words in a sentence, and then initializing all the matrixes to be 0;

b) the matrix is assigned according to the syntactic dependency information of the sentence, and the assignment rule is as follows, aiming at a certain node M in the matrix_ijIf node j is an "ancestor" of node i on the syntactic dependency tree, then the position is assigned a value of 1, otherwise the position is assigned a value of 0, and the process is represented as follows:

after obtaining the syntactic dependency tree matrix, the output of the coding network is further calculated by a multi-head attention mechanism, and the calculation process is as follows:

h′_i＝A′_iV'

after the dependency tree matrix is merged into the output of the forward feedback network, when the attention mechanism calculation is carried out, the sentence only carries out self-attention calculation on words with dependency relationship with the dependency tree matrix, so that irrelevant noise information in the sentence can be removed; after the output is obtained through calculation based on a syntactic dependency network, the output is weighted and summed with the output of an original Transformer coding network to obtain a complete sentence semantic representation, and the process is represented as follows:

wherein

Is a hyperparameter, and in the present invention, has a value of 0.5.

4. The entity and relationship joint extraction method of claim 3, wherein: two identical pointer decoding networks are designed, wherein each pointer decoding network is realized based on a multi-head attention mechanism, the input of the decoding networks is the output of the coding network and the output of the relation decoding network, and the attention moment array calculation process comprises the following steps:

after the attention mechanism, the attention moment array is directly used as a pointer to select a corresponding entity boundary from the sentence; in the present invention, an entity contains two bit positions, so we split the multi-head attention average into two parts, where the sum of the attention matrixes of the head of the former part is used to indicate the start position of the entity, and the sum of the attention matrixes of the head of the latter part is used to indicate the end position of the entity, and the process is shown as follows.

。

5. In training the joint information extraction network, the objective function is defined as follows:

where B is the number of each Batch, which in the present invention is 6; t is the maximum value of the number of the triples in a sentence, and the value is set to be 10 in the invention; and r_tProbability score of softmax expressed as a relation category, e1_sProbability score of softmax expressed as subject's (subject) start position, e1_eProbability score of softmax expressed as subject (subject) end position, e2_sProbability score of softmax expressed as object start position, e2_eA softmax probability score expressed as an object (object) end position; and optimizing the network by using an Adam optimizer during model training, wherein the learning rate is 1 e-5.