CN112860904B

CN112860904B - External knowledge-integrated biomedical relation extraction method

Info

Publication number: CN112860904B
Application number: CN202110367973.9A
Authority: CN
Inventors: 王春宇; 张�浩; 梁天铭; 刘晓燕; 刘国军; 郭茂祖
Original assignee: Harbin Institute of Technology
Current assignee: Harbin Institute of Technology
Priority date: 2021-04-06
Filing date: 2021-04-06
Publication date: 2022-02-22
Anticipated expiration: 2041-04-06
Also published as: CN112860904A

Abstract

A biomedical relation extraction method integrated with external knowledge relates to the technical field of natural language processing, and aims at the problem of noise interference in training data of a remote supervision technology.

Description

External knowledge-integrated biomedical relation extraction method

Technical Field

The invention relates to the technical field of natural language processing, in particular to a biomedical relation extraction method integrated with external knowledge.

Background

With the rapid development of society, the society has entered the information explosion era at present, and in the field of biomedical science, thousands of biomedical literatures are published every day, and these literatures contain massive biomedical entity relationships, so that the needs of doctors and experts for effective information screening and induction are increasingly urgent, and how to extract effective information from massive data becomes a difficult problem.

At present, a relation extraction technology based on supervised learning is often adopted in the relation extraction technology, and the technology needs a large amount of manually labeled training data, so that the time and labor are consumed. Therefore, the automatic and efficient extraction of the biomedical entity relationships hidden in the literature can effectively save manpower and resources.

With the development of deep learning theory, researchers began to use neural network models to automatically extract entity relationships from biomedical literature. Among them, the most common method is to automatically generate a large amount of training data for human use by using remote supervision, but one of the serious problems of the remote supervision is that the generated training data has a large amount of noise, which is more prominent in biomedical data. The neural network model has no effective method for processing the noise generated by the remote monitoring biomedical data, so that how to use the neural network method for processing the remote monitoring biomedical data noise is a significant research direction

Disclosure of Invention

The purpose of the invention is: aiming at the problem of noise interference in training data of a remote supervision technology, a biomedical relation extraction method integrated with external knowledge is provided.

The technical scheme adopted by the invention to solve the technical problems is as follows:

a biomedical relation extraction method integrated with external knowledge comprises the following steps:

performing word embedding and position embedding operation on each word in each sentence in the biomedical data set to obtain a word vector and a position vector, splicing the obtained word vector and position vector to obtain vector representation of each word, and finally splicing the vector representations of all words in the sentences to obtain matrix representation of each sentence;

inputting the matrix representation of each sentence obtained in the step one into a PCNN neural network to obtain the vector representation of each sentence in the biomedical data set;

acquiring a head entity and a tail entity of each sentence in the biomedical data set, respectively extracting an entity related to the head entity and an entity related to the tail entity from an external knowledge graph to obtain a relation graph taking the head entity as a center and a relation graph taking the tail entity as a center, and inputting the two obtained relation graphs into a graph encoder to construct comprehensive vector representation of the external knowledge graph of the head entity and the tail entity;

combining the vector representation of each sentence in the biomedical data set with the comprehensive vector of the external knowledge map of the head and tail entities to obtain a sentence vector containing external information;

and step five, for each entity pair, selecting all sentences containing the entity pair to form a set, calculating the attention weight of each sentence in the set by adopting a sentence-level attention mechanism according to the sentence vector representation containing the external information, then taking the sum of the attention weights represented by all sentences containing the external information in the set as the vector representation of the set, and predicting the vector representation of the set to obtain the predicted relationship of the entity pair.

Further, the graph encoder in step three adopts KG-Transformer.

Further, the KG-Transformer encoding process is as follows:

KG-Transformer represents X ═ X in the vector of the node sequence in the two input relational graphs₁,x₂,...,x_NIs inputted to the Muti-head attachment Layer and Add&Norm Layer：

The Muti-head attachment Layer is calculated as follows:

wherein the content of the first and second substances,

denotes the concatenation of the H attention heads of this layer, x'_iRepresenting the node representation of the output, A is the adjacency matrix, i represents the ith row, j represents the jth column, d is the dimension of node embedding,

are all weight matrices, N represents the length of the node sequence, Masking (X, A) represents the length in terms of momentsMasking the value of the corresponding position of the matrix X at the position of which the value in the matrix A is 1;

the Add & Norm Layer was calculated as follows:

O＝LayerNorm(X+X')

wherein X ═ { X ═ X₁,x₂,......,x_NX is a vector representation of a sequence of nodes, X '═ X'₁,x'₂,......x'_NX' is the output calculated by the Muti-head attachment Layer, LayerNorm (-) is a Layer normalization function, and the output result O is used as the input of the next Layer of the Muti-head attachment Layer;

repeating the calculation process for L times, wherein L is any integer, obtaining the vector representation of all nodes, and finally summing the vector representations of all nodes of the relational graph of the head entity and the tail entity respectively to obtain the comprehensive vector representation of the head entity and the tail entity.

Further, L is 8, 12, 16 or 24.

Further, in the second step, the PCNN neural network obtains a vector representation of each sentence in the biomedical data set through convolution, pooling and nonlinear operations.

Further, the PCNN neural network in the second step specifically executes the following steps:

firstly, extracting local features from the matrix representation of the sentence by adopting a convolution kernel with the sliding window size of 3, and then combining all the local features through maximum pool operation to obtain a vector of the matrix representation of the sentence;

then dividing the vector represented by the matrix of the obtained sentence into three segments according to the positions of the head entity and the tail entity, and then pooling each segment to obtain a three-dimensional vector:

and splicing the three-dimensional vectors corresponding to all the convolution kernels, and performing non-linearization by using an activation function to obtain the final vector representation of the sentence.

Further, the activation function is a tanh function.

Further, the vector of the node sequence in the relational graph is represented as:

(e,r₁,e₁,r₂,e₂,...,r_n,e_n)

wherein, entity e and entity e₁,e₂,...,e_nAssociating entity e with entity e₁,e₂,...,e_nRespectively is r₁,r₂,...,r_n。

Further, the fourth step is to combine the vector representation of each sentence in the biomedical data set and the comprehensive vector of the external knowledge map of the head and tail entities to represent:

MutiHead(Q,K,V)＝Concat(head₁,...,head_h)W_o

in the above formula, Q represents a matrix representation of each sentence, K, V represents a vector representation of each sentence, and K and V are equal;

W_orepresenting weight matrices, heads, within neural networks_iRepresenting a certain head of the calculation, a different head representing a different expression, Concat (head), of the biomedical statement₁,...,head_h) Representing the vectors resulting from concatenating the different headers.

Further, the concrete steps of the fifth step are as follows:

the attention weight of each sentence in the set of sentences is calculated using a sentence-level attention mechanism, with the sum of the attention weights of all the sentence representations as a vector representation of the set of sentences, i.e.

e_i＝x_iAr

α_iIs a sentence vector x_iWeight of (1), x_iIs a vector representation of the ith sentence, a is a diagonal weight matrix, r is a vector representation of the relationship r, s represents a vector representation of the set of sentences;

and finally, according to the vector representation s of the sentence set, calculating the probability that the sentence set belongs to the relation r by using a softmax classifier as follows:

P(r|S；θ)＝softmax(Ws+b)

wherein W represents a weight matrix, S represents a sentence vector, b represents a bias term, theta is a model parameter, and S represents the sentence set.

The invention has the beneficial effects that:

according to the method for extracting the relation of the biomedical entities, abundant semantic information and structural information among sentences and in sentences and information of the biomedical entities in an external knowledge base are utilized more fully, noise interference existing in data sets is reduced, a model is more stable, and more accurate relation prediction can be obtained.

Drawings

FIG. 1 is an overall flow chart of the present application;

fig. 2 is a schematic diagram of a model for extracting a relation of a single sentence packet according to the present application.

Detailed Description

It should be noted that, in the present invention, the embodiments disclosed in the present application may be combined with each other without conflict.

The first embodiment is as follows: specifically, the present embodiment is described with reference to fig. 1, and the biomedical relationship extraction method incorporating external knowledge according to the present embodiment includes the steps of:

The second embodiment is as follows: this embodiment is further described with respect to the first embodiment, and the difference between this embodiment and the first embodiment is that the graph encoder in step three employs a KG-Transformer.

The third concrete implementation mode: this embodiment is further described with respect to the second embodiment, and the difference between this embodiment and the second embodiment is that the KG-Transformer encoding process is as follows:

The Muti-head attachment Layer is calculated as follows:

wherein the content of the first and second substances,

all the data are weight matrixes, N represents the length of a node sequence, Masking (X, A) represents that the value of the corresponding position of the matrix X is subjected to Masking operation according to the position of the median value of the matrix A being 1;

the Add & Norm Layer was calculated as follows:

O＝LayerNorm(X+X')

repeating the calculation process for L times, wherein L is a hyperparameter, L is any integer, and is generally 8, 12, 16 and 24, obtaining vector representations of all nodes, and finally summing the vector representations of all nodes of the relational graph of the head entity and the tail entity respectively to obtain the comprehensive vector representation of the head entity and the tail entity.

The fourth concrete implementation mode: this embodiment mode is a further description of the first embodiment mode, and the difference between this embodiment mode and the first embodiment mode is that L is 8, 12, 16, or 24.

The fifth concrete implementation mode: the present embodiment is a further description of the first embodiment, and the difference between the present embodiment and the first embodiment is that the PCNN neural network in the second step obtains a vector representation of each sentence in the biomedical data set through convolution, pooling and nonlinear operations.

The sixth specific implementation mode: the present embodiment is further described with respect to the first embodiment, and the difference between the present embodiment and the first embodiment is that the PCNN neural network in the second step specifically executes the following steps:

The seventh embodiment: this embodiment mode is a further description of a sixth embodiment mode, and is different from the sixth embodiment mode in that the activation function is a tanh function.

The specific implementation mode is eight: this embodiment is a further description of a sixth embodiment, and a difference between this embodiment and the sixth embodiment is that a vector of a node sequence in the relational graph is represented as:

(e,r₁,e₁,r₂,e₂,...,r_n,e_n)

The specific implementation method nine: the present embodiment is further described with respect to the eighth embodiment, and the difference between the present embodiment and the eighth embodiment is that, in the fourth step, the vector representation of each sentence in the biomedical data set and the integrated vector of the external knowledge base of the head-tail entity are represented by combining:

MutiHead(Q,K,V)＝Concat(head₁,...,head_h)W_o

The detailed implementation mode is ten: this embodiment is a further description of a ninth embodiment, and the difference between this embodiment and the ninth embodiment is that the specific step of the fifth step is:

e_i＝x_iAr

P(r|S；θ)＝softmax(Ws+b)

Example (b): a biomedical relationship extraction method incorporating external knowledge, comprising:

step one, embedding words of each sentence in a sentence sub-packet, splicing word embedding vectors and position embedding vectors of each word to obtain vector representation corresponding to each word, and then splicing all word vectors in each sentence to obtain matrix representation of the sentence.

Inputting the matrix representation of the sentences into a CNN layer and a piece-Max _ posing layer, and obtaining the vector representation of each sentence through convolution, pooling and nonlinear operation.

The CNN layer is a convolutional layer, which first extracts a local feature with a sliding window length of 3 above the sentence. It then combines all local features by max-pool operation to obtain a fixed size vector for the input sentence. Here, convolution is defined as the operation between a sequence of vectors W and a convolution matrix W,

d^cis the embedding dimension of the sentence, the convolution operation can extract the local features through a sliding window with length l.

In the invention, considering the positions of two entities, the pooling operation can be further improved into segmented pooling, and each dimension characteristic p obtained by convolution is taken as_iFrom the head entity and the tail entity into three segments (p)_i1,p_i2,p_i3) Then pooling is performed separately for each segment:

[x_ij]＝max(p_ij)

then [ x ]_i]Is defined as [ x ]_ij]And (4) splicing.

At the end of this step, the vector x is subjected to a non-linearization, such as a tanh function, resulting in a final vector representation of the sentence.

Step two, each sentence in the biomedical data set comprises a head entity and a tail entity of the sentence, the head entity and the tail entity have relations with other entities in an external knowledge base, and the relations are expressed in a form of triples, namely the triples<Entity 1, relationship, entity 2>. And representing a plurality of triples in a mode of a graph, wherein corresponding relation nodes exist between two entity nodes in the graph and are connected with the entity nodes, so that the head entity and the tail entity are respectively associated with the entities in a plurality of external knowledge bases. The entity nodes and the relation nodes in the graph are converted into a sequence form (e, r1, e)₁,r₂,e₂,...,r_n,e_n) Wherein entity e and entity e₁,e₂,...,e_nAre related and have a corresponding relationship of r₁,r₂,...,r_n。

The obtained head-tail entity relation sequence

And

and embedding the nodes, inputting the nodes into a KG-Transformer model for feature extraction, and obtaining KG expressions of the head entity and the tail entity.

Wherein the KG-Transformer model represents X ═ X as the synthetic vector of the input node sequence₁,x₂,...,x_nIs inputted to the Muti-head attachment Layer and Add&Norm Layer：

Wherein the content of the first and second substances,

the connections of the H attention heads of this layer are shown,

and

respectively represent the h-th attention head node x_jAnd node embedding X_jWeight of the linear transformation of (1).

The transform blocks are stacked L times, and finally KG representation of the head and tail entities is obtained.

And step three, performing Knowledge-Attention operation on the results obtained in the two steps, capturing the internal correlation of the data and the features in the biomedical text by using a multi-head Attention mechanism, and fusing the obtained feature vectors with external Knowledge. The calculation process is as follows:

MutiHead(Q,K,V)＝Concat(head₁,...,head_h)W_o

in the above formula, Q is the comprehensive vector representation obtained in the first step, K and V are the comprehensive vector representation described in the second step, and K and V are equal;

W_orepresenting weight matrices, heads, within neural networks_iRepresenting a certain head of the calculation, different heads being understood as different expressions, Concat (head), to the biomedical statement₁,...,head_h) Representing the vectors resulting from concatenating the different headers.

Step four, defining the weight represented by each sentence vector by adopting a set sentence level attention mechanism;

in this step, a query-based function is used to measure the vector representation x of each sentence_iThe degree of association with the relationship r of the entity pair to be predicted finally.

Since the information of the relationship r between the pair of entities to be predicted finally is taken into consideration, with the sentence level attention mechanism set, the influence of noise is reduced by assigning a smaller weight to a noise sentence.

Finally, given the set of all sentences and the pair of entities, the probability of defining the predicted relationship r is:

wherein n is_rThe number of all relation types, o is the input of the final neural network, and o is M_s+ d, where d is the offset vector and M is the all-relationship vector representation matrix.

It should be noted that the detailed description is only for explaining and explaining the technical solution of the present invention, and the scope of protection of the claims is not limited thereby. It is intended that all such modifications and variations be included within the scope of the invention as defined in the following claims and the description.

Claims

1. A biomedical relation extraction method integrated with external knowledge is characterized by comprising the following steps:

2. The method of claim 1, wherein the histogram coder employs KG-Transformer.

3. The method for extracting biomedical relations fused with external knowledge according to claim 2, wherein the KG-Transformer is encoded by:

The Muti-head attachment Layer is calculated as follows:

wherein the content of the first and second substances,

all are weight matrixes, N represents the length of a node sequence, Masking (X, A) represents that the value of the corresponding position of the matrix X is masked according to the position of the value of 1 in the matrix A,

indicating the h-th head of attention with respect to node x_jH denotes the h-th attention head,

to represent

Summing all row elements of (a), I being an identity matrix;

the Add & Norm Layer was calculated as follows:

O＝LayerNorm(X+X')

wherein X ═ { X ═ X₁,x₂,......,x_NX is a vector of a node sequenceDenotes, X '═ X'₁,x'₂,......x'_NX' is the output calculated by the Muti-head attachment Layer, LayerNorm (-) is a Layer normalization function, and the output result O is used as the input of the next Layer of the Muti-head attachment Layer;

4. The method of claim 3, wherein L is 8, 12, 16 or 24.

5. The method of claim 1, wherein the PCNN neural network in the second step obtains the vector representation of each sentence in the biomedical data set through convolution, pooling and nonlinear operations.

6. The method as claimed in claim 1, wherein the PCNN neural network in the second step specifically performs the following steps:

7. The method of claim 6, wherein the activation function is a tanh function.

8. The method of claim 6, wherein the vectors of the node sequences in the relational graph are represented as follows:

(e,r₁,e₁,r₂,e₂,...,r_n,e_n)

9. The method for extracting biomedical relations fused with external knowledge according to claim 8, wherein the step four combines the vector representation of each sentence in the biomedical data set and the comprehensive vector of the external knowledge-graph of the head and tail entities into a representation:

MutiHead(Q,K,V)＝Concat(head₁,...,head_h)W_o

W_orepresenting weight matrices, heads, within neural networks_iRepresenting a certain head of the calculation, a different head representing a different expression, Concat (head), of the biomedical statement₁,...,head_h) Representing the vectors resulting from the concatenation of the different heads, h denotes the h-th attention head.

10. The method for extracting biomedical relations integrated with external knowledge as claimed in claim 9, wherein the step five comprises the following specific steps:

e_i＝x_iAr

P(r|S；θ)＝softmax(Ws+b)

wherein W represents a weight matrix, S represents a sentence vector, b represents a bias term, theta is a model parameter, S represents the sentence set, e_iRepresenting the ith sentence x_iScore of degree of match with its relation r, e_kRepresenting the k-th sentence x_iThe degree of match with its relationship r is scored.