CN111222338A

CN111222338A - Biomedical relation extraction method based on pre-training model and self-attention mechanism

Info

Publication number: CN111222338A
Application number: CN202010017867.3A
Authority: CN
Inventors: 张益嘉; 于洪海; 田方正; 刘雨; 张雨琪
Original assignee: Dalian University of Technology
Current assignee: Dalian University of Technology
Priority date: 2020-01-08
Filing date: 2020-01-08
Publication date: 2020-06-02

Abstract

The invention belongs to the technical field of natural language processing, and relates to a biomedical relation extraction method based on a pre-training model and a self-attention mechanism. According to the invention, the ELMO pre-training model can be used for extracting more complex information in the biomedical sentences, so that the expression effect on sentence characteristics of the biomedical texts is improved, and the relationship between biomedical entities is better extracted; the position features are added to learn the position relationship between the internal structure of the sentence and the biomedical entity, and the internal correlation of the data and the features in the sentence is captured better by using a self-attention mechanism, so that the task of extracting the biomedical relationship is completed better. The method solves the problem that the current biomedical relation extraction mostly only focuses on the simple semantics of the sentence sequence, can enhance the relation extraction effect of the regular biomedical text, and has better effect on the relation extraction of the irregular biomedical text on the social media.

Description

Biomedical relation extraction method based on pre-training model and self-attention mechanism

Technical Field

The invention belongs to the technical field of natural language processing, and relates to a biomedical relation extraction method based on a pre-training model and a self-attention mechanism.

Background

With the rapid development of the internet, the network information is explosively increased, 2000 to 4000 biomedical documents are published every day in the biomedical field, and the documents contain massive biomedical entity relationships, such as drug interaction relationships, protein interaction relationships and the like, and are important resources for biomedical research.

The use of manual labeling methods to label entity relationships from biomedical literature is time consuming, labor intensive, and requires support by knowledge in the biomedical field of expertise. The automatic and efficient extraction of entity relationships hidden in biomedical documents can effectively save manpower and resources.

The traditional extraction method based on the template and the rule has low recall rate and needs to manually construct the template. As deep learning theories and methods mature, researchers have begun to use deep learning neural network models to automatically extract biomedical entity relationships from biomedical literature. However, most of these methods only focus on the simple semantics of the sentence sequence, and actually, the dependency relationship between two entities in the sentence depends on more complex semantic information.

Disclosure of Invention

The invention aims to provide a natural language relationship extraction method based on a pre-training model and a self-attention mechanism for solving the defects of the prior art, the ELMO pre-training model can be used for extracting more complex information in a biomedical sentence, including grammar, semantics and even word ambiguity, so that the expression effect of the sentence characteristics of the biomedical text is improved, and biomedical entities such as: the relationship between proteins, diseases and drugs, drugs and side effects. The position features are added to learn the position relationship between the internal structure of the sentence and the biomedical entity, and the internal correlation of the data and the features in the sentence is captured better by using a self-attention mechanism, so that the task of extracting the biomedical relationship is completed better.

The technical scheme of the invention is as follows:

a biomedical relation extraction method based on a pre-training model and a self-attention mechanism comprises the following specific steps:

s1) preprocessing the labeled biomedical corpus: writing a data preprocessing program, converting the original corpus into input which can be accepted by a deep learning network model, constructing the position characteristics of the biomedical entities, and providing information for the relationship among the biomedical entities; biomedical entities include diseases, drugs and side effects.

S2) constructing a sentence vector of the biomedical text: inputting the corpus preprocessed in the step S1) into an ELMO pre-training model, extracting the characteristics of the whole biomedical text sentence, and outputting a vector; meanwhile, processing the position characteristics of the biomedical entity generated in the step S1) by adopting a word embedding mode to form a position vector of the biomedical entity; all the vectors obtained are then concatenated to form a long vector to represent the sentence in the biomedical text.

S3) inputting the BILSTM neural network model to extract features: and (4) inputting the long vector obtained in the step (S2) into a BILSTM neural network after passing through a dropout layer, learning the context information of the biomedicine, understanding a single sentence of the biomedicine text from two directions, and outputting a feature vector by the BILSTM.

S4) extracting key features of the biomedical text using a multi-headed self-attentiveness mechanism: after the BILSTM neural network outputs the feature vectors, a multi-head self-attention mechanism is used for capturing data in the biomedical text and the internal correlation of the features, and then key features are extracted through a full connection layer.

S5) biomedical entity relationship prediction: after key features are extracted through the full-connection layer, the results are input into the full-connection neural network again, the relation of the two biomedical entities in sentences is predicted, and finally probability distribution of the biomedical relation is obtained, so that the biomedical relation is extracted.

The invention has the technical characteristics that:

(1) representing the word vector by using a pre-training model ELMO;

(2) adding location features of the entity;

(3) a multi-head self-attention mechanism is added behind the BILSTM neural network model.

(4) The ELMO pre-training model can be used for learning complex features of words, including grammar, semantics and the like, and learning word ambiguity under different contexts, the position features of the entities can provide information for a following multi-head self-attention mechanism, the context information of the words is acquired through the BILSTM, and then the multi-head attention mechanism is used for processing biomedical sentences output by an upper layer in a plurality of attention modes from a plurality of representation spaces, so that the relationship among the entities is better extracted.

Compared with the prior art, the invention has the beneficial effects that:

1) word context information can be considered through an ELMO pre-training model, so that word vector representation which is more similar to actual semantics is carried out on words, and the problem of word ambiguity of the words in the biomedical text is solved.

2) The position feature and the BILSTM are used to cooperate with a multi-head self-attention mechanism to understand a sentence from multiple angles, and more complex semantics among biomedical entities are obtained.

The invention solves the problem that the current biomedical relation extraction mostly only focuses on the simple semantics of sentence sequences, can enhance the relation extraction effect of regular biomedical texts, and has better effect on the relation extraction of the non-regular biomedical texts on social media due to better excavation of the text characteristics of the non-regular biomedical texts.

Drawings

FIG. 1 is a schematic flow chart of the present invention.

FIG. 2 is a schematic diagram of a BILSTM neural network.

FIG. 3 is a schematic diagram of the ELMO pre-training model.

Detailed Description

The following further describes a specific embodiment of the present invention with reference to the drawings and technical solutions.

The invention relates to a biomedical relation extraction method based on a pre-training model and a self-attention mechanism, which comprises the following steps of firstly preprocessing a labeled corpus and constructing two biomedical entities: such as position characteristics among proteins, diseases, side effects and the like, and then converting the original linguistic data into input which can be accepted by a deep learning network model; then, generating word vectors by using an ELMO pre-training model and word embedding, and then connecting the word vectors to generate long vectors to represent numerous sentences of the biomedical texts; inputting the long vector into a BILSTM neural network after passing through a dropout layer, and learning the context information of the biomedical text; and realizing a multi-head Self-attention mechanism through a Self-attention layer to understand the biomedical text from a plurality of angles by a plurality of side points, extracting a full-connection layer of key features through a layer, and finally outputting the probability distribution of the relation through an output layer. The flow is shown in FIG. 1, and the steps are further described below.

S1) preprocessing the labeled biomedical text corpus, and converting the original corpus into an input which can be accepted by the deep learning network model, namely, firstly, constructing a dictionary through a corpus, wherein each word corresponds to a sequence number, so that each sentence is represented as a sequence of a plurality of numbers. And (3) constructing the position characteristics of the biomedical entities, namely constructing the position sequences of the two entities according to the positions of the two biomedical entities in the sentence.

S2) constructing sentence vectors of the biomedical texts, inputting the digitalized sentence sequences processed in the step S1) and dictionaries, namely mapping of serial numbers and words, into an ELMO pre-training model, extracting the characteristics of the whole biomedical text sentences, and learning the biomedical texts through a multi-layer neural network in the ELMO, wherein the state of the high-layer neural network can capture the characteristics of the biomedical entities and the aspects related to the contexts, so that semantic disambiguation can be performed, and the characteristics of the grammatical aspects can be found by the low-layer neural network, so that part-of-speech tagging can be performed, and the grammatical characteristics of the biomedical entities-relation-entities can be obtained. Combining them together and extracting the relationship between them will be advantageous. And outputting the generated word vector, and processing the position characteristics of the biomedical entity generated in the step S1) by adopting a word embedding mode to form a position vector. After that, the position vector and the word vector generated by the ELMO pre-training model are connected together by using a concatenate function in the keras to form a long vector, so that a large number of key features are extracted from the biomedical text, and the biomedical text is better represented.

The ELMO pre-training model is a dynamic model, the internal structure of which is shown in FIG. 2, and can update Word Embedding of words, essentially, a language model is firstly learned out Word Embedding of words on a large corpus, parameters in the model are finely adjusted by biomedical text data, which is also called a fine-tuning pre-trained ELMO model, so that characteristics of the biomedical text, such as grammatical semantics and the like, can be better represented by means of huge corpus information.

The ELMO pre-training model is a bidirectional LSTM language model and is composed of forward and backward language models, and the objective function is the maximum likelihood of the language models in the two directions. The forward model formula and the reverse model formula are respectively:

the log-likelihood function is expressed as:

wherein, t₁,t₂,…,t_NRepresenting each word in a biomedical text sentence, p (t)₁,t₂,…,t_N) Representing the probability of occurrence of biomedical text consisting of these words; Θ x represents the word embedding, and Θ s represents the parameters of the softmax layer.

The ELMO model differs from the previous models in that it uses only the last layer's output values as the values for WordEmaddressing, whereas ELMO represents the values for WordEmaddressing in a linear combination of all the layers' output values, and for each token in a sentence, the BILSTM of one L layer has to compute 2L +1 tokens:

wherein the content of the first and second substances,

the word representing the kth position, at each position k, a corresponding token is output on each layer LSTM

j represents the number of layers of the neural network.

Then R will be set in the downstream task_kCompressed into one vector:

wherein the content of the first and second substances,

is the softmax normalized weight, gamma^taskIs a scaling factor, which may have the effect of scaling the entire ELMO vector. The learning range of the ELMO pre-training model is wide, and compared with Word2vec, the ELMO pre-training model has the advantages of wide learning rangeThe method comprises the steps of learning the complex characteristics of word usage and the changes of the complex usage in different contexts, and directly calling the complex usage in a specific network model, wherein the embodiment is that a pre-trained ELMO language model is directly called from https:// tfhub.

And then changing the position features of the biomedical text into position vectors by calling an embedding layer in Keras, and adding the position features to learn the position relationship between the internal structure of the biomedical text and the biomedical entities.

S3) inputting the BILSTM network extraction features, preventing overfitting through a dropout layer, and then inputting the features into the BILSTM network. BILSTM is chosen because BILSTM is more powerful than RNN and LSTM, able to learn context information and solve problems such as gradient explosion and gradient disappearance.

The BILSTM essentially combines forward LSTM and backward LSTM, learning both forward and backward information. While LSTM is a variation of RNN that combines short-term memory with long-term memory through subtle gating controls, if RNN is used, if the sentence of the biomedical text is too long, the gradient of the neural network disappears and the extraction of the relationship between the biomedical entities cannot be performed efficiently. The use of BILSTM effectively solves this problem.

LSTM has the ability to remove or add information to the cell state through three gate structures, input, forgetting, and output gates, respectively. The gate is essentially a method for selectively passing information, and comprises a sigmoid neural network layer and a pointwise multiplication operation, wherein the sigmoid outputs a value between 0 and 1, which represents how much of each part can pass through. The principle of LSTM is shown in fig. 3, which contains corresponding parameters, and the formula for implementing LSTM function is as follows:

f_t＝σ(W_f·[h_t-1,x_t])

i_t＝σ(W_i·[h_t-1,x_t])

o_t＝σ(W_o·[h_t-1,x_t])

h_t＝o_t*tanh(C_t)

the first step of LSTM is to decide what information will be discarded from the cell state, which is done by the forgoing forgetting gate. Forget the door will read h_t-1And x_tThen, a value between 0 and 1 is output as each cell state C by calculation_t-1The number 1 indicates complete retention, and 0 indicates complete rejection. By this value, the cell can be made aware of what the forgotten proportion is, and the next step is to determine what new information is accepted by the cell deposit. Implemented by a sigmoid layer, which decides what value is to be updated, and a tanh layer, which creates a new vector of candidate values

This state is added to the state and the update of the state is to be performed next. Old state and f_tBy multiplying, discarding of information can be achieved, adding

A new state is obtained, i.e. the contents of forgetting and remembering are determined by the sigmoid function to select updates, and candidate values are obtained by the tanh function. And finally, through an output gate, determining that part of the cell is to be output by the sigmoid function, multiplying the state of the cell by the output of the sigmoid through tanh function processing, and outputting the determined part of the output.

While the output at the t time step is obtained by the BILSTM

Representing the output of the forward LSTM,

representing the output of the backward LSTM. The new vector H (H) finally generated₁,h₂,...,h_n) T can reflect the semantic meaning of a sentence better and more deeply at a higher level. And H ∈ RⁿX 2d, where n represents the length of the sentence sequence and d represents the dimension of LSTM.

S4) extracting the characteristics of the biomedical text by using a multi-head self-attention mechanism, and after the BILSTM is output, better capturing the internal correlation of the data and the characteristics in the biomedical sentence by using the multi-head self-attention mechanism.

Inputting a sentence in the biomedical field, and performing attribute calculation on each word in the sentence and all words in the sentence, wherein the self-attribute mechanism is adopted to better learn the dependency relationship and the internal structure of the sentence in the biomedical sentence.

As mentioned above, the invention introduces the position feature, namely, the dependency relationship of the relative distance in the sentence is learned through self-attribute mechanism, the attention value of each word in the sentence can be calculated, and the internal structure of a sentence can be further learned.

While Multi-Head-Self-Attention is not only calculated once, it refers to capturing relevant information on different subspaces in different dimensions by calculating many times. The multi-head self-attention mechanism can provide a plurality of different expression subspaces for the biomedical text, and a plurality of groups of Q, K and V weight matrixes exist, so that the capability of the model for processing different positions is enhanced.

The calculation process is as follows:

head_i＝Attention(QW_i ^Q,KW_i ^K,VW_i ^V)

MultiHead(Q,K,V)＝Concat(head₁,…,head_h)W^o

wherein d is_kIn the above formula, Q, K, and V are all long vectors described in step S2), and three of them are equal due to the self-attention mechanism; w_i ^Q,W_i ^K,W_i ^V,W^oRepresenting weight matrices, heads, within neural networks_iRepresenting a certain head of the calculation, different heads being understood as different expressions, Concat (head), to the biomedical statement₁,…,head_h) Representing the vectors resulting from concatenating the different headers.

And finally extracting key features through a full connection layer.

S5) biomedical entity relationship prediction

After key features are extracted through the full-connection layer, the results are input into the full-connection neural network again, the relation of the two biomedical entities in the sentence is predicted, and the final relation prediction result is calculated by utilizing a Softmax function.

Claims

1. A biomedical relation extraction method based on a pre-training model and a self-attention mechanism is characterized by comprising the following specific steps:

s1) preprocessing the labeled biomedical corpus: writing a data preprocessing program, converting the original corpus into input which can be accepted by a deep learning network model, constructing the position characteristics of the biomedical entities, and providing information for the relationship among the biomedical entities;

s2) constructing a sentence vector of the biomedical text: inputting the corpus preprocessed in the step S1) into an ELMO pre-training model, extracting the characteristics of the whole biomedical text sentence, and outputting a vector; meanwhile, processing the position characteristics of the biomedical entity generated in the step S1) by adopting a word embedding mode to form a position vector of the biomedical entity; then all the obtained vectors are connected to form a long vector to represent sentences in the biomedical text;

s3) inputting the BILSTM neural network model to extract features: after the long vector obtained in the step S2) passes through a dropout layer, the long vector is input into a BILSTM neural network to learn the context information of the biomedicine, so that a single statement of the biomedicine text is understood from two directions, and the BILSTM outputs a feature vector;

s4) extracting key features of the biomedical text using a multi-headed self-attentiveness mechanism: after the BILSTM neural network outputs the feature vectors, capturing data in the biomedical text and the internal correlation of the features by using a multi-head self-attention mechanism, and extracting key features through a full connection layer;

2. The method of claim 1, wherein the biomedical entities comprise diseases, drugs and side effects.