CN111078889A

CN111078889A - Method for extracting relationships among medicines based on attention of various entities and improved pre-training language model

Info

Publication number: CN111078889A
Application number: CN201911330114.1A
Authority: CN
Inventors: 李丽双; 朱燏
Original assignee: Dalian University of Technology
Current assignee: Dalian University of Technology
Priority date: 2019-12-20
Filing date: 2019-12-20
Publication date: 2020-04-28
Anticipated expiration: 2039-12-20
Also published as: CN111078889B

Abstract

The invention belongs to the technical field of computer natural language processing, and provides a method for extracting relationships among medicines based on attention of various entities and an improved pre-training language model. A variety of different entity attention mechanisms are utilized within a neural network to enhance understanding of complex drug names by the neural network, wherein the entity attention mechanisms include: the method comprises the following steps of entity mark attention, two-entity mark difference attention and an attention mechanism based on an entity description document, and meanwhile, the input of a pre-training language model is improved, so that the output of the pre-training language model can be better suitable for a relationship extraction task among medicines. The method has the advantages that the problem that the medicine name cannot be well understood by a deep learning model due to the fact that the medicine name is too complex when a medicine relation description document is processed is solved, and the medicine relation recognition level is improved.

Description

Method for extracting relationships among medicines based on attention of various entities and improved pre-training language model

Technical Field

The invention belongs to the technical field of computer natural language processing, relates to a method for extracting relationships between medicines from biomedical texts, and particularly relates to a method for extracting relationships between medicines based on an improved pre-training language model and multiple entity attention mechanisms.

Background

Drug-drug interactions (DDIs) refer to the combined effects of two or more drugs taken simultaneously or over a period of time. With the ongoing and intensive research on drug-drug interactions by medical workers, a great deal of valuable information is buried in the exponentially growing unstructured biomedical literature. Many of The information on drugs are currently found in Drug-related open databases such as Drug bank (Wishart, D.S. et al. Drug Bank5.0: A major update to The Drug Bank database for 2018.nucleic acids Res.2017,46, 1074-. How to automatically extract the relationship between structured medicines from massive unstructured biomedical documents is a problem which needs to be solved by researchers urgently.

Relationship extraction is one of the common tasks in natural language processing that enables the mining of relationships between two specific entities in text through a machine learning model. The extraction of the relationship between drugs is a very typical relationship extraction task, and is also one of the tasks which are very concerned in the biomedical field. In recent years, DDI Extraction2011 (section Bedmar, I.et al, The 1st section topic of The section-2011 Challenge: Extraction of Drug-Drug Interactions from biological tissues. in Proceedings of The 1st section Task Extraction2011, Huelva, Spain,7 September) and DDIExtraction2013 (section-Bedmar, I.et al.section Eval-2013Task9: Extraction of Drug-Drug Interactions from biological tissues (DDI Extraction 2013), introduction of The section 7 of The section biological tissues of The section, III, IV.

At present, researchers mainly adopt corpora of a DDI Extraction2013 task to evaluate the performance of a DDI Extraction model. The difficulty with this task is to classify the relationships between drugs described in biomedical text into 5 classes, including the mecanism type, Effect type, Advice type, Int type and Negative type. The mecanism type is used to describe the pharmacokinetic relationship of two drugs. The Effect type is used to indicate that two drugs have mutual influence on the drug Effect. The Advise type is used to describe the recommended or suggested relationship of two drugs in use. The Int type is used to describe that two drugs have a specific relationship but is not described in the literature. Negative type indicates that there is no correlation between the two drugs. For example, in the illustrative sentence "Pantoprazole a murach maker effect on loopdogrel's pharmacokineticics and ndon sheet reactivity during contact consistent use", where the relationship mechanism exists between the drugs "Pantoprazole" and "loopdogrel". In the example sentence "code in combination with other therapeutic algorithms, general anethosthetics, phenothiazines, transformers, sectional-hypnotics, or other cns depressants (including alcohols) has additional negative effects", the relationship effect exists between the drugs code and the therapeutic algorithms. Through the second example, we can also find that in the sentence, besides two drugs that have a relationship, other drugs such as "anesthesics", "phenothiazines" and "alcohol" are mentioned. The medicines which do not have the relation in the sentence can interfere the judgment of the current medicine relation, and the difficulty of model judgment is improved. In addition, drug names tend to be quite complex, which also makes it difficult for models to understand the meaning of drug entities in sentences by drug name.

Currently, two types of methods are mainly used for such tasks, the first type is the traditional machine learning method, and the second type is Deep learning (LeCun Y et al, Deep learning [ J ]]Methods of Nature,2015,521(7553): 436). The traditional machine learning method needs to extract a large number of words from the original text,And the characteristics such as grammar and the like are sent to a discriminator such as SVM or random forest and the like. Chowdhury et al (Chowdhury M et al, FBK-irst: A multi-phase kernel base amplification for Drug-Drug interaction detection and classification of Drug expression C]7th International work on semantic evaluation, Atlanta, Georgia, USA,2013: 351-. J.

Et al (

J,Kaewphan S,Salakoski T.UTurku:drug namedentity recognition and drug-drug interaction extraction using SVMclassification and domain knowledge[C]The shortest dependence path information is adopted as the input of the SVM model and the knowledge of the related fields is fused by Volume 2: Proceedings of the SeventhInternationalWorkshop on Semantic Evaluation (SemEval 2013).2013: 651-659). Thomas et al (Thomas P, Neves M,

T,et al.WBI-DDI:drug-drug interaction extraction usingmajority voting[C]// Second Joint Conference on Lexical and comparative Semantics (. SEM), Volume 2: Proceedings of the Seventh International Workshop on Semanetic Evaluation (SemEval 2013).2013:628-635) used a voting-based kernel method for classification. In general, the conventional machine learning-based method needs to design a large number of complex feature sets to improve the performance of the model, but designing and extracting the feature sets requires much manpower.

In recent years, more and more depth models are applied to natural language processing tasks and have good effects. Quan et al (Quan C, Hua L, Sun X, et al. multichannel connected neural network for biological interaction [ J ]. BioMed research international,2016,2016:1-10) propose a multichannel CNN model using word vectors obtained by various pre-training methods as input. Asada et al (Asada M, Miwa M, Sasaki Y. engineering Drug-Drug Interactive extraction from texture by Molecular Structure Information [ J ]. Proceedings of 56th annular Meeting of the ACL,2018:680-685) propose a method of fusing Molecular Information into CNN and Graph Convolution Neural Network (GCNN) to extract DDI. Recurrent Neural Networks (RNNs) are better suited to processing time series data than CNNs, and are better at capturing sequence features of sentences. Zhang et al (Zhang Y, Zheng W, Lin H, et al. drug-drug interaction experience technical RNNs on sequence and short dependency path [ J ]. Bioinformatics,2017,34(5): 828) 835) propose a hierarchical RNN method, combining Shortest Dependent Paths (SDPs) and sentence sequences for DDI extraction. Some researchers also combined the two models to extract DDIs. Sun et al (Sun X, Dong K, Ma L, et al. Drug-Drug Interactive extraction viia recovery Hybrid Networks with improved Focal local [ J ]. Entrophy, 2019,21(1):37) propose a new Recursive Hybrid Convolutional Neural Network (RHCNN) for DDI extraction.

Although various approaches have been proposed, there is still much room to improve the performance of the DDI extraction model. In order to avoid the influence of complicated drug names on the performance of the model, the former work often replaces drug names in sentences with specific words, which results in the loss of part of useful information. Furthermore, previous work mostly relies on the syntactic characteristics of the dependency path to improve the performance of the model, and the syntactic characteristics depend on specific tool generation, so that the performance of the model is also limited by the tools.

Disclosure of Invention

The invention does not depend on any lexical and syntactic information, simplifies the input of the model through the improved BioBERT pre-training word vector and various entity attention mechanisms, better utilizes the drug name information, and the performance reaches the current leading level.

The technical scheme of the invention is as follows:

a method for extracting relationships among medicines based on multi-entity attention and an improved pre-training language model comprises the following steps:

text preprocessing

Preprocessing the corpus: (1) firstly, all texts are converted into lower case, and then punctuation marks and non-English characters are removed; (2) because the extraction of the relationship between the medicines does not relate to quantitative analysis, the invention replaces all the numbers in the text with the word "num"; (3) a sentence may contain a plurality of drug entities, and for each pair of drug entities an instance is generated, together

An example, where n is the number of drug entities in the sentence; (4) replacing the target entity in each instance with "drug 1" and "drug 2", replacing with "drug 0" for the non-target entities in the instance; (5) the set model is able to handle the maximum length of a sentence and if the sentence in the instance does not reach the maximum length, it is filled in with the character "0".

(II) obtaining sentence preliminary coding by using improved BioBERT model

And the improved BioBERT is adopted as an encoding mode of a word vector, so that the word vector has better generalization performance. As shown in FIG. 2, the BioBERT model is composed of 12 layers of Transformer structures as the BERT model, and the output of each layer of Transformer is sent to the next layer of Transformer; averaging output vectors of transformers of the last four layers in a BioBERT model, and replacing the original output of the BioBERT with the average vectors; for the preprocessed sentence X ═ { X ═ X₁,x₂,...,x_m(m is sentence length), after encoding by the above-mentioned modified biobeep, a vector representation of the sentence, V, biobert (x), is obtained;

(III) utilizing bidirectional gating recursion unit to obtain semantic representation of sentence

In order to incorporate the context information into the sentence code, the Bi-GRU is adopted to enter the sentenceCarrying out further encoding; for each word V in V_iIts representation is obtained by forward and backward GRU coding

And

then the forward and backward results are spliced to obtain the final representation of each word

Wherein d is_hDimension for GRU unit output; the sentence coding vector is H ═ H in this case₁,h₂,...,h_m}；

(IV) enhancing the weight of an entity in a sentence by using a plurality of entity attention mechanisms

The sentence coding vector H is processed through three different entity attention mechanisms to enhance the understanding of the model to the medicine entity; the three attention mechanisms all adopt an original attention model, but input drug entity information is different, so that the neural network model utilizes the drug entity information from different angles;

these three attention mechanisms are described separately below;

(4.1) drug description document attention

Selecting Wikipedia and drug Bank as the acquisition path of the drug entity description document, and for the set E ═ E of all drug entities in the corpus₁,e₂,...,e_kK is the total number of all drug entities in the corpus, the drug description document is converted into a vector set K of drug description documents which is Doc2Vec (E) through a Doc2Vec model,

wherein d is_eIs the length of the document vector;

(4.2) attention of drug entities

The drug entity word vector is sent to an attention mechanism as a feature; the drug entity information is two existing relations in the sentence coding vector HCorresponding to the drug entity of (a) vector h_e1,h_e2；

(4.3) attention between drug entities

The difference between two drug entities is used as mutual information between two drugs and sent to an attention mechanism; the inter-drug information is the difference of the vectors of two drug entities, i.e. h_e12＝h_e1-h_e2；

Respectively sending the three entity information and the sentence coding vector H into an attention mechanism together to obtain entity information weighted sentence representation; the attention mechanism is shown in equations (1-3):

M＝tanh([HW_s,RW_p]+b) (1)

α＝softmax(M) (2)

r＝Hα^T(3)

wherein the content of the first and second substances,

expanding the three characteristics to the length equal to the sentence to obtain a sequence;

a parameter matrix for attention mechanism, wherein d_aIs the matrix dimension;

is an offset; the output of the attention mechanism is

With the above-mentioned attention mechanism, an entity-weighted sentence vector representation based on three features is obtained, as shown in equations (4-8):

r_e1＝attention(H,h_e1) (6)

r_e2＝attention(H,h_e2) (7)

r_e12＝attention(H,h_e12) (8)

wherein k is₁And k₂Is two drug description document vectors, r, from a set K of drug description document vectors_k1And r_k2Attention results obtained by two drug entity document description vectors, r_e1And r_e2Attention results obtained for two drug entities, r_e12Attention results obtained by the difference of the two drug entities; by encoding these attention results and the last element H of the sentence-encoding vector H_mAnd (4) splicing to obtain a final sentence expression vector O, as shown in formula (9):

(V) obtaining the final medicine relation classification by utilizing a Softmax classifier

After the sentence representation weighted by the entity information is obtained, compressing the sentence representation vector dimension through a layer of feedforward neural network, and finally sending the sentence representation vector dimension to a Softmax layer to obtain a final classification result;

the model output layer sends the output O of the multi-entity attention layer as the final classification feature to the full-connection layer for classification, and the probability P (y ═ C) of y belonging to the DDI type of the C (C ∈ C) of the candidate drug-drug relation is shown as the formula (10):

P(y_i)＝Softmax(OW_O+b) (10)

wherein, W_OAnd b is a weight matrix and an offset, the activation function of the full connectivity layer is Softmax, and C is a set of DDI type tags. Finally, the category label with the highest probability is calculated using equation (11)

I.e. the type of relationship between candidate drug-drug pairs.

The invention has the beneficial effects that: the comparison between the extraction method of the present invention and other DDI extraction methods is shown in table 1, where all the methods are tests performed on DDIExtraction2013 corpus. The F1 value of the present invention was 80.9%, which is a 5.4% improvement over the previous best results. In addition, the system also reaches the highest accuracy and recall rate, which are respectively increased to 81.0% and 80.9%.

TABLE 1 comparison of the Effect of the invention with other DDI extraction methods

Drawings

FIG. 1 is a diagram of a neural network model architecture employed by the present invention.

FIG. 2 is a schematic representation of the improvement of the BioBERT model according to the present invention.

Detailed Description

The following describes in detail embodiments of the present invention in conjunction with the constructed neural network model of the present invention.

The general model structure of the invention is shown in figure 1. The DDI corpus to be processed is preprocessed, the noun explanation of the drugs involved in the text is searched out from DrugBank and Wikipedia, and the descriptions of the drugs are converted into vectors by a Doc2Vec tool. For sentences in DDI corpus, the invention obtains vector representation of the sentences through a modified BioBERT model and a bidirectional GRU network. And finally, obtaining a final discrimination result through a feedforward neural network and a softmax layer. The specific implementation flow is described below.

First, preprocessing the corpus

The pretreatment work comprises the following steps:

(1) removing punctuation marks and non-English characters in the corpus, and separating each word by a blank;

(2) uniformly converting the text into lower case characters;

(3) uniformly replacing related numbers in the corpus into num;

(4) for the case that a sentence in the corpus contains a plurality of drug entities, combining all the drug entities pairwise, if the sentence contains n drug entities, generating the drug entities together

An example. In addition, the invention replaces the drug entities in each instance that are in relationship with each other with "drug 1" and "drug 2", and for the other drugs in the instance with "drug 0".

(5) The set model is able to handle the maximum length of a sentence and if the sentence in the instance does not reach the maximum length, it is filled in with the character "0".

Coding of sentences

The sentence coding is divided into the following two steps:

(1) preliminary coding of sentences by modified BioBERT

The present invention encodes each word in a sentence as a word vector through a modified BioBERT. For the preprocessed sentence X ═ { X ═ X₁,x₂,...,x_nN is the sentence length, resulting in a vector representation of the sentence V-biobert (x). BioBERT was trained using two biomedical databases, PMC and PubMed.

(2) Context semantic coding of sentences by Bi-GRU

For each word V in V_iThe invention obtains its representation by forward and backward GRU coding

And

then the forward and backward results are spliced to obtain the final result of each wordTo represent

Wherein d is_hThe dimension of the GRU unit output. When the sentence is coded as H ═ H₁,h₂,...,h_n}. The output dimensions of the GRU units and the dimensions of the output of the BioBERT model are identical.

Coding of medicine description document

The invention adopts a browser automatic test framework selenium as a crawler to dynamically crawl the abstract of each entity in Wikipedia and DrugBank. In the process of crawling the abstract, all entities can not find a definite abstract corresponding to the entity, for example, a medicine of which 'neuro-drugs' (anticonvulsant medicines) are not very clear but a general name of a class of medicines can not be found, so that the entity entry can not be found, and therefore, the names of the large classes of the entities are used for replacing the whole entity, namely, the abstract using 'neuro-drugs' as key words is used as the abstract of the whole entity. If a small number of words have no corresponding abstract after the above processing, the entity itself is used as the corresponding abstract for supplement.

For the set of all drug entities in the corpus E ═ E₁,e₂,...,e_kConverting the drug description document of the corpus into a vector set K of drug description documents, which is Doc2Vec (E), through a Doc2Vec model,

wherein d is_eIs the length of the document vector.

Attention mechanism for four or more kinds of entities

The three kinds of entity information adopted by the method are respectively medicine description information, medicine entity information and medicine inter-information. The medicine description information is a medicine description document vector set K, and the medicine entity information is a vector H corresponding to two medicine entities with a relationship in the sentence sequence coding H_e1,h_e2The inter-drug information is the difference of the vectors of two drug entities, i.e. h_e12＝h_e1-h_e2. Dimension of three kinds of entity information is same as GRU unitIs output dimension.

The three entity information are respectively sent into an attention mechanism (formula 1-3) together with a vector expression H of the sentence, and a sentence expression with the entity information weighted is obtained, as shown in formula (4-8). Where rk1 and rk2 are attention results obtained by two drug entity document description vectors, re1 and re2 are attention results obtained by two drug entities, and re12 is attention result obtained by the difference of two drug entities. By comparing these attention results with the last element H of the sentence vector sequence H_finalAnd (5) splicing to obtain a final sentence expression vector O, as shown in formula (9). Note that the output of the force mechanism has the same dimension as the output of the GRU unit.

Five, output

And the model output layer sends the output O of the multi-entity attention layer as a final classification feature to the full-connection layer for classification, and the candidate drug-drug relation is shown as a formula (10) for the probability P (y ═ C) that y belongs to the DDI type of the C (C ∈ C).

Wherein, W_OAnd b is a weight matrix and an offset, the activation function of the full link layer is Softmax, and C is a set C ═ negative, effect, mechanism, advice, int } of DDI type tags. Finally, the category label with the highest probability is calculated using equation (11)

I.e. the type of relationship between candidate drug-drug pairs.

After the model is realized through the five steps, the invention trains the model and tests the performance on the DDIExtraction2013 corpus. The division of the training set and test set is 9: 1. Summary of the DDI 2013 corpus is shown in Table 2, the DDI corpus consists of 792 texts from a drug Bank database and 233 abstracts from a MedLine database, and the total drug relationships are 5, namely Negative, Effect, Mechanism, Advice and Int.

TABLE 2 number of relationships in DDIExtraction2013 corpus

Type (B)	DDI-DrugBank	DDI-MedLine	Total of
				Effect	1855(39.4％)	214(65.4％)	2069(41.1％)
Mechanism	1539(32.7％)	86(26.3％)	1625(32.3％)
				Advice	1035(22％)	15(4.6％)	1050(20.9％)
Int	272(5.8％)	12(3.7％)	765(5.6％)
				Total of	4701	327	5028

The invention generates additional examples by matching the corpora in the medicine pairwise. However, in the training examples obtained by the method, the number of Negative type examples is extremely large, and the imbalance of the classes can greatly affect the performance of the model. In order to solve the problem of unbalanced number of the relation examples of various medicines in the corpus, the invention carries out the work of clearing negative examples according to the following three rules:

1. if two drugs in a drug pair appear in the same relationship, the corresponding instance is filtered out.

2. If two drugs in a drug pair have the same name, or one is an abbreviation for the other, the corresponding instance is filtered out.

3. If one drug in a drug pair is a special case of the other drug, the corresponding instance is filtered out.

The corpus instance information after deleting the negative cases is shown in table 3. By adopting the negative case deleting method based on the rules, the problem of unbalance among the cases is relieved to a certain extent.

TABLE 3 data set by example Generation and negative deletion

The evaluation index adopted by the invention is F1 value, as shown in formula (12):

wherein P represents the precision, R represents the recall, and the calculation formulas (13-14) of precision and recall are as follows:

where TP represents the number of predicted positive and actual positive instances, FP represents the number of predicted positive and actual negative instances, FN represents the number of predicted negative and actual positive instances, and TN represents the number of predicted negative and actual negative instances.

The invention adopts a Keras library based on a Tensorflow bottom layer to realize a specific model. The model set-up parameters are shown in table 4.

TABLE 4 parameter set of inventive model

Parameter name	Parameter value
		Doc2Vec vector dimension	200
BioBERT vector dimension	768
		Output dimension of BiGRU layer	1536
Maximum sentence length	250
		Attention layer output dimension	1536
Output dimension of multilayer perceptron	256

In the training phase, the present invention uses an early stop method. After 10 rounds of continuous training, if the model performance in the verification set is not improved, the training is stopped and the best model to perform in the verification set is selected as the final model to predict the results of the test set. And (4) tuning all the hyper-parameters on the verification set through grid search. The learning rate of the model during training was set to 0.001, and 128 instances were processed by the model each time.

Claims

1. A method for extracting relationships among medicines based on attention of various entities and an improved pre-training language model is characterized by comprising the following steps:

text preprocessing

Preprocessing the corpus: (1) firstly, all texts are converted into lower case, and then punctuation marks and non-English characters are removed; (2) replacing all the numbers in the text with the word "num"; (3) a sentence may contain a plurality of drug entities, and for each pair of drug entities an instance is generated, together

An example, wherein n is the number of drug entities in the sentence; (4) replacing the target entity in each instance with "drug 1" and "drug 2", replacing with "drug 0" for the non-target entities in the instance; (5) setting the maximum length of the sentence which can be processed by the model, and if the sentence in the example does not reach the maximum length, filling the sentence with the character of '0';

(II) obtaining sentence preliminary coding by using improved BioBERT model

Adopting an improved BioBERT as a coding mode of a word vector, wherein the BioBERT model is composed of 12 layers of Transformer structures like the BERT model, and the output of each layer of Transformer is sent to the next layer of Transformer; in the BioBERT model, averaging output vectors of transformers of the last four layers, and replacing the original output of the BioBERT with the average vectors; for the preprocessed sentence X ═ { X ═ X₁,x₂,...,x_mWhere m is the sentence length, after encoding by the above-described modified biobert, a vector representation of the sentence V ═ biobert (x) is obtained;

In order to integrate the context information into the sentence codes, Bi-GRU is adopted to further code the sentences; for each word V in V_iIts representation is obtained by forward and backward GRU coding

And

The sentence coding vector H is processed through three different entity attention mechanisms to enhance the understanding of the model to the medicine entity;

(4.1) drug description document attention

Selecting Wikipedia and drug Bank as the acquisition path of the drug entity description document, and for the set E ═ E of all drug entities in the corpus₁,e₂,...,e_kK is the total number of all drug entities in the corpus, the drug description document is converted into a vector set K of drug description document (Doc 2Vec (E)) through a Doc2Vec model,

wherein d is_eIs the length of the document vector;

(4.2) attention of drug entities

The drug entity word vector is sent to an attention mechanism as a feature; the drug entity information is a vector corresponding to two drug entities with relationship in the sentence coding vector Hh_e1,h_e2；

(4.3) attention between drug entities

M＝tanh([HW_s,RW_p]+b) (1)

α＝softmax(M) (2)

r＝Hα^T(3)

wherein the content of the first and second substances,

a parameter matrix for attention mechanism, wherein d_aIs the matrix dimension;

is an offset; the output of the attention mechanism is

r_e1＝attention(H,h_e1) (6)

r_e2＝attention(H,h_e2) (7)

r_e12＝attention(H,h_e12) (8)

P(y_i)＝Soft max(OW_O+b) (10)

wherein, W_OB is a weight matrix and an offset, an activation function of the full connection layer is Softmax, and C is a set of DDI type tags; finally, the category label with the highest probability is calculated using equation (11)

Namely the relation type of the candidate drug-drug pair;