CN111949792B

CN111949792B - Medicine relation extraction method based on deep learning

Info

Publication number: CN111949792B
Application number: CN202010811218.0A
Authority: CN
Inventors: 刘勇国; 何家欢; 杨尚明; 李巧勤
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2020-08-13
Filing date: 2020-08-13
Publication date: 2022-05-31
Anticipated expiration: 2040-08-13
Also published as: CN111949792A

Abstract

The invention discloses a medicine relation extraction method based on deep learning, which is characterized in that a RDkit tool is utilized to convert a medicine molecular formula into a molecular diagram structure, the characteristics of medicine molecules are expressed, the text characteristics of a sample are extracted at the same time, the medicine molecule characteristics and the text characteristics of the sample are combined, then the medicine relations are classified by utilizing a full-connection layer softmax, the physicochemical properties of medicines in sentences are adopted, the extraction accuracy can be improved, and the problems that the existing method is difficult to cover all text scenes and excessively depends on an external natural language processing tool are solved.

Description

Medicine relation extraction method based on deep learning

Technical Field

The invention relates to the field of extraction of pharmaceutical chemical entity relations, in particular to a deep learning-based extraction method of a pharmaceutical relation.

Background

The extraction of the relationship of the pharmaceutical chemical entities refers to the automatic extraction of the relationship between the pharmaceutical entities from the text, which can assist pharmaceutical researchers in developing new drugs, assist doctors in making reasonable treatment schemes for patients and is also the basis for constructing a pharmaceutical chemical knowledge database. The existing extraction method of interaction relation of medicinal entities mainly comprises two types: rule-based methods and supervised machine learning-based methods.

Early research mostly employed rule-based approaches because early drug relationships extracted an authoritative annotated corpus that lacked. The sentence structure for expressing action relationship in the method is fixed and limited, namely, most sentences with action relationship description have the same or similar sentence structure. The method analyzes the syntax of the sentences, detects the syntax structure of the sentences, extracts the interacting drug pairs from the short sentences according to the description rules formulated by pharmacists, and classifies the drug pair relationship.

Since DDIExtraction2011 and DDIExtraction2013 evaluations, supervised-based machine learning methods were used for the extraction of the interaction relationships of the pharmaceutical entities, the most important of which is feature-based methods that treat the relationship extraction as a classification problem, explicitly represent candidate relationship instances as a feature vector with various types of features, and then classify the candidate relationship instances using a supervised machine learning model.

The rule-based method has a good extraction effect only for simple short sentences because it is difficult to formulate a proper rule for complex long sentences. However, the sentences of the documents in the pharmaceutical field are complex long sentences, many of the descriptive sentences contain more than two drugs, and the sentences contain a large number of isotopologues, parallel structures and other complex structures. The rule-based approach is less accurate with the current large amount of data. The formulation of the rules is time-consuming and labor-consuming and requires the participation of personnel in the professional field; furthermore, it is difficult for manually-programmed rules to cover all application text scenarios. The method based on supervised machine learning has better performance and portability, but the method depends on external natural language processing tools, and if the external tools make mistakes, errors can be propagated to influence the performance.

Disclosure of Invention

Aiming at the defects in the prior art, the medicine relation extraction method based on deep learning provided by the invention solves the problems that the existing method is difficult to cover all text scenes and excessively depends on an external natural language processing tool.

In order to achieve the purpose of the invention, the invention adopts the technical scheme that:

the deep learning-based medicine relation extraction method comprises the following steps:

s1, acquiring documents related to the medicine, dividing the text content of the documents into sentences by taking the sentences as basic units, and taking each sentence as an initial sample;

s2, reserving an initial sample containing two or more drug nouns, and labeling the reserved sample to obtain a labeled sample;

s3, adding a position attribute relative to the medicine for each word according to the position relation between the word and the medicine in the labeled sample to obtain a position feature vector corresponding to each word;

s4, obtaining and converting the SMILES expressions of all the drug molecules into graph structures, and obtaining the drug molecule characteristic vector of each drug in the graph structures;

s5, representing the words in the text as vectors, replacing the corresponding words with the vectors, and vectorizing each sentence;

S6, inputting the vectorized sentence into a deep learning network to obtain a text feature vector corresponding to the sentence;

s7, connecting the text characteristic vector and the medicine molecule characteristic vector corresponding to each sentence in series to obtain an integral characteristic vector corresponding to each sentence;

s8, inputting the integral characteristic vector corresponding to each sentence into a full-connection layer to obtain a vector represented by nonlinearity;

and S9, classifying the vectors represented by the nonlinearity by adopting a softmax function to obtain the probability of each classification, and taking the class with the highest probability as the medicine-pair relation obtained by identification to finish the extraction of the medicine relation.

Further, in step S2, labeling the retained sample, and the specific method for obtaining the labeled sample includes:

according to DDIExtraction2013 challenge rules, tags are classified into 5 classes, which are respectively: advice, action, drug mechanism, positive and irrelevant.

Further, the specific method of step S3 is:

acquiring the position relation between each word and the medicine in the labeled sample, establishing a vector with the number of elements equal to the number of the medicines, and setting the value of the nth element in the vector as m if the word is in m positions before the nth medicine; and if the word is m positions behind the nth medicine, setting the value of the nth element in the vector as-m, traversing each medicine to obtain the position feature vector corresponding to the word, and further obtaining the position feature vector corresponding to each word.

Further, the specific method of step S4 includes the following sub-steps:

s4-1, obtaining SMILES expressions of all drug molecules from a database DrugBank;

s4-2, converting the drug molecule SMILES expression into a graph structure by using an RDkit tool and taking each atom of the drug as a node and taking an element bond between atoms as an edge;

s4-3, randomly initializing all element bonds and atoms in the graph structure into a vector, and according to a formula:

obtaining the vector representation of the v-th atom and element bond after the t-th iteration

Where σ (-) is the sigmod activation function; h_t-1Is a parameter matrix;

is the vector representation of the v atom and element bond after the t-1 iteration;

representing the vector representation of the w atom and element bond after the t-1 iteration; n (v) represents a set of atoms and elemental bonds adjacent to the v atom in the diagram structure;

s4-4, according to the formula:

acquiring a drug molecular characteristic vector of a drug corresponding to the v-th atom, and further acquiring a drug molecular characteristic vector of each drug in a graph structure; wherein softmax (·) is a softmax function; w^tIs a parameter matrix.

Further, the specific method of step S5 is:

and training text contents by adopting a word2vec model, representing words in the text as vectors, taking each vector as an element of a sentence vector according to the front-back position relation of the words to obtain a vector representing each sentence, and vectorizing each sentence.

Further, in step S6, the deep learning network is a two-way long-short term memory model, where the expression of the two-way long-short term memory model is:

i_p＝σ(W_xix_p+W_hih_p-1+b_i)

f_p＝σ(W_xfx_p+W_hfh_p-1+b_f)

c_p＝f_pc_p-1+i_ptanh(W_xcx_p+W_hch_p-1+b_c)

o_p＝σ(W_xox_p+W_hoh_p-1+b_o)

h_p＝o_ptanh(c_p)

wherein i_pRepresents the output of the input gate; σ (-) is a sigmod activation function; w_xiA parameter matrix representing between the input and the input gate; x is the number of_pIs the input of the model; w_hiIs a parameter matrix between the hidden layer and the input gate; h is_p-1A hidden layer output representing a last input word in the sentence; b_iAn offset vector representing an input gate; f. of_pAn output representing a forgetting gate; w_xfA parameter matrix representing between the input and the forgetting gate; w_hfA parameter matrix representing between the hidden layer and the forgetting gate; b_fAn offset vector representing a forgetting gate; c. C_pPresentation memory sheetThe output of the element; c. C_p-1The memory unit output corresponding to the last word is shown; tanh (-) is a tanh activation function; w_xcA parameter matrix representing the input and memory cells; w_hcA parameter matrix representing between the hidden layer and the memory cell; b_cAn offset vector representing a memory cell; o_pRepresents the output of the output gate; w_xoRepresenting a parameter matrix between the output gate and the input; w_hoRepresenting a parameter matrix between the output gate and the hidden layer; b_oAn offset vector representing an output gate; h is_pRepresenting the output of the two-way long-short term memory model.

Further, the specific method of step S7 is:

and taking the text characteristic vector corresponding to each sentence as a first element, and sequentially arranging the drug molecule characteristic vectors corresponding to the drugs in the sentence after the text characteristic vectors of the sentence according to the sequence of the drugs in the sentence to obtain the overall characteristic vector corresponding to each sentence.

Further, the specific method of step S8 is:

inputting the global feature vector corresponding to each sentence into the full-connected layer, according to the formula:

X'＝tanh(W'X+b')

obtaining a vector X' represented by nonlinearity; wherein tanh (-) is a tanh activation function; w' is a full link layer parameter; b' is the offset of the full connection layer; x is input.

Further, the specific method of step S9 is:

according to the formula:

classifying the vector X' of the non-linear representation to obtain an output comprising a probability for each classification

The class with the highest probability is used as the medicine-pair relation obtained by identification, and medicine relation extraction is completed; wherein softmax (. cndot.) is softman ax function; w' is a classification parameter matrix; b "is the classification parameter offset.

The invention has the beneficial effects that:

1. the method utilizes an RDkit tool to convert the molecular formula of the medicine into a molecular diagram structure, then expresses the characteristics of the medicine molecules, extracts the text characteristics of the sample, combines the characteristics of the medicine molecules with the text characteristics of the sample, and classifies the medicine relations by utilizing a full-link softmax.

2. The data of the invention comes from the medicine literature, so that the lag of updating the medicine relation data can be effectively reduced, the acquisition speed of the medicine relation information is accelerated, the learning cost and the learning burden of medical workers are reduced, the cognitive level of the medical workers on medicine knowledge is improved, and the potential risk of various adverse drug reactions to patients in the medicine taking process is reduced.

Drawings

FIG. 1 is a schematic flow chart of the present invention.

Detailed Description

The following description of the embodiments of the present invention is provided to facilitate the understanding of the present invention by those skilled in the art, but it should be understood that the present invention is not limited to the scope of the embodiments, and it will be apparent to those skilled in the art that various changes may be made without departing from the spirit and scope of the invention as defined and defined in the appended claims, and all matters produced by the invention using the inventive concept are protected.

As shown in fig. 1, the deep learning-based drug relationship extraction method includes the following steps:

In step S2, labeling the retained sample, and the specific method for obtaining the labeled sample is as follows: according to DDIExtraction2013 challenge rules, tags are classified into 5 classes, which are respectively: advice, action, drug mechanism, positive and irrelevant.

The specific method of step S3 is: acquiring the position relation between each word and the medicine in the labeled sample, establishing a vector with the number of elements equal to the number of the medicines, and setting the value of the nth element in the vector as m if the word is in m positions before the nth medicine; and if the word is m positions behind the nth medicine, setting the value of the nth element in the vector as-m, traversing each medicine to obtain the position feature vector corresponding to the word, and further obtaining the position feature vector corresponding to each word.

The specific method of step S4 includes the following substeps:

s4-1, acquiring SMILES expressions of all drug molecules from a database DrugBank;

s4-2, converting the drug molecule SMILES expression into a graph structure by using each atom of the drug as a node and an element bond between atoms as an edge through an RDkit tool;

s4-3, randomly initializing all element bonds and atoms in the graph structure into a vector according to a formula:

Where σ (-) is the sigmod activation function; h_t-1Is a parameter matrix;

s4-4, according to the formula:

The specific method of step S5 is: and training text contents by adopting a word2vec model, representing words in the text as vectors, taking each vector as an element of a sentence vector according to the front-back position relation of the words to obtain a vector representing each sentence, and vectorizing each sentence.

In step S6, the deep learning network is a bidirectional long-short term memory model, where the expression of the bidirectional long-short term memory model is:

i_p＝σ(W_xix_p+W_hih_p-1+b_i)

f_p＝σ(W_xfx_p+W_hfh_p-1+b_f)

c_p＝f_pc_p-1+i_ptanh(W_xcx_p+W_hch_p-1+b_c)

o_p＝σ(W_xox_p+W_hoh_p-1+b_o)

h_p＝o_ptanh(c_p)

wherein i_pRepresents the output of the input gate; σ (-) is a sigmod activation function; w_xiA parameter matrix representing between the input and the input gate; x is the number of _pIs the input of the model; w_hiIs a parameter matrix between the hidden layer and the input gate; h is_p-1A hidden layer output representing a last input word in the sentence; b_iAn offset vector representing an input gate; f. of_pAn output representing a forgetting gate; w_xfA parameter matrix representing between the input and the forgetting gate; w_hfA parameter matrix representing between the hidden layer and the forgetting gate; b_fAn offset vector representing a forgetting gate; c. C_pRepresenting the output of the memory cell; c. C_p-1The memory unit output corresponding to the last word is shown; tanh (-) is a tanh activation function; w_xcA parameter matrix representing the input and memory cells; w_hcA parameter matrix representing between the hidden layer and the memory cell; b_cAn offset vector representing a memory cell; o_pRepresents the output of the output gate; w_xoRepresenting a parameter matrix between the output gate and the input; w_hoRepresenting a parameter matrix between the output gate and the hidden layer; b_oAn offset vector representing an output gate; h is_pRepresenting outputs of a two-way long-short term memory modelAnd (6) discharging.

The specific method of step S7 is: and taking the text characteristic vector corresponding to each sentence as a first element, and sequentially arranging the drug molecule characteristic vectors corresponding to the drugs in the sentence after the text characteristic vectors of the sentence according to the sequence of the drugs in the sentence to obtain the overall characteristic vector corresponding to each sentence.

The specific method of step S8 is: inputting the global feature vector corresponding to each sentence into the full-connected layer, according to the formula:

X'＝tanh(W'X+b')

The specific method of step S9 is: according to the formula:

The class with the highest probability is used as the medicine-pair relation obtained by identification, and medicine relation extraction is completed; wherein softmax (·) is a softmax function; w' is a classification parameter matrix; b "is the classification parameter offset.

In one embodiment of the present invention, the drug literature is available from PubMed. The bidirectional long-short term memory model respectively calculates hidden vectors from front to back of text sentences

And computing the hidden vector from back to front

The last output in two directions is respectively

And

are connected in series to obtain the text characteristic vector H of the sentence_S。

In summary, the invention utilizes the RDKit tool to convert the molecular formula of the drug into the molecular diagram structure, then expresses the characteristics of the drug molecules, extracts the text characteristics of the sample, combines the characteristics of the drug molecules with the text characteristics of the sample, and then classifies the drug relationships by utilizing the full-link softmax, and adopts the physicochemical properties of the drug in the sentence, thereby improving the extraction accuracy and solving the problems that the existing method is difficult to cover all text scenes and excessively depends on the external natural language processing tool.

Claims

1. A medicine relation extraction method based on deep learning is characterized by comprising the following steps:

s1, obtaining documents related to the medicine, dividing the text content of the documents into sentences by taking the sentences as basic units, and taking each sentence as an initial sample;

s4, obtaining and converting the SMILES expressions of all the drug molecules into graph structures, and obtaining the drug molecule feature vector of each drug in the graph structures;

s5, representing the words in the text as vectors, and substituting the vectors for the corresponding words to vectorize each sentence;

2. The method for extracting drug relationship based on deep learning of claim 1, wherein the step S2 is to label the retained samples, and the specific method for obtaining the labeled samples is as follows:

according to the DDIExtraction2013 challenge rule, the tags are divided into 5 classes, which are respectively: advice, action, drug mechanism, positive and irrelevant.

3. The method for extracting drug relationship based on deep learning of claim 1, wherein the specific method of step S3 is as follows:

4. The deep learning-based drug relationship extraction method as claimed in claim 1, wherein the specific method of step S4 includes the following sub-steps:

Where σ (-) is the sigmod activation function; h_t-1Is a parameter matrix;

s4-4, according to the formula:

5. The deep learning-based drug relationship extraction method according to claim 1, wherein the specific method of step S5 is:

6. The deep learning based drug relationship extraction method as claimed in claim 1, wherein the deep learning network in step S6 is a two-way long-short term memory model, wherein the expression of the two-way long-short term memory model is:

i_p＝σ(W_xix_p+W_hih_p-1+b_i)

f_p＝σ(W_xfx_p+W_hfh_p-1+b_f)

c_p＝f_pc_p-1+i_ptanh(W_xcx_p+W_hch_p-1+b_c)

o_p＝σ(W_xox_p+W_hoh_p-1+b_o)

h_p＝o_ptanh(c_p)

wherein i_pRepresents the output of the input gate; σ (-) is a sigmod activation function; w_xiA parameter matrix representing between the input and the input gate; x is the number of_pIs the input of the model; w_hiIs a parameter matrix between the hidden layer and the input gate; h is_p-1A hidden layer output representing a last input word in the sentence; b_iAn offset vector representing an input gate; f. of_pAn output representing a forgetting gate; w_xfA parameter matrix representing between the input and the forgetting gate; w_hfA parameter matrix representing between the hidden layer and the forgetting gate; b_fAn offset vector representing a forgetting gate; c. C _pRepresenting the output of the memory cell; c. C_p-1The memory unit output corresponding to the last word is shown; tanh (-) is a tanh activation function; w_xcA parameter matrix representing the input and memory cells; w_hcA parameter matrix representing between the hidden layer and the memory cell; b_cAn offset vector representing a memory cell; o_pRepresents the output of the output gate; w_xoRepresenting a parameter matrix between the output gate and the input; w_hoRepresenting a parameter matrix between the output gate and the hidden layer; b_oAn offset vector representing an output gate; h is_pRepresenting the output of the two-way long-short term memory model.

7. The method for extracting drug relationship based on deep learning of claim 1, wherein the specific method of step S7 is as follows:

8. The method for extracting drug relationship based on deep learning of claim 1, wherein the specific method of step S8 is as follows:

inputting the integral characteristic vector corresponding to each sentence into a full connection layer, and according to a formula:

X'＝tanh(W'X+b')

9. The method for extracting drug relationship based on deep learning of claim 1, wherein the specific method of step S9 is as follows:

according to the formula:

The class with the highest probability is used as the medicine-pair relation obtained by identification, and medicine relation extraction is completed; wherein softmax (·) is a softmax function; w' is a classification parameter matrix;b "is the classification parameter offset.