CN110020671B

CN110020671B - Drug relationship classification model construction and classification method based on dual-channel CNN-LSTM network

Info

Publication number: CN110020671B
Application number: CN201910174269.4A
Authority: CN
Inventors: 孙霞; 马龙; 张蕾; 冯筠; 吴楠楠
Original assignee: Northwest University
Current assignee: Northwest University
Priority date: 2019-03-08
Filing date: 2019-03-08
Publication date: 2023-04-18
Anticipated expiration: 2039-03-08
Also published as: CN110020671A

Abstract

The invention discloses a method for constructing a drug relation classification model based on a two-channel CNN-LSTM network, which comprises the steps of preprocessing an original drug text set, and performing reverse order operation on each preprocessed drug text in the preprocessed drug text set to obtain a reverse order text set; taking the preprocessed medicine text set as a positive sequence text set; training a neural network to obtain a drug relation classification model; the neural network comprises a parallel forward text feature extraction layer, a parallel reverse text feature extraction layer, a feature fusion layer and a classification layer; the positive sequence text feature extraction layer and the negative sequence text feature extraction layer respectively comprise a convolution block and a long-short term memory neural network block which are sequentially arranged; according to the invention, a double-channel CNN-LSTM network is constructed, the local features of the medicine texts are extracted by using CNN, the global features of the medicine texts are respectively extracted by using LSTM, the extracted medicine relation features are richer, and the classification accuracy is improved.

Description

Drug relationship classification model construction and classification method based on dual-channel CNN-LSTM network

Technical Field

The invention relates to a method for constructing and classifying a drug relationship classification model, in particular to a method for constructing and classifying a drug relationship classification model based on a dual-channel CNN-LSTM network.

Background

Drug relationship refers to the combined effect of two or more drugs administered simultaneously or over a period of time. Such effects can be classified as synergistic, antagonistic, and non-interactive. The antagonistic effects of drugs on each other pose serious health risks to the patient. The drug relationship extraction (DDIE) task is a typical relationship extraction task in the field of natural language processing, aims to detect and identify the semantic relationship of drug pairs, and has important significance for reducing drug safety accidents and promoting the development of biomedical technology.

In recent years, experts in biomedicine and text mining have made great efforts on the task of DDIE, and have created many methods, which can be mainly classified into three categories: rule pattern based methods, statistical machine learning based methods, and deep learning based methods. Although the method based on the rule pattern can be used for identifying entity relations in the target text in a targeted mode, the method has three serious disadvantages: (1) A large amount of manpower and material resources are required to be consumed to research the target text, otherwise, the information extraction quality of the formulated rule cannot be guaranteed; (2) When the rules are formulated, experts in the field are required to provide a large amount of prior knowledge, and different experts may formulate rule sets with inconsistent standards due to subjective consciousness; (3) Because the method has strong pertinence to the domain knowledge, the method is only suitable for information extraction in the domain, and the generalization capability is generally poor, so that the method based on the rule mode does not attract extensive attention of researchers. Although these statistical machine learning-based methods work well, they require elaborate and cumbersome feature engineering to extract the appropriate feature set. However, the quality of the extracted features depends on the existing natural language processing tools, and therefore, the extracted features are unordered due to the noise and cost of the tools, and the quality of the features is difficult to be effectively guaranteed, so that the accuracy of classification is not high.

Disclosure of Invention

The invention aims to provide a method for constructing and classifying a drug relationship classification model based on a dual-channel CNN-LSTM network, which is used for solving the problem of low accuracy of drug relationship classification caused by disorder of features extracted by a drug relationship classification method in the prior art.

In order to realize the task, the invention adopts the following technical scheme:

a drug relationship classification model construction method based on a dual-channel CNN-LSTM network is implemented according to the following steps:

step 1, obtaining an original medicine text set;

labeling the medicine relation in each original medicine text in the original medicine text set to obtain a medicine relation label set;

step 2, preprocessing the original medicine text set to obtain a preprocessed medicine text set;

the preprocessing comprises text normalization, text length fixation and text vector mapping;

step 3, performing reverse order operation on each preprocessed drug text in the preprocessed drug text set to obtain a reverse order text set;

taking the preprocessed medicine text set as a positive sequence text set;

step 4, taking the positive sequence text set and the negative sequence text set as input, taking the drug relationship label set as output, training a neural network, and obtaining a drug relationship classification model;

the neural network comprises a forward sequence text feature extraction layer, a reverse sequence text feature extraction layer, a feature fusion layer and a classification layer which are arranged in parallel in sequence;

the forward sequence text feature extraction layer and the reverse sequence text feature extraction layer respectively comprise a convolution block and a long-term and short-term memory neural network block which are sequentially arranged.

Further, the number of the convolution blocks is set to 4.

Furthermore, each convolution block comprises a batch regularization sublayer, a convolution sublayer, an activation function sublayer and a pooling sublayer which are arranged in sequence.

Further, the activation function in the activation function sublayer is a ReLU function.

Further, the feature fusion layer comprises a full connection layer.

Further, the classification layer comprises a Softmax function layer.

A drug relation classification method based on a dual-channel CNN-LSTM network is characterized in that a drug text to be classified is executed according to the following steps:

step A, preprocessing the text of the medicine to be classified by adopting the method in the step 2 in the claim 1 to obtain a preprocessed medicine text;

step B, inputting the preprocessed drug text into the drug relation classification model of any one of claims 1 to 6 to obtain a classification result.

Compared with the prior art, the invention has the following technical characteristics:

1. according to the method for constructing and classifying the drug relationship classification model based on the dual-channel CNN-LSTM network, the dual-channel CNN-LSTM network is constructed, the CNN is used for extracting the local features of the drug texts, the LSTM is used for extracting the global features of the drug texts respectively, the extracted drug relationship features are richer, and the classification accuracy is improved;

2. the invention provides a method for constructing and classifying a medicine relation classification model based on a double-channel CNN-LSTM network, which is characterized in that a positive sequence text and a negative sequence text of a medicine relation text are respectively sent to the CNN-LSTM network to complete a characteristic extraction process, compared with a single-channel LSTM network, the extracted medicine relation characteristics are more comprehensive, and the classification accuracy is improved;

3. according to the method for constructing and classifying the drug relationship classification model based on the dual-channel CNN-LSTM network, the process of extracting the drug feature vectors is simplified and the accuracy of drug relationship classification is improved by extracting the drug text feature vectors;

4. the invention provides a method for constructing and classifying a drug relationship classification model based on a dual-channel CNN-LSTM network, which takes an original drug relationship text containing a plurality of drug entities as input, does not need manual intervention and related field knowledge, does not need to manually extract complex text features, and has strong generalization capability.

Drawings

FIG. 1 is a diagram of a drug classification model provided in one embodiment of the present invention;

fig. 2 is a diagram of the internal structure of a convolution block provided in one embodiment of the present invention.

Detailed Description

The terms appearing in the detailed description are explained first:

long short term memory neural network (LSTM): the LSTM network consists of an input gate, a forgetting gate, an output gate and a memory unit, and the LSTM can effectively learn long-term dependence information of input data through the complex gating mechanism and is widely applied to processing of serialized information such as text data, track data and the like.

Convolutional Neural Network (CNN): a feed-forward neural network including convolution calculations and having a depth structure.

Example one

As shown in fig. 1, in this embodiment, a method for constructing a drug relationship classification model based on a dual-channel CNN-LSTM network is disclosed, where the method is performed according to the following steps:

step 1, obtaining an original medicine text set;

labeling the drug relationship in each original drug text in the original drug text set to obtain a drug relationship label set;

the biomedical texts acquired in the embodiment can be acquired through biomedical documents, papers and other modes, and the acquired texts can be local or integral parts of the documents and the papers, but the semantic expression of the texts needs to be ensured to be complete.

The original drug text at least needs to include two target drug name words, the two target drug name words are drug words related to drug relationship classification, and the rest are other words, for example, in this embodiment, the original drug text is: "Some quinones, including ciprofloxacin, had been associated with a transformed expression in a serum secretory in a tissues recovering cyclosporine comunicationly", wherein "quinolones", "ciprofloxacin" and "cyclosporine" are drug name words and the remaining words are other words.

In the data set used herein, the length of text is between 0 and 150 words, most of the text length is distributed between 20 and 60 words, and backward dependence phenomena (e.g., grammatical phenomena such as idiomatic clauses) account for 46% of the data set.

The drug relationship labels include 5, which are advice, effect action, mechanism, int forward and irrelevant false, respectively.

in this embodiment, the preprocessing method for the original drug text set utilizes the processing method for the drug text set in the patent "drug relation classification method based on multilayer convolutional neural network". The method comprises the following steps that each original medicine text in an original medicine text set is different in format and length, medicine name words are complex and uncommon, and errors are easily introduced when a neural network is adopted for classification, so that the acquired original medicine text needs to be preprocessed, wherein the preprocessing comprises the step of performing word shape normalization on all words in the original medicine text, namely, the word shapes of all words are unified; naming the target drug name words in a uniform naming mode and replacing the original target drug name words in a named form, wherein the specific operation comprises the following steps:

step 2.1, normalizing all words in the original medicine text set, naming the words in a unified form, and replacing the target medicine name words by using the target medicine name words named in the unified form to obtain a normalized medicine text set;

wherein, the normalization comprises morphology normalization and naming normalization;

in order to enable the classification of the medicine texts to be more accurate and reduce the introduction of errors, the word form normalization is performed on each word in the original medicine texts, and the words are converted into a uniform format. And performing morphology normalization on each word in the original medicine text to obtain a normalized original medicine text until each word in each original medicine text in the original medicine text set is subjected to morphology normalization to obtain a normalized original medicine text set.

In order to improve generalization of the neural network, all target drug name words in the drug text are firstly named in a unified form, wherein the unified form is a form of "X serial number", where X may be any english word, such as "day", "interaction", and the like, and the serial number is a sorting serial number in an english form, such as "one, two, three", and the like, and the name after unified naming is replaced with the name of the original target drug word, and the target drug name words after replacement are "drug", "drug" and the like, and there is no influence between the drug texts, so as to obtain a preprocessed drug text set.

Step 2.2, unifying the length of each drug text in the normalized drug text set to obtain a drug text set with a fixed length;

fixing the length of each drug text in the preprocessed drug text set to n, and filling the text with the length less than n, wherein the filling mode may be a mode of using all-zero living random numbers, and the drug text may be represented as:

S＝w ₁ w ₂ w ₃ ...w _n

step 2.3, carrying out vector mapping on each fixed-length medicine text in the fixed-length medicine text set to obtain a preprocessed medicine text set;

since the neural network cannot directly process the text in the natural language form as an input, the method for mapping the drug text into the text vector in the digital form comprises the following steps:

a. constructing a word vector table, wherein the word vector table is composed of words and corresponding digital word vectors;

the word vector table is composed of words and corresponding digital word vectors, each word in the word vector table corresponds to a unique digital word vector, and more words are filled in the word vector table as much as possible, so that the word vector table can cover more words.

In order to convert more meaningful Word Vectors, in the present embodiment, a GloVe (Global Vectors for Word reproduction) model Word vector table is provided by the NLP research group at the university of Stanford, wherein the GloVe model Word vector table includes 2196016 Word Vectors, and the dimension of each Word vector is 300. If a word in the input original text is not in this word vector table, each dimension of the word vector for that word is initialized to 0.

b. And mapping each medicine text with fixed length in the medicine text set in a table look-up mode to obtain a preprocessed medicine text set.

For each word in an n-dimensional drug text, mapping the word into a d-dimensional vector by looking up the word vector table, and mapping each word in the n-dimensional drug text into a d-dimensional word vector in this way, so that an original drug text S with the length of n is mapped into a (n × d) -dimensional text vector:

and for a text vector set containing m original medicine texts S with length n, the original medicine texts S are mapped into m x (n x d), and the medicine text set contains m (n x d) dimensional text vectors.

taking the preprocessed medicine text set as a positive sequence text set;

in the scheme, in order to enable the extracted features to be more comprehensive, the positive sequence medicine text and the negative sequence medicine text are used for respectively training the neural network to obtain a classification model.

When the medicine text is operated in the reverse order, the order in the text vector is reversed, for example, a 1-dimensional vector [ 0.21.35.0.62.85.96 ], which is in the reverse order: [0.96 0.85 0.62 0.35 0.21].

the neural network comprises a parallel forward-sequence text feature extraction layer, a parallel reverse-sequence text feature extraction layer, a feature fusion layer and a classification layer;

the positive sequence text feature extraction layer and the negative sequence text feature extraction layer have the same structure and respectively comprise a convolution block and a long-term and short-term memory neural network block which are sequentially arranged.

In this embodiment, in order to improve the accuracy of the classification of the drug relationship, the structure of the neural network is redesigned, as shown in fig. 1, the convolution block is used to extract the local features of the text, and then the local features are sent to the LSTM model to supplement and extract the global features and the time sequence features of the text, but this is also a text capable of processing the forward sequence, if a backward modified text such as a certain phrase is encountered, the processing capability is still weak, so that two identical feature extraction layers are used to process the forward sequence and the reverse sequence of the input text respectively, and then the extracted forward sequence and reverse sequence features are combined to obtain the final text features; and then outputting the text features to a classification layer for classification.

In this embodiment, the number of the convolution blocks is less than 4, the extracted local features are not accurate enough, and the number of the convolution blocks is more than 4, so that an overfitting phenomenon occurs, which results in failure in feature extraction, so that as a preferred embodiment, 4 convolution blocks are provided.

Optionally, each convolution block includes a batch regularization sublayer, a convolution sublayer, an activation function sublayer, and a pooling sublayer that are sequentially arranged.

In this embodiment, as shown in fig. 2, the positive-order drug text and the negative-order drug text are both sent into the convolution block and then enter the batch regularization layer, the batch regularization layer functions to make the input data meet normal distribution, the speed of sample training meeting normal distribution is greatly increased, and the accuracy is also increased.

In this embodiment, the normalized data is sent to the convolutional layer for convolution operation, and the parameters of the convolutional layer are set as: the number of convolution units filter is 128.

And then entering an activation function, and deleting meaningless data after convolution, wherein the activation function is a Relu function as a preferred embodiment.

And repeating the convolution and activation operations, and sending the obtained data into a pooling layer, wherein the pooling layer uses the maximum pooling operation, for example, the size of a pooling window is 2*2, the pooling window of 2*2 is slid on the data after convolution and activation, the largest number in the window is selected as a representative in the sliding process, and the representative is represented by how many times of sliding, and then the representative is used as the representative of the original data. The benefits of this are: on the premise of ensuring that the text special certificate is not lost, the data is reduced, and the training of the network is accelerated.

After 4 identical convolution blocks are passed through, the positive-order medicine text and the negative-order medicine text enter the long-short term memory neural network block to obtain the global characteristics and the time sequence characteristics existing in the medicine relation text.

In this embodiment, the number of nodes in the long-short term memory neural network block is set to 64.

Optionally, the feature fusion layer comprises a fully-connected layer.

In this embodiment, after the forward-order drug text and the reverse-order drug text are sent to the CNN-LSTM network, the forward-order text feature and the reverse-order text feature are obtained, and the two features are sent to the full connection layer at the same time. For example, if 100 forward text features and 100 reverse text features exist, a fully connected layer with 200 nodes in the first layer and 100 nodes in the second layer is constructed, the forward text features are sent into the first 100 nodes in the first layer, the reverse text features are sent into the last 100 nodes in the first layer, and then the 200 features are fused together in this way.

Optionally, the classification layer comprises a Softmax function layer.

In this embodiment, the fully-connected layer and the Softmax function layer form the last part of the drug relationship classification algorithm, and are used to output the drug relationship labels in the form of digital vectors according to the number of classes, so as to determine the final result of the final drug relationship classification, each output node of the fully-connected layer and the Softmax function layer represents a drug class, the drug label finally output by the classifier is the probability that a given drug entity pair belongs to each drug class, and the probability value is [0,1]. For example, it is assumed that there are 2 drug relationships, which respectively represent a relationship and a non-relationship, the output nodes of the Softmax function layer are set to 2, that is, there are two drug relationships, which respectively represent positive and negative, and if the drug relationship label in the form of a digital vector output by the Softmax function layer is p [ positive, negative ] = [0.1,0.9], that is, in the output result of the Softmax function layer, the probability value that positive exists is 0.1, and the probability value that negative exists is 0.9, then the determination is made based on the probability value. In this example, the drug relationships include 5, advice, effect action, mechanism, int forward and irrelevant false.

Training the hierarchical convolutional cyclic neural network by adopting the input and the output to obtain a medicine relation classification hierarchical convolutional cyclic neural network, wherein the medicine relation text and each medicine relation label are in a digital vector form; repeatedly training the hierarchical convolutional cyclic neural network for N times, and taking the hierarchical convolutional cyclic neural network with the best performance after the N times of training as the drug relation classification hierarchical convolutional cyclic neural network, wherein N > =1.

And the training set of the classified hierarchical convolutional neural network comprises two parts, namely a medicine text set which is input into the classified hierarchical convolutional neural network after preprocessing, and medicine relation labels among target medicine name words in original medicine texts corresponding to each medicine text in the preprocessed medicine text set, so that the medicine relation label set corresponding to each medicine text in the medicine text set is obtained and is used as the target output of the multilayer convolutional network. Similarly, the test set of the classified hierarchical convolutional neural network also comprises two parts, and the difference is that in the test process, only the preprocessed drug text set is input into the trained classified hierarchical convolutional neural network, the classified hierarchical convolutional neural network can obtain a drug classification result set predicted by a model according to the input drug text data and the trained model parameters, and then the drug classification result set is compared with the real label of the drug relationship, and the performance of the classified hierarchical convolutional neural network is evaluated according to the comparison result of the two.

In this example, a DDIExtraction 2013 drug relationship data set is used as a drug relationship text to train and test a classification hierarchical convolution cyclic neural network, and 80% of the whole data set is used as a training set and 20% is used as a test set, that is, the training set consists of 27792 drug relationship text samples, and the test set consists of 6409 drug relationship text samples. And then, training the hierarchical convolution cyclic neural network for 10 times by using the divided training set, and selecting the model with the best model effect in 10 times of training as the final model of the drug relationship hierarchical convolution cyclic neural network.

Example two

step A, preprocessing the text of the medicine to be classified by adopting the method in the step 2 in the embodiment 1 to obtain a preprocessed medicine text;

and step B, inputting the preprocessed drug text into the drug relation classification model in the embodiment 1 to obtain a classification result.

After the final drug relationship hierarchical convolution cyclic neural network is trained, the model can predict drug relationships involved in any drug relationship texts, drug texts with unknown drug relationships are input into the drug relationship hierarchical convolution cyclic neural network, and the drug relationship with the highest probability is selected from digital vectors output by the drug relationship hierarchical convolution cyclic neural network to serve as a drug relationship classification result of the drug texts with unknown drug relationships.

In this embodiment, the text of the drug to be classified is "several drugs have been associated with each other with a transition element in a series of drugs in a series of reactions in a cyclic relationship component, the first target drug name is a drug, the second target drug name is a drug, the drug relationship classification is performed through a trained drug relationship hierarchical convolutional neural network, and the output drug relationship digital vector label is:

P[mechanism,advice,effect,int,false]＝[0.02,0.09,0.1,0.67,0.12]

namely, the probability of existence of mecanism between two target drugs quinolones and cyclosporine is 2%, namely, the probability of existence of advice between the two target drugs quinolones and cyclosporine is 9%, namely, the probability of existence of effect between the two target drugs quinolones and cyclosporine is 10%, namely, the probability of existence of int between the two target drugs quinolones and cyclosporine is 67%, namely, the probability of existence of impact between the two target drugs quinolones and cyclosporine is 12%, wherein the probability of existence of int relation is at most 67%, and therefore, the relation between the two target drugs quinolones and cyclosporine is classified as a forward relation by adopting the drug relation hierarchical convolutional recurrent neural network.

Compared with the drug relation classification algorithm in the prior art, the performance of the drug relation classification method based on the two-channel CNN-LSTM network provided by the scheme is compared with that of the drug classification algorithm in the prior art, as the accuracy, the recall rate and the F value are higher when the performance of one drug relation classification method is evaluated, the better the performance of a drug relation classification model is, and as can be seen from the table 1, the drug relation hierarchical convolutional recurrent neural network provided by the invention is obviously superior to other methods in three indexes of the accuracy, the recall rate and the F value, which proves that the drug relation classification method based on the hierarchical bidirectional convolutional recurrent neural network provided by the invention has the optimal classification performance in the aspect of the drug relation classification problem.

TABLE 1 comparison of the drug relationship Classification methods provided by the present invention with other drug relationship Classification methods

/>

Claims

1. A method for constructing a drug relationship classification model based on a dual-channel CNN-LSTM network is characterized by comprising the following steps:

step 1, obtaining an original medicine text set;

the preprocessing comprises text normalization, text length fixing and text vector mapping;

taking the preprocessed medicine text set as a positive sequence text set;

the forward sequence text feature extraction layer and the reverse sequence text feature extraction layer respectively comprise a convolution block and a long-term and short-term memory neural network block which are sequentially arranged;

the number of the convolution blocks is 4;

each convolution block comprises a batch regularization sublayer, a convolution sublayer, an activation function sublayer and a pooling sublayer which are arranged in sequence.

2. The method for constructing a drug relationship classification model based on a dual-channel CNN-LSTM network as claimed in claim 1, wherein the activation function in the activation function sub-layer is a ReLU function.

3. The method for constructing a drug relationship classification model based on a dual-channel CNN-LSTM network as claimed in claim 1, wherein said feature fusion layer comprises a full link layer.

4. The method for constructing the drug relationship classification model based on the dual-channel CNN-LSTM network as claimed in claim 1, wherein the classification layer comprises a Softmax function layer.

5. A drug relationship classification method based on a dual-channel CNN-LSTM network is characterized in that a drug text to be classified is executed according to the following steps:

step B, inputting the preprocessed drug text into the drug relation classification model of any one of claims 1 to 4 to obtain a classification result.