CN114841261A

CN114841261A - Increment width and deep learning drug response prediction method, medium, and apparatus

Info

Publication number: CN114841261A
Application number: CN202210464986.2A
Authority: CN
Inventors: 陈俊龙; 詹永康; 孟献兵
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2022-04-29
Filing date: 2022-04-29
Publication date: 2022-08-02
Anticipated expiration: 2042-04-29
Also published as: CN114841261B

Abstract

The invention provides a method, medium and apparatus for predicting drug response for incremental width and deep learning; the method comprises the following steps: carrying out text coding and position coding on the sequence of the medicine to construct a medicine information code; coding and inputting medicine information into a Transformer coder to mine the structural characteristics of the medicine, inputting gene expression data into the characteristic representation of the learning gene of the multilayer perceptron, and splicing the medicine characteristics and the gene characteristics together to form a medicine-gene characteristic pair; and inputting the characteristic pairs into a width learning system to obtain a predicted drug sensitivity regression value. The method can solve the problem of poor drug representation; a width learning system is adopted to fuse the drug expression and gene expression characteristics, so that the accuracy of a drug sensitivity prediction result is improved; the network weight is updated through an incremental learning algorithm, the performance of the model is improved, and the whole model does not need to be retrained.

Description

Increment width and deep learning drug response prediction method, medium, and apparatus

Technical Field

The invention relates to the technical field of drug response prediction, in particular to a drug response prediction method, medium and device for increment width and deep learning.

Background

Cancer is an important disease threatening human health and causing death, and realizing personalized treatment for cancer patients is one of the most prominent research fields of precise medicine. In recent years, with drugsThe rapid development of genomics and computational models and the drug response prediction technology gradually bring more convenience to personalized treatment research. Drug response prediction aims to extract and integrate gene expression information of drugs and cell lines, predicting the sensitivity of cell lines to drugs. Half maximal Inhibitory Concentration (IC) ₅₀ ) Can be used for reflecting the drug response sensitivity of cell lines and is a commonly used drug response prediction index. Most of the traditional medicine response prediction methods are based on machine learning methods, such as a support vector machine, Bayesian multi-task multi-core learning, random forest and simple neural network models. These methods rely on a priori knowledge and feature engineering to obtain drug and gene signatures, which are then combined into new signatures to predict drug sensitivity of cell lines. In the face of complex, high-dimensional and noisy data, the prediction performance and generalization performance of the method are not advantageous.

With the development of artificial intelligence, deep learning makes a remarkable breakthrough in the problems of drug response prediction, drug development and the like. Methods for predicting drug response using deep learning can be broadly classified into two types, one being based on unsupervised or semi-supervised methods and the other being based on end-to-end supervised methods. Unsupervised or semi-supervised drug response prediction models generally use an auto-encoder to perform feature dimension reduction learning on data such as a drug text sequence and a methyl group, copy number variation, transcriptome and the like of a cell line, and the learned features are used for training a classifier to predict the sensitivity of the cell line to drugs. The end-to-end supervised method utilizes the characteristic of deep learning network modularization, adopts models such as a convolutional neural network, a deep encoder, an integrated deep neural network and a graph neural network to extract the characteristics of different types of data, and inputs the learned medicine and gene expression into a prediction classifier for training. Compared with the traditional machine learning method, the deep learning method has improved prediction performance and generalization, but still cannot meet the requirements of clinical trials. Most of the current drug response prediction algorithms based on deep learning have considerable limitations. Firstly, a characteristic engineering method only adopting a chemical descriptor or molecular fingerprint lacks consideration on the representation of a drug structure, cannot distinguish different atoms in a drug molecule and different action information among related chemical bonds thereof, and easily loses hidden drug structure information; secondly, the fusion mode of the drug characteristics and the gene characteristics is single, and the performance improvement is limited by a classifier constructed by a multilayer neural network model; thirdly, the modeled system needs to retrain the whole model when facing to newly added data, thus greatly increasing the time cost; fourthly, in a real clinical experimental environment, all training samples cannot be acquired at one time due to different privacy and property protection and data acquisition periods. Current drug response prediction models also do not have the ability to incrementally learn multiple batches of data.

In conclusion, the current medicine response prediction method based on deep learning has room for improvement.

Disclosure of Invention

To overcome the disadvantages and shortcomings of the prior art, it is an object of the present invention to provide a method, medium and apparatus for predicting drug response with incremental width and deep learning; the method learns the structural characteristics of the SMILES sequence of the medicine through a Transformer encoder, and solves the problem that different atoms in medicine molecules and different action information among related chemical bonds of the atoms cannot be distinguished; a width learning system is adopted to fuse the drug expression and gene expression characteristics, so that the accuracy of a drug sensitivity prediction result is improved; the network weight is updated through an incremental learning algorithm, the performance of the model is improved, and the whole model does not need to be retrained.

In order to achieve the purpose, the invention is realized by the following technical scheme: a method for predicting a drug response in incremental width and deep learning, comprising: the method comprises the following steps:

s1, text coding and position coding the SMILES sequence of the medicine to obtain a text code T _i And a position code P _i Thereby constructing a drug information code E _i (ii) a Wherein, i is 1,2,. and L; l represents the maximum drug string sequence length;

s2, encoding the drug information E _i Inputting the data into an IBDT model; the IBDT model comprises a Transformer encoder, a multi-layer perceptron and a width learning system;

encoding drug information f _i Input into a Transformer encoder to mine the drug characteristics D _F Simultaneous gene expression data G _o Inputting the characteristics of the learning genes into a multilayer perceptron _F Characterization of the drug by D _F And gene signature G _F Spliced together to form a drug-gene signature pair X _DG (ii) a Pair of features X _DG Inputting the drug sensitivity values into a width learning system to obtain predicted drug sensitivity regression values;

the IBDT model is characterized in that parameters of the IBDT model are fixed after initial training, feature nodes and enhanced nodes are added to the width learning system by using an added sample subsequently, and the output weight W of the learning system is dynamically updated through an incremental learning algorithm _DG The IBDT model of (1).

Preferably, in step S1, the text encoding of the SMILES sequence of the drug means: decomposing the SMILES sequence of the medicine into a single atom symbol and small molecule sequence according to chemical prior knowledge, wherein the atom symbol and the small molecule sequence are expressed in a word vector form;

positionally encoding the SMILES sequence of a drug means: encoding the position information of the drug by utilizing a dictionary lookup matrix;

encoding text T _i And a position code P _i Adding to obtain medicine information code E _i ：

E _i ＝T _i +P _i

Preferably, said text encoding the SMILES sequence of the drug comprises:

decomposing the SMILES sequence of the drug into a single atomic symbol and a small molecule sequence according to chemical prior knowledge; regarding a single atom symbol and a small molecule sequence as a character string word, constructing a word set D containing character strings with different particle sizes, then using a Torchtext tool library to count and label a corpus containing SMILES sequences of all drugs, and expressing the SMILES sequences as a sequence string S ═ { S ═ S ₁ ，...，S _L And are represented by a one-hot vector, where S _i Represents a word in the vocabulary set D; the text code for each drug is expressed as:

wherein, W _T Representing a matrix of word vectors that can be trained,

a one-hot vector representing the ith character string of the sequence string S;

the position coding of the SMILES sequence of the drug comprises:

wherein, W _P A matrix of weights is represented by a matrix of weights,

a one-hot vector representing the ith position of the sequence string S.

Preferably, the drug information code E _i The multi-head self-attention layer through a transform encoder is mapped into a Query matrix, a Key matrix and a Value matrix through linear transformation, and Q, K and V are used for respectively expressing that:

wherein, W _q ，W _k ，W _v Representing a learnable weight matrix; obtaining the subsequences S by using an attention calculation formula _i Output of attention relationship between:

wherein d is _k Code for indicating drug information E _i Dimension (d); the output of the multi-headed self-attention layer is then connected into a fully-connected feedforward neural network:

wherein F represents the output of the feedforward neural network; w ₁ 、W ₂ Respectively representing learnable weight matrixes; beta is a ₁ 、β ₂ Respectively represent the bias; finally, the output F is input into a multilayer perceptron to obtain the medicine characteristic D _F ：

Wherein σ ₁ 、σ ₂ 、σ ₃ Respectively representing nonlinear activation functions;

respectively representing learnable weight matrixes;

respectively, the offsets.

Gene expression data G _o Inputting into a multilayer perceptron to obtain gene characteristics G _F (ii) a Characterization of the drug by D _F And gene signature G _F Drug-gene signature pair X formed by splicing and integration _DG ＝[D _F |G _F ]. Raw gene expression data was imported from a data set called Cancer Cell Line Encyclopedia (CCLE).

Preferably, the gene expression data G _o The input into the multilayer perceptron means that: the multilayer perceptron comprises three hidden layers and three active layers; gene signature G _F Comprises the following steps:

wherein σ ₄ 、σ ₅ 、σ ₆ Respectively representing nonlinear activation functions;

respectively representing learnable weight matrixes;

respectively, the offsets.

Preferably, during initial training of the IBDT model, the sample forms feature pairs X _DG Inputting the data into a width learning system, and mapping n groups of characteristic nodes

And m groups of enhanced nodes

All the characteristic nodes and the enhanced nodes are combined to obtain an input matrix A _DG ：

Computing feature pairs X by a pseudo-inverse and ridge regression learning algorithm _DG To the weight W between the outputs Y _DG I.e. the weight between the feature versus true drug sensitivity:

W _DG ＝A _DG ⁺ Y

A _DG ⁺ ＝(λI+A _DG A _DG ^T ) ^-1 A _DG ^T

wherein A is _DG ⁺ Representing an input matrix A _DG The pseudo-inverse of (1); λ represents a non-negative number in the ridge regression that tends to 0; i denotes an identity matrix.

Preferably, in the IBDT model training, the width learning system is added with feature nodes and enhanced nodes by using the added samples subsequently, and the output weight W of the learning system is dynamically updated through an incremental learning algorithm _DG The method comprises the following steps:

for newly added sample X _a Firstly, generating feature pairs by using fixed IBDT model parameters

Then independently adding a sample X in the width learning system _a Characteristic pair of

Mapping new characteristic nodes and enhancement nodes, and combining all the characteristic nodes and the enhancement nodes to obtain an input matrix corresponding to the newly added sample

The input matrix update of the model is:

calculating the pseudo-inverse between the newly added feature pair and the newly added sample output, and obtaining the weight information of the newly added sample by using an incremental learning algorithm:

wherein

Wherein,

merging the weight information of the newly added samples into the output weight W of the width learning system _DG Dynamically updating model output weights:

wherein, Y _a The label value representing the newly added sample.

A storage medium, wherein the storage medium stores a computer program that, when executed by a processor, causes the processor to perform the above-described increment width and deep learning drug response prediction method.

A computing device comprises a processor and a memory for storing a processor executable program, wherein the processor executes the program stored in the memory to realize the increment width and deep learning drug response prediction method.

Compared with the prior art, the invention has the following advantages and beneficial effects:

according to the method, when a newly added sample is faced, the network weight can be updated through the incremental learning algorithm without retraining the whole model, and the performance of the model is improved. The model learns the structural characteristics of the SMILES sequence of the drug through a Transformer encoder, and solves the problem that different atoms in drug molecules and different action information among related chemical bonds of the atoms cannot be distinguished; a width learning system is adopted to fuse the drug expression and gene expression characteristics, and the accuracy of the drug sensitivity prediction result is improved.

Drawings

FIG. 1 is a schematic flow diagram of a method for incremental width and deep learning drug response prediction in accordance with the present invention;

FIG. 2 is an architecture diagram of the IBDT model in the drug response prediction method of incremental Width and deep learning according to the present invention;

FIG. 3 is a schematic flow chart of incremental learning in the method for predicting drug response by incremental width and deep learning according to the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and specific embodiments.

Example one

The invention provides a drug response prediction method based on an incremental width learning system and a transform model, which comprises the steps of firstly carrying out text coding and position coding on a SMILES sequence of a drug, constructing a drug information code, inputting the drug information code into a transform coder to mine the structural characteristics of the drug, simultaneously inputting gene expression data into the characteristic representation of a learning gene of a multilayer perceptron, splicing the drug characteristics and the gene characteristics together to form a drug-gene characteristic pair, inputting the characteristic pair into the width learning system to train to obtain a final model, and carrying out drug response prediction by using the trained model. The method has the advantages that the structuralization characteristics of the SMILES sequence of the medicine are learned through a Transformer encoder, and the problem that different atoms in medicine molecules and different action information among related chemical bonds of the atoms cannot be distinguished is solved; a width learning system is adopted to fuse the drug expression and gene expression characteristics, and the accuracy of the drug sensitivity prediction result is improved. For the newly added samples, the model does not need to retrain the whole model, and the network weight is updated through an incremental learning algorithm, so that the performance of the model is improved.

The flow of the method for predicting the drug response of increment width and deep learning in the embodiment is shown in fig. 1, and comprises the following steps:

s1, text coding and position coding the SMILES sequence of the medicine to obtain a text code T _i And a position code P _i Thereby constructing a drug information code E _i (ii) a Wherein, i is 1,2,. and L; l represents the maximum drug string sequence length.

Text coding of the SMILES sequence of a drug means: decomposing the SMILES sequence of the medicine into a single atom symbol and small molecule sequence according to chemical prior knowledge, wherein the atom symbol and the small molecule sequence are expressed in a word vector form; these symbols and sequences may represent the SMILES sequence, i.e. the encoding of the drug text, in the form of a word vector;

the text encoding of the SMILES sequence of the drug comprises:

decomposing the SMILES sequence of the drug into a single atomic symbol and a small molecule sequence according to chemical prior knowledge; regarding a single atomic symbol and a small molecule sequence as a character string word, constructing a vocabulary set D containing character strings with different granularity, and then using a Torchtext tool library to perform statistics and labeling on a corpus containing all medicine SMILES sequencesNote that the SMILES sequence is expressed as a sequence string S ═ S ₁ ，...，S _L And are represented by a one-hot vector, where S _i Represents a word in the vocabulary set D; the text code for each drug is expressed as:

wherein, W _T Representing a matrix of word vectors that can be trained,

for capturing the position information of the medicines, the invention generates position codes for each medicine, and codes the position information of the medicines by utilizing a dictionary lookup matrix:

wherein, W _P A matrix of weights is represented by a matrix of weights,

a one-hot vector representing the ith position of the sequence string S.

E _i ＝T _i +P _i 。

S2, encoding the drug information E _i Inputting the data into an IBDT model; the IBDT model includes a transform encoder, a multi-layer perceptron and a breadth learning system, as shown in FIG. 2. Encoding the drug information E _i Input into a Transformer encoder to mine the drug characteristics D _F Simultaneous gene expression data G _o Inputting the characteristics of the learning genes into a multilayer perceptron _F Characterization of the drug by D _F And gene signature G _F Spliced together to form a drug-gene signature pair X _DG 。TransformeThe r encoder module may learn different interaction information between different atoms in the drug SMILES sequence and their associated chemical bonds, generating a drug signature representation with structured information. The multi-tier perceptron module learns a feature representation of the genes. The breadth learning system module integrates drug characteristics and gene characteristics, reduces training time cost, and improves the prediction performance of the model.

In particular, the drug information code E _i The multi-head self-attention layer through a transform encoder is mapped into a Query matrix, a Key matrix and a Value matrix through linear transformation, and Q, K and V are respectively used for representing:

wherein, W _q ，W _k ，W _v Representing a learnable weight matrix; obtaining the subsequences S by using an attention calculation formula _i Output of attention relationship:

wherein F represents the output of the feedforward neural network; w ₁ 、W ₂ Respectively representing learnable weight matrixes; beta is a beta ₁ 、β ₂ Respectively represent the bias; finally, the output F is input into a multilayer perceptron to obtain the output of a Transformer encoder, namely the medicine characteristic D _F ：

Wherein σ ₁ 、σ ₂ 、σ ₃ Representing a non-linear activation function;

respectively representing learnable weight matrixes;

respectively, the offsets.

Gene expression data G _o Inputting into a multilayer perceptron to obtain gene characteristics G _F The multilayer perceptron comprises three hidden layers and three active layers; gene signature G _F Comprises the following steps:

wherein σ ₄ 、σ ₅ 、σ ₆ Representing an activation function;

respectively representing learnable weight matrixes;

respectively, the offsets. Raw gene expression data was imported from a data set called Cancer Cell Line Encyclopedia (CCLE).

Characterization of the drug by D _F And gene signature G _F Drug-gene signature pair X formed by splicing and integration _DG ＝[D _F |G _F ]。

Pair of features X _DG Inputting the drug sensitivity values into a width learning system to obtain predicted drug sensitivity regression values.

The IBDT model is characterized in that parameters of the IBDT model are fixed after initial training, feature nodes and enhanced nodes are added to the width learning system by using an added sample subsequently, and the output weight W of the learning system is dynamically updated through an incremental learning algorithm _DG IBDT model of。

When the IBDT model is initially trained, the characteristic pair X formed by the sample _DG Inputting the data into a width learning system, and mapping n groups of characteristic nodes

And m groups of enhanced nodes

W _DG ＝A _DG ⁺ Y

A _DG ⁺ ＝(λI+A _DG A _DG ^T ) ^-1 A _DG ^T

After the IBDT model is initially trained, parameters of the IBDT model are fixed, and the newly added sample generates a drug-gene characteristic pair through a Transformer encoder and a multilayer perceptron of the IBDT model. As shown in FIG. 3, for the newly added sample X _a Firstly, generating feature pairs by using fixed IBDT model parameters

New feature nodes and enhanced nodes are mapped, and the feature space of the original model is enriched; merge allThe characteristic node and the enhanced node obtain an input matrix corresponding to the newly added sample

The input matrix of the model may be updated as:

the output weights of the network are dynamically updated through an incremental learning algorithm, new knowledge is learned, a knowledge base is updated, and the whole network does not need to be retrained.

wherein

Wherein,

wherein, Y _a The label value representing the newly added sample.

The method comprises the steps of training and testing a drug response prediction model purely based on deep learning, a drug response prediction model based on deep learning and a Transformer, a drug response prediction model based on width learning and a deep Transformer and a drug response prediction model based on incremental width and depth Transformer respectively, wherein test results show that the introduction of the Transformer model can better extract action information between different atoms and associated chemical bonds in drug molecules; the breadth learning system can better fuse the characteristics of the medicine and the gene and improve the prediction effect of the model; incremental learning is introduced to further improve the performance of the prediction model.

The method is effectively based on the incremental width and deep learning model, and the problem that different atoms in drug molecules and different action information among related chemical bonds of the atoms cannot be distinguished is solved through learning of the structuralized drug information codes by the Transformer encoder; the medicine characteristics and the gene characteristics are fused by adopting a width learning system, so that the accuracy of a model prediction result is improved; by utilizing the characteristic that the width learning system can be dynamically expanded, new knowledge of a new sample is learned under the condition that the whole network does not need to be retrained, and the model performance is improved. The method disclosed by the invention is used for carrying out reasonable drug reaction prediction, is beneficial to biologists to carry out in-vitro clinical tests, is beneficial to the biologists to design and research new drugs, and is greatly beneficial to the medical scientists to design personalized cancer treatment schemes.

Example two

The present embodiment is a storage medium storing a computer program, which when executed by a processor causes the processor to execute the increment width and deep learning drug response prediction method according to the first embodiment.

EXAMPLE III

The embodiment is a computing device, which comprises a processor and a memory for storing a program executable by the processor, wherein the processor executes the program stored in the memory to realize the increment width and deep learning drug response prediction method according to the first embodiment.

The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims

1. A method for predicting a drug response in incremental width and deep learning, comprising: the method comprises the following steps:

s1, text coding and position coding the SMILES sequence of the medicine to obtain a text code T _i And a position code P _i Thereby constructing a drug information code E _i (ii) a Wherein i is 1,2, …, L; l represents the maximum drug string sequence length;

encoding the drug information E _i Input into a Transformer encoder to mine the drug characteristics D _F Simultaneous gene expression data G _o Inputting the characteristics of the learning genes into a multilayer perceptron _F Characterization of the drug by D _F And gene signature G _F Spliced together to form drug-gene signature pairs X _DG (ii) a Pair of features X _DG Inputting the drug sensitivity values into a width learning system to obtain predicted drug sensitivity regression values;

2. The increment-width and deep-learning drug response prediction method of claim 1, wherein: in step S1, text-coding the SMILES sequence of the drug means: decomposing the SMILES sequence of the medicine into a single atom symbol and small molecule sequence according to chemical prior knowledge, wherein the atom symbol and the small molecule sequence are expressed in a word vector form;

E _i ＝T _i +P _i 。

3. The increment-width and deep-learning drug response prediction method of claim 2, wherein: the text encoding of the SMILES sequence of the drug comprises:

decomposing the SMILES sequence of the drug into a single atomic symbol and a small molecule sequence according to chemical prior knowledge; regarding a single atom symbol and a small molecule sequence as a character string word, constructing a word set D containing character strings with different particle sizes, then using a Torchtext tool library to count and label a corpus containing SMILES sequences of all drugs, and expressing the SMILES sequences as a sequence string S ═ { S ═ S ₁ ,…,S _L And are represented by a one-hot vector, where S _i Represents a word in the vocabulary set D; the text code for each drug is expressed as:

wherein, W _T Representing a matrix of word vectors that can be trained,

the position coding of the SMILES sequence of the drug comprises:

wherein, W _P A matrix of weights is represented by a matrix of weights,

a one-hot vector representing the ith position of the sequence string S.

4. The increment-width and deep-learning drug response prediction method of claim 1, wherein: the drug information code E _i The multi-head self-attention layer through a transform encoder is mapped into a Query matrix, a Key matrix and a Value matrix through linear transformation, and Q, K and V are respectively used for representing:

Wherein，σ ₁ 、σ ₂ 、σ ₃ Representing a non-linear activation function;

respectively representing learnable weight matrixes;

respectively, the offsets.

Gene expression data G _o Inputting into a multilayer perceptron to obtain gene characteristics G _F (ii) a Characterization of the drug by D _F And gene signature G _F Drug-gene signature pair X formed by splicing and integration _DG ＝[D _F |G _F ]。

5. The increment-width and deep-learning drug response prediction method of claim 4, wherein: said gene expression data G _o The input into the multilayer perceptron means that: the multilayer perceptron comprises three hidden layers and three active layers; gene signature G _F Comprises the following steps:

wherein σ ₄ 、σ ₅ 、σ ₆ Representing an activation function;

respectively representing learnable weight matrixes;

respectively, the offsets.

6. The increment-width and deep-learning drug response prediction method of claim 1, wherein: during initial training of the IBDT model, the feature pairs X formed by the samples _DG Inputting the data into a width learning system, and mapping n groups of characteristic nodes

And m groups of enhanced nodes

Computing feature pairs X by a pseudo-inverse and ridge regression learning algorithm _DG To the weight W between the outputs Y _DG ：

W _DG ＝A _DG +Y

A _DG ⁺ ＝(λI+A _DG A _DG ^T ) ^-1 A _DG ^T

7. The incremental width and deep learning drug response prediction method of claim 6, wherein: in the IBDT model training, the newly added samples are subsequently utilized to enable the feature nodes and the enhanced nodes of the width learning system to be newly added, and the output weight W of the learning system is dynamically updated through an incremental learning algorithm _DG The method comprises the following steps:

Mapping out new characteristic nodes and enhanced nodesCombining all the characteristic nodes and the enhanced nodes to obtain an input matrix corresponding to the newly added sample

The input matrix update of the model is:

wherein

Wherein,

wherein, Y _a The label value representing the newly added sample.

8. A storage medium storing a computer program which, when executed by a processor, causes the processor to perform the increment width and deep learning drug response prediction method of any one of claims 1-7.

9. A computing device comprising a processor and a memory for storing processor-executable programs, wherein the processor, when executing a program stored in the memory, implements the increment-width and deep-learned drug response prediction method of any one of claims 1-7.