WO2023038501A1

WO2023038501A1 - System for predicting drug responses by using convolutional neural network based on drug and cell line similarity matrix

Info

Publication number: WO2023038501A1
Application number: PCT/KR2022/013647
Authority: WO
Inventors: 심주용; 황창하; 손인석
Original assignee: 주식회사 아론티어
Priority date: 2021-09-10
Filing date: 2022-09-13
Publication date: 2023-03-16
Also published as: KR102653969B1; KR20230038016A

Abstract

Human cancer cell lines are frequently used in researching the biology of cancer and in research for testing the effects of cancer treatment. Accurately predicting drug responses by using pharmacogenomic data is an essential problem in oncology and precision medicine. On the basis that similar cell lines respond similarly to similar drugs, provided is an ensemble deep learning system in which a cross product between column vectors of a drug-drug similarity matrix and a cell line-cell line similarity matrix is applied to a convolutional neural network. Therefore, it has been identified that genetic characteristics of patients are connected to drug sensitivity so as to be useful in precision medicine.

Description

A system for predicting drug response using convolutional neural networks based on similarity matrices between drugs and cell lines

Precision medicine aims to elaborately select cancer treatments based on the genetic information of each patient. Indeed, one of the most important problems in medicine is predicting the anticancer drug response for each patient. Because of tumor heterogeneity, patients with the same type of cancer may have different responses to similar drugs. Therefore, it is very important to provide a predictive method that reveals the relationship between genomic information and drug response, which can be helpful for precision medicine.

Genomics of Drug Sensitivity in Cancer (GDSC) and Cancer Cell Line Encyclopedia (CCLE) are two projects that have provided molecular profiles and drug response values for hundreds of cancer cell lines treated with multiple anti-cancer drugs. These large datasets allow the development of methods for predicting patient-specific drug response. Generally, methods for predicting drug response are classified into two categories. The first is a classification approach that predicts sensitive drug-cell line pairs. The second is a regression analysis approach that predicts a reference value for measuring the response of a cell line to a drug. One embodiment of the present invention is a cell line for drugs

quantified through the (half-maximal inhibitory concentration) value

We disclose a regression analysis approach to predict . Various regression analysis approaches have been proposed to predict drug response using gene expression profiles or other molecular information of a cell line. Some prediction methods have improved drug response prediction by incorporating drug information such as drug chemical substructure and cell line information. In addition, numerous machine learning methods have been applied to the problem of predicting drug response. For example, regression analysis methods such as lasso (least absolute shrinkage and selection operator), elastic nets, random forests, kernel-based methods, neural networks and deep learning. have been applied Ali and Aittokallio provide a comprehensive recent review (Ali, M. & Aittokallio, T. Machine learning and feature selection for drug response prediction in precision oncology applications. Biophys. Rev. 11, 31-39 (2019)). ). Recent advances in deep learning have opened a new avenue for finding regression models for predicting drug response, ultimately providing a more accurate tool for treatment response.

Wang et al. proposed a similarity-regularized matrix factorization (SRMF) method for predicting drug response, which simultaneously includes gene expression profile similarity of cell lines and chemical substructure similarity of drugs (Reference. [Wang, L., Li, X., Zhang, L. & Gao, Q. Improved anticancer drug response prediction in cell lines using matrix factorization with similarity regularization. BMC Cancer 17, 513 (2017)]). Patients with similar genetic characteristics have been shown to have similar responses to similar drugs. Suphavilai et al. devised a matrix decomposition-based recommender system called “CaDRReS” that can predict drug responses to new drugs and new cell lines through learned prediction of drugs and cell lines in the latent space (Suphavilai, C., Bertrand, D. & Nagarajan, N. Predicting cancer drug response using a recommender system. Bioinformatics 34, 3907-3914 (2018)]). This showed that the characteristics of the latent space were correlated with the route of the drug. Chang et al. proposed “CDRscan”, an ensemble model that includes five Convolutional Neural Networks (CNNs). It used the mutation profile of the cell line and the chemical substructure of the drug as input features for CNNs. The drug response value was measured as the average value of the output values of the five CNNs. However, “CDRscan” tends not to predict drug responses well to new drugs and new cell lines. Wei et al. recently devised the cell line-drug complex network (CDCN), which predicts drug response by inferring information from a simple network composed of cell lines and drugs (Wei et al. , D., Liu, C., Zheng, X. & Li, Y. Comprehensive anticancer drug response prediction based on a simple cell line-drug complex network model. BMC Bioinf. 20, 44 (2019)]). CDCN provides satisfactory results imputing missing drug information. Moughari and Eslahchi devised a model for Anticancer Drug Response Prediction using Manifold Learning (ADRML) using manifold learning (Moughari, F. A. & Eslahchi, C. ADRML: anticancer drug response prediction using manifold learning. Sci. Rep. 10, 14245 (2020)]). ADRML maps drug response values to a low-dimensional latent space and computes drug response values for new cell line-drug pairs from the latent space. It takes into account the cell line similarity and drug similarity of various types and utilizes them in the manifold learning procedure. ADRML has been shown to provide accurate and robust predictions.

Patients with the same type of cancer may react differently to similar drugs. Therefore, it is very important to accurately predict drug response according to individual patients.

In addition, drug response prediction tends not to predict drug response well to new drugs and new cell lines. Therefore, a new approach to overcome this is required.

In addition, a new method capable of simplifying the computational process and providing higher accuracy in drug response prediction is required.

One embodiment of the present invention is a method of predicting a drug response between a drug and a cell line using a convolutional neural network model, comprising the steps of preparing a first similarity matrix between drugs; preparing a second similarity matrix between cell lines; Calculating a cross product between the first similarity matrix and the second similarity matrix, wherein the ith column vector (i = 1, 2, ... m: m is an integer) of the first similarity matrix and the second similarity matrix Calculating a cross product between the j-th column vectors of (j = 1, 2, ..., n: n is an integer); learning the convolutional neural network model by taking the cross product as an input value and using the i-th drug and the drug response value of the j-th cell line as output values; and predicting and outputting a drug response value of a new drug and a cell line using the learned convolutional neural network model.

In addition, the first similarity matrix is an m x m drug-drug similarity matrix based on the Tanimoto coefficient, and the second similarity matrix is an n x n cell line-cell line similarity matrix that is a radial basis function (RBF) kernel matrix, , wherein m and n represent the number of drugs and cell lines in the learning dataset, respectively, providing a drug response prediction method.

In addition, the i-th column vector of the first similarity matrix is the Tanimoto similarity between the i-th drug and other drugs including the i-th drug, and the j-th column vector of the second similarity matrix is the j-th cell A method for predicting drug response, which is the degree of gene expression similarity between the line and other cell lines including the j-th cell line, is provided.

In addition, the drug response value is quantified through the IC ₅₀ (half-maximal inhibitory concentration) value of the cell line for the drug.

may be a value. In addition, the second similarity matrix includes a first sub-matrix and a second sub-matrix, and calculating a cross product between the first similarity matrix and the second similarity may include: A drug response prediction method comprising calculating a first cross product between first sub-matrices of a matrix and calculating a second cross product between the first similarity matrix and a second sub-matrix of the second similarity matrix. do.

In addition, the convolutional neural network model includes a first convolutional neural network model and a second convolutional neural network model, and the step of learning the convolutional neural network model comprises the first synthesis of the first cross product and the second cross product. A method for predicting a drug response, wherein an input value of each of the convolutional neural network model and the second convolutional neural network model is used, and an average of output values of the first convolutional neural network model and the second convolutional neural network model is used as an output value. to provide.

Also, the convolutional neural network model may be a 2-dimensional model.

In addition, the convolutional neural network model may include two 2D convolutional layers, two max pooling layers, one flattening layer, one dropout layer, and two fully connected layers.

Another embodiment of the present invention is a system for predicting a drug response between a drug and a cell line using a convolutional neural network model, comprising: a control unit for controlling the convolutional neural network model; Communication unit for communication with an external server; memory unit; display unit; and an input unit that receives a user's input, wherein the memory unit includes a first similarity matrix between drugs and a second similarity matrix between cell lines and cell lines, and the control unit stores the first similarity matrix Calculate the cross product between the i-th column vector (i = 1, 2, ... m: m is an integer) and the j-th column vector (j = 1, 2, ..., n: n is an integer) of the second similarity matrix , The convolutional neural network model is trained using the cross product as an input value and the drug response values of the i-th drug and the j-th cell line as output values, and using the learned convolutional neural network model, new drugs and cell lines A drug response prediction system for predicting the drug response value of

In addition, the first similarity matrix is an m x m drug-drug similarity matrix based on the Tanimoto coefficient, and the second similarity matrix is an n x n cell line-cell line similarity matrix that is a radial basis function (RBF) kernel matrix, , where m and n represent the number of drugs and cell lines in the training dataset, respectively, to provide a drug response prediction system.

In addition, the i-th column vector of the first similarity matrix is the Tanimoto similarity between the i-th drug and other drugs including the i-th drug, and the j-th column vector of the second similarity matrix is the j-th cell A drug response prediction system is provided, which is the gene expression similarity between the line and other cell lines including the j-th cell line.

may be a value.

In addition, the second similarity matrix includes a first sub-matrix and a second sub-matrix, and calculating a cross product between the first similarity matrix and the second similarity may include: Calculating a first cross product between first sub-matrices of a matrix, and calculating a second cross product between the first similarity matrix and a second sub-matrix of the second similarity matrix. do.

In addition, the convolutional neural network model includes a first convolutional neural network model and a second convolutional neural network model, and the step of learning the convolutional neural network model comprises the first synthesis of the first cross product and the second cross product. A drug response prediction system having an input value of each of the convolutional neural network model and the second convolutional neural network model, and an average of output values of the first convolutional neural network model and the second convolutional neural network model as an output value. to provide.

Also, the convolutional neural network model may be a 2-dimensional model.

One embodiment discloses the Dr.CNN model to solve the problem of predicting drug response values. Dr.CNN simply divides the RBF kernel matrix associated with the cell line into two submatrices, and calculates the cross product of the column vectors of the Tanimoto similarity matrix and the column vectors of each RBF kernel submatrix. Then, it is used as an input value for each of the two CNN models. This is to obtain faster calculation and improved prediction performance through ensemble learning. The RBF kernel matrix may be randomly divided and may be divided into two or more sub-matrices according to the number of cell lines. Dr.CNN is the first non-linear method that applies 2D CNN to the cross product between the column vectors of the drug's Tanimoto similarity matrix and the column vectors of the cell line's RBF kernel matrix for drug response prediction.

Experimental results show that Dr.CNN exceeds the performance of existing models such as elastic net, RF, SVR, and 1D CNN ensemble. Dr.CNN can be further improved by adjusting the architecture of CNN according to the data structure. The main idea behind Dr.CNN is to integrate the two modalities using the cross product and apply the CNN to the resulting matrix. Dr.CNN is a very effective approach for drug response prediction and can play a huge role in the drug development process.

Figure 1 shows the overall workflow of an embodiment proposed for the prediction of drug response values.

2 shows a summary of the GDSC1 and GDSC2 datasets used in one example.

Figure 3 is the measured data set of GDSC2 and GDSC1

value predicted by one embodiment

Shows a scatterplot of values.

Figure 4 shows the workflow of an ensemble 1D CNN model for prediction of drug response values.

5 shows the architecture of a CNN submodel based on similarity according to an embodiment.

6 is a block diagram showing an example of a system for implementing a drug response prediction method according to an embodiment.

The present invention will become clear with reference to the embodiments described below in detail in conjunction with the accompanying drawings. However, the present invention is not limited to the embodiments disclosed below, but will be implemented in various different forms, only these embodiments make the disclosure of the present invention complete, and common knowledge in the art to which the present invention belongs. It is provided to fully inform the holder of the scope of the invention, and the present invention is only defined by the scope of the claims. Meanwhile, terms used in this specification are for describing the embodiments and are not intended to limit the present invention. In this specification, singular forms also include plural forms unless specifically stated otherwise in a phrase. As used herein, “comprises” or “comprising” does not exclude the presence or addition of one or more other components or steps other than the recited components or steps. Terms such as first and second may be used to describe various components, but the components should not be limited by the terms. Terms are only used to distinguish one component from another.

One embodiment of the present invention uses a two-dimensional CNN approach for the outer product of the column vector of the Tanimoto similarity matrix of the drug and the radial basis function (RBF) kernel matrix of the cell line. to provide a similarity-based ensemble deep learning model that predicts drug response values. In one embodiment, this model is referred to as Dr.CNN.

Figure 1 shows the overall workflow of Dr.CNN proposed for predicting drug response values. Dr.CNN's workflow predicts the final drug response value based on the drug response values obtained from the two sub-networks running inside the ensemble deep learning architecture, for example, as an average value. Using the cross product of the drug similarity vector and the cell line similarity vector as an input value, a 2D CNN is used to learn the feature. Each CNN has, for example, 2 convolutional layers, 2 max-pooling layers, 1 flatten layer, 1 dropout layer, and It may be composed of three fully connected layers (FC layers). A detailed description of the CNN structure will be described later with reference to FIG. 5 .

In one embodiment, in order to obtain more sophisticated prediction performance and faster operation through ensemble learning, the RBF kernel matrix of the cell line is divided into two submatrices, the Tanimoto similarity matrix and each RBF kernel submatrix The cross product of the column vectors of is built, and it is used as the input value of the CNN. The RBF kernel matrix of cell lines may be divided into two or more sub-matrices according to the number of cell lines. An embodiment can consist of two steps. In the first step, the Tanimoto similarity matrix of the drug and the RBF kernel matrix of the cell line are calculated, and then the cross product between the Tanimoto similarity matrix and the column vectors of each RBF kernel submatrix is calculated. It is a step. In the second step, a 2D CNN model is applied to extract features from the extrinsic product and predict drug response values for the two subnetworks. The final prediction of the drug response value may be obtained by computing the drug response value from the two learned sub-networks, for example, by obtaining an average value.

Dr.CNN according to an embodiment is root mean squared error (RMSE), concordance index (CI) and modified squared correlation coefficient (modified squared correlation coefficient:

) is validated by comparing it to other machine learning and deep learning models. CI is the rank correlation between observed data and predicted data.

Embodiments of the present invention may help users predict drug response values.

experimental dataset

As an input value for Dr.CNN in one embodiment, SMILES (Simplified Molecular-Input Line-Entry System) of cell line gene expression and anticancer compounds is used. A publicly available database, GDSC, is used for drug responses observed in all pairs of cell lines and drugs. GDSC drug SMILES is obtained from PubChem. GDSC is GDSC1 (Iorio, F. et al. A landscape of pharmacogenomic interactions in cancer. Cell 166, 740-754 (2016)) and GDSC2 (Picco, G. et al. Functional linkage of gene fusions to Cancer cell fitness assessed by pharmacological and CRISPR-Cas9 screening. Nat. Commun. 10, 2198 (2019)]) dataset. One embodiment uses the GDSC1 and GDSC2 datasets for evaluation of drug response prediction. The GDSC1 dataset tested 681 cell lines across 234 compounds using Resazurin or Syto60 assays. The GDSC2 dataset tested 588 cell lines across 147 compounds with the CellTitreGlo assay. The GDSC1 dataset is actually observed for all pairs of 234 drugs and 681 cell lines,

It contains 131,894 drug response values measured as values. On the other hand, the GDSC2 dataset observed for all pairs of 147 drugs and 588 cell lines,

Includes 72,393 drug response values observed as values. Table 1 shows these two datasets in the form used in the actual experiment. In one embodiment

Use the value converted to logspace through .

DatasetDataset	Drugs (약물)Drugs	Cell line (셀 라인)Cell line	InteractionsInteractions	Density (%)Density (%)
GDSC1GDSC1	234234	681681	131,894131,894	82.7782.77
GDSC2GDSC2	147147	588588	72,39372,393	83.7583.75

Figure 2 shows a summary of the GDSC1 dataset and the GDSC2 dataset. 2 (a) and (b) show the distribution of drug response values in the GDSC1 and GDSC2 datasets, respectively, and (c) and (d) respectively show the distribution of SMILES string lengths in the GDSC1 and GDSC2 datasets. More specifically, (a) and (b) of FIG. 2

(c) and (d) show the distribution of SMILES string lengths of drugs in the GDSC1 and GDSC2 datasets, respectively. of the GDSC1 dataset.

, the mean and standard deviation are -0.9032 and 1.1777, respectively. of the GDSC2 dataset.

The mean and standard deviation for is -1.2472 and 1.2182, respectively. For drugs in the GDSC1 dataset, the maximum SMILES length is 133 and the average is 62. The maximum SMILES length for drugs in the GDSC2 dataset is 126, and the average is 62.

Representation of inputs and outputs

In one embodiment of Dr.CNN, a drug-drug, cell line-cell line similarity matrix for the GDSC1 and GDSC2 datasets is used. These two matrices are

,

is expressed as To create an ensemble model, one embodiment is a matrix

Divide into two submatrices. in other words,

It is expressed as, where

am. Therefore, the input value of Dr.CNN for each drug-cell line pair is

class

cross product of

is, where

is the similarity matrix

is the ith column of

is the similarity matrix

for

is the jth column of

represents the outer product. The subscript t denotes the transpose of the vector. cross product

Is actually defined as Equation (1) below.

Equation (1)

This cross product yields two sets of information.

, yielding bimodal interactions and primitive unimodal representations of individual modalities. therefore,

Is

and

Includes all combinations of information from this is

In predicting the drug response of the i-th drug and the j-th cell line,

class

It indicates that it can be a more effective input value than simple concatenation of .

The drug-drug similarity is calculated using the Tanimoto coefficient T, which is the most popular similarity measure for comparing chemical structures represented by fingerprints. One embodiment uses RDKit's topological fingerprint. The Tanimoto similarity scale has a value from 0 to 1, and can be interpreted as a percentage of a property shared by two drugs. On the other hand, the cell line-cell line similarity can be calculated based on the gene expression vector using the RBF kernel described by the equation (2). The RBF kernel is a popular kernel function used in various kernel learning algorithms.

Equation (2)

here,

Is the gene expression vector of the i-th cell line,

is the bandwidth parameter for which estimates are obtained as in He et al. (He, T. et al. SimBoost: A read-across approach for predicting drug-target binding affinities using gradient boosting machines. J. Cheminf. 9, 24 .https://doi.org/10.1186/s1332 1-017-0209-z (2017)]). The RBF kernel is a measure of the degree of similarity between vectors. The value of the RBF kernel is the distance

, and from 0 (threshold) to 1 (

) in the range of up to That is, if two vectors are close to each other,

becomes smaller. After that,

increases gradually over the limit. Thus, near vectors have larger RBF kernel values than distant vectors.

The output value of each drug-cell line pair is

correspond to the value

performance rating scale

Since elastic net, random forest (RF), and support vector regression (SVR) are general regression analysis methods proposed for drug response prediction, one embodiment considers them as basic methods. An embodiment compares the performance of elastic net, RF, SVR, 1D CNN ensemble, and DR.CNN using the above-described evaluation dataset. The ensemble of 1D CNNs here was developed by Park et al. (Park, H. et al. Detection of chromosome structural variation by targeted next-generation sequencing and deep learning application. Sci. Rep. 9, 3644 (2019)). The input values of the elastic net, RF, SVR, and 1D CNN ensemble are gene expression vectors composed of 172 values selected from concatenated vectors of 2048-bit extended-connectivity fingerprints (ECFPs) and 19,144 values. In one embodiment, the predictive performance of the GDSC2 dataset was evaluated by a 5-fold cross validation experiment. This technique randomly partitions the dataset into 5 folds of approximately equal size. One fold is treated as a validation set, and learning is applied to the remaining 4 folds. This procedure is repeated 5 times, in each procedure a different instance group is treated as a validation set. One embodiment also evaluated the predictive performance of the GDSC1 dataset after training five models using the GDSC2 dataset as a training dataset. One embodiment is RMSE, CI, Pearson's correlation coefficient to evaluate the performance of the regression analysis model

,

4 indicators were used.

RMSE is a commonly used metric for error in continuous prediction. Because it uses a regression technique, one embodiment uses RMSE, a commonly used metric for the error of continuous prediction. here,

is the actual output value,

corresponds to the prediction. n means the number of samples.

Formula (3)

Pahikkala et al. As suggested in, CI can be used as an evaluation index for prediction accuracy (Pahikkala, T. et al. Toward more realistic drug-target interaction predictions. Brief. Bioinform. 16, 325-337 (2015)). ). My intuition about CI is as follows. The CI for a set of pairwise data is the probability that the predictions for two randomly drawn drug-cell line pairs with different label values are in the correct order, which is the greater the affinity value.

prediction for

has a smaller affinity value

prediction of

means bigger than

Equation (4)

where Z is the normalization constant,

is a step function (Pahikkala, T. et al. Toward more realistic drug-target interaction predictions. Brief. Bioinform. 16, 325-337 (2015)).

Formula (5)

The CI ranges from 0.5 to 1.0, where 0.5 corresponds to random prediction and 1.0 corresponds to perfect prediction accuracy.

To increase the predictability of the model, Roy and Roy use the modified squared correlation coefficient

was introduced (Roy, P. & Roy, K. On some aspects of variable selection for partial least squares regression models. QSAR Comb. Sci. 27, 302-313 (2008)).

Formula (6)

here,

class

is the squared correlation coefficient in the case of including and not including the intercept, respectively. About the test dataset

The in model is judged to be an acceptable model.

learning and assessment

Hereinafter, prediction of drug response values for GDSC1 and GDSC2 datasets of Dr.CNN, which is an example,

describes the predictive performance of One embodiment evaluated the performance of the model of one embodiment against two benchmark datasets. First, since the GDSC2 dataset is more recent than the GDSC1 dataset, the predictive performance of the GDSC2 dataset was evaluated through 5-fold cross-validation. Table 2 shows the performance results of five models through overlapping 5-fold cross-validation on the GDSC2 dataset. Values in bold represent the best performance results. Standard errors are given in parentheses.

모델Model	RMSERMSE	CICI
RFRF	0.5386 (0.0026)0.5386 (0.0026)	0.8433 (0.0002)0.8433 (0.0002)	0.8970 (0.0011)0.8970 (0.0011)	0.7997 (0.0018)0.7997 (0.0018)
SVRSVR	0.7305 (0.0398)0.7305 (0.0398)	0.7923 (0.0021)0.7923 (0.0021)	0.8307 (0.0033) 0.8307 (0.0033)	0.6455 (0.0227)0.6455 (0.0227)
Elastic NetElastic Net	0.7522 (0.0377) 0.7522 (0.0377)	0.7843 (0.0094) 0.7843 (0.0094)	0.8084 (0.0176) 0.8084 (0.0176)	0.5416 (0.0410)0.5416 (0.0410)
1D CNN1D CNNs	0.5390 (0.0022)0.5390 (0.0022)	0.8448 (0.0004)0.8448 (0.0004)	0.8979 (0.0009)0.8979 (0.0009)	0.7909 (0.0060)0.7909 (0.0060)
Dr.CNNDr. CNN	0.5085 (0.0042) 0.5085 (0.0042)	0.8536 (0.0007) 0.8536 (0.0007)	0.9098 (0.0011) 0.9098 (0.0011)	0.8162 (0.0043) 0.8162 (0.0043)

As can be seen in Table 2, Dr.CNN of one embodiment shows the best performance in all indicators of the GDSC2 dataset. In order to statistically evaluate the significant improvement of the model of one embodiment, a one-sided t-test was performed. We compared Dr.CNN and other models with the best performance results. Therefore, the null hypotheses related to Table 2 are given as

,

. All relevant P-values in the above hypothesis tests are calculated to be less than 0.01. Therefore, Dr.CNN shows significantly better performance than other models for all four indicators. In particular, Dr.CNN has a significantly greater than other models in the 5-fold cross-validation on the GDSC2 dataset.

is the most acceptable model because it derives

Next, after training 5 models using the GDSC2 dataset as a training dataset, the prediction performance on the GDSC1 dataset was evaluated. Table 3 shows the performance results of the five models on the GDSC1 dataset. Values shown in bold represent the best performance results, and standard errors are listed in parentheses. Since only GDSC1 is used to calculate the evaluation index of each model, it is not possible to specify a statistically significant model through the evaluation index. Therefore, a bootstrap method was used to estimate the mean and standard error of each evaluation index (Efron, B. & Tibshirani, R. An Introduction to the Bootstrap (Chapman Hall, 1993)). This makes it possible to specify a statistically significant model based on the mean and standard error estimated for each indicator. Bootstrap methods are known to be useful for estimating the sampling distribution of evaluation indicators without using normal theory. The bootstrap method involves iterative sampling of the GDSC1 dataset through extraction with replacement. When applying bootstrap sampling to the GDSC1 dataset, set the bootstrap sample size to 131,894 and repeat the sampling process 20 times.

모델Model	RMSERMSE	CICI
RFRF	1.1992 (0.0006)1.1992 (0.0006)	0.6180 (0.0002)0.6180 (0.0002)	0.4210 (0.0008)0.4210 (0.0008)	0.1420 (0.0007)0.1420 (0.0007)
SVRSVR	1.0948 (0.0004)1.0948 (0.0004)	0.6176 (0.0002)0.6176 (0.0002)	0.4606 (0.0005)0.4606 (0.0005)	0.1756 (0.0005)0.1756 (0.0005)
Elastic NetElastic Net	1.0890 (0.0004)1.0890 (0.0004)	0.6398 (0.0002)0.6398 (0.0002)	0.4829 (0.0006)0.4829 (0.0006)	0.2034 (0.0006)0.2034 (0.0006)
1D CNN1D CNNs	1.0652 (0.0006)1.0652 (0.0006)	0.6321 (0.0001)0.6321 (0.0001)	0.5080 (0.0005)0.5080 (0.0005)	0.2536 (0.0003)0.2536 (0.0003)
Dr.CNNDr. CNN	1.0524 (0.0005) 1.0524 (0.0005)	0.6531 (0.0002) 0.6531 (0.0002)	0.5597 (0.0005) 0.5597 (0.0005)	0.3024 (0.0007) 0.3024 (0.0007)

As shown in Table 3, Dr.CNN shows the best performance in all 4 indicators for 20 bootstrap samples of the GDSC1 dataset. As above, a one-tailed t-test was performed to statistically evaluate the significant improvement of Dr.CNN. The best resulting Dr.CNN model was compared with other models. Therefore, the null hypothesis related to Table 3 is given as

,

. All relevant P values in the above hypothesis test are calculated to be less than 0.01. Therefore, Dr.CNN shows significantly better performance than other models for all four indicators. Although, Dr.CNN is big enough

was not obtained, but this is the most acceptable model. This is larger than the other models for the 20 bootstrap samples of the GDSC1 dataset.

Because it shows the value.

To visually demonstrate the predictive ability, we plotted the predicted values versus the measured values for several datasets. Figure 3 shows the measured data of the GDSC2 and GDSC1 datasets.

Values predicted by Dr.CNN

Shows a scatterplot of values. In the case of the GDSC2 dataset, the prediction value is obtained by alternately using 4 folds of the GDSC2 dataset as a training dataset and the remaining one fold as a test dataset. For the GDSC1 dataset, the predicted values are obtained using the GDSC2 dataset as a training dataset and the GDSC1 dataset as a test dataset. The ideal regression model is the predicted value

this measure

to be the same, that is

expected to be In particular, for the GDSC2 dataset, a straight line

shows high density in the surroundings.

reference model

For the reference model, three existing machine learning models and one deep learning based model can be considered. Existing machine learning models are elastic net, RF and SVR. Elastic nets first emerged as a result of criticism of Lasso (least absolute shrinkage and selection operator) that variable selection was too data-dependent and could be unstable. The solution is to combine ridge regression with Lasso's penalty to be optimal for both domains. Lasso is a regression analysis method that performs both variable selection and normalization to increase the predictive accuracy and interpretability of statistical models. RF assigns the sampled input data with replacement to a number of decision trees for learning, collects the decision results of drug-cell line pairs, and determines the drug response through average. When the tree grows, the split is determined by considering only a subset of all the properties at each node. This algorithm is simple, fast, and does not cause overfitting. In general, it shows better performance than when using one good regression model. SVR gives you the flexibility to define acceptable errors in your model and can find a high-dimensional hyperplane that fits your data. The goal function of SVR is to minimize the coefficients, especially the L2-norm of the vector of coefficients, not the squared error. Instead, the error term is the maximum error

It is handled in a constraint that sets an absolute error less than or equal to the specified margin, called . In one embodiment, to obtain the required accuracy of the model

can be tuned. SVR has proven to be an effective tool in real-value function estimation. One of the advantages of SVR is that its computational complexity does not depend on the dimensionality of the input space. In addition, it has excellent generalization capability with high predictive accuracy. The input of the elastic net, RF and SVR is a gene expression vector consisting of 172 values selected from the 2048-bit long ECFPs concatenation vector and 1,444 values. One of the deep learning models is an ensemble 1D CNN-based prediction model that uses a 2048-bit ECFP vector and a gene expression vector consisting of 172 values selected from 19,144 values as input to each individual 1D CNN.

4 shows the workflow of an ensemble 1D CNN model for prediction of drug response values. In 1D CNN, the kernel and pooling move along one dimension. The stride is set to 1 for all convolutional layers. Set the stride to 2 for all max-pooling layers.

Dr.CNN is a 2D CNN-based prediction model that uses the cross product of a drug similarity vector and a cell line similarity vector as an input. For the input of the Dr.CNN model, first, an mxm drug-drug similarity matrix based on the Tanimoto coefficient

and an nxn cell line-cell line similarity matrix based on gene expression scores through the RBF kernel function

calculate RDKit's 2048-bit long ECFP fingerprint is used to compute Tanimoto coefficients. Here, m and n represent the number of drugs and cell lines in the training dataset, respectively. Then, for every drug-cell line pair, the mx 1 drug similarity vector

and nx 1 cell line similarity vector

cross product of

calculate here,

class

are similarity matrices, respectively

The ith column of and the similarity matrix

represents the jth column of in other words,

is composed of the Tanimoto similarity between the ith drug and other drugs including itself,

is composed of gene expression similarities between the j-th cell line and other cell lines including itself.

Parameters related to Dr.CNN can be obtained by taking the cross product as an input and taking the drug response as an output.

5 shows the architecture of the similarity-based CNN submodel of Dr.CNN. As shown in FIG. 5, the CNN submodel consists of two 2D convolution layers, one flatten layer, FC(128), FC(64), and FC(1) layers each followed by max-pooling. It can be. Numbers in parentheses indicate the number of nodes. The FC 128, FC 64, and FC(1) layers may use a Rectified Linear Unit (ReLU) and a linear function, respectively. In one embodiment, a drop ratio of 0.1 is applied between the flattening layer and the FC(128) layer, between the FC(128) layer and the FC(64) layer, and between the FC(64) layer and the FC(1) layer to eliminate overfitting. An out layer may be included. The number of filters for each convolution layer is 18 and 24, respectively. In one embodiment, filters having kernel sizes of 5 x 5 and 3 x 3 may be used for the convolution layer, respectively. The max pooling layer has size 2 and stride 2. In one embodiment, the batch size and the number of epochs may be set to 32 and 20, respectively, for the learning algorithm. An embodiment may use an Adam optimizer with a learning rate of 0.001.

6 is a block diagram showing an example of a system for implementing a method for predicting drug response according to an embodiment, conceptually showing parts related to the present embodiment. Each component may be provided in one device and may be processed independently, but is not limited thereto, and may also include a device connected through a network so that each component is performed in a separate device.

The external server 20 may be connected to the prediction system 10 through a network, and may provide information on chemical characteristics of drugs, cell line information, drug response information, and the like. For example, drug chemical property information may include SMILES (Simplified Molecular-Input Line-Entry System) information, and cell line information may include cell line gene expression information. Specifically, GDSC provides information on drug responses observed in all pairs of cell lines and drugs, which may be received from the external server 20 . For example, the external server 20 may be a database for drug response prediction processing of the prediction system 10 or a server that provides it.

The prediction system 10 may include a control unit 11, a communication unit 12, an input/output interface unit 13, and a memory unit 14.

The control unit 11 is a component that controls the entire prediction system 10, and may include, for example, a processing unit such as a CPU or a GPU. The control unit 11 may use information stored in the memory unit 14 to learn models to be described later, and may also calculate a predicted value for a new input through the learned model. Specifically, the controller 11 may control a model for predicting a drug response. To this end, the control unit 11 may include an internal memory for storing a control program such as an OS (operating system), a program defining various processing procedures, and data. Then, the control unit 11 can perform information processing for executing various processes by these programs and the like.

In addition, the communication unit 12 may include an interface that can be connected to a communication device such as a router connected to a communication line, etc., and can control communication between the prediction system 10 and the external server 20. .

The input/output interface unit 13 may be an interface connected to the input unit 15 and/or the display unit 16 . The prediction system 10 and the user may communicate through the input/output interface 13 . For example, the display unit 16 may be a display means (for example, a display made of liquid crystal or organic EL, a monitor, a touch panel, etc.) for displaying a display screen of an application or the like. Also, the input unit 15 may be, for example, a key input unit, a touch panel, a control pad (eg, a touch pad, a game pad, etc.), a mouse, a keyboard, a microphone, or the like.

In addition, the memory unit 14 may be a device for storing various databases or tables. For example, the memory unit may provide information about chemical properties of drugs, cell line information, drug response information, and the like. For example, drug chemical property information may include SMILES (Simplified Molecular-Input Line-Entry System) information, and cell line information may include cell line gene expression information. In addition, processes for input and output of the prediction system 10 may be stored, and result values for process processing may be stored.

Embodiments according to the present invention described above may be implemented in the form of program instructions that can be executed through various computer components and recorded on a computer-readable recording medium. A computer-readable recording medium may include program instructions, data files, data structures, etc. alone or in combination. Program instructions recorded on a computer readable recording medium may be specially designed and configured for the present invention or may be known and usable to those skilled in the art of computer software. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks and magnetic tapes, optical recording media such as CD-ROMs and DVDs, and magneto-optical media such as floptical disks. media), and hardware devices specially configured to store and execute program instructions, such as ROM, RAM, flash memory, and the like. Examples of program instructions include high-level language codes that can be executed by a computer using an interpreter or the like as well as machine language codes generated by a compiler. The hardware device may be configured to act as one or more software modules to perform processing according to the present invention and vice versa.

In addition, the embodiments according to the present invention described above may be a set of program commands that can be executed through various computer components and a user application itself for executing them. Specifically, it may be a program itself that can be installed on a client computer after being downloaded through a server or through a storage medium.

In the above, the present invention has been described by specific details such as specific components and limited embodiments and drawings, but these are provided to help a more general understanding of the present invention, and the present invention is not limited to the above embodiments. , Those skilled in the art to which the present invention pertains may seek various modifications and variations from these descriptions.

Therefore, the spirit of the present invention should not be limited to the above-described embodiments and should not be determined, and not only the claims to be described later, but also all modifications equivalent or equivalent to these claims fall within the scope of the spirit of the present invention. will do it

In addition, the embodiments of the present invention are not mutually exclusive, and configurations of one embodiment may be applied to other embodiments. Embodiments of the present invention are provided as examples of some of the various forms that can be derived from various combinations of components, and are not limited to the specific embodiments of the present invention itself.

explanation of code

10: prediction system 20: external server

11: control unit 12: communication unit

13: input/output interface unit 14: memory unit

15: input unit 16: display unit

Claims

As a method of predicting the drug response of a drug and a cell line using a convolutional neural network model,

preparing a first similarity matrix between drugs;

preparing a second similarity matrix between cell lines;

Calculating a cross product between the first similarity matrix and the second similarity matrix, wherein the ith column vector (i = 1, 2, ... m: m is an integer) of the first similarity matrix and the second similarity matrix Calculating a cross product between j-th column vectors of (j = 1, 2, ..., n: n is an integer);

learning the convolutional neural network model by taking the cross product as an input value and using the i-th drug and the drug response value of the j-th cell line as output values; and

A drug response prediction method comprising predicting and outputting drug response values of a new drug and a cell line using the learned convolutional neural network model.
According to claim 1,

The first similarity matrix is an m x m drug-drug similarity matrix based on Tanimoto coefficients,

The second similarity matrix is an n x n cell line-cell line similarity matrix that is a radial basis function (RBF) kernel matrix,

wherein m and n represent the number of drugs and cell lines in the learning dataset, respectively.
According to claim 2,

The i-th column vector of the first similarity matrix is the Tanimoto similarity between the i-th drug and other drugs including the i-th drug,

The j-th column vector of the second similarity matrix is a gene expression similarity between the j-th cell line and other cell lines including the j-th cell line.
According to claim 1,

The drug response value is quantified through the IC 50 (half-maximal inhibitory concentration) value of the cell line for the drug.
A method for predicting drug response, which is a value.
According to claim 4,

The second similarity matrix includes a first sub-matrix and a second sub-matrix,

Calculating a cross product between the first similarity matrix and the second similarity,

calculating a first cross product between the first similarity matrix and a first sub-matrix of the second similarity matrix, and calculating a second cross product between the first similarity matrix and a second sub-matrix of the second similarity matrix; Including, drug response prediction method.
According to claim 5,

The convolutional neural network model includes a first convolutional neural network model and a second convolutional neural network model,

The step of learning the convolutional neural network model,

The first cross product and the second cross product are input values of the first convolutional neural network model and the second convolutional neural network model, respectively, and output values of the first convolutional neural network model and the second convolutional neural network model, respectively. A method for predicting drug response, which takes the average of as an output value.
According to claim 1,

The convolutional neural network model is a two-dimensional (2-Dimensional) model, drug response prediction method.
According to claim 1,

The method for predicting drug response, wherein the convolutional neural network model includes two two-dimensional convolutional layers, two max pooling layers, one flattening layer, one dropout layer, and two fully connected layers.
A system for predicting the drug response of a drug and a cell line using a Convolutional Neural Network model,

a controller for controlling the convolutional neural network model;

Communication unit for communication with an external server;

memory unit;

display unit; and

Including an input unit for receiving a user's input;

The memory unit includes a first similarity matrix between drugs and a second similarity matrix between cell lines and cell lines,

The controller controls the i-th column vector of the first similarity matrix (i = 1, 2, ..., m: m is an integer) and the j-th column vector of the second similarity matrix (j = 1, 2, ..., n: n is an integer), and the convolutional neural network model is trained using the cross product as an input value and the drug response values of the i-th drug and the j-th cell line as output values,

A drug response prediction system for predicting drug response values of new drugs and cell lines using the learned convolutional neural network model.
According to claim 9,

The first similarity matrix is an m x m drug-drug similarity matrix based on Tanimoto coefficients,

The second similarity matrix is an n x n cell line-cell line similarity matrix that is a radial basis function (RBF) kernel matrix,

wherein m and n represent the number of drugs and cell lines in the learning dataset, respectively.
According to claim 10,

The i-th column vector of the first similarity matrix is the Tanimoto similarity between the i-th drug and other drugs including the i-th drug,

The j-th column vector of the second similarity matrix is a gene expression similarity between the j-th cell line and other cell lines including the j-th cell line, the drug response prediction system.
According to claim 9,

The drug response value is quantified through the IC 50 (half-maximal inhibitory concentration) value of the cell line for the drug.
A value, drug response prediction system.
According to claim 12,

The second similarity matrix includes a first sub-matrix and a second sub-matrix,

Calculating a cross product between the first similarity matrix and the second similarity,

calculating a first cross product between the first similarity matrix and a first sub-matrix of the second similarity matrix, and calculating a second cross product between the first similarity matrix and a second sub-matrix of the second similarity matrix; Including, drug response prediction system.
According to claim 13,

The convolutional neural network model includes a first convolutional neural network model and a second convolutional neural network model,

The step of learning the convolutional neural network model,

The first cross product and the second cross product are input values of the first convolutional neural network model and the second convolutional neural network model, respectively, and output values of the first convolutional neural network model and the second convolutional neural network model, respectively. A drug response prediction system that takes the average of as an output value.
According to claim 9,

The convolutional neural network model is a two-dimensional (2-Dimensional) model, drug response prediction system.
According to claim 9,

The convolutional neural network model includes two two-dimensional convolutional layers, two max pooling layers, one flattening layer, one dropout layer, and two fully connected layers.