WO2023038501A1 - System for predicting drug responses by using convolutional neural network based on drug and cell line similarity matrix - Google Patents

System for predicting drug responses by using convolutional neural network based on drug and cell line similarity matrix Download PDF

Info

Publication number
WO2023038501A1
WO2023038501A1 PCT/KR2022/013647 KR2022013647W WO2023038501A1 WO 2023038501 A1 WO2023038501 A1 WO 2023038501A1 KR 2022013647 W KR2022013647 W KR 2022013647W WO 2023038501 A1 WO2023038501 A1 WO 2023038501A1
Authority
WO
WIPO (PCT)
Prior art keywords
drug
neural network
similarity matrix
convolutional neural
network model
Prior art date
Application number
PCT/KR2022/013647
Other languages
French (fr)
Korean (ko)
Inventor
심주용
황창하
손인석
Original Assignee
주식회사 아론티어
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 주식회사 아론티어 filed Critical 주식회사 아론티어
Publication of WO2023038501A1 publication Critical patent/WO2023038501A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/30Prediction of properties of chemical compounds, compositions or mixtures
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/60In silico combinatorial chemistry
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/70Machine learning, data mining or chemometrics

Definitions

  • Precision medicine aims to elaborately select cancer treatments based on the genetic information of each patient. Indeed, one of the most important problems in medicine is predicting the anticancer drug response for each patient. Because of tumor heterogeneity, patients with the same type of cancer may have different responses to similar drugs. Therefore, it is very important to provide a predictive method that reveals the relationship between genomic information and drug response, which can be helpful for precision medicine.
  • Genomics of Drug Sensitivity in Cancer GDSC
  • Cancer Cell Line Encyclopedia CCLE
  • GDSC Drug Sensitivity in Cancer
  • CCLE Cancer Cell Line Encyclopedia
  • regression analysis approaches have been proposed to predict drug response using gene expression profiles or other molecular information of a cell line. Some prediction methods have improved drug response prediction by incorporating drug information such as drug chemical substructure and cell line information. In addition, numerous machine learning methods have been applied to the problem of predicting drug response. For example, regression analysis methods such as lasso (least absolute shrinkage and selection operator), elastic nets, random forests, kernel-based methods, neural networks and deep learning. have been applied Ali and Aittokallio provide a comprehensive recent review (Ali, M. & Aittokallio, T. Machine learning and feature selection for drug response prediction in precision oncology applications. Biophys. Rev. 11, 31-39 (2019)). ). Recent advances in deep learning have opened a new avenue for finding regression models for predicting drug response, ultimately providing a more accurate tool for treatment response.
  • CaDRReS matrix decomposition-based recommender system
  • CNNscan an ensemble model that includes five Convolutional Neural Networks (CNNs). It used the mutation profile of the cell line and the chemical substructure of the drug as input features for CNNs.
  • CDCN cell line-drug complex network
  • ADRML Anticancer Drug Response Prediction using Manifold Learning
  • manifold learning Moughari, F. A. & Eslahchi, C. ADRML: anticancer drug response prediction using manifold learning. Sci. Rep. 10, 14245 (2020)]
  • ADRML maps drug response values to a low-dimensional latent space and computes drug response values for new cell line-drug pairs from the latent space. It takes into account the cell line similarity and drug similarity of various types and utilizes them in the manifold learning procedure.
  • ADRML has been shown to provide accurate and robust predictions.
  • m m is an integer
  • m m is an integer
  • m is an integer
  • n is an integer
  • learning the convolutional neural network model by taking the cross product as an input value and using the i-th drug and the drug response value of the j-th cell line as output values; and predicting and outputting a drug response value of a new drug and a cell line using the learned convolutional neural network model.
  • the first similarity matrix is an m x m drug-drug similarity matrix based on the Tanimoto coefficient
  • the second similarity matrix is an n x n cell line-cell line similarity matrix that is a radial basis function (RBF) kernel matrix, , wherein m and n represent the number of drugs and cell lines in the learning dataset, respectively, providing a drug response prediction method.
  • RBF radial basis function
  • the i-th column vector of the first similarity matrix is the Tanimoto similarity between the i-th drug and other drugs including the i-th drug
  • the j-th column vector of the second similarity matrix is the j-th cell
  • the drug response value is quantified through the IC 50 (half-maximal inhibitory concentration) value of the cell line for the drug.
  • the second similarity matrix includes a first sub-matrix and a second sub-matrix, and calculating a cross product between the first similarity matrix and the second similarity may include: A drug response prediction method comprising calculating a first cross product between first sub-matrices of a matrix and calculating a second cross product between the first similarity matrix and a second sub-matrix of the second similarity matrix. do.
  • the convolutional neural network model includes a first convolutional neural network model and a second convolutional neural network model, and the step of learning the convolutional neural network model comprises the first synthesis of the first cross product and the second cross product.
  • a method for predicting a drug response wherein an input value of each of the convolutional neural network model and the second convolutional neural network model is used, and an average of output values of the first convolutional neural network model and the second convolutional neural network model is used as an output value. to provide.
  • the convolutional neural network model may be a 2-dimensional model.
  • the convolutional neural network model may include two 2D convolutional layers, two max pooling layers, one flattening layer, one dropout layer, and two fully connected layers.
  • the convolutional neural network model is trained using the cross product as an input value and the drug response values of the i-th drug and the j-th cell line as output values, and using the learned convolutional neural network model, new drugs and cell lines
  • the first similarity matrix is an m x m drug-drug similarity matrix based on the Tanimoto coefficient
  • the second similarity matrix is an n x n cell line-cell line similarity matrix that is a radial basis function (RBF) kernel matrix, , where m and n represent the number of drugs and cell lines in the training dataset, respectively, to provide a drug response prediction system.
  • RBF radial basis function
  • the i-th column vector of the first similarity matrix is the Tanimoto similarity between the i-th drug and other drugs including the i-th drug
  • the j-th column vector of the second similarity matrix is the j-th cell
  • a drug response prediction system is provided, which is the gene expression similarity between the line and other cell lines including the j-th cell line.
  • the drug response value is quantified through the IC 50 (half-maximal inhibitory concentration) value of the cell line for the drug.
  • IC 50 half-maximal inhibitory concentration
  • the second similarity matrix includes a first sub-matrix and a second sub-matrix
  • calculating a cross product between the first similarity matrix and the second similarity may include: Calculating a first cross product between first sub-matrices of a matrix, and calculating a second cross product between the first similarity matrix and a second sub-matrix of the second similarity matrix. do.
  • the convolutional neural network model includes a first convolutional neural network model and a second convolutional neural network model
  • the step of learning the convolutional neural network model comprises the first synthesis of the first cross product and the second cross product.
  • a drug response prediction system having an input value of each of the convolutional neural network model and the second convolutional neural network model, and an average of output values of the first convolutional neural network model and the second convolutional neural network model as an output value. to provide.
  • the convolutional neural network model may be a 2-dimensional model.
  • the convolutional neural network model may include two 2D convolutional layers, two max pooling layers, one flattening layer, one dropout layer, and two fully connected layers.
  • Dr.CNN To solve the problem of predicting drug response values.
  • Dr.CNN simply divides the RBF kernel matrix associated with the cell line into two submatrices, and calculates the cross product of the column vectors of the Tanimoto similarity matrix and the column vectors of each RBF kernel submatrix. Then, it is used as an input value for each of the two CNN models. This is to obtain faster calculation and improved prediction performance through ensemble learning.
  • the RBF kernel matrix may be randomly divided and may be divided into two or more sub-matrices according to the number of cell lines.
  • Dr.CNN is the first non-linear method that applies 2D CNN to the cross product between the column vectors of the drug's Tanimoto similarity matrix and the column vectors of the cell line's RBF kernel matrix for drug response prediction.
  • Dr.CNN exceeds the performance of existing models such as elastic net, RF, SVR, and 1D CNN ensemble. Dr.CNN can be further improved by adjusting the architecture of CNN according to the data structure.
  • the main idea behind Dr.CNN is to integrate the two modalities using the cross product and apply the CNN to the resulting matrix.
  • Dr.CNN is a very effective approach for drug response prediction and can play a huge role in the drug development process.
  • Figure 1 shows the overall workflow of an embodiment proposed for the prediction of drug response values.
  • FIG. 2 shows a summary of the GDSC1 and GDSC2 datasets used in one example.
  • Figure 3 is the measured data set of GDSC2 and GDSC1 value predicted by one embodiment Shows a scatterplot of values.
  • Figure 4 shows the workflow of an ensemble 1D CNN model for prediction of drug response values.
  • FIG. 5 shows the architecture of a CNN submodel based on similarity according to an embodiment.
  • FIG. 6 is a block diagram showing an example of a system for implementing a drug response prediction method according to an embodiment.
  • One embodiment of the present invention uses a two-dimensional CNN approach for the outer product of the column vector of the Tanimoto similarity matrix of the drug and the radial basis function (RBF) kernel matrix of the cell line. to provide a similarity-based ensemble deep learning model that predicts drug response values.
  • this model is referred to as Dr.CNN.
  • Figure 1 shows the overall workflow of Dr.CNN proposed for predicting drug response values.
  • Dr.CNN's workflow predicts the final drug response value based on the drug response values obtained from the two sub-networks running inside the ensemble deep learning architecture, for example, as an average value.
  • a 2D CNN is used to learn the feature.
  • Each CNN has, for example, 2 convolutional layers, 2 max-pooling layers, 1 flatten layer, 1 dropout layer, and It may be composed of three fully connected layers (FC layers). A detailed description of the CNN structure will be described later with reference to FIG. 5 .
  • the RBF kernel matrix of the cell line is divided into two submatrices, the Tanimoto similarity matrix and each RBF kernel submatrix
  • the cross product of the column vectors of is built, and it is used as the input value of the CNN.
  • the RBF kernel matrix of cell lines may be divided into two or more sub-matrices according to the number of cell lines.
  • An embodiment can consist of two steps. In the first step, the Tanimoto similarity matrix of the drug and the RBF kernel matrix of the cell line are calculated, and then the cross product between the Tanimoto similarity matrix and the column vectors of each RBF kernel submatrix is calculated. It is a step.
  • a 2D CNN model is applied to extract features from the extrinsic product and predict drug response values for the two subnetworks.
  • the final prediction of the drug response value may be obtained by computing the drug response value from the two learned sub-networks, for example, by obtaining an average value.
  • Dr.CNN is root mean squared error (RMSE), concordance index (CI) and modified squared correlation coefficient (modified squared correlation coefficient: ) is validated by comparing it to other machine learning and deep learning models.
  • CI is the rank correlation between observed data and predicted data.
  • Embodiments of the present invention may help users predict drug response values.
  • SMILES Simple Molecular-Input Line-Entry System
  • a publicly available database, GDSC is used for drug responses observed in all pairs of cell lines and drugs.
  • GDSC drug SMILES is obtained from PubChem.
  • GDSC is GDSC1 (Iorio, F. et al. A landscape of pharmacogenomic interactions in cancer. Cell 166, 740-754 (2016)) and GDSC2 (Picco, G. et al. Functional linkage of gene fusions to Cancer cell fitness assessed by pharmacological and CRISPR-Cas9 screening. Nat. Commun. 10, 2198 (2019)]) dataset.
  • One embodiment uses the GDSC1 and GDSC2 datasets for evaluation of drug response prediction.
  • the GDSC1 dataset tested 681 cell lines across 234 compounds using Resazurin or Syto60 assays.
  • the GDSC2 dataset tested 588 cell lines across 147 compounds with the CellTitreGlo assay.
  • the GDSC1 dataset is actually observed for all pairs of 234 drugs and 681 cell lines, It contains 131,894 drug response values measured as values.
  • the GDSC2 dataset observed for all pairs of 147 drugs and 588 cell lines Includes 72,393 drug response values observed as values. Table 1 shows these two datasets in the form used in the actual experiment. In one embodiment Use the value converted to logspace through .
  • Figure 2 shows a summary of the GDSC1 dataset and the GDSC2 dataset.
  • 2 (a) and (b) show the distribution of drug response values in the GDSC1 and GDSC2 datasets, respectively, and (c) and (d) respectively show the distribution of SMILES string lengths in the GDSC1 and GDSC2 datasets. More specifically, (a) and (b) of FIG. 2 (c) and (d) show the distribution of SMILES string lengths of drugs in the GDSC1 and GDSC2 datasets, respectively. of the GDSC1 dataset. , the mean and standard deviation are -0.9032 and 1.1777, respectively. of the GDSC2 dataset. The mean and standard deviation for is -1.2472 and 1.2182, respectively.
  • the maximum SMILES length is 133 and the average is 62.
  • the maximum SMILES length for drugs in the GDSC2 dataset is 126, and the average is 62.
  • Dr.CNN a drug-drug, cell line-cell line similarity matrix for the GDSC1 and GDSC2 datasets is used. These two matrices are , is expressed as To create an ensemble model, one embodiment is a matrix Divide into two submatrices. in other words, It is expressed as, where am. Therefore, the input value of Dr.CNN for each drug-cell line pair is class cross product of is, where is the similarity matrix is the ith column of is the similarity matrix for is the jth column of represents the outer product. The subscript t denotes the transpose of the vector. cross product Is actually defined as Equation (1) below.
  • This cross product yields two sets of information. , yielding bimodal interactions and primitive unimodal representations of individual modalities. therefore, Is and Includes all combinations of information from this is In predicting the drug response of the i-th drug and the j-th cell line, class It indicates that it can be a more effective input value than simple concatenation of .
  • the drug-drug similarity is calculated using the Tanimoto coefficient T, which is the most popular similarity measure for comparing chemical structures represented by fingerprints.
  • T Tanimoto coefficient
  • One embodiment uses RDKit's topological fingerprint.
  • the Tanimoto similarity scale has a value from 0 to 1, and can be interpreted as a percentage of a property shared by two drugs.
  • the cell line-cell line similarity can be calculated based on the gene expression vector using the RBF kernel described by the equation (2).
  • the RBF kernel is a popular kernel function used in various kernel learning algorithms.
  • the RBF kernel is a measure of the degree of similarity between vectors.
  • the value of the RBF kernel is the distance , and from 0 (threshold) to 1 ( ) in the range of up to That is, if two vectors are close to each other, becomes smaller. After that, increases gradually over the limit. Thus, near vectors have larger RBF kernel values than distant vectors.
  • the output value of each drug-cell line pair is correspond to the value
  • the predictive performance of the GDSC2 dataset was evaluated by a 5-fold cross validation experiment. This technique randomly partitions the dataset into 5 folds of approximately equal size. One fold is treated as a validation set, and learning is applied to the remaining 4 folds. This procedure is repeated 5 times, in each procedure a different instance group is treated as a validation set.
  • One embodiment also evaluated the predictive performance of the GDSC1 dataset after training five models using the GDSC2 dataset as a training dataset.
  • One embodiment is RMSE, CI, Pearson's correlation coefficient to evaluate the performance of the regression analysis model , 4 indicators were used.
  • RMSE is a commonly used metric for error in continuous prediction. Because it uses a regression technique, one embodiment uses RMSE, a commonly used metric for the error of continuous prediction. here, is the actual output value, corresponds to the prediction. n means the number of samples.
  • CI can be used as an evaluation index for prediction accuracy (Pahikkala, T. et al. Toward more realistic drug-target interaction predictions. Brief. Bioinform. 16, 325-337 (2015)). ).
  • My intuition about CI is as follows. The CI for a set of pairwise data is the probability that the predictions for two randomly drawn drug-cell line pairs with different label values are in the correct order, which is the greater the affinity value. prediction for has a smaller affinity value prediction of means bigger than
  • the CI ranges from 0.5 to 1.0, where 0.5 corresponds to random prediction and 1.0 corresponds to perfect prediction accuracy.
  • Roy and Roy use the modified squared correlation coefficient (Roy, P. & Roy, K. On some aspects of variable selection for partial least squares regression models. QSAR Comb. Sci. 27, 302-313 (2008)).
  • class is the squared correlation coefficient in the case of including and not including the intercept, respectively.
  • the in model is judged to be an acceptable model.
  • prediction of drug response values for GDSC1 and GDSC2 datasets of Dr.CNN which is an example, describes the predictive performance of One embodiment evaluated the performance of the model of one embodiment against two benchmark datasets.
  • the predictive performance of the GDSC2 dataset was evaluated through 5-fold cross-validation.
  • Table 2 shows the performance results of five models through overlapping 5-fold cross-validation on the GDSC2 dataset. Values in bold represent the best performance results. Standard errors are given in parentheses.
  • Dr.CNN of one embodiment shows the best performance in all indicators of the GDSC2 dataset.
  • a one-sided t-test was performed.
  • Dr.CNN and other models with the best performance results Therefore, the null hypotheses related to Table 2 are given as , , , .
  • All relevant P-values in the above hypothesis tests are calculated to be less than 0.01. Therefore, Dr.CNN shows significantly better performance than other models for all four indicators.
  • Dr.CNN has a significantly greater than other models in the 5-fold cross-validation on the GDSC2 dataset. is the most acceptable model because it derives
  • Table 3 shows the performance results of the five models on the GDSC1 dataset. Values shown in bold represent the best performance results, and standard errors are listed in parentheses. Since only GDSC1 is used to calculate the evaluation index of each model, it is not possible to specify a statistically significant model through the evaluation index. Therefore, a bootstrap method was used to estimate the mean and standard error of each evaluation index (Efron, B. & Tibshirani, R. An Introduction to the Bootstrap (Chapman Hall, 1993)). This makes it possible to specify a statistically significant model based on the mean and standard error estimated for each indicator.
  • Bootstrap methods are known to be useful for estimating the sampling distribution of evaluation indicators without using normal theory.
  • the bootstrap method involves iterative sampling of the GDSC1 dataset through extraction with replacement.
  • When applying bootstrap sampling to the GDSC1 dataset set the bootstrap sample size to 131,894 and repeat the sampling process 20 times.
  • Model RMSE CI RF 1.1992 (0.0006) 0.6180 (0.0002) 0.4210 (0.0008) 0.1420 (0.0007)
  • SVR 1.0948 (0.0004) 0.6176 (0.0002) 0.4606 (0.0005) 0.1756 (0.0005)
  • Elastic Net 1.0890 (0.0004) 0.6398 (0.0002) 0.4829 (0.0006) 0.2034 (0.0006)
  • 1D CNNs 1.0652 (0.0006) 0.6321 (0.0001) 0.5080 (0.0005) 0.2536 (0.0003)
  • Dr. CNN 1.0524 (0.0005) 0.6531 (0.0002) 0.5597 (0.0005) 0.3024 (0.0007)
  • Dr.CNN shows the best performance in all 4 indicators for 20 bootstrap samples of the GDSC1 dataset.
  • a one-tailed t-test was performed to statistically evaluate the significant improvement of Dr.CNN.
  • the best resulting Dr.CNN model was compared with other models. Therefore, the null hypothesis related to Table 3 is given as , , , . All relevant P values in the above hypothesis test are calculated to be less than 0.01. Therefore, Dr.CNN shows significantly better performance than other models for all four indicators.
  • Dr.CNN is big enough was not obtained, but this is the most acceptable model. This is larger than the other models for the 20 bootstrap samples of the GDSC1 dataset. Because it shows the value.
  • Figure 3 shows the measured data of the GDSC2 and GDSC1 datasets. Values predicted by Dr.CNN Shows a scatterplot of values.
  • the prediction value is obtained by alternately using 4 folds of the GDSC2 dataset as a training dataset and the remaining one fold as a test dataset.
  • the predicted values are obtained using the GDSC2 dataset as a training dataset and the GDSC1 dataset as a test dataset.
  • the ideal regression model is the predicted value this measure to be the same, that is expected to be In particular, for the GDSC2 dataset, a straight line shows high density in the surroundings.
  • the split is determined by considering only a subset of all the properties at each node.
  • This algorithm is simple, fast, and does not cause overfitting. In general, it shows better performance than when using one good regression model.
  • SVR gives you the flexibility to define acceptable errors in your model and can find a high-dimensional hyperplane that fits your data.
  • the goal function of SVR is to minimize the coefficients, especially the L2-norm of the vector of coefficients, not the squared error. Instead, the error term is the maximum error It is handled in a constraint that sets an absolute error less than or equal to the specified margin, called . In one embodiment, to obtain the required accuracy of the model can be tuned. SVR has proven to be an effective tool in real-value function estimation.
  • the input of the elastic net, RF and SVR is a gene expression vector consisting of 172 values selected from the 2048-bit long ECFPs concatenation vector and 1,444 values.
  • One of the deep learning models is an ensemble 1D CNN-based prediction model that uses a 2048-bit ECFP vector and a gene expression vector consisting of 172 values selected from 19,144 values as input to each individual 1D CNN.
  • FIG 4 shows the workflow of an ensemble 1D CNN model for prediction of drug response values.
  • 1D CNN the kernel and pooling move along one dimension.
  • the stride is set to 1 for all convolutional layers. Set the stride to 2 for all max-pooling layers.
  • Dr.CNN is a 2D CNN-based prediction model that uses the cross product of a drug similarity vector and a cell line similarity vector as an input.
  • an mxm drug-drug similarity matrix based on the Tanimoto coefficient and an nxn cell line-cell line similarity matrix based on gene expression scores through the RBF kernel function calculate RDKit's 2048-bit long ECFP fingerprint is used to compute Tanimoto coefficients.
  • m and n represent the number of drugs and cell lines in the training dataset, respectively.
  • the mx 1 drug similarity vector and nx 1 cell line similarity vector cross product of calculate here, class are similarity matrices, respectively
  • the ith column of and the similarity matrix represents the jth column of in other words, is composed of the Tanimoto similarity between the ith drug and other drugs including itself, is composed of gene expression similarities between the j-th cell line and other cell lines including itself.
  • Parameters related to Dr.CNN can be obtained by taking the cross product as an input and taking the drug response as an output.
  • the CNN submodel consists of two 2D convolution layers, one flatten layer, FC(128), FC(64), and FC(1) layers each followed by max-pooling. It can be. Numbers in parentheses indicate the number of nodes.
  • the FC 128, FC 64, and FC(1) layers may use a Rectified Linear Unit (ReLU) and a linear function, respectively.
  • ReLU Rectified Linear Unit
  • a drop ratio of 0.1 is applied between the flattening layer and the FC(128) layer, between the FC(128) layer and the FC(64) layer, and between the FC(64) layer and the FC(1) layer to eliminate overfitting.
  • An out layer may be included.
  • the number of filters for each convolution layer is 18 and 24, respectively.
  • filters having kernel sizes of 5 x 5 and 3 x 3 may be used for the convolution layer, respectively.
  • the max pooling layer has size 2 and stride 2.
  • the batch size and the number of epochs may be set to 32 and 20, respectively, for the learning algorithm.
  • An embodiment may use an Adam optimizer with a learning rate of 0.001.
  • FIG. 6 is a block diagram showing an example of a system for implementing a method for predicting drug response according to an embodiment, conceptually showing parts related to the present embodiment.
  • Each component may be provided in one device and may be processed independently, but is not limited thereto, and may also include a device connected through a network so that each component is performed in a separate device.
  • the external server 20 may be connected to the prediction system 10 through a network, and may provide information on chemical characteristics of drugs, cell line information, drug response information, and the like.
  • drug chemical property information may include SMILES (Simplified Molecular-Input Line-Entry System) information
  • cell line information may include cell line gene expression information.
  • GDSC provides information on drug responses observed in all pairs of cell lines and drugs, which may be received from the external server 20 .
  • the external server 20 may be a database for drug response prediction processing of the prediction system 10 or a server that provides it.
  • the prediction system 10 may include a control unit 11, a communication unit 12, an input/output interface unit 13, and a memory unit 14.
  • the control unit 11 is a component that controls the entire prediction system 10, and may include, for example, a processing unit such as a CPU or a GPU.
  • the control unit 11 may use information stored in the memory unit 14 to learn models to be described later, and may also calculate a predicted value for a new input through the learned model.
  • the controller 11 may control a model for predicting a drug response.
  • the control unit 11 may include an internal memory for storing a control program such as an OS (operating system), a program defining various processing procedures, and data. Then, the control unit 11 can perform information processing for executing various processes by these programs and the like.
  • the communication unit 12 may include an interface that can be connected to a communication device such as a router connected to a communication line, etc., and can control communication between the prediction system 10 and the external server 20. .
  • the input/output interface unit 13 may be an interface connected to the input unit 15 and/or the display unit 16 .
  • the prediction system 10 and the user may communicate through the input/output interface 13 .
  • the display unit 16 may be a display means (for example, a display made of liquid crystal or organic EL, a monitor, a touch panel, etc.) for displaying a display screen of an application or the like.
  • the input unit 15 may be, for example, a key input unit, a touch panel, a control pad (eg, a touch pad, a game pad, etc.), a mouse, a keyboard, a microphone, or the like.
  • the memory unit 14 may be a device for storing various databases or tables.
  • the memory unit may provide information about chemical properties of drugs, cell line information, drug response information, and the like.
  • drug chemical property information may include SMILES (Simplified Molecular-Input Line-Entry System) information
  • cell line information may include cell line gene expression information.
  • processes for input and output of the prediction system 10 may be stored, and result values for process processing may be stored.
  • Embodiments according to the present invention described above may be implemented in the form of program instructions that can be executed through various computer components and recorded on a computer-readable recording medium.
  • a computer-readable recording medium may include program instructions, data files, data structures, etc. alone or in combination.
  • Program instructions recorded on a computer readable recording medium may be specially designed and configured for the present invention or may be known and usable to those skilled in the art of computer software.
  • Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks and magnetic tapes, optical recording media such as CD-ROMs and DVDs, and magneto-optical media such as floptical disks. media), and hardware devices specially configured to store and execute program instructions, such as ROM, RAM, flash memory, and the like.
  • Examples of program instructions include high-level language codes that can be executed by a computer using an interpreter or the like as well as machine language codes generated by a compiler.
  • the hardware device may be configured to act as one or more software modules to perform processing according to the present invention and vice versa.
  • the embodiments according to the present invention described above may be a set of program commands that can be executed through various computer components and a user application itself for executing them. Specifically, it may be a program itself that can be installed on a client computer after being downloaded through a server or through a storage medium.
  • embodiments of the present invention are not mutually exclusive, and configurations of one embodiment may be applied to other embodiments.
  • Embodiments of the present invention are provided as examples of some of the various forms that can be derived from various combinations of components, and are not limited to the specific embodiments of the present invention itself.
  • control unit 12 communication unit

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computing Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Computation (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Biomedical Technology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Medicinal Chemistry (AREA)
  • Medical Treatment And Welfare Office Work (AREA)

Abstract

Human cancer cell lines are frequently used in researching the biology of cancer and in research for testing the effects of cancer treatment. Accurately predicting drug responses by using pharmacogenomic data is an essential problem in oncology and precision medicine. On the basis that similar cell lines respond similarly to similar drugs, provided is an ensemble deep learning system in which a cross product between column vectors of a drug-drug similarity matrix and a cell line-cell line similarity matrix is applied to a convolutional neural network. Therefore, it has been identified that genetic characteristics of patients are connected to drug sensitivity so as to be useful in precision medicine.

Description

약물과 셀 라인의 유사도 행렬에 기반한 합성곱 신경망을 이용하여 약물 반응을 예측하는 시스템A system for predicting drug response using convolutional neural networks based on similarity matrices between drugs and cell lines
정밀의학은 환자 개개인의 유전 정보에 기초하여 암 치료제를 정교하게 선정하는 것을 목적으로 한다. 정말의학에 있어서 가장 중요한 문제 중 하나는 각 환자에 대하여 항암 약물 반응을 예측하는 것이다. 종양의 이질성 때문에 동일한 유형의 암에 걸린 환자라도 유사한 약물에 대하여 다른 반응을 보일 수 있다. 그러므로, 유전체 정보와 약물 반응 사이의 관계를 밝히는 예측 방법을 제공하는 것이 매우 중요하며 이는 정밀의학에 도움이 될 수 있다.Precision medicine aims to elaborately select cancer treatments based on the genetic information of each patient. Indeed, one of the most important problems in medicine is predicting the anticancer drug response for each patient. Because of tumor heterogeneity, patients with the same type of cancer may have different responses to similar drugs. Therefore, it is very important to provide a predictive method that reveals the relationship between genomic information and drug response, which can be helpful for precision medicine.
GDSC(Genomics of Drug Sensitivity in Cancer) 및 CCLE(Cancer Cell Line Encyclopedia)는 여러 항암 약물로 치료된 수백 개의 암 셀 라인에 대한 분자 프로파일과 약물 반응 값을 제공한 두 개의 프로젝트이다. 이런 대규모 데이터세트를 통해 환자별 약물 반응을 예측하기 위한 방법을 개발할 수 있다. 일반적으로, 약물 반응을 예측하기 위한 방법은 두 가지로 분류된다. 첫 번째는, 민감한 약물-셀 라인 쌍을 예측하는 분류 접근법이다. 두 번째는, 약물에 대한 셀 라인의 반응을 측정하기 위한 기준 값을 예측하는 회귀분석 접근법이다. 본 발명의 일실시예는 약물에 대한 셀 라인의
Figure PCTKR2022013647-appb-img-000001
(half-maximal inhibitory concentration) 값을 통해 정량화되는
Figure PCTKR2022013647-appb-img-000002
를 예측하는 회귀분석 접근법을 개시한다. 유전자 발현 프로파일 또는 셀 라인의 다른 분자 정보를 이용하여 약물 반응을 예측하도록 다양한 회귀분석 접근법들이 제안되었다. 일부 예측 방법들은 약물의 화학적 하부구조와 셀 라인 정보와 같은 약물 정보를 통합하여 약물 반응 예측을 개선했다. 또한, 수많은 머신 러닝 방법들이 약물 반응 예측 문제에 적용되었다. 예를 들어, lasso(least absolute shrinkage and selection operator), elastic net, 랜덤 포레스트(random forest), 커널 기반(kernel-based) 방법, 신경망(neural networks)과 딥러닝(deep learning)과 같은 회귀분석 방법들이 적용되었다. Ali 및 Aittokallio는 종합적인 최근 리뷰를 제공한다(문헌 [Ali, M. & Aittokallio, T. Machine learning and feature selection for drug response prediction in precision oncology applications. Biophys. Rev. 11, 31-39 (2019)]). 최근 딥러닝의 발전은 약물 반응 예측 회귀 모델을 찾기 위한 새로운 장을 열게 하였고, 궁극적으로 치료 반응을 위한 보다 정확한 도구를 제공할 수 있다.
Genomics of Drug Sensitivity in Cancer (GDSC) and Cancer Cell Line Encyclopedia (CCLE) are two projects that have provided molecular profiles and drug response values for hundreds of cancer cell lines treated with multiple anti-cancer drugs. These large datasets allow the development of methods for predicting patient-specific drug response. Generally, methods for predicting drug response are classified into two categories. The first is a classification approach that predicts sensitive drug-cell line pairs. The second is a regression analysis approach that predicts a reference value for measuring the response of a cell line to a drug. One embodiment of the present invention is a cell line for drugs
Figure PCTKR2022013647-appb-img-000001
quantified through the (half-maximal inhibitory concentration) value
Figure PCTKR2022013647-appb-img-000002
We disclose a regression analysis approach to predict . Various regression analysis approaches have been proposed to predict drug response using gene expression profiles or other molecular information of a cell line. Some prediction methods have improved drug response prediction by incorporating drug information such as drug chemical substructure and cell line information. In addition, numerous machine learning methods have been applied to the problem of predicting drug response. For example, regression analysis methods such as lasso (least absolute shrinkage and selection operator), elastic nets, random forests, kernel-based methods, neural networks and deep learning. have been applied Ali and Aittokallio provide a comprehensive recent review (Ali, M. & Aittokallio, T. Machine learning and feature selection for drug response prediction in precision oncology applications. Biophys. Rev. 11, 31-39 (2019)). ). Recent advances in deep learning have opened a new avenue for finding regression models for predicting drug response, ultimately providing a more accurate tool for treatment response.
Wang et al.은 약물 반응 예측을 위한 유사성 정규화된 행렬 분해 (similarity-regularized matrix factorization: SRMF)방법을 제안하였고, 이는 셀 라인의 유전자 발현 프로파일 유사성과 약물의 화학적 하부구조 유사성을 동시에 포함한다(문헌 [Wang, L., Li, X., Zhang, L. & Gao, Q. Improved anticancer drug response prediction in cell lines using matrix factorization with similarity regularization. BMC Cancer 17, 513 (2017)]). 유사한 유전적 특성을 가진 환자들은 유사한 약물에 대하여 유사한 반응을 보이는 것으로 나타났다. Suphavilai et al. 은 잠재 공간(latent space)에 대한 약물 및 셀 라인의 학습 예측을 통해 새로운 약물과 새로운 셀 라인에 대한 약물 반응을 예측할 수 있는 “CaDRReS”라는 행렬 분해 기반의 추천 시스템을 고안했다(문헌 [Suphavilai, C., Bertrand, D. & Nagarajan, N. Predicting cancer drug response using a recommender system. Bioinformatics 34, 3907-3914 (2018)]). 이는 잠재 공간의 특성이 약물의 경로와 상관관계가 있다는 것을 보여주었다. Chang et al.은 다섯 개의 합성곱 신경망(Convolutional Neural Networks: CNNs)를 포함하는 앙상블 모델인 “CDRscan”을 제안하였다. 이는 셀 라인의 돌연변이 프로파일과 약물의 화학적 하부구조를 CNNs의 입력 특성으로 사용하였다. 약물 반응 값은 다섯 개의 CNNs의 출력 값의 평균치로 측정되었다. 그러나, “CDRscan”은 새로운 약물 및 새로운 셀 라인에 대한 약물 반응은 잘 예측하지 못 하는 경향이 있다. Wei et al.은 셀 라인-약물 복합 네트워크(cell line-drug complex network: CDCN)을 최근 고안하였는데, 이는 셀 라인과 약물로 구성되는 간단한 네트워크로부터 정보를 추론하여 약물 반응을 예측한다(문헌 [Wei, D., Liu, C., Zheng, X. & Li, Y. Comprehensive anticancer drug response prediction based on a simple cell line-drug complex network model. BMC Bioinf. 20, 44 (2019)]). CDCN은 누락된 약물 정보를 귀속시키는 만족스러운 결과를 제공한다. Moughari 및 Eslahchi는 매니폴드 러닝을 이용하여 항암 약물 반응 예측 모델(a model for Anticancer Drug Response Prediction using Manifold Learning: ADRML)을 고안하였다(문헌 [Moughari, F. A. & Eslahchi, C. ADRML: anticancer drug response prediction using manifold learning. Sci. Rep. 10, 14245 (2020)]). ADRML은 약물 반응 값을 저차원의 잠재 공간에 맵핑하고 잠재 공간으로부터 새로운 셀 라인-약물 쌍에 대한 약물 반응 값을 연산한다. 이는 다양한 타입의 셀 라인 유사성 및 약물 유사성을 고려하고, 이들을 매니폴드 러닝 절차에 활용한다. ADRML은 정확하고 강력한 예측을 제공하는 것으로 나타났다.Wang et al. proposed a similarity-regularized matrix factorization (SRMF) method for predicting drug response, which simultaneously includes gene expression profile similarity of cell lines and chemical substructure similarity of drugs (Reference. [Wang, L., Li, X., Zhang, L. & Gao, Q. Improved anticancer drug response prediction in cell lines using matrix factorization with similarity regularization. BMC Cancer 17, 513 (2017)]). Patients with similar genetic characteristics have been shown to have similar responses to similar drugs. Suphavilai et al. devised a matrix decomposition-based recommender system called “CaDRReS” that can predict drug responses to new drugs and new cell lines through learned prediction of drugs and cell lines in the latent space (Suphavilai, C., Bertrand, D. & Nagarajan, N. Predicting cancer drug response using a recommender system. Bioinformatics 34, 3907-3914 (2018)]). This showed that the characteristics of the latent space were correlated with the route of the drug. Chang et al. proposed “CDRscan”, an ensemble model that includes five Convolutional Neural Networks (CNNs). It used the mutation profile of the cell line and the chemical substructure of the drug as input features for CNNs. The drug response value was measured as the average value of the output values of the five CNNs. However, “CDRscan” tends not to predict drug responses well to new drugs and new cell lines. Wei et al. recently devised the cell line-drug complex network (CDCN), which predicts drug response by inferring information from a simple network composed of cell lines and drugs (Wei et al. , D., Liu, C., Zheng, X. & Li, Y. Comprehensive anticancer drug response prediction based on a simple cell line-drug complex network model. BMC Bioinf. 20, 44 (2019)]). CDCN provides satisfactory results imputing missing drug information. Moughari and Eslahchi devised a model for Anticancer Drug Response Prediction using Manifold Learning (ADRML) using manifold learning (Moughari, F. A. & Eslahchi, C. ADRML: anticancer drug response prediction using manifold learning. Sci. Rep. 10, 14245 (2020)]). ADRML maps drug response values to a low-dimensional latent space and computes drug response values for new cell line-drug pairs from the latent space. It takes into account the cell line similarity and drug similarity of various types and utilizes them in the manifold learning procedure. ADRML has been shown to provide accurate and robust predictions.
동일한 유형의 암에 걸린 환자라도 유사한 약물에 대하여 다른 반응을 보일 수 있다. 따라서, 개별 환자에 따른 약물 반응을 정확하게 예측하는 것이 매우 중요하다. Patients with the same type of cancer may react differently to similar drugs. Therefore, it is very important to accurately predict drug response according to individual patients.
또한, 약물 반응 예측은 새로운 약물 및 새로운 셀 라인에 대한 약물 반응은 잘 예측하지 못 하는 경향이 있다. 따라서, 이를 극복하기 위한 새로운 접근방법이 필요하다. In addition, drug response prediction tends not to predict drug response well to new drugs and new cell lines. Therefore, a new approach to overcome this is required.
또한, 연산 프로세스는 단순화하면서도 보다 높은 정확도의 약물 반응 예측을 제공할 수 있는 새로운 방법이 요구된다.In addition, a new method capable of simplifying the computational process and providing higher accuracy in drug response prediction is required.
본 발명의 일실시예는, 합성곱 신경망(Convolutional Neural Network) 모델을 이용하여 약물과 셀 라인의 약물 반응을 예측하는 방법으로서, 약물-약물 사이의 제 1 유사도 행렬을 준비하는 단계; 셀 라인-셀 라인 사이의 제 2 유사도 행렬을 준비하는 단계; 상기 제 1 유사도 행렬과 상기 제 2 유사도 행렬 사이의 외적을 산출하는 단계로서, 상기 제 1 유사도 행렬의 i번째 열벡터(i = 1, 2, … m: m은 정수)와 상기 제 2 유사도 행렬의 j번째 열벡터(j = 1, 2, …, n: n은 정수) 사이의 외적을 산출하는 단계; 상기 외적을 입력 값으로 하고, i번째 약물과 j번째 셀 라인의 약물 반응 값을 출력 값으로 하여 상기 합성곱 신경망 모델을 학습시키는 단계; 및 상기 학습된 합성곱 신경망 모델을 이용하여 새로운 약물과 셀 라인의 약물 반응 값을 예측하여 출력하는 단계를 포함하는, 약물 반응 예측 방법을 제공한다. One embodiment of the present invention is a method of predicting a drug response between a drug and a cell line using a convolutional neural network model, comprising the steps of preparing a first similarity matrix between drugs; preparing a second similarity matrix between cell lines; Calculating a cross product between the first similarity matrix and the second similarity matrix, wherein the ith column vector (i = 1, 2, ... m: m is an integer) of the first similarity matrix and the second similarity matrix Calculating a cross product between the j-th column vectors of (j = 1, 2, ..., n: n is an integer); learning the convolutional neural network model by taking the cross product as an input value and using the i-th drug and the drug response value of the j-th cell line as output values; and predicting and outputting a drug response value of a new drug and a cell line using the learned convolutional neural network model.
또한, 상기 제 1 유사도 행렬은 타니모토(Tanimoto) 계수에 기반한 m x m 의 약물-약물 유사도 행렬이고, 상기 제 2 유사도 행렬은 RBF(radial basis function) 커널 행렬인 n x n 의 셀 라인-셀 라인 유사도 행렬이며, 상기 m 및 n 은 각각 학습 데이터세트의 약물과 셀 라인의 수를 나타내는, 약물 반응 예측 방법을 제공한다.In addition, the first similarity matrix is an m x m drug-drug similarity matrix based on the Tanimoto coefficient, and the second similarity matrix is an n x n cell line-cell line similarity matrix that is a radial basis function (RBF) kernel matrix, , wherein m and n represent the number of drugs and cell lines in the learning dataset, respectively, providing a drug response prediction method.
또한, 상기 제 1 유사도 행렬의 i번째 열벡터는, i번째 약물과 상기 i번째 약물을 포함하는 다른 약물들 사이의 타니모토 유사도이고, 상기 제 2 유사도 행렬의 j번째 열벡터는, j번째 셀 라인과 상기 j번째 셀 라인을 포함하는 다른 셀 라인들 사이의 유전자 발현 유사도인, 약물 반응 예측 방법을 제공한다.In addition, the i-th column vector of the first similarity matrix is the Tanimoto similarity between the i-th drug and other drugs including the i-th drug, and the j-th column vector of the second similarity matrix is the j-th cell A method for predicting drug response, which is the degree of gene expression similarity between the line and other cell lines including the j-th cell line, is provided.
또한, 상기 약물 반응 값은 상기 약물에 대한 상기 셀 라인의 IC50(half-maximal inhibitory concentration) 값을 통해 정량화되는
Figure PCTKR2022013647-appb-img-000003
값일 수도 있다. 또한, 상기 제 2 유사도 행렬은 제 1 부분행렬 및 제 2 부분행렬을 포함하고, 상기 제 1 유사도 행렬과 상기 제 2 유사도 사이의 외적을 산출하는 단계는, 상기 제 1 유사도 행렬과 상기 제 2 유사도 행렬의 제 1 부분행렬 사이의 제 1 외적을 산출하고, 상기 제 1 유사도 행렬과 상기 제 2 유사도 행렬의 제 2 부분행렬 사이의 제 2 외적을 산출하는 단계를 포함하는, 약물 반응 예측 방법을 제공한다.
In addition, the drug response value is quantified through the IC 50 (half-maximal inhibitory concentration) value of the cell line for the drug.
Figure PCTKR2022013647-appb-img-000003
may be a value. In addition, the second similarity matrix includes a first sub-matrix and a second sub-matrix, and calculating a cross product between the first similarity matrix and the second similarity may include: A drug response prediction method comprising calculating a first cross product between first sub-matrices of a matrix and calculating a second cross product between the first similarity matrix and a second sub-matrix of the second similarity matrix. do.
또한, 상기 합성곱 신경망 모델은 제 1 합성곱 신경망 모델 및 제 2 합성곱 신경망 모델을 포함하고, 상기 합성곱 신경망 모델을 학습시키는 단계는, 상기 제 1 외적과 상기 제 2 외적을 상기 제 1 합성곱 신경망 모델과 상기 제 2 합성곱 신경망 모델 각각의 입력 값으로 하고, 상기 제 1 합성곱 신경망 모델과 상기 제 2 합성곱 신경망 모델 각각의 출력 값의 평균을 출력 값으로 하는, 약물 반응 예측 방법을 제공한다. In addition, the convolutional neural network model includes a first convolutional neural network model and a second convolutional neural network model, and the step of learning the convolutional neural network model comprises the first synthesis of the first cross product and the second cross product. A method for predicting a drug response, wherein an input value of each of the convolutional neural network model and the second convolutional neural network model is used, and an average of output values of the first convolutional neural network model and the second convolutional neural network model is used as an output value. to provide.
또한, 상기 합성곱 신경망 모델은 2차원 (2-Dimensional) 모델일 수도 있다.Also, the convolutional neural network model may be a 2-dimensional model.
또한, 상기 합성곱 신경망 모델은 2 개의 2 차원 합성곱 레이어, 2 개의 맥스 풀링 레이어, 1 개의 평탄화 레이어, 1 개의 드롭아웃 레이어, 2개의 완전연결 레이어를 포함할 수도 있다. In addition, the convolutional neural network model may include two 2D convolutional layers, two max pooling layers, one flattening layer, one dropout layer, and two fully connected layers.
본 발명의 다른 실시예는, 합성곱 신경망(Convolutional Neural Network) 모델을 이용하여 약물과 셀 라인의 약물 반응을 예측하는 시스템으로서, 상기 합성곱 신경망 모델을 제어하기 위한 제어부; 외부 서버와의 통신을 위한 통신부; 메모리부; 디스플레이부; 및 사용자의 입력을 수신하는 입력부를 포함하고, 상기 메모리부는, 약물-약물 사이의 제 1 유사도 행렬 및 셀 라인-셀 라인 사이의 제 2 유사도 행렬을 포함하고, 상기 제어부는 상기 제 1 유사도 행렬의 i번째 열벡터(i = 1, 2, … m: m은 정수)와 상기 제 2 유사도 행렬의 j번째 열벡터(j = 1, 2, …, n: n은 정수) 사이의 외적을 연산하고, 상기 외적을 입력 값으로 하고, i번째 약물과 j번째 셀 라인의 약물 반응 값을 출력 값으로 하여 상기 합성곱 신경망 모델을 학습시키며, 상기 학습된 합성곱 신경망 모델을 이용하여 새로운 약물과 셀 라인의 약물 반응 값을 예측하는, 약물 반응 예측 시스템을 제공한다.Another embodiment of the present invention is a system for predicting a drug response between a drug and a cell line using a convolutional neural network model, comprising: a control unit for controlling the convolutional neural network model; Communication unit for communication with an external server; memory unit; display unit; and an input unit that receives a user's input, wherein the memory unit includes a first similarity matrix between drugs and a second similarity matrix between cell lines and cell lines, and the control unit stores the first similarity matrix Calculate the cross product between the i-th column vector (i = 1, 2, ... m: m is an integer) and the j-th column vector (j = 1, 2, ..., n: n is an integer) of the second similarity matrix , The convolutional neural network model is trained using the cross product as an input value and the drug response values of the i-th drug and the j-th cell line as output values, and using the learned convolutional neural network model, new drugs and cell lines A drug response prediction system for predicting the drug response value of
또한, 상기 제 1 유사도 행렬은 타니모토(Tanimoto) 계수에 기반한 m x m 의 약물-약물 유사도 행렬이고, 상기 제 2 유사도 행렬은 RBF(radial basis function) 커널 행렬인 n x n 의 셀 라인-셀 라인 유사도 행렬이며, 상기 m 및 n 은 각각 학습 데이터세트의 약물과 셀 라인의 수를 나타내는, 약물 반응 예측 시스템을 제공한다. In addition, the first similarity matrix is an m x m drug-drug similarity matrix based on the Tanimoto coefficient, and the second similarity matrix is an n x n cell line-cell line similarity matrix that is a radial basis function (RBF) kernel matrix, , where m and n represent the number of drugs and cell lines in the training dataset, respectively, to provide a drug response prediction system.
또한, 상기 제 1 유사도 행렬의 i번째 열벡터는, i번째 약물과 상기 i번째 약물을 포함하는 다른 약물들 사이의 타니모토 유사도이고, 상기 제 2 유사도 행렬의 j번째 열벡터는, j번째 셀 라인과 상기 j번째 셀 라인을 포함하는 다른 셀 라인들 사이의 유전자 발현 유사도인, 약물 반응 예측 시스템을 제공한다.In addition, the i-th column vector of the first similarity matrix is the Tanimoto similarity between the i-th drug and other drugs including the i-th drug, and the j-th column vector of the second similarity matrix is the j-th cell A drug response prediction system is provided, which is the gene expression similarity between the line and other cell lines including the j-th cell line.
또한, 상기 약물 반응 값은 상기 약물에 대한 상기 셀 라인의 IC50(half-maximal inhibitory concentration) 값을 통해 정량화되는
Figure PCTKR2022013647-appb-img-000004
값일 수도 있다.
In addition, the drug response value is quantified through the IC 50 (half-maximal inhibitory concentration) value of the cell line for the drug.
Figure PCTKR2022013647-appb-img-000004
may be a value.
또한, 상기 제 2 유사도 행렬은 제 1 부분행렬 및 제 2 부분행렬을 포함하고, 상기 제 1 유사도 행렬과 상기 제 2 유사도 사이의 외적을 산출하는 단계는, 상기 제 1 유사도 행렬과 상기 제 2 유사도 행렬의 제 1 부분행렬 사이의 제 1 외적을 산출하고, 상기 제 1 유사도 행렬과 상기 제 2 유사도 행렬의 제 2 부분행렬 사이의 제 2 외적을 산출하는 단계를 포함하는, 약물 반응 예측 시스템을 제공한다.In addition, the second similarity matrix includes a first sub-matrix and a second sub-matrix, and calculating a cross product between the first similarity matrix and the second similarity may include: Calculating a first cross product between first sub-matrices of a matrix, and calculating a second cross product between the first similarity matrix and a second sub-matrix of the second similarity matrix. do.
또한, 상기 합성곱 신경망 모델은 제 1 합성곱 신경망 모델 및 제 2 합성곱 신경망 모델을 포함하고, 상기 합성곱 신경망 모델을 학습시키는 단계는, 상기 제 1 외적과 상기 제 2 외적을 상기 제 1 합성곱 신경망 모델과 상기 제 2 합성곱 신경망 모델 각각의 입력 값으로 하고, 상기 제 1 합성곱 신경망 모델과 상기 제 2 합성곱 신경망 모델 각각의 출력 값의 평균을 출력 값으로 하는, 약물 반응 예측 시스템을 제공한다.In addition, the convolutional neural network model includes a first convolutional neural network model and a second convolutional neural network model, and the step of learning the convolutional neural network model comprises the first synthesis of the first cross product and the second cross product. A drug response prediction system having an input value of each of the convolutional neural network model and the second convolutional neural network model, and an average of output values of the first convolutional neural network model and the second convolutional neural network model as an output value. to provide.
또한, 상기 합성곱 신경망 모델은 2차원 (2-Dimensional) 모델일 수도 있다.Also, the convolutional neural network model may be a 2-dimensional model.
또한, 상기 합성곱 신경망 모델은 2 개의 2 차원 합성곱 레이어, 2 개의 맥스 풀링 레이어, 1 개의 평탄화 레이어, 1 개의 드롭아웃 레이어, 2개의 완전연결 레이어를 포함할 수도 있다.In addition, the convolutional neural network model may include two 2D convolutional layers, two max pooling layers, one flattening layer, one dropout layer, and two fully connected layers.
일실시예는 약물 반응 값 예측의 문제점을 해결하기 위하여 Dr.CNN 모델을 개시한다. Dr.CNN 은 셀 라인과 연관된 RBF 커널 행렬을 단순히 두 개의 부분행렬로 나누고, 타니모토(Tanimoto) 유사도 행렬의 열벡터와 각각의 RBF 커널 부분행렬의 열벡터의 외적을 연산한다. 그리고, 이를 2 개의 각각의 CNN 모델의 입력 값으로 사용한다. 이는 앙상블 학습을 통한 빠른 연산과 더 향상된 예측 성능을 얻기 위함이다. RBF 커널 행렬은 랜덤하게 나눠질 수 있고, 셀 라인의 수에 따라 두 개 또는 그 이상의 부분행렬로 나눠질 수도 있다. Dr.CNN 은 약물 반응 예측을 위하여 약물의 타니모토(Tanimoto) 유사도 행렬의 열벡터와 셀 라인의 RBF 커널 행렬의 열벡터 사이의 외적에 2D CNN 을 적용하는 최초의 비선형 방법이다. One embodiment discloses the Dr.CNN model to solve the problem of predicting drug response values. Dr.CNN simply divides the RBF kernel matrix associated with the cell line into two submatrices, and calculates the cross product of the column vectors of the Tanimoto similarity matrix and the column vectors of each RBF kernel submatrix. Then, it is used as an input value for each of the two CNN models. This is to obtain faster calculation and improved prediction performance through ensemble learning. The RBF kernel matrix may be randomly divided and may be divided into two or more sub-matrices according to the number of cell lines. Dr.CNN is the first non-linear method that applies 2D CNN to the cross product between the column vectors of the drug's Tanimoto similarity matrix and the column vectors of the cell line's RBF kernel matrix for drug response prediction.
실험적 결과는 Dr.CNN이 elastic net, RF, SVR, 1D CNN 앙상블과 같은 기존의 모델들의 성능을 뛰어넘는 것을 보여준다. Dr.CNN 은 데이터 구조에 따라 CNN 의 아키텍쳐를 조절하여 더 개선될 수 있다. Dr.CNN 에 내포되어 있는 주요 아이디어는 외적을 이용하여 2 개의 모달리티를 통합하고 결과 행렬에 CNN 을 적용하는 것이다. Dr.CNN 은 약물 반응 예측을 위한 매우 효과적인 접근법이며 약물 개발 프로세스에 큰 역할을 할 수 있다.Experimental results show that Dr.CNN exceeds the performance of existing models such as elastic net, RF, SVR, and 1D CNN ensemble. Dr.CNN can be further improved by adjusting the architecture of CNN according to the data structure. The main idea behind Dr.CNN is to integrate the two modalities using the cross product and apply the CNN to the resulting matrix. Dr.CNN is a very effective approach for drug response prediction and can play a huge role in the drug development process.
도 1은 약물 반응 값의 예측을 위하여 제안된 일실시예의 전체적인 워크플로우를 나타낸다.Figure 1 shows the overall workflow of an embodiment proposed for the prediction of drug response values.
도 2는 일실시예에 사용되는 GDSC1 데이터세트와 GDSC2 데이터세트의 요약을 나타낸다. 2 shows a summary of the GDSC1 and GDSC2 datasets used in one example.
도 3은 GDSC2 및 GDSC1 데이터세트의 측정된
Figure PCTKR2022013647-appb-img-000005
값 대비 일실시예에 의해 예측된
Figure PCTKR2022013647-appb-img-000006
값의 산점도를 나타낸다.
Figure 3 is the measured data set of GDSC2 and GDSC1
Figure PCTKR2022013647-appb-img-000005
value predicted by one embodiment
Figure PCTKR2022013647-appb-img-000006
Shows a scatterplot of values.
도 4는 약물 반응 값의 예측을 위한 앙상블 1D CNN 모델의 워크플로우를 나타낸다.Figure 4 shows the workflow of an ensemble 1D CNN model for prediction of drug response values.
도 5는 일실시예의 유사도 기반 CNN 서브모델의 아키텍쳐를 나타낸다.5 shows the architecture of a CNN submodel based on similarity according to an embodiment.
도 6은 일실시예에 따른 약물 반응 예측 방법을 구현하기 위한 시스템의 일례를 나타내는 블록도이다.6 is a block diagram showing an example of a system for implementing a drug response prediction method according to an embodiment.
본 발명은 본 명세서에 첨부된 도면과 함께 상세하게 후술되어 있는 실시예들을 참조하면 명확해질 것이다. 그러나 본 발명은 이하에서 개시되는 실시예들에 한정되는 것이 아니라 서로 다른 다양한 형태로 구현될 것이며, 단지 본 실시예들은 본 발명의 개시가 완전하도록 하며, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 발명은 청구항의 범주에 의해 정의될 뿐이다. 한편, 본 명세서에서 사용된 용어는 실시예들을 설명하기 위한 것이며 본 발명을 제한하고자 하는 것은 아니다. 본 명세서에서, 단수형은 문구에서 특별히 언급하지 않는 한 복수형도 포함한다. 본 명세서에서 사용되는 "포함한다(comprises)" 또는 "포함하는(comprising)"은 언급된 구성요소, 단계 외에 하나 이상의 다른 구성요소, 단계의 존재 또는 추가를 배제하지 않는다. 제1, 제2 등의 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 구성요소들은 용어들에 의해 한정되어서는 안 된다. 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다.The present invention will become clear with reference to the embodiments described below in detail in conjunction with the accompanying drawings. However, the present invention is not limited to the embodiments disclosed below, but will be implemented in various different forms, only these embodiments make the disclosure of the present invention complete, and common knowledge in the art to which the present invention belongs. It is provided to fully inform the holder of the scope of the invention, and the present invention is only defined by the scope of the claims. Meanwhile, terms used in this specification are for describing the embodiments and are not intended to limit the present invention. In this specification, singular forms also include plural forms unless specifically stated otherwise in a phrase. As used herein, “comprises” or “comprising” does not exclude the presence or addition of one or more other components or steps other than the recited components or steps. Terms such as first and second may be used to describe various components, but the components should not be limited by the terms. Terms are only used to distinguish one component from another.
본 발명의 일실시예는, 약물의 타니모토(Tanimoto) 유사도 행렬과 셀 라인의 방사 기저 함수(Radial Basis Function: RBF) 커널 행렬의 열벡터의 외적(outer product)에 대한 2차원 CNN 접근법을 사용하여 약물 반응 값을 예측하는 유사도 기반의 앙상블 딥러닝 모델을 제공한다. 일실시예에서는 이 모델을 Dr.CNN 으로 칭한다.One embodiment of the present invention uses a two-dimensional CNN approach for the outer product of the column vector of the Tanimoto similarity matrix of the drug and the radial basis function (RBF) kernel matrix of the cell line. to provide a similarity-based ensemble deep learning model that predicts drug response values. In one embodiment, this model is referred to as Dr.CNN.
도 1은 약물 반응 값의 예측을 위하여 제안된 Dr.CNN의 전체적인 워크플로우를 나타낸다. Dr.CNN의 워크플로우는 앙상블 딥러닝 아키텍처 내부에서 구동되는 두 개의 서브 네트워크로부터 얻어지는 약물 반응 값을 기초로, 예를 들어, 평균값으로 최종 약물 반응 값을 예측한다. 약물 유사도 벡터와 셀 라인 유사도 벡터의 외적을 입력 값으로 하여, 2D CNN이 특성을 학습하는데 사용된다. 각각의 CNN은 예를 들어, 2 개의 합성곱 레이어(convolutional layer), 2 개의 맥스 풀링 레이어(max-pooling layer), 1 개의 평탄화 레이어(flatten layer), 1 개의 드롭아웃 레이어(dropout layer), 그리고 3 개의 완전연결 레이어(fully connected layer: FC layer)들로 구성될 수 있다. CNN 구조에 대한 구체적인 설명은 도 5에서 후술한다. Figure 1 shows the overall workflow of Dr.CNN proposed for predicting drug response values. Dr.CNN's workflow predicts the final drug response value based on the drug response values obtained from the two sub-networks running inside the ensemble deep learning architecture, for example, as an average value. Using the cross product of the drug similarity vector and the cell line similarity vector as an input value, a 2D CNN is used to learn the feature. Each CNN has, for example, 2 convolutional layers, 2 max-pooling layers, 1 flatten layer, 1 dropout layer, and It may be composed of three fully connected layers (FC layers). A detailed description of the CNN structure will be described later with reference to FIG. 5 .
일실시예는 앙상블 러닝을 통한 보다 정교한 예측 성능과 빠른 연산을 얻기 위하여, 셀 라인의 RBF 커널 행렬을 두 개의 부분행렬(submatrices)로 나누고, 타니모토(Tanimoto) 유사도 행렬과 각각의 RBF 커널 부분행렬의 열벡터의 외적을 구축하였고, 이를 CNN의 입력 값으로 사용한다. 셀 라인의 RBF 커널 행렬은 셀 라인의 수에 따라서 두 개 이상의 부분행렬로 분할될 수도 있다. 실시예는 두 개의 단계로 구성될 수 있다. 첫 번째 단계는, 약물의 타니모토(Tanimoto) 유사도 행렬과 셀 라인의 RBF 커널 행렬이 연산되고, 그 후 타니모토(Tanimoto) 유사도 행렬과 각각의 RBF 커널 부분행렬의 열벡터 사이의 외적이 연산되는 단계이다. 두 번째 단계는, 2D CNN 모델이 적용되어 외적으로부터 특징을 추출하고, 두 개의 서브 네트워크에 대한 약물 반응 값을 예측하는 단계이다. 약물 반응 값의 최종 예측은 2 개의 학습된 서브 네트워크로부터의 약물 반응 값을 연산하여, 예를 들어, 평균값을 구하여 얻어질 수 있다.In one embodiment, in order to obtain more sophisticated prediction performance and faster operation through ensemble learning, the RBF kernel matrix of the cell line is divided into two submatrices, the Tanimoto similarity matrix and each RBF kernel submatrix The cross product of the column vectors of is built, and it is used as the input value of the CNN. The RBF kernel matrix of cell lines may be divided into two or more sub-matrices according to the number of cell lines. An embodiment can consist of two steps. In the first step, the Tanimoto similarity matrix of the drug and the RBF kernel matrix of the cell line are calculated, and then the cross product between the Tanimoto similarity matrix and the column vectors of each RBF kernel submatrix is calculated. It is a step. In the second step, a 2D CNN model is applied to extract features from the extrinsic product and predict drug response values for the two subnetworks. The final prediction of the drug response value may be obtained by computing the drug response value from the two learned sub-networks, for example, by obtaining an average value.
일실시예에 따른 Dr.CNN은 평균제곱근오차(root mean squared error: RMSE), 일치성지수(concordance index: CI) 및 수정제곱상관계수(modified squared correlation coefficient:
Figure PCTKR2022013647-appb-img-000007
)를 다른 머신 러닝 및 딥러닝 모델들과 비교하여 검증된다. CI는 관측 데이터와 예측 데이터 사이의 순위 상관관계이다.
Dr.CNN according to an embodiment is root mean squared error (RMSE), concordance index (CI) and modified squared correlation coefficient (modified squared correlation coefficient:
Figure PCTKR2022013647-appb-img-000007
) is validated by comparing it to other machine learning and deep learning models. CI is the rank correlation between observed data and predicted data.
본발명의 실시예들은 사용자가 약물 반응 값을 예측하는데 도움이 될 수 있다. Embodiments of the present invention may help users predict drug response values.
실험 데이터세트experimental dataset
일실시예의 Dr.CNN을 위한 입력 값은, 셀 라인 유전자 발현과 항암 화합물의 SMILES(Simplified Molecular-Input Line-Entry System)가 사용된다. 공개적으로 이용가능한 데이터베이스인 GDSC는 셀 라인과 약물의 모든 쌍에서 관측되는 약물 반응을 위해 사용된다. GDSC 약물 SMILES는 PubChem으로부터 획득된다. GDSC는 GDSC1(문헌 [Iorio, F. et al. A landscape of pharmacogenomic interactions in cancer. Cell 166, 740-754 (2016)]) 과 GDSC2(문헌 [Picco, G. et al. Functional linkage of gene fusions to cancer cell fitness assessed by pharmacological and CRISPR-Cas9 screening. Nat. Commun. 10, 2198 (2019)]) 데이터세트를 포함한다. 일실시예에서는 약물 반응 예측의 평가를 위하여 GDSC1 및 GDSC2 데이터세트를 사용한다. GDSC1 데이터세트는 Resazurin 또는 Syto60 검사를 사용하여 234개 화합물에 걸쳐 681개의 셀 라인을 실험했다. GDSC2 데이터세트는 CellTitreGlo 분석으로 147개 화합물에 걸쳐 588개의 셀 라인을 실험했다. GDSC1 데이터세트는 실제로 234개의 약물과 681개의 셀 라인의 모든 쌍에 대하여 관측되는,
Figure PCTKR2022013647-appb-img-000008
값으로 측정되는 131,894 개의 약물 반응 값을 포함한다. 반면에, GDSC2 데이터세트는 147개의 약물과 588개의 셀 라인의 모든 쌍에 대하여 관측되는,
Figure PCTKR2022013647-appb-img-000009
값으로 관측되는 72,393 개의 약물 반응 값을 포함한다. 표 1은 이 두 데이터세트를 실제 실험에 사용한 형태로 나타낸다. 일실시예에서는
Figure PCTKR2022013647-appb-img-000010
을 통해 logspace로 변환된 값을 사용한다.
As an input value for Dr.CNN in one embodiment, SMILES (Simplified Molecular-Input Line-Entry System) of cell line gene expression and anticancer compounds is used. A publicly available database, GDSC, is used for drug responses observed in all pairs of cell lines and drugs. GDSC drug SMILES is obtained from PubChem. GDSC is GDSC1 (Iorio, F. et al. A landscape of pharmacogenomic interactions in cancer. Cell 166, 740-754 (2016)) and GDSC2 (Picco, G. et al. Functional linkage of gene fusions to Cancer cell fitness assessed by pharmacological and CRISPR-Cas9 screening. Nat. Commun. 10, 2198 (2019)]) dataset. One embodiment uses the GDSC1 and GDSC2 datasets for evaluation of drug response prediction. The GDSC1 dataset tested 681 cell lines across 234 compounds using Resazurin or Syto60 assays. The GDSC2 dataset tested 588 cell lines across 147 compounds with the CellTitreGlo assay. The GDSC1 dataset is actually observed for all pairs of 234 drugs and 681 cell lines,
Figure PCTKR2022013647-appb-img-000008
It contains 131,894 drug response values measured as values. On the other hand, the GDSC2 dataset observed for all pairs of 147 drugs and 588 cell lines,
Figure PCTKR2022013647-appb-img-000009
Includes 72,393 drug response values observed as values. Table 1 shows these two datasets in the form used in the actual experiment. In one embodiment
Figure PCTKR2022013647-appb-img-000010
Use the value converted to logspace through .
DatasetDataset Drugs (약물)Drugs Cell line (셀 라인)Cell line InteractionsInteractions Density (%)Density (%)
GDSC1GDSC1 234234 681681 131,894131,894 82.7782.77
GDSC2GDSC2 147147 588588 72,39372,393 83.7583.75
도 2는 GDSC1 데이터세트와 GDSC2 데이터세트의 요약을 나타낸다. 도 2의 (a)와 (b)는 각각 GDSC1 과 GDSC2 데이터세트의 약물 반응 값의 분포를 나타내고, (c)와 (d)는 각각 GDSC1 과 GDSC2 데이터세트의 SMILES 문자열의 길이의 분포를 나타낸다. 보다 구체적으로, 도 2의 (a), (b)는
Figure PCTKR2022013647-appb-img-000011
값의 분포를 나타내고, (c), (d)는 각각 GDSC1 과 GDSC2 데이터세트에서의 약물의 SMILES 문자열 길이의 분포를 나타낸다. GDSC1 데이터세트의
Figure PCTKR2022013647-appb-img-000012
에 대하여, 평균과 표준편차는 각각 -0.9032 및 1.1777 이다. GDSC2 데이터세트의
Figure PCTKR2022013647-appb-img-000013
에 대한 평균과 표준편차는 각각 -1.2472 및 1.2182 이다. GDSC1 데이터세트의 약물에 대하여, SMILES 길이의 최대치는 133 이며, 평균은 62 이다. GDSC2 데이터세트의 약물에 대한 SMILES 길이의 최대치는 126 이며, 평균은 62 이다.
Figure 2 shows a summary of the GDSC1 dataset and the GDSC2 dataset. 2 (a) and (b) show the distribution of drug response values in the GDSC1 and GDSC2 datasets, respectively, and (c) and (d) respectively show the distribution of SMILES string lengths in the GDSC1 and GDSC2 datasets. More specifically, (a) and (b) of FIG. 2
Figure PCTKR2022013647-appb-img-000011
(c) and (d) show the distribution of SMILES string lengths of drugs in the GDSC1 and GDSC2 datasets, respectively. of the GDSC1 dataset.
Figure PCTKR2022013647-appb-img-000012
, the mean and standard deviation are -0.9032 and 1.1777, respectively. of the GDSC2 dataset.
Figure PCTKR2022013647-appb-img-000013
The mean and standard deviation for is -1.2472 and 1.2182, respectively. For drugs in the GDSC1 dataset, the maximum SMILES length is 133 and the average is 62. The maximum SMILES length for drugs in the GDSC2 dataset is 126, and the average is 62.
입력값과 출력값의 표현Representation of inputs and outputs
일실시예의 Dr.CNN 에 있어서, GDSC1 및 GDSC2 데이터세트에 대한 약물-약물, 셀 라인-셀 라인 유사도 행렬이 사용된다. 이들 두 행렬은 각각
Figure PCTKR2022013647-appb-img-000014
,
Figure PCTKR2022013647-appb-img-000015
로 표현된다. 앙상블 모델을 만들기 위하여, 일실시예는 행렬
Figure PCTKR2022013647-appb-img-000016
를 두 개의 부분행렬(submatrices)로 나눈다. 즉,
Figure PCTKR2022013647-appb-img-000017
로 표현되며, 여기서,
Figure PCTKR2022013647-appb-img-000018
이다. 따라서, 각각의 약물-셀 라인 쌍에 대한 Dr.CNN의 입력 값은
Figure PCTKR2022013647-appb-img-000019
Figure PCTKR2022013647-appb-img-000020
의 외적
Figure PCTKR2022013647-appb-img-000021
이며, 여기서,
Figure PCTKR2022013647-appb-img-000022
은 유사도 행렬
Figure PCTKR2022013647-appb-img-000023
의 i번째 열이고,
Figure PCTKR2022013647-appb-img-000024
는 유사도 행렬
Figure PCTKR2022013647-appb-img-000025
for
Figure PCTKR2022013647-appb-img-000026
의 j번째 열이며,
Figure PCTKR2022013647-appb-img-000027
는 외적(outer product)을 나타낸다. 첨자 t는 벡터의 전치를 나타낸다. 외적
Figure PCTKR2022013647-appb-img-000028
는 실제로 아래의 수식(1)과 같이 정의된다.
In one embodiment of Dr.CNN, a drug-drug, cell line-cell line similarity matrix for the GDSC1 and GDSC2 datasets is used. These two matrices are
Figure PCTKR2022013647-appb-img-000014
,
Figure PCTKR2022013647-appb-img-000015
is expressed as To create an ensemble model, one embodiment is a matrix
Figure PCTKR2022013647-appb-img-000016
Divide into two submatrices. in other words,
Figure PCTKR2022013647-appb-img-000017
It is expressed as, where
Figure PCTKR2022013647-appb-img-000018
am. Therefore, the input value of Dr.CNN for each drug-cell line pair is
Figure PCTKR2022013647-appb-img-000019
class
Figure PCTKR2022013647-appb-img-000020
cross product of
Figure PCTKR2022013647-appb-img-000021
is, where
Figure PCTKR2022013647-appb-img-000022
is the similarity matrix
Figure PCTKR2022013647-appb-img-000023
is the ith column of
Figure PCTKR2022013647-appb-img-000024
is the similarity matrix
Figure PCTKR2022013647-appb-img-000025
for
Figure PCTKR2022013647-appb-img-000026
is the jth column of
Figure PCTKR2022013647-appb-img-000027
represents the outer product. The subscript t denotes the transpose of the vector. cross product
Figure PCTKR2022013647-appb-img-000028
Is actually defined as Equation (1) below.
Figure PCTKR2022013647-appb-img-000029
수식 (1)
Figure PCTKR2022013647-appb-img-000029
Equation (1)
이 외적은 두 세트의 정보를 산출한다.
Figure PCTKR2022013647-appb-img-000030
이므로, 바이모달(bimodal) 상호작용과 개별 모달리티들의 원시 유니모달(unimodal) 표현을 산출한다. 그러므로,
Figure PCTKR2022013647-appb-img-000031
Figure PCTKR2022013647-appb-img-000032
Figure PCTKR2022013647-appb-img-000033
의 정보의 모든 조합을 포함한다. 이는
Figure PCTKR2022013647-appb-img-000034
가 i번째 약물과 j번째 셀 라인의 약물 반응을 예측함에 있어서,
Figure PCTKR2022013647-appb-img-000035
Figure PCTKR2022013647-appb-img-000036
의 단순한 연접(concatenation)보다 더 효과적인 입력 값이 될 수 있음을 나타낸다.
This cross product yields two sets of information.
Figure PCTKR2022013647-appb-img-000030
, yielding bimodal interactions and primitive unimodal representations of individual modalities. therefore,
Figure PCTKR2022013647-appb-img-000031
Is
Figure PCTKR2022013647-appb-img-000032
and
Figure PCTKR2022013647-appb-img-000033
Includes all combinations of information from this is
Figure PCTKR2022013647-appb-img-000034
In predicting the drug response of the i-th drug and the j-th cell line,
Figure PCTKR2022013647-appb-img-000035
class
Figure PCTKR2022013647-appb-img-000036
It indicates that it can be a more effective input value than simple concatenation of .
약물-약물 유사도는 타니모토(Tanimoto) 계수 T를 이용하여 연산되며, 이는 지문으로 표현되는 화학 구조를 비교하기 위한 가장 대중적인 유사성 척도이다. 일실시예는 RDKit 의 위상 지문을 사용한다. 타니모토(Tanimoto) 유사도 척도는 0 부터 1 까지의 값을 가지며, 두 개의 약물이 공유하는 특성의 백분율로 해석할 수 있다. 반면에, 셀 라인-셀 라인 유사도는 수식 (2)의 방정식으로 설명되는 RBF 커널을 사용하는 유전자 발현 벡터에 기초하여 연산될 수 있다. RBF 커널은 다양한 커널 학습 알고리즘에 사용되는 대중적인 커널 함수이다. The drug-drug similarity is calculated using the Tanimoto coefficient T, which is the most popular similarity measure for comparing chemical structures represented by fingerprints. One embodiment uses RDKit's topological fingerprint. The Tanimoto similarity scale has a value from 0 to 1, and can be interpreted as a percentage of a property shared by two drugs. On the other hand, the cell line-cell line similarity can be calculated based on the gene expression vector using the RBF kernel described by the equation (2). The RBF kernel is a popular kernel function used in various kernel learning algorithms.
Figure PCTKR2022013647-appb-img-000037
수식 (2)
Figure PCTKR2022013647-appb-img-000037
Equation (2)
여기서,
Figure PCTKR2022013647-appb-img-000038
는 i번째 셀 라인의 유전자 발현 벡터이고,
Figure PCTKR2022013647-appb-img-000039
는 He et al.에서와 같이 추정치가 얻어지는 대역폭 파라미터이다(문헌 [He, T. et al. SimBoost: A read-across approach for predicting drug-target binding affinities using gradient boosting machines. J. Cheminf. 9, 24. https ://doi.org/10.1186/s1332 1-017-0209-z (2017)]). RBF 커널은 벡터들 사이의 유사도를 나타내는 척도이다. RBF 커널의 값은 거리
Figure PCTKR2022013647-appb-img-000040
에 따라 감소하고, 0 (한계점)에서 1(
Figure PCTKR2022013647-appb-img-000041
) 까지의 범위를 가진다. 즉, 두 벡터들이 서로 가까운 경우,
Figure PCTKR2022013647-appb-img-000042
는 작아진다. 그 후,
Figure PCTKR2022013647-appb-img-000043
인 한도에서 점점 커진다. 따라서, 가까운 벡터들은 먼 벡터들에 비해 큰 RBF 커널 값을 가진다.
here,
Figure PCTKR2022013647-appb-img-000038
Is the gene expression vector of the i-th cell line,
Figure PCTKR2022013647-appb-img-000039
is the bandwidth parameter for which estimates are obtained as in He et al. (He, T. et al. SimBoost: A read-across approach for predicting drug-target binding affinities using gradient boosting machines. J. Cheminf. 9, 24 .https://doi.org/10.1186/s1332 1-017-0209-z (2017)]). The RBF kernel is a measure of the degree of similarity between vectors. The value of the RBF kernel is the distance
Figure PCTKR2022013647-appb-img-000040
, and from 0 (threshold) to 1 (
Figure PCTKR2022013647-appb-img-000041
) in the range of up to That is, if two vectors are close to each other,
Figure PCTKR2022013647-appb-img-000042
becomes smaller. After that,
Figure PCTKR2022013647-appb-img-000043
increases gradually over the limit. Thus, near vectors have larger RBF kernel values than distant vectors.
각각의 약물-셀 라인 쌍의 출력 값은
Figure PCTKR2022013647-appb-img-000044
값에 대응한다.
The output value of each drug-cell line pair is
Figure PCTKR2022013647-appb-img-000044
correspond to the value
성능 평가 척도performance rating scale
Elastic net, 랜덤 포레스트(Random Forest: RF), SVR(support vector regression)은 약물 반응의 예측을 위해 제안되는 일반적인 회귀분석 방법이므로, 일실시예는 이들을 기본 방법으로 고려한다. 일실시예는 상술한 평가용 데이터세트를 사용하여 elastic net, RF, SVR, 1D CNN 앙상블 및 DR.CNN 의 성능을 비교한다. 여기의 1D CNN 앙상블은 Park et al. 에서 사용된 것과 유사하다(문헌 [Park, H. et al. Detection of chromosome structural variation by targeted next-generation sequencing and deep learning application. Sci. Rep. 9, 3644 (2019)]). Elastic net, RF, SVR 그리고 1D CNN 앙상블의 입력 값은 2048 비트 길이의 ECFPs(extended-connectivity fingerprints)의 연접 벡터(concatenated vector)들과 19,144 개의 값들로부터 선택된 172 개의 값으로 구성되는 유전자 발현 벡터이다. 일실시예는 GDSC2 데이터세트의 예측 성능을 5-폴드 교차 검증(5-fold cross validation) 실험으로 평가하였다. 이 기법은 데이터세트를 대략 동일한 크기를 갖도록 5 폴드로 무작위로 분할한다. 하나의 폴드는 유효성 검증 세트로 처리되며, 학습은 나머지 4 폴드에 적용된다. 이 절차는 5 회 반복되며, 각각의 절차에서 다른 인스턴스 그룹이 유효성 검증 세트로 처리된다. 일실시예는 또한, GDSC2 데이터세트를 학습 데이터세트로 사용하는 다섯 개의 모델을 학습시킨 후에, GDSC1 데이터세트의 예측 성능을 평가하였다. 일실시예는 회귀분석 모델의 성능 평가를 위하여 RMSE, CI, Pearson 상관계수
Figure PCTKR2022013647-appb-img-000045
,
Figure PCTKR2022013647-appb-img-000046
와 같은 4 개의 지표를 사용하였다.
Since elastic net, random forest (RF), and support vector regression (SVR) are general regression analysis methods proposed for drug response prediction, one embodiment considers them as basic methods. An embodiment compares the performance of elastic net, RF, SVR, 1D CNN ensemble, and DR.CNN using the above-described evaluation dataset. The ensemble of 1D CNNs here was developed by Park et al. (Park, H. et al. Detection of chromosome structural variation by targeted next-generation sequencing and deep learning application. Sci. Rep. 9, 3644 (2019)). The input values of the elastic net, RF, SVR, and 1D CNN ensemble are gene expression vectors composed of 172 values selected from concatenated vectors of 2048-bit extended-connectivity fingerprints (ECFPs) and 19,144 values. In one embodiment, the predictive performance of the GDSC2 dataset was evaluated by a 5-fold cross validation experiment. This technique randomly partitions the dataset into 5 folds of approximately equal size. One fold is treated as a validation set, and learning is applied to the remaining 4 folds. This procedure is repeated 5 times, in each procedure a different instance group is treated as a validation set. One embodiment also evaluated the predictive performance of the GDSC1 dataset after training five models using the GDSC2 dataset as a training dataset. One embodiment is RMSE, CI, Pearson's correlation coefficient to evaluate the performance of the regression analysis model
Figure PCTKR2022013647-appb-img-000045
,
Figure PCTKR2022013647-appb-img-000046
4 indicators were used.
RMSE 는 연속적 예측에서의 오차를 위해 일반적으로 사용되는 지표이다. 회귀 기법을 사용하기 때문에, 일실시예는 연속적 예측의 오차를 위해 일반적으로 사용되는 지표인 RMSE를 사용한다. 여기서,
Figure PCTKR2022013647-appb-img-000047
는 실제 출력값,
Figure PCTKR2022013647-appb-img-000048
는 예측에 대응된다. n은 샘플의 개수를 의미한다.
RMSE is a commonly used metric for error in continuous prediction. Because it uses a regression technique, one embodiment uses RMSE, a commonly used metric for the error of continuous prediction. here,
Figure PCTKR2022013647-appb-img-000047
is the actual output value,
Figure PCTKR2022013647-appb-img-000048
corresponds to the prediction. n means the number of samples.
Figure PCTKR2022013647-appb-img-000049
수식 (3)
Figure PCTKR2022013647-appb-img-000049
Formula (3)
Pahikkala et al. 에서 제안된 바와 같이, CI 는 예측 정확도를 위한 평가 지표로 사용될 수 있다(문헌 [Pahikkala, T. et al. Toward more realistic drug-target interaction predictions. Brief. Bioinform. 16, 325-337 (2015)]). CI에 대한 직관은 다음과 같다. 쌍인 데이터의 세트에 대한 CI는 상이한 레이블 값을 가지는 두 개의 임의적으로 그려진 약물-셀 라인 쌍에 대한 예측이 올바른 순서일 확률이며, 이는 더 큰 친화도 값
Figure PCTKR2022013647-appb-img-000050
에 대한 예측
Figure PCTKR2022013647-appb-img-000051
가 더 작은 친화도 값
Figure PCTKR2022013647-appb-img-000052
의 예측
Figure PCTKR2022013647-appb-img-000053
보다 크다는 것을 의미한다.
Pahikkala et al. As suggested in, CI can be used as an evaluation index for prediction accuracy (Pahikkala, T. et al. Toward more realistic drug-target interaction predictions. Brief. Bioinform. 16, 325-337 (2015)). ). My intuition about CI is as follows. The CI for a set of pairwise data is the probability that the predictions for two randomly drawn drug-cell line pairs with different label values are in the correct order, which is the greater the affinity value.
Figure PCTKR2022013647-appb-img-000050
prediction for
Figure PCTKR2022013647-appb-img-000051
has a smaller affinity value
Figure PCTKR2022013647-appb-img-000052
prediction of
Figure PCTKR2022013647-appb-img-000053
means bigger than
Figure PCTKR2022013647-appb-img-000054
수식 (4)
Figure PCTKR2022013647-appb-img-000054
Equation (4)
여기서, Z는 정규화 상수이며,
Figure PCTKR2022013647-appb-img-000055
는 계단 함수이다(문헌 [Pahikkala, T. et al. Toward more realistic drug-target interaction predictions. Brief. Bioinform. 16, 325-337 (2015)]).
where Z is the normalization constant,
Figure PCTKR2022013647-appb-img-000055
is a step function (Pahikkala, T. et al. Toward more realistic drug-target interaction predictions. Brief. Bioinform. 16, 325-337 (2015)).
Figure PCTKR2022013647-appb-img-000056
수식 (5)
Figure PCTKR2022013647-appb-img-000056
Formula (5)
CI 의 범위는 0.5 에서 1.0 이며, 여기서 0.5 는 랜덤 예측에 해당하고, 1.0 은 완벽한 예측 정확도에 해당한다. The CI ranges from 0.5 to 1.0, where 0.5 corresponds to random prediction and 1.0 corresponds to perfect prediction accuracy.
모델의 예측 가능성을 높이기 위하여, Roy and Roy 는 수정 제곱 상관계수(modified squared correlation coefficient)
Figure PCTKR2022013647-appb-img-000057
을 도입하였다(문헌 [Roy, P. & Roy, K. On some aspects of variable selection for partial least squares regression models. QSAR Comb. Sci. 27, 302-313 (2008)]).
To increase the predictability of the model, Roy and Roy use the modified squared correlation coefficient
Figure PCTKR2022013647-appb-img-000057
was introduced (Roy, P. & Roy, K. On some aspects of variable selection for partial least squares regression models. QSAR Comb. Sci. 27, 302-313 (2008)).
Figure PCTKR2022013647-appb-img-000058
수식 (6)
Figure PCTKR2022013647-appb-img-000058
Formula (6)
여기서,
Figure PCTKR2022013647-appb-img-000059
Figure PCTKR2022013647-appb-img-000060
은 각각 절편(intercept)를 포함한 경우와 포함하지 않은 경우의 제곱 상관계수이다. 테스트 데이터세트에 대하여
Figure PCTKR2022013647-appb-img-000061
인 모델은 허용 가능한 모델로 판단된다.
here,
Figure PCTKR2022013647-appb-img-000059
class
Figure PCTKR2022013647-appb-img-000060
is the squared correlation coefficient in the case of including and not including the intercept, respectively. About the test dataset
Figure PCTKR2022013647-appb-img-000061
The in model is judged to be an acceptable model.
학습과 평가learning and assessment
이하에서는, 일실시예인 Dr.CNN 의 GDSC1 및 GDSC2 데이터세트에 대한 약물 반응 값의 예측,
Figure PCTKR2022013647-appb-img-000062
의 예측 성능을 설명한다. 일실시예는 2 개의 벤치마크 데이터세트들에 대하여 일실시예의 모델의 성능을 평가하였다. 먼저, GDSC2 데이터세트가 GDSC1 데이터세트보다 최근 데이터이므로, GDSC2 데이터세트의 예측 성능을 5-폴드 교차검증을 통해 평가했다. 표 2 는 GDSC2 데이터세트에 대하여 중첩 5-폴드 교차검증을 통한 5 개의 모델들의 성능 결과를 나타낸다. 진하게 나타낸 값은 최고 성능 결과를 의미한다. 표준오차는 괄호 안에 기재되었다.
Hereinafter, prediction of drug response values for GDSC1 and GDSC2 datasets of Dr.CNN, which is an example,
Figure PCTKR2022013647-appb-img-000062
describes the predictive performance of One embodiment evaluated the performance of the model of one embodiment against two benchmark datasets. First, since the GDSC2 dataset is more recent than the GDSC1 dataset, the predictive performance of the GDSC2 dataset was evaluated through 5-fold cross-validation. Table 2 shows the performance results of five models through overlapping 5-fold cross-validation on the GDSC2 dataset. Values in bold represent the best performance results. Standard errors are given in parentheses.
모델Model RMSERMSE CICI
Figure PCTKR2022013647-appb-img-000063
Figure PCTKR2022013647-appb-img-000063
Figure PCTKR2022013647-appb-img-000064
Figure PCTKR2022013647-appb-img-000064
RFRF 0.5386
(0.0026)
0.5386
(0.0026)
0.8433 (0.0002)0.8433 (0.0002) 0.8970
(0.0011)
0.8970
(0.0011)
0.7997
(0.0018)
0.7997
(0.0018)
SVRSVR 0.7305
(0.0398)
0.7305
(0.0398)
0.7923
(0.0021)
0.7923
(0.0021)
0.8307
(0.0033)
0.8307
(0.0033)
0.6455
(0.0227)
0.6455
(0.0227)
Elastic NetElastic Net 0.7522
(0.0377)
0.7522
(0.0377)
0.7843
(0.0094)
0.7843
(0.0094)
0.8084
(0.0176)
0.8084
(0.0176)
0.5416
(0.0410)
0.5416
(0.0410)
1D CNN1D CNNs 0.5390
(0.0022)
0.5390
(0.0022)
0.8448
(0.0004)
0.8448
(0.0004)
0.8979
(0.0009)
0.8979
(0.0009)
0.7909
(0.0060)
0.7909
(0.0060)
Dr.CNNDr. CNN 0.5085
(0.0042)
0.5085
(0.0042)
0.8536
(0.0007)
0.8536
(0.0007)
0.9098
(0.0011)
0.9098
(0.0011)
0.8162
(0.0043)
0.8162
(0.0043)
표 2 에서 확인할 수 있듯이, 일실시예의 Dr.CNN 이 GDSC2 데이터세트의 모든 지표에서 최상의 성능을 보여준다. 일실시예의 모델의 확연한 개선을 통계적으로 평가하기 위하여, 단측(one-sided) t-test를 수행하였다. 최고 성능 결과를 가진 Dr.CNN 과 다른 모델들을 비교하였다. 그러므로, 표 2와 관련된 귀무가설(null hypotheses)은 다음과 같이 주어진다.
Figure PCTKR2022013647-appb-img-000065
,
Figure PCTKR2022013647-appb-img-000066
,
Figure PCTKR2022013647-appb-img-000067
,
Figure PCTKR2022013647-appb-img-000068
. 위 가설 테스트의 모든 관련있는 P값은 0.01 보다 작게 계산된다. 그러므로, Dr.CNN 은 4 개의 모든 지표에 대하여 다른 모델들보다 확연히 우수한 성능을 보여준다. 특히, Dr.CNN 은 GDSC2 데이터세트에 대한 5-폴드 교차검증에서 다른 모델들보다 확연히 큰
Figure PCTKR2022013647-appb-img-000069
를 도출하므로 가장 허용가능한 모델이다.
As can be seen in Table 2, Dr.CNN of one embodiment shows the best performance in all indicators of the GDSC2 dataset. In order to statistically evaluate the significant improvement of the model of one embodiment, a one-sided t-test was performed. We compared Dr.CNN and other models with the best performance results. Therefore, the null hypotheses related to Table 2 are given as
Figure PCTKR2022013647-appb-img-000065
,
Figure PCTKR2022013647-appb-img-000066
,
Figure PCTKR2022013647-appb-img-000067
,
Figure PCTKR2022013647-appb-img-000068
. All relevant P-values in the above hypothesis tests are calculated to be less than 0.01. Therefore, Dr.CNN shows significantly better performance than other models for all four indicators. In particular, Dr.CNN has a significantly greater than other models in the 5-fold cross-validation on the GDSC2 dataset.
Figure PCTKR2022013647-appb-img-000069
is the most acceptable model because it derives
다음으로, GDSC2 데이터세트를 학습 데이터세트로 사용하여 5 개의 모델들을 학습시킨 후에, GDSC1 데이터세트에 대한 예측 성능을 평가하였다. 표 3 은 GDSC1 데이터세트에 대한 5 개 모델들의 성능 결과를 나타낸다. 진하게 나타낸 값은 최고 성능 결과를 나타내고, 표준오차는 괄호 안에 기재되었다. 각각의 모델의 평가 지표를 연산함에 있어 GDSC1 만을 사용하므로, 평가 지표를 통해 통계적으로 유의한 모델을 특정할 수는 없다. 그러므로, 각각의 평가 지표의 평균과 표준오차를 추정하기 위하여 부트스트랩(bootstrap) 방법을 사용하였다(문헌 [Efron, B. & Tibshirani, R. An Introduction to the Bootstrap (Chapman Hall, 1993)]). 이는 각각의 지표에 대하여 추정된 평균과 표준오차에 기초하여 통계적으로 유의한 모델을 특정할 수 있도록 한다. 부트스트랩 방법은 정규 이론을 사용하지 않고 평가 지표의 표본 분포를 추정하는데 유용한 것으로 알려져 있다. 부트스트랩 방법은 복원 추출을 통해 GDSC1 데이터세트를 반복적으로 샘플링하는 작업을 포함한다. GDSC1 데이터세트에 부트스트랩 샘플링을 적용할 때, 부트스트랩 샘플 크기를 131,894 로 설정하고 샘플링 프로세스를 20회 반복한다. Next, after training 5 models using the GDSC2 dataset as a training dataset, the prediction performance on the GDSC1 dataset was evaluated. Table 3 shows the performance results of the five models on the GDSC1 dataset. Values shown in bold represent the best performance results, and standard errors are listed in parentheses. Since only GDSC1 is used to calculate the evaluation index of each model, it is not possible to specify a statistically significant model through the evaluation index. Therefore, a bootstrap method was used to estimate the mean and standard error of each evaluation index (Efron, B. & Tibshirani, R. An Introduction to the Bootstrap (Chapman Hall, 1993)). This makes it possible to specify a statistically significant model based on the mean and standard error estimated for each indicator. Bootstrap methods are known to be useful for estimating the sampling distribution of evaluation indicators without using normal theory. The bootstrap method involves iterative sampling of the GDSC1 dataset through extraction with replacement. When applying bootstrap sampling to the GDSC1 dataset, set the bootstrap sample size to 131,894 and repeat the sampling process 20 times.
모델Model RMSERMSE CICI
Figure PCTKR2022013647-appb-img-000070
Figure PCTKR2022013647-appb-img-000070
Figure PCTKR2022013647-appb-img-000071
Figure PCTKR2022013647-appb-img-000071
RFRF 1.1992
(0.0006)
1.1992
(0.0006)
0.6180
(0.0002)
0.6180
(0.0002)
0.4210
(0.0008)
0.4210
(0.0008)
0.1420
(0.0007)
0.1420
(0.0007)
SVRSVR 1.0948
(0.0004)
1.0948
(0.0004)
0.6176
(0.0002)
0.6176
(0.0002)
0.4606
(0.0005)
0.4606
(0.0005)
0.1756
(0.0005)
0.1756
(0.0005)
Elastic NetElastic Net 1.0890
(0.0004)
1.0890
(0.0004)
0.6398
(0.0002)
0.6398
(0.0002)
0.4829
(0.0006)
0.4829
(0.0006)
0.2034
(0.0006)
0.2034
(0.0006)
1D CNN1D CNNs 1.0652
(0.0006)
1.0652
(0.0006)
0.6321
(0.0001)
0.6321
(0.0001)
0.5080
(0.0005)
0.5080
(0.0005)
0.2536
(0.0003)
0.2536
(0.0003)
Dr.CNNDr. CNN 1.0524
(0.0005)
1.0524
(0.0005)
0.6531
(0.0002)
0.6531
(0.0002)
0.5597
(0.0005)
0.5597
(0.0005)
0.3024
(0.0007)
0.3024
(0.0007)
표 3 에 나타난 바와 같이, Dr.CNN 은 GDSC1 데이터세트의 20 개의 부트스트랩 샘플에 대하여 모든 4 개의 지표에서 최고의 성능을 보여준다. 위에서와 같이, Dr.CNN 의 유의미한 개선을 통계적으로 평가하기 위하여 단측 t-test를 실행하였다. 최고 결과의 Dr.CNN 모델을 다른 모델들과 비교하였다. 따라서, 표 3 과 관련된 귀무가설은 다음과 같이 주어진다.
Figure PCTKR2022013647-appb-img-000072
,
Figure PCTKR2022013647-appb-img-000073
,
Figure PCTKR2022013647-appb-img-000074
,
Figure PCTKR2022013647-appb-img-000075
. 위 가설검정의 모든 관련있는 P값은 0.01 보다 작게 계산된다. 그러므로, Dr.CNN 은 4 개의 모든 지표에 대하여 다른 모델들보다 확연히 우수한 성능을 보여준다. 비록, Dr.CNN 이 충분히 큰
Figure PCTKR2022013647-appb-img-000076
을 얻지는 못 했으나, 이는 가장 허용 가능한 모델이다. 이는, GDSC1 데이터세트의 20 개의 부트스트랩 샘플들에 대하여 다른 모델들보다 큰
Figure PCTKR2022013647-appb-img-000077
값을 보여주기 때문이다.
As shown in Table 3, Dr.CNN shows the best performance in all 4 indicators for 20 bootstrap samples of the GDSC1 dataset. As above, a one-tailed t-test was performed to statistically evaluate the significant improvement of Dr.CNN. The best resulting Dr.CNN model was compared with other models. Therefore, the null hypothesis related to Table 3 is given as
Figure PCTKR2022013647-appb-img-000072
,
Figure PCTKR2022013647-appb-img-000073
,
Figure PCTKR2022013647-appb-img-000074
,
Figure PCTKR2022013647-appb-img-000075
. All relevant P values in the above hypothesis test are calculated to be less than 0.01. Therefore, Dr.CNN shows significantly better performance than other models for all four indicators. Although, Dr.CNN is big enough
Figure PCTKR2022013647-appb-img-000076
was not obtained, but this is the most acceptable model. This is larger than the other models for the 20 bootstrap samples of the GDSC1 dataset.
Figure PCTKR2022013647-appb-img-000077
Because it shows the value.
예측 능력을 시각적으로 보여주기 위해, 여러 데이터세트에 대한 측정값 대비 예측값을 도시하였다. 도 3 은 GDSC2 및 GDSC1 데이터세트의 측정된
Figure PCTKR2022013647-appb-img-000078
값 대비 Dr.CNN 에 의해 예측된
Figure PCTKR2022013647-appb-img-000079
값의 산점도를 나타낸다. GDSC2 데이터세트의 경우, GDSC2 데이터세트의 4-폴드를 학습 데이터세트로, 남은 하나의 폴드를 테스트 데이터세트로 번갈아 활용하여 예측 값을 구한다. GDSC1 데이터세트의 경우, GDSC2 데이터세트를 학습 데이터세트로, GDSC1 데이터세트를 테스트 데이터세트로 활용하여 예측값을 구한다. 이상적인 회귀모델은 예측값
Figure PCTKR2022013647-appb-img-000080
이 측정값
Figure PCTKR2022013647-appb-img-000081
와 같을 것으로, 즉
Figure PCTKR2022013647-appb-img-000082
일 것으로 예상된다. 특히, GDSC2 데이터세트의 경우 직선
Figure PCTKR2022013647-appb-img-000083
주변에서 높은 밀도를 보여준다.
To visually demonstrate the predictive ability, we plotted the predicted values versus the measured values for several datasets. Figure 3 shows the measured data of the GDSC2 and GDSC1 datasets.
Figure PCTKR2022013647-appb-img-000078
Values predicted by Dr.CNN
Figure PCTKR2022013647-appb-img-000079
Shows a scatterplot of values. In the case of the GDSC2 dataset, the prediction value is obtained by alternately using 4 folds of the GDSC2 dataset as a training dataset and the remaining one fold as a test dataset. For the GDSC1 dataset, the predicted values are obtained using the GDSC2 dataset as a training dataset and the GDSC1 dataset as a test dataset. The ideal regression model is the predicted value
Figure PCTKR2022013647-appb-img-000080
this measure
Figure PCTKR2022013647-appb-img-000081
to be the same, that is
Figure PCTKR2022013647-appb-img-000082
expected to be In particular, for the GDSC2 dataset, a straight line
Figure PCTKR2022013647-appb-img-000083
shows high density in the surroundings.
기준 모델reference model
기준 모델을 위하여, 3 개의 기존 머신 러닝 모델과 1 개의 딥러닝 기반 모델이 고려될 수 있다. 기존의 머신 러닝 모델들은 elastic net, RF 및 SVR 이다. Elastic net 은 변수 선택이 데이터에 너무 의존적이어서 불안정할 수 있다는 Lasso(least absolute shrinkage and selection operator)에 대한 비판의 결과로 처음 등장했다. 해결책은 두 영역 모두에 최적이 되도록 능형(ridge) 회귀와 Lasso 의 페널티를 결합하는 것이다. Lasso 는 통계적 모델의 예측 정확도와 해석 가능성을 높이기 위하여 변수 선택과 정규화를 모두 수행하는 회귀분석 방법이다. RF 는 학습을 위한 많은 의사결정 트리에 대체(replacement)와 함께 샘플링된 입력 데이터를 할당하고, 약물-셀 라인 쌍의 의사결정 결과를 수집하고, 평균을 통해 약물 반응을 결정한다. 트리가 커지게 되면, 각 노드에서의 모든 특성 중 서브세트만을 고려하여 분할을 결정한다. 이 알고리즘은 간단하고, 빠르며, 과대적합을 유발하지 않는다. 일반적으로, 하나의 양호한 회귀모형을 사용하는 경우보다 우수한 성능을 보여준다. SVR은 모델에서 허용 가능한 오차를 정의할 수 있는 유연성을 제공하고, 데이터에 적합한 고차원 초평면(hyperplane)을 찾을 수 있다. SVR의 목표 함수는 오차 제곱이 아니라 계수, 특히, 계수 벡터의 L2-norm 을 최소화하는 것이다. 대신에, 오차항은 최대 오차
Figure PCTKR2022013647-appb-img-000084
이라 불리는 지정된 여유보다 작거나 같은 절대 오차를 설정하는 제약조건에서 처리된다. 일실시예는 모델의 요구되는 정확도를 얻기 위하여
Figure PCTKR2022013647-appb-img-000085
를 튜닝할 수 있다. SVR은 실함수(real-value function) 추정에서 효과적인 도구임이 입증되었다. SVR의 장점 중 하나는 SVR은 연산의 복잡성이 입력 공간의 차원에 의존하지 않는다는 것이다. 또한, 이는 예측 정확도가 높은 뛰어난 일반화 기능을 가지고 있다. Elastic net, RF 및 SVR 의 입력은 2048 비트 길이의 ECFPs 의 연접 벡터와 1,444 개의 값에서 선택된 172 개의 값으로 구성되는 유전자 발현 벡터이다. 딥 러닝 모델 중 하나는 2048 비트 길이의 ECFP 벡터와 19,144 개의 값에서 선택된 172 개의 값으로 구성된 유전자 발현 벡터를 각 개별 1D CNN 의 입력으로 사용하는 앙상블 1D CNN 기반의 예측 모델이다.
For the reference model, three existing machine learning models and one deep learning based model can be considered. Existing machine learning models are elastic net, RF and SVR. Elastic nets first emerged as a result of criticism of Lasso (least absolute shrinkage and selection operator) that variable selection was too data-dependent and could be unstable. The solution is to combine ridge regression with Lasso's penalty to be optimal for both domains. Lasso is a regression analysis method that performs both variable selection and normalization to increase the predictive accuracy and interpretability of statistical models. RF assigns the sampled input data with replacement to a number of decision trees for learning, collects the decision results of drug-cell line pairs, and determines the drug response through average. When the tree grows, the split is determined by considering only a subset of all the properties at each node. This algorithm is simple, fast, and does not cause overfitting. In general, it shows better performance than when using one good regression model. SVR gives you the flexibility to define acceptable errors in your model and can find a high-dimensional hyperplane that fits your data. The goal function of SVR is to minimize the coefficients, especially the L2-norm of the vector of coefficients, not the squared error. Instead, the error term is the maximum error
Figure PCTKR2022013647-appb-img-000084
It is handled in a constraint that sets an absolute error less than or equal to the specified margin, called . In one embodiment, to obtain the required accuracy of the model
Figure PCTKR2022013647-appb-img-000085
can be tuned. SVR has proven to be an effective tool in real-value function estimation. One of the advantages of SVR is that its computational complexity does not depend on the dimensionality of the input space. In addition, it has excellent generalization capability with high predictive accuracy. The input of the elastic net, RF and SVR is a gene expression vector consisting of 172 values selected from the 2048-bit long ECFPs concatenation vector and 1,444 values. One of the deep learning models is an ensemble 1D CNN-based prediction model that uses a 2048-bit ECFP vector and a gene expression vector consisting of 172 values selected from 19,144 values as input to each individual 1D CNN.
도 4 는 약물 반응 값의 예측을 위한 앙상블 1D CNN 모델의 워크플로우를 나타낸다. 1D CNN 에서, 커널과 풀링은 한 차원을 따라 이동한다. 모든 합성곱 레이어(convolutional layer)에 대해 스트라이드(stride)는 1 로 설정한다. 모든 맥스 풀링 레이어(max-pooling layer)에 대해서 스트라이드를 2 로 설정한다.4 shows the workflow of an ensemble 1D CNN model for prediction of drug response values. In 1D CNN, the kernel and pooling move along one dimension. The stride is set to 1 for all convolutional layers. Set the stride to 2 for all max-pooling layers.
Dr.CNN 은 약물 유사도 벡터와 셀 라인 유사도 벡터의 외적을 입력으로 사용하는 2D CNN 기반의 예측 모델이다. Dr.CNN 모델의 입력을 위하여, 먼저 타니모토 계수에 기초하는 m x m 약물-약물 유사도 행렬
Figure PCTKR2022013647-appb-img-000086
와 RBF 커널 함수를 통한 유전자 발현 스코어에 기초하는 n x n 셀 라인-셀 라인 유사도 행렬
Figure PCTKR2022013647-appb-img-000087
를 연산한다. RDKit 의 2048 비트 길이의 ECFP 지문은 타니모토 계수를 연산하기 위해 사용된다. 여기서, m 과 n 은 각각 학습 데이터세트에서의 약물과 셀 라인의 수를 나타낸다. 그 후, 모든 약물-셀 라인 쌍에 대하여, m x 1 약물 유사도 벡터
Figure PCTKR2022013647-appb-img-000088
과 n x 1 셀 라인 유사도 벡터
Figure PCTKR2022013647-appb-img-000089
의 외적
Figure PCTKR2022013647-appb-img-000090
를 연산한다. 여기서,
Figure PCTKR2022013647-appb-img-000091
Figure PCTKR2022013647-appb-img-000092
는 각각 유사도 행렬
Figure PCTKR2022013647-appb-img-000093
의 i번째 열과 유사도 행렬
Figure PCTKR2022013647-appb-img-000094
의 j번째 열을 나타낸다. 즉,
Figure PCTKR2022013647-appb-img-000095
은 i번째 약물과 자신을 포함하는 다른 약물들 사이의 타니모토 유사도로 구성되고,
Figure PCTKR2022013647-appb-img-000096
는 j번째 셀 라인과 자신을 포함하는 다른 셀 라인들 사이의 유전자 발현 유사도로 구성된다.
Dr.CNN is a 2D CNN-based prediction model that uses the cross product of a drug similarity vector and a cell line similarity vector as an input. For the input of the Dr.CNN model, first, an mxm drug-drug similarity matrix based on the Tanimoto coefficient
Figure PCTKR2022013647-appb-img-000086
and an nxn cell line-cell line similarity matrix based on gene expression scores through the RBF kernel function
Figure PCTKR2022013647-appb-img-000087
calculate RDKit's 2048-bit long ECFP fingerprint is used to compute Tanimoto coefficients. Here, m and n represent the number of drugs and cell lines in the training dataset, respectively. Then, for every drug-cell line pair, the mx 1 drug similarity vector
Figure PCTKR2022013647-appb-img-000088
and nx 1 cell line similarity vector
Figure PCTKR2022013647-appb-img-000089
cross product of
Figure PCTKR2022013647-appb-img-000090
calculate here,
Figure PCTKR2022013647-appb-img-000091
class
Figure PCTKR2022013647-appb-img-000092
are similarity matrices, respectively
Figure PCTKR2022013647-appb-img-000093
The ith column of and the similarity matrix
Figure PCTKR2022013647-appb-img-000094
represents the jth column of in other words,
Figure PCTKR2022013647-appb-img-000095
is composed of the Tanimoto similarity between the ith drug and other drugs including itself,
Figure PCTKR2022013647-appb-img-000096
is composed of gene expression similarities between the j-th cell line and other cell lines including itself.
Dr.CNN 과 관련된 파라미터들은 외적을 입력으로 하고, 약물 반응을 출력으로 하여 얻어질 수 있다.Parameters related to Dr.CNN can be obtained by taking the cross product as an input and taking the drug response as an output.
도 5 는 Dr.CNN 의 유사도 기반 CNN 서브모델의 아키텍쳐를 나타낸다. 도 5 에 나타난 바와 같이, CNN 서브모델은 각각 맥스-풀링으로 이어지는 두 개의 2D 합성곱 레이어, 1 개의 평탄화 레이어(flatten layer), FC(128), FC(64) 및 FC(1) 레이어로 구성될 수 있다. 괄호 안의 숫자는 노드의 수를 나타낸다. FC(128), FC(64) 및 FC(1) 레이어는 각각 ReLU(Rectified linear unit) 과 리니어 함수를 사용할 수 있다. 일실시예는 과대적합을 제거하기 위하여 평탄화 레이어와 FC(128) 레이어 사이, FC(128) 레이어와 FC(64) 레이어 사이, FC(64) 레이어와 FC(1) 레이어 사이에 비율 0.1 의 드롭아웃 레이어를 포함할 수 있다. 각각의 합성곱 레이어에 대한 필터 수는 각각 18 과 24 이다. 일실시예는 합성곱 레이어에 대하여 각각 5 x 5, 3 x 3 커널 사이즈의 필터를 사용할 수 있다. 맥스풀링 레이어는 사이즈 2, 스트라이드 2 이다. 일실시예는 학습 알고리즘에 대하여 배치 사이즈와 에폭 수를 각각 32 와 20 으로 설정할 수 있다. 일실시예는 학습율 0.001 의 아담 옵티마이저(Adam optimizer)를 사용할 수 있다.5 shows the architecture of the similarity-based CNN submodel of Dr.CNN. As shown in FIG. 5, the CNN submodel consists of two 2D convolution layers, one flatten layer, FC(128), FC(64), and FC(1) layers each followed by max-pooling. It can be. Numbers in parentheses indicate the number of nodes. The FC 128, FC 64, and FC(1) layers may use a Rectified Linear Unit (ReLU) and a linear function, respectively. In one embodiment, a drop ratio of 0.1 is applied between the flattening layer and the FC(128) layer, between the FC(128) layer and the FC(64) layer, and between the FC(64) layer and the FC(1) layer to eliminate overfitting. An out layer may be included. The number of filters for each convolution layer is 18 and 24, respectively. In one embodiment, filters having kernel sizes of 5 x 5 and 3 x 3 may be used for the convolution layer, respectively. The max pooling layer has size 2 and stride 2. In one embodiment, the batch size and the number of epochs may be set to 32 and 20, respectively, for the learning algorithm. An embodiment may use an Adam optimizer with a learning rate of 0.001.
도 6 은 일실시예에 따른 약물 반응 예측 방법을 구현하기 위한 시스템의 일례를 나타내는 블록도로서, 본 실시예에 관련된 부분을 개념적으로 나타내고 있다. 각각의 구성은 하나의 장치에 모두 구비되어 단독으로 처리를 행할 수도 있으나 이에 한정되는 것은 아니며, 네트워크를 통해 접속되어 각각의 구성이 분리된 장치에서 수행되는 것 또한 포함할 수 있다. 6 is a block diagram showing an example of a system for implementing a method for predicting drug response according to an embodiment, conceptually showing parts related to the present embodiment. Each component may be provided in one device and may be processed independently, but is not limited thereto, and may also include a device connected through a network so that each component is performed in a separate device.
외부 서버(20)는 네트워크를 통해 예측 시스템(10)과 서로 접속될 수 있고, 약물의 화학적 특성 정보, 셀 라인 정보, 약물 반응 정보 등에 대한 정보를 제공할 수도 있다. 예를 들어, 약물의 화학적 특성 정보는 SMILES(Simplified Molecular-Input Line-Entry System) 정보를 포함할 수 있고, 셀 라인 정보는 셀 라인 유전자 발현 정보를 포함할 수도 있다. 구체적으로 GDSC 는 셀 라인과 약물의 모든 쌍에서 관측되는 약물 반응에 대한 정보를 제공하는데 이를 외부 서버(20)로부터 전달받을 수도 있다. 예를 들어, 외부 서버(20)는 예측 시스템(10)의 약물 반응 예측 처리를 위한 데이터 베이스이거나 또는 이를 제공하는 서버일 수 있다. The external server 20 may be connected to the prediction system 10 through a network, and may provide information on chemical characteristics of drugs, cell line information, drug response information, and the like. For example, drug chemical property information may include SMILES (Simplified Molecular-Input Line-Entry System) information, and cell line information may include cell line gene expression information. Specifically, GDSC provides information on drug responses observed in all pairs of cell lines and drugs, which may be received from the external server 20 . For example, the external server 20 may be a database for drug response prediction processing of the prediction system 10 or a server that provides it.
예측 시스템(10)은 제어부(11), 통신부(12), 입출력 인터페이스부(13) 및 메모리부(14)를 포함할 수 있다. The prediction system 10 may include a control unit 11, a communication unit 12, an input/output interface unit 13, and a memory unit 14.
제어부(11)는 예측 시스템(10)의 전체를 제어하는 구성으로서, 예를 들어, CPU, GPU 등의 프로세싱 유닛을 포함할 수 있다. 제어부(11)는 메모리부(14)에 저장된 정보들을 이용하여 후술할 모델들을 학습시킬 수 있고, 또한 학습된 모델을 통해 새로운 입력에 대한 예측값 산출을 수행할 수도 있다. 구체적으로, 제어부(11)는 약물 반응을 예측하는 모델을 제어할 수 있다. 이를 위하여 제어부(11)는 OS(operating system) 등의 제어 프로그램이나, 각종의 처리 순서 등을 규정한 프로그램, 데이터를 저장하기 위한 내부 메모리를 포함할 수도 있다. 그리고, 제어부(11)는 이들 프로그램 등에 의해 다양한 처리를 실행하기 위한 정보 처리를 수행할 수 있다. The control unit 11 is a component that controls the entire prediction system 10, and may include, for example, a processing unit such as a CPU or a GPU. The control unit 11 may use information stored in the memory unit 14 to learn models to be described later, and may also calculate a predicted value for a new input through the learned model. Specifically, the controller 11 may control a model for predicting a drug response. To this end, the control unit 11 may include an internal memory for storing a control program such as an OS (operating system), a program defining various processing procedures, and data. Then, the control unit 11 can perform information processing for executing various processes by these programs and the like.
또한, 통신부(12)는 통신 회선 등에 접속되는 라우터(router) 등의 통신 장치에 접속될 수 있는 인터페이스를 포함할 수 있고, 예측 시스템(10)과 외부 서버(20)와의 통신을 제어할 수 있다. In addition, the communication unit 12 may include an interface that can be connected to a communication device such as a router connected to a communication line, etc., and can control communication between the prediction system 10 and the external server 20. .
입출력 인터페이스부(13)는 입력부(15) 및/또는 디스플레이부(16)에 접속되는 인터페이스일 수 있다. 입출력 인터페이스부(13)를 통해 예측 시스템(10)과 사용자가 소통할 수 있다. 예를 들어, 디스플레이부(16)는 애플리케이션 등의 표시 화면을 표시하는 표시 수단(예를 들면, 액정 또는 유기 EL 등으로 구성되는 디스플레이, 모니터, 터치 패널 등)일 수도 있다. 또한, 입력부(15)는, 예를 들면 키입력부, 터치 패널, 컨트롤 패드(예를 들면 터치 패드, 게임 패드 등), 마우스, 키보드, 마이크 등일 수도 있다. The input/output interface unit 13 may be an interface connected to the input unit 15 and/or the display unit 16 . The prediction system 10 and the user may communicate through the input/output interface 13 . For example, the display unit 16 may be a display means (for example, a display made of liquid crystal or organic EL, a monitor, a touch panel, etc.) for displaying a display screen of an application or the like. Also, the input unit 15 may be, for example, a key input unit, a touch panel, a control pad (eg, a touch pad, a game pad, etc.), a mouse, a keyboard, a microphone, or the like.
또한, 메모리부(14)는 각종의 데이터 베이스나 테이블 등을 저장하는 장치일 수 있다. 예를 들어, 메모리부는 약물의 화학적 특성 정보, 셀 라인 정보, 약물 반응 정보 등에 대한 정보를 제공할 수도 있다. 예를 들어, 약물의 화학적 특성 정보는 SMILES(Simplified Molecular-Input Line-Entry System) 정보를 포함할 수 있고, 셀 라인 정보는 셀 라인 유전자 발현 정보를 포함할 수도 있다. 또한, 예측 시스템(10)의 입력 및 출력에 대한 프로세스를 저장할 수 있고, 프로세스 처리에 대한 결과값들을 저장할 수도 있다.In addition, the memory unit 14 may be a device for storing various databases or tables. For example, the memory unit may provide information about chemical properties of drugs, cell line information, drug response information, and the like. For example, drug chemical property information may include SMILES (Simplified Molecular-Input Line-Entry System) information, and cell line information may include cell line gene expression information. In addition, processes for input and output of the prediction system 10 may be stored, and result values for process processing may be stored.
이상 설명된 본 발명에 따른 실시예들은 다양한 컴퓨터 구성요소를 통하여 수행될 수 있는 프로그램 명령어의 형태로 구현되어 컴퓨터 판독 가능한 기록 매체에 기록될 수 있다. 컴퓨터 판독 가능한 기록 매체는 프로그램 명령어, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 컴퓨터 판독 가능한 기록 매체에 기록되는 프로그램 명령어는 본 발명을 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 분야의 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능한 기록 매체의 예에는, 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체, CD-ROM, DVD와 같은 광기록 매체, 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 ROM, RAM, 플래시 메모리 등과 같은 프로그램 명령어를 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령어의 예에는, 컴파일러에 의해 만들어지는 것과 같은 기계어 코드 뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드도 포함된다. 상기 하드웨어 장치는 본 발명에 따른 처리를 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.Embodiments according to the present invention described above may be implemented in the form of program instructions that can be executed through various computer components and recorded on a computer-readable recording medium. A computer-readable recording medium may include program instructions, data files, data structures, etc. alone or in combination. Program instructions recorded on a computer readable recording medium may be specially designed and configured for the present invention or may be known and usable to those skilled in the art of computer software. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks and magnetic tapes, optical recording media such as CD-ROMs and DVDs, and magneto-optical media such as floptical disks. media), and hardware devices specially configured to store and execute program instructions, such as ROM, RAM, flash memory, and the like. Examples of program instructions include high-level language codes that can be executed by a computer using an interpreter or the like as well as machine language codes generated by a compiler. The hardware device may be configured to act as one or more software modules to perform processing according to the present invention and vice versa.
또한, 이상 설명된 본 발명에 따른 실시예들은 다양한 컴퓨터 구성요소를 통하여 수행될 수 있는 프로그램 명령어의 집합 및 이를 실행하기 위한 사용자 애플리케이션 자체일 수도 있다. 구체적으로, 서버를 통해 또는 저장매체를 통해 다운로드하여 클라이언트 컴퓨터에 설치할 수 있는 프로그램 그 자체일 수도 있다.In addition, the embodiments according to the present invention described above may be a set of program commands that can be executed through various computer components and a user application itself for executing them. Specifically, it may be a program itself that can be installed on a client computer after being downloaded through a server or through a storage medium.
이상에서 본 발명이 구체적인 구성요소 등과 같은 특정 사항들과 한정된 실시예 및 도면에 의해 설명되었으나, 이는 본 발명의 보다 전반적인 이해를 돕기 위해서 제공된 것일 뿐, 본 발명이 상기 실시예들에 한정되는 것은 아니며, 본 발명이 속하는 기술분야에서 통상적인 지식을 가진 자라면 이러한 기재로부터 다양한 수정 및 변형을 꾀할 수 있다.In the above, the present invention has been described by specific details such as specific components and limited embodiments and drawings, but these are provided to help a more general understanding of the present invention, and the present invention is not limited to the above embodiments. , Those skilled in the art to which the present invention pertains may seek various modifications and variations from these descriptions.
따라서, 본 발명의 사상은 상기 설명된 실시예에 국한되어 정해져서는 아니되며, 후술하는 특허청구범위 뿐만 아니라 이 특허청구범위와 균등하게 또는 등가적으로 변형된 모든 것들은 본 발명의 사상의 범주에 속한다고 할 것이다.Therefore, the spirit of the present invention should not be limited to the above-described embodiments and should not be determined, and not only the claims to be described later, but also all modifications equivalent or equivalent to these claims fall within the scope of the spirit of the present invention. will do it
또한, 본 발명의 실시예들은 상호 배타적인 것은 아니며, 일 실시예의 구성이 다른 실시예에 적용될 수도 있다. 본 발명의 실시예들은 구성요소들의 다양한 조합으로 도출될 수 있는 여러가지 형태 중 일부를 예시로서 제공하는 것으로서, 본 발명의 구체적인 실시예 자체에 한정되는 것은 아니다.In addition, the embodiments of the present invention are not mutually exclusive, and configurations of one embodiment may be applied to other embodiments. Embodiments of the present invention are provided as examples of some of the various forms that can be derived from various combinations of components, and are not limited to the specific embodiments of the present invention itself.
부호의 설명explanation of code
10: 예측 시스템 20: 외부 서버10: prediction system 20: external server
11: 제어부 12: 통신부11: control unit 12: communication unit
13: 입출력 인터페이스부 14: 메모리부13: input/output interface unit 14: memory unit
15: 입력부 16: 디스플레이부15: input unit 16: display unit

Claims (16)

  1. 합성곱 신경망(Convolutional Neural Network) 모델을 이용하여 약물과 셀 라인의 약물 반응을 예측하는 방법으로서, As a method of predicting the drug response of a drug and a cell line using a convolutional neural network model,
    약물-약물 사이의 제 1 유사도 행렬을 준비하는 단계;preparing a first similarity matrix between drugs;
    셀 라인-셀 라인 사이의 제 2 유사도 행렬을 준비하는 단계;preparing a second similarity matrix between cell lines;
    상기 제 1 유사도 행렬과 상기 제 2 유사도 행렬 사이의 외적을 산출하는 단계로서, 상기 제 1 유사도 행렬의 i번째 열벡터(i = 1, 2, … m: m은 정수)와 상기 제 2 유사도 행렬의 j번째 열벡터 (j = 1, 2, …, n: n은 정수) 사이의 외적을 산출하는 단계; Calculating a cross product between the first similarity matrix and the second similarity matrix, wherein the ith column vector (i = 1, 2, ... m: m is an integer) of the first similarity matrix and the second similarity matrix Calculating a cross product between j-th column vectors of (j = 1, 2, ..., n: n is an integer);
    상기 외적을 입력값으로 하고, i번째 약물과 j번째 셀 라인의 약물 반응 값을 출력값으로 하여 상기 합성곱 신경망 모델을 학습시키는 단계; 및 learning the convolutional neural network model by taking the cross product as an input value and using the i-th drug and the drug response value of the j-th cell line as output values; and
    상기 학습된 합성곱 신경망 모델을 이용하여 새로운 약물과 셀 라인의 약물 반응 값을 예측하여 출력하는 단계를 포함하는, 약물 반응 예측 방법. A drug response prediction method comprising predicting and outputting drug response values of a new drug and a cell line using the learned convolutional neural network model.
  2. 제 1 항에 있어서, According to claim 1,
    상기 제 1 유사도 행렬은 타니모토(Tanimoto) 계수에 기반한 m x m 의 약물-약물 유사도 행렬이고, The first similarity matrix is an m x m drug-drug similarity matrix based on Tanimoto coefficients,
    상기 제 2 유사도 행렬은 RBF(radial basis function) 커널 행렬인 n x n 의 셀 라인-셀 라인 유사도 행렬이며, The second similarity matrix is an n x n cell line-cell line similarity matrix that is a radial basis function (RBF) kernel matrix,
    상기 m 및 n 은 각각 학습 데이터세트의 약물과 셀 라인의 수를 나타내는, 약물 반응 예측 방법. wherein m and n represent the number of drugs and cell lines in the learning dataset, respectively.
  3. 제 2 항에 있어서,According to claim 2,
    상기 제 1 유사도 행렬의 i번째 열벡터는, i번째 약물과 상기 i번째 약물을 포함하는 다른 약물들 사이의 타니모토 유사도이고, The i-th column vector of the first similarity matrix is the Tanimoto similarity between the i-th drug and other drugs including the i-th drug,
    상기 제 2 유사도 행렬의 j번째 열벡터는, j번째 셀 라인과 상기 j번째 셀 라인을 포함하는 다른 셀 라인들 사이의 유전자 발현 유사도인, 약물 반응 예측 방법. The j-th column vector of the second similarity matrix is a gene expression similarity between the j-th cell line and other cell lines including the j-th cell line.
  4. 제 1 항에 있어서,According to claim 1,
    상기 약물 반응 값은 상기 약물에 대한 상기 셀 라인의 IC50(half-maximal inhibitory concentration) 값을 통해 정량화되는
    Figure PCTKR2022013647-appb-img-000097
    값인, 약물 반응 예측 방법.
    The drug response value is quantified through the IC 50 (half-maximal inhibitory concentration) value of the cell line for the drug.
    Figure PCTKR2022013647-appb-img-000097
    A method for predicting drug response, which is a value.
  5. 제 4 항에 있어서, According to claim 4,
    상기 제 2 유사도 행렬은 제 1 부분행렬 및 제 2 부분행렬을 포함하고, The second similarity matrix includes a first sub-matrix and a second sub-matrix,
    상기 제 1 유사도 행렬과 상기 제 2 유사도 사이의 외적을 산출하는 단계는, Calculating a cross product between the first similarity matrix and the second similarity,
    상기 제 1 유사도 행렬과 상기 제 2 유사도 행렬의 제 1 부분행렬 사이의 제 1 외적을 산출하고, 상기 제 1 유사도 행렬과 상기 제 2 유사도 행렬의 제 2 부분행렬 사이의 제 2 외적을 산출하는 단계를 포함하는, 약물 반응 예측 방법. calculating a first cross product between the first similarity matrix and a first sub-matrix of the second similarity matrix, and calculating a second cross product between the first similarity matrix and a second sub-matrix of the second similarity matrix; Including, drug response prediction method.
  6. 제 5 항에 있어서, According to claim 5,
    상기 합성곱 신경망 모델은 제 1 합성곱 신경망 모델 및 제 2 합성곱 신경망 모델을 포함하고, The convolutional neural network model includes a first convolutional neural network model and a second convolutional neural network model,
    상기 합성곱 신경망 모델을 학습시키는 단계는, The step of learning the convolutional neural network model,
    상기 제 1 외적과 상기 제 2 외적을 상기 제 1 합성곱 신경망 모델과 상기 제 2 합성곱 신경망 모델 각각의 입력값으로 하고, 상기 제 1 합성곱 신경망 모델과 상기 제 2 합성곱 신경망 모델 각각의 출력값의 평균을 출력값으로 하는, 약물 반응 예측 방법. The first cross product and the second cross product are input values of the first convolutional neural network model and the second convolutional neural network model, respectively, and output values of the first convolutional neural network model and the second convolutional neural network model, respectively. A method for predicting drug response, which takes the average of as an output value.
  7. 제 1 항에 있어서, According to claim 1,
    상기 합성곱 신경망 모델은 2차원 (2-Dimensional) 모델인, 약물 반응 예측 방법. The convolutional neural network model is a two-dimensional (2-Dimensional) model, drug response prediction method.
  8. 제 1 항에 있어서, According to claim 1,
    상기 합성곱 신경망 모델은 2 개의 2 차원 합성곱 레이어, 2 개의 맥스 풀링 레이어, 1 개의 평탄화 레이어, 1 개의 드롭아웃 레이어, 2개의 완전연결 레이어를 포함하는, 약물 반응 예측 방법. The method for predicting drug response, wherein the convolutional neural network model includes two two-dimensional convolutional layers, two max pooling layers, one flattening layer, one dropout layer, and two fully connected layers.
  9. 합성곱 신경망(Convolutional Neural Network) 모델을 이용여 약물과 셀 라인의 약물 반응을 예측하는 시스템으로서, A system for predicting the drug response of a drug and a cell line using a Convolutional Neural Network model,
    상기 합성곱 신경망 모델을 제어하기 위한 제어부;a controller for controlling the convolutional neural network model;
    외부 서버와의 통신을 위한 통신부;Communication unit for communication with an external server;
    메모리부;memory unit;
    디스플레이부; 및 display unit; and
    사용자의 입력을 수신하는 입력부를 포함하고, Including an input unit for receiving a user's input;
    상기 메모리부는, 약물-약물 사이의 제 1 유사도 행렬 및 셀 라인-셀 라인 사이의 제 2 유사도 행렬을 포함하고, The memory unit includes a first similarity matrix between drugs and a second similarity matrix between cell lines and cell lines,
    상기 제어부는 상기 제 1 유사도 행렬의 i번째 열벡터(i = 1, 2, … m: m은 정수)와 상기 제 2 유사도 행렬의 j번째 열벡터(j = 1, 2, …, n: n은 정수) 사이의 외적을 연산하고, 상기 외적을 입력값으로 하고, i번째 약물과 j번째 셀 라인의 약물 반응 값을 출력값으로 하여 상기 합성곱 신경망 모델을 학습시키며,The controller controls the i-th column vector of the first similarity matrix (i = 1, 2, ..., m: m is an integer) and the j-th column vector of the second similarity matrix (j = 1, 2, ..., n: n is an integer), and the convolutional neural network model is trained using the cross product as an input value and the drug response values of the i-th drug and the j-th cell line as output values,
    상기 학습된 합성곱 신경망 모델을 이용하여 새로운 약물과 셀 라인의 약물 반응 값을 예측하는, 약물 반응 예측 시스템. A drug response prediction system for predicting drug response values of new drugs and cell lines using the learned convolutional neural network model.
  10. 제 9 항에 있어서, According to claim 9,
    상기 제 1 유사도 행렬은 타니모토(Tanimoto) 계수에 기반한 m x m 의 약물-약물 유사도 행렬이고, The first similarity matrix is an m x m drug-drug similarity matrix based on Tanimoto coefficients,
    상기 제 2 유사도 행렬은 RBF(radial basis function) 커널 행렬인 n x n 의 셀 라인-셀 라인 유사도 행렬이며, The second similarity matrix is an n x n cell line-cell line similarity matrix that is a radial basis function (RBF) kernel matrix,
    상기 m 및 n 은 각각 학습 데이터세트의 약물과 셀 라인의 수를 나타내는, 약물 반응 예측 시스템.wherein m and n represent the number of drugs and cell lines in the learning dataset, respectively.
  11. 제 10 항에 있어서,According to claim 10,
    상기 제 1 유사도 행렬의 i번째 열벡터는, i번째 약물과 상기 i번째 약물을 포함하는 다른 약물들 사이의 타니모토 유사도이고, The i-th column vector of the first similarity matrix is the Tanimoto similarity between the i-th drug and other drugs including the i-th drug,
    상기 제 2 유사도 행렬의 j번째 열벡터는, j번째 셀 라인과 상기 j번째 셀 라인을 포함하는 다른 셀 라인들 사이의 유전자 발현 유사도인, 약물 반응 예측 시스템.The j-th column vector of the second similarity matrix is a gene expression similarity between the j-th cell line and other cell lines including the j-th cell line, the drug response prediction system.
  12. 제 9 항에 있어서,According to claim 9,
    상기 약물 반응 값은 상기 약물에 대한 상기 셀 라인의 IC50(half-maximal inhibitory concentration) 값을 통해 정량화되는
    Figure PCTKR2022013647-appb-img-000098
    값인, 약물 반응 예측 시스템.
    The drug response value is quantified through the IC 50 (half-maximal inhibitory concentration) value of the cell line for the drug.
    Figure PCTKR2022013647-appb-img-000098
    A value, drug response prediction system.
  13. 제 12 항에 있어서, According to claim 12,
    상기 제 2 유사도 행렬은 제 1 부분행렬 및 제 2 부분행렬을 포함하고, The second similarity matrix includes a first sub-matrix and a second sub-matrix,
    상기 제 1 유사도 행렬과 상기 제 2 유사도 사이의 외적을 산출하는 단계는, Calculating a cross product between the first similarity matrix and the second similarity,
    상기 제 1 유사도 행렬과 상기 제 2 유사도 행렬의 제 1 부분행렬 사이의 제 1 외적을 산출하고, 상기 제 1 유사도 행렬과 상기 제 2 유사도 행렬의 제 2 부분행렬 사이의 제 2 외적을 산출하는 단계를 포함하는, 약물 반응 예측 시스템. calculating a first cross product between the first similarity matrix and a first sub-matrix of the second similarity matrix, and calculating a second cross product between the first similarity matrix and a second sub-matrix of the second similarity matrix; Including, drug response prediction system.
  14. 제 13 항에 있어서, According to claim 13,
    상기 합성곱 신경망 모델은 제 1 합성곱 신경망 모델 및 제 2 합성곱 신경망 모델을 포함하고, The convolutional neural network model includes a first convolutional neural network model and a second convolutional neural network model,
    상기 합성곱 신경망 모델을 학습시키는 단계는, The step of learning the convolutional neural network model,
    상기 제 1 외적과 상기 제 2 외적을 상기 제 1 합성곱 신경망 모델과 상기 제 2 합성곱 신경망 모델 각각의 입력값으로 하고, 상기 제 1 합성곱 신경망 모델과 상기 제 2 합성곱 신경망 모델 각각의 출력값의 평균을 출력값으로 하는, 약물 반응 예측 시스템. The first cross product and the second cross product are input values of the first convolutional neural network model and the second convolutional neural network model, respectively, and output values of the first convolutional neural network model and the second convolutional neural network model, respectively. A drug response prediction system that takes the average of as an output value.
  15. 제 9 항에 있어서, According to claim 9,
    상기 합성곱 신경망 모델은 2차원 (2-Dimensional) 모델인, 약물 반응 예측 시스템. The convolutional neural network model is a two-dimensional (2-Dimensional) model, drug response prediction system.
  16. 제 9 항에 있어서, According to claim 9,
    상기 합성곱 신경망 모델은 2 개의 2 차원 합성곱 레이어, 2 개의 맥스 풀링 레이어, 1 개의 평탄화 레이어, 1 개의 드롭아웃 레이어, 2개의 완전연결 레이어를 포함하는, 약물 반응 예측 시스템.The convolutional neural network model includes two two-dimensional convolutional layers, two max pooling layers, one flattening layer, one dropout layer, and two fully connected layers.
PCT/KR2022/013647 2021-09-10 2022-09-13 System for predicting drug responses by using convolutional neural network based on drug and cell line similarity matrix WO2023038501A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
KR10-2021-0121233 2021-09-10
KR1020210121233A KR102653969B1 (en) 2021-09-10 2021-09-10 A system of predicting drug response with convolutional neural networks based on similarity matrices of drugs and cell lines

Publications (1)

Publication Number Publication Date
WO2023038501A1 true WO2023038501A1 (en) 2023-03-16

Family

ID=85506828

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/KR2022/013647 WO2023038501A1 (en) 2021-09-10 2022-09-13 System for predicting drug responses by using convolutional neural network based on drug and cell line similarity matrix

Country Status (2)

Country Link
KR (1) KR102653969B1 (en)
WO (1) WO2023038501A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117275608A (en) * 2023-09-08 2023-12-22 浙江大学 Cooperative attention-based method and device for cooperative prediction of interpretable anticancer drugs

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101953762B1 (en) * 2017-09-25 2019-03-04 (주)신테카바이오 Drug indication and response prediction systems and method using AI deep learning based on convergence of different category data

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101953762B1 (en) * 2017-09-25 2019-03-04 (주)신테카바이오 Drug indication and response prediction systems and method using AI deep learning based on convergence of different category data

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
AHMADI MOUGHARI FATEMEH, ESLAHCHI CHANGIZ: "A computational method for drug sensitivity prediction of cancer cell lines based on various molecular information", PLOS ONE, vol. 16, no. 4, 29 April 2021 (2021-04-29), pages e0250620, XP093045892, DOI: 10.1371/journal.pone.0250620 *
MICHAEL P. MENDEN, ET AL.: "Machine learning prediction of cancer cell sensitivity to drugs based on genomic and chemical properties", PLOS ONE, vol. 28, no. 4, 1 January 2013 (2013-01-01), pages 1 - 7, XP055338891, DOI: 10.1371/journal.pone.0061318 *
PENGFEI LIU;HONGJIAN LI;SHUAI LI;KWONG-SAK LEUNG: "Improving prediction of phenotypic drug response on cancer cell lines using deep convolutional network", BMC BIOINFORMATICS, BIOMED CENTRAL LTD, LONDON, UK, vol. 20, no. 1, 29 July 2019 (2019-07-29), London, UK , pages 1 - 14, XP021272446, DOI: 10.1186/s12859-019-2910-6 *
WANG YONGCUI, FANG JIANWEN, CHEN SHILONG: "Inferences of drug responses in cancer cells from cancer genomic features and compound chemical and therapeutic properties", SCIENTIFIC REPORTS, vol. 6, no. 1, XP093045889, DOI: 10.1038/srep32679 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117275608A (en) * 2023-09-08 2023-12-22 浙江大学 Cooperative attention-based method and device for cooperative prediction of interpretable anticancer drugs
CN117275608B (en) * 2023-09-08 2024-04-26 浙江大学 Cooperative attention-based method and device for cooperative prediction of interpretable anticancer drugs

Also Published As

Publication number Publication date
KR102653969B1 (en) 2024-04-03
KR20230038016A (en) 2023-03-17

Similar Documents

Publication Publication Date Title
Wang et al. Confounder adjustment in multiple hypothesis testing
Rosenbaum et al. Inferring multi-target QSAR models with taxonomy-based multi-task learning
Wu et al. Network-based structural learning nonnegative matrix factorization algorithm for clustering of scRNA-seq data
WO2023038501A1 (en) System for predicting drug responses by using convolutional neural network based on drug and cell line similarity matrix
Montserrat et al. Lai-net: Local-ancestry inference with neural networks
Huang et al. Clustering of cancer attributed networks by dynamically and jointly factorizing multi-layer graphs
US8972406B2 (en) Generating epigenetic cohorts through clustering of epigenetic surprisal data based on parameters
EP3338211A1 (en) Multi-level architecture of pattern recognition in biological data
Ma et al. Layer-specific modules detection in cancer multi-layer networks
CN111951886A (en) Drug relocation prediction method based on Bayesian inductive matrix completion
Weighill et al. Gene regulatory network inference as relaxed graph matching
Feng et al. PBPI: a high performance implementation of Bayesian phylogenetic inference
Yi et al. Learning representation of molecules in association network for predicting intermolecular associations
Yu et al. NPI-RGCNAE: fast predicting ncRNA-protein interactions using the relational graph convolutional network auto-encoder
Du et al. Deep multi-label joint learning for RNA and DNA-binding proteins prediction
US20170329913A1 (en) Method and system for determining an association of biological feature with medical condition
Wang et al. Multi-view random-walk graph regularization low-rank representation for cancer clustering and differentially expressed gene selection
Yu et al. Novel graphical representation of genome sequence and its applications in similarity analysis
Newaz et al. Inference of a dynamic aging-related biological subnetwork via network propagation
Li et al. Nonnegative matrix factorization for dynamic modules in cancer attribute temporal networks
Tian et al. GTAMP-DTA: Graph transformer combined with attention mechanism for drug-target binding affinity prediction
Fan et al. The EM algorithm and the rise of computational biology
Spirko-Burns et al. Supervised dimension reduction for large-scale “omics” data with censored survival outcomes under possible non-proportional hazards
Ma et al. Fusing heterogeneous genomic data to discover cancer progression related dynamic modules
Testa et al. A Non-Negative Matrix Tri-Factorization Based Method for Predicting Antitumor Drug Sensitivity

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22867769

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE