KR102653969B1

KR102653969B1 - A system of predicting drug response with convolutional neural networks based on similarity matrices of drugs and cell lines

Info

Publication number: KR102653969B1
Application number: KR1020210121233A
Authority: KR
Inventors: 심주용; 황창하; 손인석
Original assignee: 주식회사 아론티어
Priority date: 2021-09-10
Filing date: 2021-09-10
Publication date: 2024-04-03
Also published as: WO2023038501A1; KR20230038016A

Abstract

인간의 암세포 셀 라인은 암의 생물학을 연구하고 암 치료의 효능을 테스트하기 위한 연구에 자주 이용된다. 약리유전체학 데이터를 사용하여 약물 반응을 정확하게 예측하는 것은 종양학 정밀 의학에서 필수적인 문제이다. 유사한 셀 라인이 유사한 약물과 유사하게 반응한다는 사실에 기초하여, 약물-약물 유사도 행렬과 셀 라인-셀 라인 유사도 행렬의 열벡터들 사이의 외적을 유사도 기반의 합성곱 신경망에 적용하는 앙상블 딥러닝 모델을 제공한다. 이를 통해 환자의 유전적 특성을 약물 민감도에 연결함으로써 정밀 의학에 유용할 수 있음을 보여준다.Human cancer cell lines are frequently used in research to study the biology of cancer and test the efficacy of cancer treatments. Accurately predicting drug response using pharmacogenomics data is an essential issue in precision medicine in oncology. Based on the fact that similar cell lines react similarly to similar drugs, an ensemble deep learning model that applies the cross product between the column vectors of the drug-drug similarity matrix and the cell line-cell line similarity matrix to a similarity-based convolutional neural network. provides. This shows that it can be useful in precision medicine by linking a patient's genetic characteristics to drug sensitivity.

Description

A system for predicting drug response using a convolutional neural network based on the similarity matrix of drugs and cell lines {A SYSTEM OF PREDICTING DRUG RESPONSE WITH CONVOLUTIONAL NEURAL NETWORKS BASED ON SIMILARITY MATRICES OF DRUGS AND CELL LINES}

정밀의학은 환자 개개인의 유전 정보에 기초하여 암 치료제를 정교하게 선정하는 것을 목적으로 한다. 정말의학에 있어서 가장 중요한 문제 중 하나는 각 환자에 대하여 항암 약물 반응을 예측하는 것이다. 종양의 이질성 때문에 동일한 유형의 암에 걸린 환자라도 유사한 약물에 대하여 다른 반응을 보일 수 있다. 그러므로, 유전체 정보와 약물 반응 사이의 관계를 밝히는 예측 방법을 제공하는 것이 매우 중요하며 이는 정밀의학에 도움이 될 수 있다. Precision medicine aims to precisely select cancer treatments based on each patient’s genetic information. One of the most important problems in real medicine is predicting anticancer drug response for each patient. Because of tumor heterogeneity, patients with the same type of cancer may respond differently to similar drugs. Therefore, it is very important to provide a predictive method that reveals the relationship between genomic information and drug response, which can be helpful in precision medicine.

GDSC(Genomics of Drug Sensitivity in Cancer) 및 CCLE(Cancer Cell Line Encyclopedia)는 여러 항암 약물로 치료된 수백 개의 암 셀 라인에 대한 분자 프로파일과 약물 반응 값을 제공한 두 개의 프로젝트이다. 이런 대규모 데이터세트를 통해 환자별 약물 반응을 예측하기 위한 방법을 개발할 수 있다. 일반적으로, 약물 반응을 예측하기 위한 방법은 두 가지로 분류된다. 첫 번째는, 민감한 약물-셀 라인 쌍을 예측하는 분류 접근법이다. 두 번째는, 약물에 대한 셀 라인의 반응을 측정하기 위한 기준 값을 예측하는 회귀분석 접근법이다. 본 발명의 일실시예는 약물에 대한 셀 라인의 (half-maximal inhibitory concentration) 값을 통해 정량화되는 를 예측하는 회귀분석 접근법을 개시한다. 유전자 발현 프로파일 또는 셀 라인의 다른 분자 정보를 이용하여 약물 반응을 예측하도록 다양한 회귀분석 접근법들이 제안되었다. 일부 예측 방법들은 약물의 화학적 하부구조와 셀 라인 정보와 같은 약물 정보를 통합하여 약물 반응 예측을 개선했다. 또한, 수많은 머신 러닝 방법들이 약물 반응 예측 문제에 적용되었다. 예를 들어, lasso(least absolute shrinkage and selection operator), elastic net, 랜덤 포레스트(random forest), 커널 기반(kernel-based) 방법, 신경망(neural networks)과 딥러닝(deep learning)과 같은 회귀분석 방법들이 적용되었다. Ali 및 Aittokallio는 종합적인 최근 리뷰를 제공한다(문헌 [Ali, M. & Aittokallio, T. Machine learning and feature selection for drug response prediction in precision oncology applications. Biophys. Rev. 11, 31-39 (2019)]). 최근 딥러닝의 발전은 약물 반응 예측 회귀 모델을 찾기 위한 새로운 장을 열게 하였고, 궁극적으로 치료 반응을 위한 보다 정확한 도구를 제공할 수 있다. Genomics of Drug Sensitivity in Cancer (GDSC) and Cancer Cell Line Encyclopedia (CCLE) are two projects that have provided molecular profiles and drug response values for hundreds of cancer cell lines treated with multiple anticancer drugs. These large-scale datasets allow us to develop methods to predict drug response for each patient. Generally, methods for predicting drug response are classified into two types. The first is a classification approach that predicts sensitive drug-cell line pairs. The second is a regression analysis approach that predicts reference values for measuring a cell line's response to a drug. One embodiment of the present invention is a cell line for a drug. Quantified through (half-maximal inhibitory concentration) value A regression analysis approach to predict is disclosed. Various regression analysis approaches have been proposed to predict drug response using gene expression profiles or other molecular information of a cell line. Some prediction methods have improved drug response prediction by integrating drug information such as the drug's chemical substructure and cell line information. Additionally, numerous machine learning methods have been applied to the drug response prediction problem. For example, regression analysis methods such as lasso (least absolute shrinkage and selection operator), elastic net, random forest, kernel-based methods, neural networks, and deep learning. were applied. Ali and Aittokallio provide a comprehensive recent review (Ali, M. & Aittokallio, T. Machine learning and feature selection for drug response prediction in precision oncology applications. Biophys. Rev. 11, 31-39 (2019)] ). Recent advances in deep learning have opened a new field for finding regression models that predict drug response, and may ultimately provide more accurate tools for treatment response.

Wang et al.은 약물 반응 예측을 위한 유사성 정규화된 행렬 분해(similarity-regularized matrix factorization: SRMF)방법을 제안하였고, 이는 셀 라인의 유전자 발현 프로파일 유사성과 약물의 화학적 하부구조 유사성을 동시에 포함한다(문헌 [Wang, L., Li, X., Zhang, L. & Gao, Q. Improved anticancer drug response prediction in cell lines using matrix factorization with similarity regularization. BMC Cancer 17, 513 (2017)]). 유사한 유전적 특성을 가진 환자들은 유사한 약물에 대하여 유사한 반응을 보이는 것으로 나타났다. Suphavilai et al. 은 잠재 공간(latent space)에 대한 약물 및 셀 라인의 학습 예측을 통해 새로운 약물과 새로운 셀 라인에 대한 약물 반응을 예측할 수 있는 “CaDRReS”라는 행렬 분해 기반의 추천 시스템을 고안했다(문헌 [Suphavilai, C., Bertrand, D. & Nagarajan, N. Predicting cancer drug response using a recommender system. Bioinformatics 34, 3907-3914 (2018)]). 이는 잠재 공간의 특성이 약물의 경로와 상관관계가 있다는 것을 보여주었다. Chang et al.은 다섯 개의 합성곱 신경망(Convolutional Neural Networks: CNNs)를 포함하는 앙상블 모델인 “CDRscan”을 제안하였다. 이는 셀 라인의 돌연변이 프로파일과 약물의 화학적 하부구조를 CNNs의 입력 특성으로 사용하였다. 약물 반응 값은 다섯 개의 CNNs의 출력 값의 평균치로 측정되었다. 그러나, “CDRscan”은 새로운 약물 및 새로운 셀 라인에 대한 약물 반응은 잘 예측하지 못 하는 경향이 있다. Wei et al.은 셀 라인-약물 복합 네트워크(cell line-drug complex network: CDCN)을 최근 고안하였는데, 이는 셀 라인과 약물로 구성되는 간단한 네트워크로부터 정보를 추론하여 약물 반응을 예측한다(문헌 [Wei, D., Liu, C., Zheng, X. & Li, Y. Comprehensive anticancer drug response prediction based on a simple cell line-drug complex network model. BMC Bioinf. 20, 44 (2019)]). CDCN은 누락된 약물 정보를 귀속시키는 만족스러운 결과를 제공한다. Moughari 및 Eslahchi는 매니폴드 러닝을 이용하여 항암 약물 반응 예측 모델(a model for Anticancer Drug Response Prediction using Manifold Learning: ADRML)을 고안하였다(문헌 [Moughari, F. A. & Eslahchi, C. ADRML: anticancer drug response prediction using manifold learning. Sci. Rep. 10, 14245 (2020)]). ADRML은 약물 반응 값을 저차원의 잠재 공간에 맵핑하고 잠재 공간으로부터 새로운 셀 라인-약물 쌍에 대한 약물 반응 값을 연산한다. 이는 다양한 타입의 셀 라인 유사성 및 약물 유사성을 고려하고, 이들을 매니폴드 러닝 절차에 활용한다. ADRML은 정확하고 강력한 예측을 제공하는 것으로 나타났다. Wang et al. proposed a similarity-regularized matrix factorization (SRMF) method for predicting drug response, which simultaneously includes similarity of gene expression profiles of cell lines and similarity of chemical substructure of drugs (Reference [Wang, L., Li, X., Zhang, L. & Gao, Q. Improved anticancer drug response prediction in cell lines using matrix factorization with similarity regularization. BMC Cancer 17, 513 (2017)]. Patients with similar genetic characteristics have been shown to have similar responses to similar drugs. Suphavilai et al. designed a matrix decomposition-based recommender system called “CaDRReS” that can predict drug responses to new drugs and new cell lines through learning predictions of drugs and cell lines in the latent space (Suphavilai, C., Bertrand, D. & Nagarajan, N. Predicting cancer drug response using a recommender system. Bioinformatics 34, 3907-3914 (2018)]). This showed that the properties of the latent space were correlated with the drug path. Chang et al. proposed “CDRscan,” an ensemble model including five convolutional neural networks (CNNs). It used the mutation profile of the cell line and the chemical substructure of the drug as input features for CNNs. Drug response values were measured as the average of the output values of five CNNs. However, “CDRscan” tends to be poor at predicting drug responses to new drugs and new cell lines. Wei et al. recently designed a cell line-drug complex network (CDCN), which predicts drug response by inferring information from a simple network composed of cell lines and drugs (Wei et al. , D., Liu, C., Zheng, X. & Li, Y. Comprehensive anticancer drug response prediction based on a simple cell line-drug complex network model. BMC Bioinf. 20, 44 (2019)] CDCN provides satisfactory results for imputing missing drug information. Moughari and Eslahchi designed a model for Anticancer Drug Response Prediction using Manifold Learning (ADRML) (document [Moughari, F. A. & Eslahchi, C. ADRML: anticancer drug response prediction using manifold learning. Sci. Rep. 10, 14245 (2020)]). ADRML maps drug response values to a low-dimensional latent space and computes drug response values for new cell line-drug pairs from the latent space. It takes into account the similarity and drug similarity of various types of cell lines and utilizes them in the manifold learning procedure. ADRML has been shown to provide accurate and robust predictions.

동일한 유형의 암에 걸린 환자라도 유사한 약물에 대하여 다른 반응을 보일 수 있다. 따라서, 개별 환자에 따른 약물 반응을 정확하게 예측하는 것이 매우 중요하다. Even patients with the same type of cancer may react differently to similar drugs. Therefore, it is very important to accurately predict drug response for individual patients.

또한, 약물 반응 예측은 새로운 약물 및 새로운 셀 라인에 대한 약물 반응은 잘 예측하지 못 하는 경향이 있다. 따라서, 이를 극복하기 위한 새로운 접근방법이 필요하다. Additionally, drug response prediction tends to be poor at predicting drug responses to new drugs and new cell lines. Therefore, a new approach is needed to overcome this.

또한, 연산 프로세스는 단순화하면서도 보다 높은 정확도의 약물 반응 예측을 제공할 수 있는 새로운 방법이 요구된다. Additionally, a new method is required that can provide higher accuracy drug response prediction while simplifying the computational process.

본 발명의 일실시예는, 합성곱 신경망(Convolutional Neural Network) 모델을 이용하여 약물과 셀 라인의 약물 반응을 예측하는 방법으로서, 약물-약물 사이의 제 1 유사도 행렬을 준비하는 단계; 셀 라인-셀 라인 사이의 제 2 유사도 행렬을 준비하는 단계; 상기 제 1 유사도 행렬과 상기 제 2 유사도 행렬 사이의 외적을 산출하는 단계로서, 상기 제 1 유사도 행렬의 i번째 열벡터(i = 1, 2, … m: m은 정수)와 상기 제 2 유사도 행렬의 j번째 열벡터(j = 1, 2, …, n: n은 정수) 사이의 외적을 산출하는 단계; 상기 외적을 입력 값으로 하고, i번째 약물과 j번째 셀 라인의 약물 반응 값을 출력 값으로 하여 상기 합성곱 신경망 모델을 학습시키는 단계; 및 상기 학습된 합성곱 신경망 모델을 이용하여 새로운 약물과 셀 라인의 약물 반응 값을 예측하여 출력하는 단계를 포함하는, 약물 반응 예측 방법을 제공한다. One embodiment of the present invention is a method of predicting drug response of a drug and a cell line using a convolutional neural network model, comprising: preparing a first similarity matrix between drugs and drugs; preparing a second similarity matrix between cell lines; As a step of calculating a cross product between the first similarity matrix and the second similarity matrix, the ith column vector of the first similarity matrix (i = 1, 2, ... m: m is an integer) and the second similarity matrix Calculating a cross product between the j-th column vectors (j = 1, 2, ..., n: n is an integer); Learning the convolutional neural network model using the cross product as an input value and the drug response value of the i-th drug and the j-th cell line as output values; and predicting and outputting drug response values of a new drug and cell line using the learned convolutional neural network model.

또한, 상기 제 1 유사도 행렬은 타니모토(Tanimoto) 계수에 기반한 m x m 의 약물-약물 유사도 행렬이고, 상기 제 2 유사도 행렬은 RBF(radial basis function) 커널 행렬인 n x n 의 셀 라인-셀 라인 유사도 행렬이며, 상기 m 및 n 은 각각 학습 데이터세트의 약물과 셀 라인의 수를 나타내는, 약물 반응 예측 방법을 제공한다.In addition, the first similarity matrix is an m x m drug-drug similarity matrix based on the Tanimoto coefficient, and the second similarity matrix is an n x n cell line-cell line similarity matrix that is a radial basis function (RBF) kernel matrix. , where m and n represent the number of drugs and cell lines in the learning dataset, respectively, providing a drug response prediction method.

또한, 상기 제 1 유사도 행렬의 i번째 열벡터는, i번째 약물과 상기 i번째 약물을 포함하는 다른 약물들 사이의 타니모토 유사도이고, 상기 제 2 유사도 행렬의 j번째 열벡터는, j번째 셀 라인과 상기 j번째 셀 라인을 포함하는 다른 셀 라인들 사이의 유전자 발현 유사도인, 약물 반응 예측 방법을 제공한다. Additionally, the i-th column vector of the first similarity matrix is the Tanimoto similarity between the i-th drug and other drugs including the i-th drug, and the j-th column vector of the second similarity matrix is the j-th cell. A method for predicting drug response is provided, which is gene expression similarity between a line and other cell lines including the j-th cell line.

또한, 상기 약물 반응 값은 상기 약물에 대한 상기 셀 라인의 IC₅₀(half-maximal inhibitory concentration) 값을 통해 정량화되는 값일 수도 있다. 또한, 상기 제 2 유사도 행렬은 제 1 부분행렬 및 제 2 부분행렬을 포함하고, 상기 제 1 유사도 행렬과 상기 제 2 유사도 사이의 외적을 산출하는 단계는, 상기 제 1 유사도 행렬과 상기 제 2 유사도 행렬의 제 1 부분행렬 사이의 제 1 외적을 산출하고, 상기 제 1 유사도 행렬과 상기 제 2 유사도 행렬의 제 2 부분행렬 사이의 제 2 외적을 산출하는 단계를 포함하는, 약물 반응 예측 방법을 제공한다. In addition, the drug response value is quantified through the IC ₅₀ (half-maximal inhibitory concentration) value of the cell line for the drug. It could be a value. In addition, the second similarity matrix includes a first submatrix and a second submatrix, and calculating a cross product between the first similarity matrix and the second similarity includes the first similarity matrix and the second similarity. Providing a method for predicting drug response, comprising calculating a first cross product between a first submatrix of a matrix, and calculating a second cross product between the first similarity matrix and a second submatrix of the second similarity matrix. do.

또한, 상기 합성곱 신경망 모델은 제 1 합성곱 신경망 모델 및 제 2 합성곱 신경망 모델을 포함하고, 상기 합성곱 신경망 모델을 학습시키는 단계는, 상기 제 1 외적과 상기 제 2 외적을 상기 제 1 합성곱 신경망 모델과 상기 제 2 합성곱 신경망 모델 각각의 입력 값으로 하는, 약물 반응 예측 방법을 제공한다. In addition, the convolutional neural network model includes a first convolutional neural network model and a second convolutional neural network model, and the step of training the convolutional neural network model includes combining the first cross product and the second cross product with the first synthesis. A drug response prediction method using input values of a product neural network model and the second convolutional neural network model is provided.

또한, 상기 합성곱 신경망 모델은 2차원 (2-Dimensional) 모델일 수도 있다. Additionally, the convolutional neural network model may be a 2-dimensional model.

또한, 상기 합성곱 신경망 모델은 2 개의 2 차원 합성곱 레이어, 2 개의 맥스 풀링 레이어, 1 개의 평탄화 레이어, 1 개의 드롭아웃 레이어, 2개의 완전연결 레이어를 포함할 수도 있다. Additionally, the convolutional neural network model may include two 2-dimensional convolutional layers, two max pooling layers, one flattening layer, one dropout layer, and two fully connected layers.

본 발명의 다른 실시예는, 합성곱 신경망(Convolutional Neural Network) 모델을 이용하여 약물과 셀 라인의 약물 반응을 예측하는 시스템으로서, 상기 합성곱 신경망 모델을 제어하기 위한 제어부; 외부 서버와의 통신을 위한 통신부; 메모리부; 디스플레이부; 및 사용자의 입력을 수신하는 입력부를 포함하고, 상기 메모리부는, 약물-약물 사이의 제 1 유사도 행렬 및 셀 라인-셀 라인 사이의 제 2 유사도 행렬을 포함하고, 상기 제어부는 상기 제 1 유사도 행렬의 i번째 열벡터(i = 1, 2, … m: m은 정수)와 상기 제 2 유사도 행렬의 j번째 열벡터(j = 1, 2, …, n: n은 정수) 사이의 외적을 연산하고, 상기 외적을 입력 값으로 하고, i번째 약물과 j번째 셀 라인의 약물 반응 값을 출력 값으로 하여 상기 합성곱 신경망 모델을 학습시키며, 상기 학습된 합성곱 신경망 모델을 이용하여 새로운 약물과 셀 라인의 약물 반응 값을 예측하는, 약물 반응 예측 시스템을 제공한다. Another embodiment of the present invention is a system for predicting drug responses of drugs and cell lines using a convolutional neural network model, comprising: a control unit for controlling the convolutional neural network model; a communication unit for communication with an external server; memory unit; display unit; and an input unit that receives a user's input, wherein the memory unit includes a first similarity matrix between drugs and drugs and a second similarity matrix between cell lines and cell lines, and the control unit stores the first similarity matrix. Calculate the cross product between the ith column vector (i = 1, 2, ... m: m is an integer) and the jth column vector (j = 1, 2, ..., n: n is an integer) of the second similarity matrix, and , the convolutional neural network model is trained using the cross product as an input value and the drug response value of the ith drug and jth cell line as an output value, and a new drug and cell line are developed using the learned convolutional neural network model. Provides a drug response prediction system that predicts drug response values.

또한, 상기 제 1 유사도 행렬은 타니모토(Tanimoto) 계수에 기반한 m x m 의 약물-약물 유사도 행렬이고, 상기 제 2 유사도 행렬은 RBF(radial basis function) 커널 행렬인 n x n 의 셀 라인-셀 라인 유사도 행렬이며, 상기 m 및 n 은 각각 학습 데이터세트의 약물과 셀 라인의 수를 나타내는, 약물 반응 예측 시스템을 제공한다. In addition, the first similarity matrix is an m x m drug-drug similarity matrix based on the Tanimoto coefficient, and the second similarity matrix is an n x n cell line-cell line similarity matrix that is a radial basis function (RBF) kernel matrix. , where m and n represent the number of drugs and cell lines in the learning dataset, respectively, providing a drug response prediction system.

또한, 상기 제 1 유사도 행렬의 i번째 열벡터는, i번째 약물과 상기 i번째 약물을 포함하는 다른 약물들 사이의 타니모토 유사도이고, 상기 제 2 유사도 행렬의 j번째 열벡터는, j번째 셀 라인과 상기 j번째 셀 라인을 포함하는 다른 셀 라인들 사이의 유전자 발현 유사도인, 약물 반응 예측 시스템을 제공한다. Additionally, the i-th column vector of the first similarity matrix is the Tanimoto similarity between the i-th drug and other drugs including the i-th drug, and the j-th column vector of the second similarity matrix is the j-th cell. A drug response prediction system is provided, which is gene expression similarity between a line and other cell lines including the j-th cell line.

또한, 상기 약물 반응 값은 상기 약물에 대한 상기 셀 라인의 IC₅₀(half-maximal inhibitory concentration) 값을 통해 정량화되는 값일 수도 있다. In addition, the drug response value is quantified through the IC ₅₀ (half-maximal inhibitory concentration) value of the cell line for the drug. It could be a value.

또한, 상기 제 2 유사도 행렬은 제 1 부분행렬 및 제 2 부분행렬을 포함하고, 상기 제 1 유사도 행렬과 상기 제 2 유사도 사이의 외적을 산출하는 단계는, 상기 제 1 유사도 행렬과 상기 제 2 유사도 행렬의 제 1 부분행렬 사이의 제 1 외적을 산출하고, 상기 제 1 유사도 행렬과 상기 제 2 유사도 행렬의 제 2 부분행렬 사이의 제 2 외적을 산출하는 단계를 포함하는, 약물 반응 예측 시스템을 제공한다. In addition, the second similarity matrix includes a first submatrix and a second submatrix, and calculating a cross product between the first similarity matrix and the second similarity includes the first similarity matrix and the second similarity. Providing a drug response prediction system comprising calculating a first cross product between a first submatrix of a matrix, and calculating a second cross product between the first similarity matrix and a second submatrix of the second similarity matrix. do.

또한, 상기 합성곱 신경망 모델은 제 1 합성곱 신경망 모델 및 제 2 합성곱 신경망 모델을 포함하고, 상기 합성곱 신경망 모델의 학습은, 상기 제 1 외적과 상기 제 2 외적을 상기 제 1 합성곱 신경망 모델과 상기 제 2 합성곱 신경망 모델 각각의 입력 값으로 하는, 약물 반응 예측 시스템을 제공한다. Additionally, the convolutional neural network model includes a first convolutional neural network model and a second convolutional neural network model, and learning of the convolutional neural network model involves combining the first cross product and the second cross product with the first convolutional neural network. A drug response prediction system is provided, which uses input values of each model and the second convolutional neural network model.

일실시예는 약물 반응 값 예측의 문제점을 해결하기 위하여 Dr.CNN 모델을 개시한다. Dr.CNN 은 셀 라인과 연관된 RBF 커널 행렬을 단순히 두 개의 부분행렬로 나누고, 타니모토(Tanimoto) 유사도 행렬의 열벡터와 각각의 RBF 커널 부분행렬의 열벡터의 외적을 연산한다. 그리고, 이를 2 개의 각각의 CNN 모델의 입력 값으로 사용한다. 이는 앙상블 학습을 통한 빠른 연산과 더 향상된 예측 성능을 얻기 위함이다. RBF 커널 행렬은 랜덤하게 나눠질 수 있고, 셀 라인의 수에 따라 두 개 또는 그 이상의 부분행렬로 나눠질 수도 있다. Dr.CNN 은 약물 반응 예측을 위하여 약물의 타니모토(Tanimoto) 유사도 행렬의 열벡터와 셀 라인의 RBF 커널 행렬의 열벡터 사이의 외적에 2D CNN 을 적용하는 최초의 비선형 방법이다. One embodiment discloses the Dr.CNN model to solve the problem of predicting drug response values. Dr.CNN simply divides the RBF kernel matrix associated with a cell line into two submatrices and calculates the cross product of the column vector of the Tanimoto similarity matrix and the column vector of each RBF kernel submatrix. Then, this is used as the input value for each of the two CNN models. This is to obtain faster computation and improved prediction performance through ensemble learning. The RBF kernel matrix may be divided randomly and may be divided into two or more submatrices depending on the number of cell lines. Dr.CNN is the first nonlinear method that applies 2D CNN to the cross product between the column vector of the drug's Tanimoto similarity matrix and the column vector of the cell line's RBF kernel matrix to predict drug response.

실험적 결과는 Dr.CNN이 elastic net, RF, SVR, 1D CNN 앙상블과 같은 기존의 모델들의 성능을 뛰어넘는 것을 보여준다. Dr.CNN 은 데이터 구조에 따라 CNN 의 아키텍쳐를 조절하여 더 개선될 수 있다. Dr.CNN 에 내포되어 있는 주요 아이디어는 외적을 이용하여 2 개의 모달리티를 통합하고 결과 행렬에 CNN 을 적용하는 것이다. Dr.CNN 은 약물 반응 예측을 위한 매우 효과적인 접근법이며 약물 개발 프로세스에 큰 역할을 할 수 있다. Experimental results show that Dr.CNN surpasses the performance of existing models such as elastic net, RF, SVR, and 1D CNN ensemble. Dr. CNN can be further improved by adjusting the CNN architecture according to the data structure. The main idea behind Dr.CNN is to integrate two modalities using a cross product and apply CNN to the resulting matrix. Dr.CNN is a very effective approach for drug response prediction and can play a big role in the drug development process.

도 1은 약물 반응 값의 예측을 위하여 제안된 일실시예의 전체적인 워크플로우를 나타낸다.
도 2는 일실시예에 사용되는 GDSC1 데이터세트와 GDSC2 데이터세트의 요약을 나타낸다.
도 3은 GDSC2 및 GDSC1 데이터세트의 측정된 값 대비 일실시예에 의해 예측된 값의 산점도를 나타낸다.
도 4는 약물 반응 값의 예측을 위한 앙상블 1D CNN 모델의 워크플로우를 나타낸다.
도 5는 일실시예의 유사도 기반 CNN 서브모델의 아키텍쳐를 나타낸다.
도 6은 일실시예에 따른 약물 반응 예측 방법을 구현하기 위한 시스템의 일례를 나타내는 블록도이다. Figure 1 shows the overall workflow of one embodiment proposed for predicting drug response values.
Figure 2 shows a summary of the GDSC1 and GDSC2 datasets used in one embodiment.
Figure 3 shows the measured values of GDSC2 and GDSC1 datasets. Value predicted by one example Shows a scatterplot of values.
Figure 4 shows the workflow of an ensemble 1D CNN model for prediction of drug response values.
Figure 5 shows the architecture of a similarity-based CNN submodel in one embodiment.
Figure 6 is a block diagram showing an example of a system for implementing a method for predicting drug response according to an embodiment.

본 발명은 본 명세서에 첨부된 도면과 함께 상세하게 후술되어 있는 실시예들을 참조하면 명확해질 것이다. 그러나 본 발명은 이하에서 개시되는 실시예들에 한정되는 것이 아니라 서로 다른 다양한 형태로 구현될 것이며, 단지 본 실시예들은 본 발명의 개시가 완전하도록 하며, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 발명은 청구항의 범주에 의해 정의될 뿐이다. 한편, 본 명세서에서 사용된 용어는 실시예들을 설명하기 위한 것이며 본 발명을 제한하고자 하는 것은 아니다. 본 명세서에서, 단수형은 문구에서 특별히 언급하지 않는 한 복수형도 포함한다. 본 명세서에서 사용되는 "포함한다(comprises)" 또는 "포함하는(comprising)"은 언급된 구성요소, 단계 외에 하나 이상의 다른 구성요소, 단계의 존재 또는 추가를 배제하지 않는다. 제1, 제2 등의 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 구성요소들은 용어들에 의해 한정되어서는 안 된다. 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다.The present invention will become clear by referring to the embodiments described in detail below along with the drawings attached to this specification. However, the present invention is not limited to the embodiments disclosed below and will be implemented in various different forms. The present embodiments only serve to ensure that the disclosure of the present invention is complete and that common knowledge in the technical field to which the present invention pertains is not limited. It is provided to fully inform those who have the scope of the invention, and the present invention is only defined by the scope of the claims. Meanwhile, the terms used in this specification are for describing embodiments and are not intended to limit the present invention. As used herein, singular forms also include plural forms, unless specifically stated otherwise in the context. As used herein, “comprises” or “comprising” does not exclude the presence or addition of one or more other components or steps in addition to the mentioned components or steps. Terms such as first, second, etc. may be used to describe various components, but the components should not be limited by the terms. Terms are used only to distinguish one component from another.

본 발명의 일실시예는, 약물의 타니모토(Tanimoto) 유사도 행렬과 셀 라인의 방사 기저 함수(Radial Basis Function: RBF) 커널 행렬의 열벡터의 외적(outer product)에 대한 2차원 CNN 접근법을 사용하여 약물 반응 값을 예측하는 유사도 기반의 앙상블 딥러닝 모델을 제공한다. 일실시예에서는 이 모델을 Dr.CNN 으로 칭한다.One embodiment of the present invention uses a two-dimensional CNN approach for the outer product of the column vector of the Tanimoto similarity matrix of the drug and the Radial Basis Function (RBF) kernel matrix of the cell line. This provides a similarity-based ensemble deep learning model that predicts drug response values. In one embodiment, this model is called Dr.CNN.

도 1은 약물 반응 값의 예측을 위하여 제안된 Dr.CNN의 전체적인 워크플로우를 나타낸다. Dr.CNN의 워크플로우는 앙상블 딥러닝 아키텍처 내부에서 구동되는 두 개의 서브 네트워크로부터 얻어지는 약물 반응 값을 기초로, 예를 들어, 평균값으로 최종 약물 반응 값을 예측한다. 약물 유사도 벡터와 셀 라인 유사도 벡터의 외적을 입력 값으로 하여, 2D CNN이 특성을 학습하는데 사용된다. 각각의 CNN은 예를 들어, 2 개의 합성곱 레이어(convolutional layer), 2 개의 맥스 풀링 레이어(max-pooling layer), 1 개의 평탄화 레이어(flatten layer), 1 개의 드롭아웃 레이어(dropout layer), 그리고 3 개의 완전연결 레이어(fully connected layer: FC layer)들로 구성될 수 있다. CNN 구조에 대한 구체적인 설명은 도 5에서 후술한다. Figure 1 shows the overall workflow of Dr.CNN proposed for predicting drug response values. Dr.CNN's workflow predicts the final drug response value based on the drug response value obtained from two sub-networks running inside the ensemble deep learning architecture, for example, as an average value. A 2D CNN is used to learn features by using the cross product of the drug similarity vector and the cell line similarity vector as input. Each CNN has, for example, 2 convolutional layers, 2 max-pooling layers, 1 flatten layer, 1 dropout layer, and It may consist of three fully connected layers (FC layer). A detailed description of the CNN structure is described later in Figure 5.

일실시예는 앙상블 러닝을 통한 보다 정교한 예측 성능과 빠른 연산을 얻기 위하여, 셀 라인의 RBF 커널 행렬을 두 개의 부분행렬(submatrices)로 나누고, 타니모토(Tanimoto) 유사도 행렬과 각각의 RBF 커널 부분행렬의 열벡터의 외적을 구축하였고, 이를 CNN의 입력 값으로 사용한다. 셀 라인의 RBF 커널 행렬은 셀 라인의 수에 따라서 두 개 이상의 부분행렬로 분할될 수도 있다. 실시예는 두 개의 단계로 구성될 수 있다. 첫 번째 단계는, 약물의 타니모토(Tanimoto) 유사도 행렬과 셀 라인의 RBF 커널 행렬이 연산되고, 그 후 타니모토(Tanimoto) 유사도 행렬과 각각의 RBF 커널 부분행렬의 열벡터 사이의 외적이 연산되는 단계이다. 두 번째 단계는, 2D CNN 모델이 적용되어 외적으로부터 특징을 추출하고, 두 개의 서브 네트워크에 대한 약물 반응 값을 예측하는 단계이다. 약물 반응 값의 최종 예측은 2 개의 학습된 서브 네트워크로부터의 약물 반응 값을 연산하여, 예를 들어, 평균값을 구하여 얻어질 수 있다. In one embodiment, in order to obtain more sophisticated prediction performance and faster calculation through ensemble learning, the RBF kernel matrix of the cell line is divided into two submatrices, and a Tanimoto similarity matrix and each RBF kernel submatrix are formed. The cross product of the column vector of is constructed, and this is used as the input value of CNN. The RBF kernel matrix of a cell line may be divided into two or more submatrices depending on the number of cell lines. The embodiment may consist of two steps. In the first step, the Tanimoto similarity matrix of the drug and the RBF kernel matrix of the cell line are computed, and then the cross product between the Tanimoto similarity matrix and the column vector of each RBF kernel submatrix is computed. It's a step. In the second step, a 2D CNN model is applied to extract features from the extrinsic product and predict drug response values for the two subnetworks. The final prediction of the drug response value can be obtained by computing the drug response values from the two learned subnetworks, for example, taking the average value.

일실시예에 따른 Dr.CNN은 평균제곱근오차(root mean squared error: RMSE), 일치성지수(concordance index: CI) 및 수정제곱상관계수(modified squared correlation coefficient: )를 다른 머신 러닝 및 딥러닝 모델들과 비교하여 검증된다. CI는 관측 데이터와 예측 데이터 사이의 순위 상관관계이다. Dr.CNN according to one embodiment has root mean squared error (RMSE), concordance index (CI), and modified squared correlation coefficient: ) is verified by comparing it with other machine learning and deep learning models. CI is the rank correlation between observed and predicted data.

본발명의 실시예들은 사용자가 약물 반응 값을 예측하는데 도움이 될 수 있다. Embodiments of the present invention can help users predict drug response values.

실험 데이터세트experimental dataset

일실시예의 Dr.CNN을 위한 입력 값은, 셀 라인 유전자 발현과 항암 화합물의 SMILES(Simplified Molecular-Input Line-Entry System)가 사용된다. 공개적으로 이용가능한 데이터베이스인 GDSC는 셀 라인과 약물의 모든 쌍에서 관측되는 약물 반응을 위해 사용된다. GDSC 약물 SMILES는 PubChem으로부터 획득된다. GDSC는 GDSC1(문헌 [Iorio, F. et al. A landscape of pharmacogenomic interactions in cancer. Cell 166, 740-754 (2016)]) 과 GDSC2(문헌 [Picco, G. et al. Functional linkage of gene fusions to cancer cell fitness assessed by pharmacological and CRISPR-Cas9 screening. Nat. Commun. 10, 2198 (2019)]) 데이터세트를 포함한다. 일실시예에서는 약물 반응 예측의 평가를 위하여 GDSC1 및 GDSC2 데이터세트를 사용한다. GDSC1 데이터세트는 Resazurin 또는 Syto60 검사를 사용하여 234개 화합물에 걸쳐 681개의 셀 라인을 실험했다. GDSC2 데이터세트는 CellTitreGlo 분석으로 147개 화합물에 걸쳐 588개의 셀 라인을 실험했다. GDSC1 데이터세트는 실제로 234개의 약물과 681개의 셀 라인의 모든 쌍에 대하여 관측되는, 값으로 측정되는 131,894 개의 약물 반응 값을 포함한다. 반면에, GDSC2 데이터세트는 147개의 약물과 588개의 셀 라인의 모든 쌍에 대하여 관측되는, 값으로 관측되는 72,393 개의 약물 반응 값을 포함한다. 표 1은 이 두 데이터세트를 실제 실험에 사용한 형태로 나타낸다. 일실시예에서는 을 통해 logspace로 변환된 값을 사용한다. In one embodiment, the input values for Dr. CNN are SMILES (Simplified Molecular-Input Line-Entry System) of cell line gene expression and anticancer compounds. GDSC, a publicly available database, is used for observed drug responses in all pairs of cell lines and drugs. The GDSC drug SMILES is obtained from PubChem. GDSC is GDSC1 (Iorio, F. et al. A landscape of pharmacogenomic interactions in cancer. Cell 166, 740-754 (2016)] and GDSC2 (Picco, G. et al. Functional linkage of gene fusions to cancer cell fitness assessed by pharmacological and CRISPR-Cas9 screening. Nat. Commun. 10, 2198 (2019)]) dataset. In one embodiment, the GDSC1 and GDSC2 datasets are used to evaluate drug response prediction. The GDSC1 dataset tested 681 cell lines across 234 compounds using the Resazurin or Syto60 assays. The GDSC2 dataset tested 588 cell lines across 147 compounds using the CellTitreGlo assay. The GDSC1 dataset is indeed observed for all pairs of 234 drugs and 681 cell lines. Contains 131,894 drug response values measured in values. On the other hand, the GDSC2 dataset observed for all pairs of 147 drugs and 588 cell lines, Contains 72,393 observed drug response values. Table 1 shows these two datasets in the form used in the actual experiment. In one embodiment Use the value converted to logspace through .

DatasetDataset Drugs (약물)Drugs Cell line (셀 라인)Cell line InteractionsInteractions Density (%)Density (%) GDSC1GDSC1 234234 681681 131,894131,894 82.7782.77 GDSC2GDSC2 147147 588588 72,39372,393 83.7583.75

도 2는 GDSC1 데이터세트와 GDSC2 데이터세트의 요약을 나타낸다. 도 2의 (a)와 (b)는 각각 GDSC1 과 GDSC2 데이터세트의 약물 반응 값의 분포를 나타내고, (c)와 (d)는 각각 GDSC1 과 GDSC2 데이터세트의 SMILES 문자열의 길이의 분포를 나타낸다. 보다 구체적으로, 도 2의 (a), (b)는 값의 분포를 나타내고, (c), (d)는 각각 GDSC1 과 GDSC2 데이터세트에서의 약물의 SMILES 문자열 길이의 분포를 나타낸다. GDSC1 데이터세트의 에 대하여, 평균과 표준편차는 각각 -0.9032 및 1.1777 이다. GDSC2 데이터세트의 에 대한 평균과 표준편차는 각각 -1.2472 및 1.2182 이다. GDSC1 데이터세트의 약물에 대하여, SMILES 길이의 최대치는 133 이며, 평균은 62 이다. GDSC2 데이터세트의 약물에 대한 SMILES 길이의 최대치는 126 이며, 평균은 62 이다. Figure 2 shows a summary of the GDSC1 and GDSC2 datasets. Figures 2 (a) and (b) show the distribution of drug response values of the GDSC1 and GDSC2 datasets, respectively, and (c) and (d) show the distribution of the length of the SMILES string of the GDSC1 and GDSC2 datasets, respectively. More specifically, (a) and (b) in Figures 2 Shows the distribution of values, and (c) and (d) show the distribution of SMILES string lengths of drugs in the GDSC1 and GDSC2 datasets, respectively. of the GDSC1 dataset. For , the mean and standard deviation are -0.9032 and 1.1777, respectively. of the GDSC2 dataset. The mean and standard deviation for are -1.2472 and 1.2182, respectively. For drugs in the GDSC1 dataset, the maximum SMILES length is 133 and the average is 62. The maximum SMILES length for drugs in the GDSC2 dataset is 126, and the average is 62.

입력값과 출력값의 표현Expression of input and output values

일실시예의 Dr.CNN 에 있어서, GDSC1 및 GDSC2 데이터세트에 대한 약물-약물, 셀 라인-셀 라인 유사도 행렬이 사용된다. 이들 두 행렬은 각각 , 로 표현된다. 앙상블 모델을 만들기 위하여, 일실시예는 행렬 를 두 개의 부분행렬(submatrices)로 나눈다. 즉, 로 표현되며, 여기서, , 이다. 따라서, 각각의 약물-셀 라인 쌍에 대한 Dr.CNN의 입력 값은 과 의 외적 이며, 여기서, 은 유사도 행렬 의 i번째 열이고, 는 유사도 행렬 for 의 j번째 열이며, 는 외적(outer product)을 나타낸다. 첨자 는 벡터의 전치를 나타낸다. 외적 는 실제로 아래의 수식(1)과 같이 정의된다. In one embodiment of Dr.CNN, drug-drug and cell line-cell line similarity matrices for the GDSC1 and GDSC2 datasets are used. These two matrices are each , It is expressed as To create an ensemble model, one embodiment uses a matrix Divide into two submatrices. in other words, It is expressed as, where: , am. Therefore, the input value of Dr.CNN for each drug-cell line pair is class cross product of , where: is the similarity matrix is the ith column of is the similarity matrix for is the jth column of represents the outer product. subscript represents the transpose of the vector. cross product is actually defined as equation (1) below.

수식 (1) Formula (1)

이 외적은 두 세트의 정보를 산출한다. , 이므로, 바이모달(bimodal) 상호작용과 개별 모달리티들의 원시 유니모달(unimodal) 표현을 산출한다. 그러므로, 는 와 의 정보의 모든 조합을 포함한다. 이는 가 i번째 약물과 j번째 셀 라인의 약물 반응을 예측함에 있어서, 과 의 단순한 연접(concatenation)보다 더 효과적인 입력 값이 될 수 있음을 나타낸다. This cross product yields two sets of information. , Therefore, it yields bimodal interactions and raw unimodal representations of the individual modalities. therefore, Is and Includes any combination of information. this is In predicting the drug response of the ith drug and the jth cell line, class This indicates that it can be a more effective input value than simple concatenation of .

약물-약물 유사도는 타니모토(Tanimoto) 계수 를 이용하여 연산되며, 이는 지문으로 표현되는 화학 구조를 비교하기 위한 가장 대중적인 유사성 척도이다. 일실시예는 RDKit 의 위상 지문을 사용한다. 타니모토(Tanimoto) 유사도 척도는 0 부터 1 까지의 값을 가지며, 두 개의 약물이 공유하는 특성의 백분율로 해석할 수 있다. 반면에, 셀 라인-셀 라인 유사도는 수식 (2)의 방정식으로 설명되는 RBF 커널을 사용하는 유전자 발현 벡터에 기초하여 연산될 수 있다. RBF 커널은 다양한 커널 학습 알고리즘에 사용되는 대중적인 커널 함수이다. Drug-drug similarity is Tanimoto coefficient It is calculated using , which is the most popular similarity measure for comparing chemical structures expressed as fingerprints. One embodiment uses RDKit's topological fingerprint. The Tanimoto similarity scale has values from 0 to 1 and can be interpreted as the percentage of characteristics shared by two drugs. On the other hand, cell line-cell line similarity can be calculated based on the gene expression vector using the RBF kernel described by the equation in equation (2). The RBF kernel is a popular kernel function used in various kernel learning algorithms.

for 수식 (2) for Formula (2)

여기서, 는 i번째 셀 라인의 유전자 발현 벡터이고, 는 He et al.에서와 같이 추정치가 얻어지는 대역폭 파라미터이다(문헌 [He, T. et al. SimBoost: A read-across approach for predicting drug-target binding affinities using gradient boosting machines. J. Cheminf. 9, 24. https ://doi.org/10.1186/s1332 1-017-0209-z (2017)]). RBF 커널은 벡터들 사이의 유사도를 나타내는 척도이다. RBF 커널의 값은 거리 에 따라 감소하고, 0 (한계점)에서 1() 까지의 범위를 가진다. 즉, 두 벡터들이 서로 가까운 경우, 는 작아진다. 그 후, 인 한도에서 점점 커진다. 따라서, 가까운 벡터들은 먼 벡터들에 비해 큰 RBF 커널 값을 가진다. here, is the gene expression vector of the ith cell line, is the bandwidth parameter for which an estimate is obtained as in He et al. (He, T. et al. SimBoost: A read-across approach for predicting drug-target binding affinities using gradient boosting machines. J. Cheminf. 9, 24 . https://doi.org/10.1186/s1332 1-017-0209-z (2017)]). The RBF kernel is a measure of similarity between vectors. The value of the RBF kernel is the distance decreases according to , from 0 (threshold) to 1 ( ) has a range of up to . That is, if the two vectors are close to each other, becomes smaller. After that, It gets bigger and bigger in the limit. Therefore, close vectors have larger RBF kernel values than distant vectors.

각각의 약물-셀 라인 쌍의 출력 값은 값에 대응한다. The output value for each drug-cell line pair is Corresponds to the value.

성능 평가 척도performance rating scale

Elastic net, 랜덤 포레스트(Random Forest: RF), SVR(support vector regression)은 약물 반응의 예측을 위해 제안되는 일반적인 회귀분석 방법이므로, 일실시예는 이들을 기본 방법으로 고려한다. 일실시예는 상술한 평가용 데이터세트를 사용하여 elastic net, RF, SVR, 1D CNN 앙상블 및 DR.CNN 의 성능을 비교한다. 여기의 1D CNN 앙상블은 Park et al. 에서 사용된 것과 유사하다(문헌 [Park, H. et al. Detection of chromosome structural variation by targeted next-generation sequencing and deep learning application. Sci. Rep. 9, 3644 (2019)]). Elastic net, RF, SVR 그리고 1D CNN 앙상블의 입력 값은 2048 비트 길이의 ECFPs(extended-connectivity fingerprints)의 연접 벡터(concatenated vector)들과 19,144 개의 값들로부터 선택된 172 개의 값으로 구성되는 유전자 발현 벡터이다. 일실시예는 GDSC2 데이터세트의 예측 성능을 5-폴드 교차 검증(5-fold cross validation) 실험으로 평가하였다. 이 기법은 데이터세트를 대략 동일한 크기를 갖도록 5 폴드로 무작위로 분할한다. 하나의 폴드는 유효성 검증 세트로 처리되며, 학습은 나머지 4 폴드에 적용된다. 이 절차는 5 회 반복되며, 각각의 절차에서 다른 인스턴스 그룹이 유효성 검증 세트로 처리된다. 일실시예는 또한, GDSC2 데이터세트를 학습 데이터세트로 사용하는 다섯 개의 모델을 학습시킨 후에, GDSC1 데이터세트의 예측 성능을 평가하였다. 일실시예는 회귀분석 모델의 성능 평가를 위하여 RMSE, CI, Pearson 상관계수 , 와 같은 4 개의 지표를 사용하였다. Elastic net, random forest (RF), and support vector regression (SVR) are general regression analysis methods proposed for predicting drug response, so one embodiment considers them as basic methods. One embodiment compares the performance of elastic net, RF, SVR, 1D CNN ensemble, and DR.CNN using the above-described evaluation dataset. The 1D CNN ensemble here is similar to that of Park et al. It is similar to the one used in (Park, H. et al. Detection of chromosome structural variation by targeted next-generation sequencing and deep learning application. Sci. Rep. 9, 3644 (2019)]). The inputs to the Elastic net, RF, SVR, and 1D CNN ensembles are concatenated vectors of extended-connectivity fingerprints (ECFPs) of 2048 bit length and a gene expression vector consisting of 172 values selected from 19,144 values. In one example, the prediction performance of the GDSC2 dataset was evaluated through a 5-fold cross validation experiment. This technique randomly splits the dataset into 5 folds of approximately equal size. One fold is treated as the validation set, and learning is applied to the remaining 4 folds. This procedure is repeated five times, and in each procedure, a different group of instances is processed into the validation set. One embodiment also evaluated the prediction performance of the GDSC1 dataset after training five models using the GDSC2 dataset as a training dataset. In one embodiment, RMSE, CI, and Pearson correlation coefficient are used to evaluate the performance of a regression model. , The same four indicators were used.

RMSE 는 연속적 예측에서의 오차를 위해 일반적으로 사용되는 지표이다. 회귀 기법을 사용하기 때문에, 일실시예는 연속적 예측의 오차를 위해 일반적으로 사용되는 지표인 RMSE를 사용한다. 여기서, 는 실제 출력값, 는 예측에 대응된다. 은 샘플의 개수를 의미한다. RMSE is a commonly used indicator for error in continuous prediction. Because it uses regression techniques, one embodiment uses RMSE, a commonly used metric for error in continuous prediction. here, is the actual output value, corresponds to the prediction. means the number of samples.

수식 (3) Formula (3)

Pahikkala et al. 에서 제안된 바와 같이, CI 는 예측 정확도를 위한 평가 지표로 사용될 수 있다(문헌 [Pahikkala, T. et al. Toward more realistic drug-target interaction predictions. Brief. Bioinform. 16, 325-337 (2015)]). CI에 대한 직관은 다음과 같다. 쌍인 데이터의 세트에 대한 CI는 상이한 레이블 값을 가지는 두 개의 임의적으로 그려진 약물-셀 라인 쌍에 대한 예측이 올바른 순서일 확률이며, 이는 더 큰 친화도 값 에 대한 예측 가 더 작은 친화도 값 의 예측 보다 크다는 것을 의미한다. Pahikkala et al. As suggested in, CI can be used as an evaluation indicator for prediction accuracy (Pahikkala, T. et al. Toward more realistic drug-target interaction predictions. Brief. Bioinform. 16, 325-337 (2015)] ). The intuition about CI is as follows. The CI for a set of paired data is the probability that the predictions for two randomly drawn drug-cell line pairs with different label values are in the correct order, with the larger affinity value prediction for has a smaller affinity value prediction of It means bigger than.

수식 (4) Formula (4)

여기서, 는 정규화 상수이며, 는 계단 함수이다(문헌 [Pahikkala, T. et al. Toward more realistic drug-target interaction predictions. Brief. Bioinform. 16, 325-337 (2015)]). here, is the normalization constant, is a step function (Pahikkala, T. et al. Toward more realistic drug-target interaction predictions. Brief. Bioinform. 16, 325-337 (2015)]).

수식 (5) Formula (5)

CI 의 범위는 0.5 에서 1.0 이며, 여기서 0.5 는 랜덤 예측에 해당하고, 1.0 은 완벽한 예측 정확도에 해당한다. CI ranges from 0.5 to 1.0, where 0.5 corresponds to a random prediction and 1.0 corresponds to perfect prediction accuracy.

모델의 예측 가능성을 높이기 위하여, Roy and Roy 는 수정 제곱 상관계수(modified squared correlation coefficient) 을 도입하였다(문헌 [Roy, P. & Roy, K. On some aspects of variable selection for partial least squares regression models. QSAR Comb. Sci. 27, 302-313 (2008)]). To increase the predictability of the model, Roy and Roy used the modified squared correlation coefficient was introduced (document [Roy, P. & Roy, K. On some aspects of variable selection for partial least squares regression models. QSAR Comb. Sci. 27, 302-313 (2008)]).

수식 (6) Formula (6)

여기서, 과 은 각각 절편(intercept)를 포함한 경우와 포함하지 않은 경우의 제곱 상관계수이다. 테스트 데이터세트에 대하여 인 모델은 허용 가능한 모델로 판단된다.here, class are the squared correlation coefficients for the cases including and not including the intercept, respectively. About the test dataset The model is judged to be an acceptable model.

학습과 평가Learning and Assessment

이하에서는, 일실시예인 Dr.CNN 의 GDSC1 및 GDSC2 데이터세트에 대한 약물 반응 값의 예측, 의 예측 성능을 설명한다. 일실시예는 2 개의 벤치마크 데이터세트들에 대하여 일실시예의 모델의 성능을 평가하였다. 먼저, GDSC2 데이터세트가 GDSC1 데이터세트보다 최근 데이터이므로, GDSC2 데이터세트의 예측 성능을 5-폴드 교차검증을 통해 평가했다. 표 2 는 GDSC2 데이터세트에 대하여 중첩 5-폴드 교차검증을 통한 5 개의 모델들의 성능 결과를 나타낸다. 진하게 나타낸 값은 최고 성능 결과를 의미한다. 표준오차는 괄호 안에 기재되었다. Hereinafter, prediction of drug response values for the GDSC1 and GDSC2 datasets of Dr.CNN, an example, The prediction performance of is explained. One example evaluated the performance of one example model on two benchmark datasets. First, because the GDSC2 dataset is more recent than the GDSC1 dataset, the prediction performance of the GDSC2 dataset was evaluated through 5-fold cross-validation. Table 2 shows the performance results of five models through nested 5-fold cross-validation on the GDSC2 dataset. Values in bold represent the highest performance results. Standard errors are listed in parentheses.

모델Model RMSERMSE CIC.I. RFRF 0.5386
(0.0026)0.5386
(0.0026) 0.8433 (0.0002)0.8433 (0.0002) 0.8970
(0.0011)0.8970
(0.0011) 0.7997
(0.0018)0.7997
(0.0018) SVRSVR 0.7305
(0.0398)0.7305
(0.0398) 0.7923
(0.0021)0.7923
(0.0021) 0.8307
(0.0033) 0.8307
(0.0033) 0.6455
(0.0227)0.6455
(0.0227) Elastic NetElasticNet 0.7522
(0.0377) 0.7522
(0.0377) 0.7843
(0.0094) 0.7843
(0.0094) 0.8084
(0.0176) 0.8084
(0.0176) 0.5416
(0.0410)0.5416
(0.0410) 1D CNN1D CNN 0.5390
(0.0022)0.5390
(0.0022) 0.8448
(0.0004)0.8448
(0.0004) 0.8979
(0.0009)0.8979
(0.0009) 0.7909
(0.0060)0.7909
(0.0060) Dr.CNNDr.CNN 0.5085
(0.0042) 0.5085
(0.0042) 0.8536
(0.0007) 0.8536
(0.0007) 0.9098
(0.0011) 0.9098
(0.0011) 0.8162
(0.0043) 0.8162
(0.0043)

표 2 에서 확인할 수 있듯이, 일실시예의 Dr.CNN 이 GDSC2 데이터세트의 모든 지표에서 최상의 성능을 보여준다. 일실시예의 모델의 확연한 개선을 통계적으로 평가하기 위하여, 단측(one-sided) t-test를 수행하였다. 최고 성능 결과를 가진 Dr.CNN 과 다른 모델들을 비교하였다. 그러므로, 표 2와 관련된 귀무가설(null hypotheses)은 다음과 같이 주어진다. , , , . 위 가설 테스트의 모든 관련있는 값은 0.01 보다 작게 계산된다. 그러므로, Dr.CNN 은 4 개의 모든 지표에 대하여 다른 모델들보다 확연히 우수한 성능을 보여준다. 특히, Dr.CNN 은 GDSC2 데이터세트에 대한 5-폴드 교차검증에서 다른 모델들보다 확연히 큰 를 도출하므로 가장 허용가능한 모델이다. As can be seen in Table 2, Dr.CNN in one example shows the best performance in all indicators of the GDSC2 dataset. To statistically evaluate the significant improvement of the model of one embodiment, a one-sided t-test was performed. We compared Dr.CNN, which had the best performance results, with other models. Therefore, the null hypotheses associated with Table 2 are given as follows. , , , . All relevant aspects of the above hypothesis testing The value is calculated to be less than 0.01. Therefore, Dr.CNN shows significantly better performance than other models for all four indicators. In particular, Dr.CNN has significantly greater results than other models in 5-fold cross-validation on the GDSC2 dataset. It is the most acceptable model because it derives .

다음으로, GDSC2 데이터세트를 학습 데이터세트로 사용하여 5 개의 모델들을 학습시킨 후에, GDSC1 데이터세트에 대한 예측 성능을 평가하였다. 표 3 은 GDSC1 데이터세트에 대한 5 개 모델들의 성능 결과를 나타낸다. 진하게 나타낸 값은 최고 성능 결과를 나타내고, 표준오차는 괄호 안에 기재되었다. 각각의 모델의 평가 지표를 연산함에 있어 GDSC1 만을 사용하므로, 평가 지표를 통해 통계적으로 유의한 모델을 특정할 수는 없다. 그러므로, 각각의 평가 지표의 평균과 표준오차를 추정하기 위하여 부트스트랩(bootstrap) 방법을 사용하였다(문헌 [Efron, B. & Tibshirani, R. An Introduction to the Bootstrap (Chapman Hall, 1993)]). 이는 각각의 지표에 대하여 추정된 평균과 표준오차에 기초하여 통계적으로 유의한 모델을 특정할 수 있도록 한다. 부트스트랩 방법은 정규 이론을 사용하지 않고 평가 지표의 표본 분포를 추정하는데 유용한 것으로 알려져 있다. 부트스트랩 방법은 복원 추출을 통해 GDSC1 데이터세트를 반복적으로 샘플링하는 작업을 포함한다. GDSC1 데이터세트에 부트스트랩 샘플링을 적용할 때, 부트스트랩 샘플 크기를 131,894 로 설정하고 샘플링 프로세스를 20회 반복한다. Next, five models were trained using the GDSC2 dataset as a training dataset, and then the prediction performance on the GDSC1 dataset was evaluated. Table 3 shows the performance results of five models on the GDSC1 dataset. Values in bold represent the highest performance results, and standard errors are listed in parentheses. Since only GDSC1 is used to calculate the evaluation index of each model, a statistically significant model cannot be specified through the evaluation index. Therefore, the bootstrap method was used to estimate the mean and standard error of each evaluation index (Efron, B. & Tibshirani, R. An Introduction to the Bootstrap (Chapman Hall, 1993)]. This allows to specify a statistically significant model based on the estimated mean and standard error for each indicator. The bootstrap method is known to be useful for estimating the sampling distribution of evaluation indicators without using normal theory. The bootstrap method involves repeatedly sampling the GDSC1 dataset using reconstruction extraction. When applying bootstrap sampling to the GDSC1 dataset, set the bootstrap sample size to 131,894 and repeat the sampling process 20 times.

모델Model RMSERMSE CIC.I. RFRF 1.1992
(0.0006)1.1992
(0.0006) 0.6180
(0.0002)0.6180
(0.0002) 0.4210
(0.0008)0.4210
(0.0008) 0.1420
(0.0007)0.1420
(0.0007) SVRSVR 1.0948
(0.0004)1.0948
(0.0004) 0.6176
(0.0002)0.6176
(0.0002) 0.4606
(0.0005)0.4606
(0.0005) 0.1756
(0.0005)0.1756
(0.0005) Elastic NetElastic Net 1.0890
(0.0004)1.0890
(0.0004) 0.6398
(0.0002)0.6398
(0.0002) 0.4829
(0.0006)0.4829
(0.0006) 0.2034
(0.0006)0.2034
(0.0006) 1D CNN1D CNN 1.0652
(0.0006)1.0652
(0.0006) 0.6321
(0.0001)0.6321
(0.0001) 0.5080
(0.0005)0.5080
(0.0005) 0.2536
(0.0003)0.2536
(0.0003) Dr.CNNDr.CNN 1.0524
(0.0005) 1.0524
(0.0005) 0.6531
(0.0002) 0.6531
(0.0002) 0.5597
(0.0005) 0.5597
(0.0005) 0.3024
(0.0007) 0.3024
(0.0007)

표 3 에 나타난 바와 같이, Dr.CNN 은 GDSC1 데이터세트의 20 개의 부트스트랩 샘플에 대하여 모든 4 개의 지표에서 최고의 성능을 보여준다. 위에서와 같이, Dr.CNN 의 유의미한 개선을 통계적으로 평가하기 위하여 단측 t-test를 실행하였다. 최고 결과의 Dr.CNN 모델을 다른 모델들과 비교하였다. 따라서, 표 3 과 관련된 귀무가설은 다음과 같이 주어진다. , , , . 위 가설검정의 모든 관련있는 값은 0.01 보다 작게 계산된다. 그러므로, Dr.CNN 은 4 개의 모든 지표에 대하여 다른 모델들보다 확연히 우수한 성능을 보여준다. 비록, Dr.CNN 이 충분히 큰 을 얻지는 못 했으나, 이는 가장 허용 가능한 모델이다. 이는, GDSC1 데이터세트의 20 개의 부트스트랩 샘플들에 대하여 다른 모델들보다 큰 값을 보여주기 때문이다. As shown in Table 3, Dr.CNN shows the best performance in all four indicators for the 20 bootstrap samples of the GDSC1 dataset. As above, a one-tailed t-test was run to statistically evaluate the significant improvement of Dr.CNN. The Dr.CNN model with the best results was compared with other models. Therefore, the null hypothesis related to Table 3 is given as follows. , , , . All relevant factors in the above hypothesis test The value is calculated to be less than 0.01. Therefore, Dr.CNN shows significantly better performance than other models for all four indicators. Although Dr.CNN is large enough , but this is the most acceptable model. This is larger than other models for the 20 bootstrap samples of the GDSC1 dataset. Because it shows the value.

예측 능력을 시각적으로 보여주기 위해, 여러 데이터세트에 대한 측정값 대비 예측값을 도시하였다. 도 3 은 GDSC2 및 GDSC1 데이터세트의 측정된 값 대비 Dr.CNN 에 의해 예측된 값의 산점도를 나타낸다. GDSC2 데이터세트의 경우, GDSC2 데이터세트의 4-폴드를 학습 데이터세트로, 남은 하나의 폴드를 테스트 데이터세트로 번갈아 활용하여 예측 값을 구한다. GDSC1 데이터세트의 경우, GDSC2 데이터세트를 학습 데이터세트로, GDSC1 데이터세트를 테스트 데이터세트로 활용하여 예측값을 구한다. 이상적인 회귀모델은 예측값 이 측정값 와 같을 것으로, 즉 일 것으로 예상된다. 특히, GDSC2 데이터세트의 경우 직선 주변에서 높은 밀도를 보여준다. To visually demonstrate the predictive ability, we plot predicted values versus measured values for several datasets. Figure 3 shows the measured values of GDSC2 and GDSC1 datasets. Predicted by Dr.CNN compared to the value Shows a scatterplot of values. In the case of the GDSC2 dataset, the 4-fold of the GDSC2 dataset is alternately used as a training dataset and the remaining fold is used as a test dataset to obtain the prediction value. In the case of the GDSC1 dataset, the predicted value is obtained by using the GDSC2 dataset as a training dataset and the GDSC1 dataset as a test dataset. The ideal regression model is the predicted value This measurement It will be the same as It is expected that it will be. In particular, for the GDSC2 dataset, a straight line It shows high density in the surrounding area.

기준 모델reference model

기준 모델을 위하여, 3 개의 기존 머신 러닝 모델과 1 개의 딥러닝 기반 모델이 고려될 수 있다. 기존의 머신 러닝 모델들은 elastic net, RF 및 SVR 이다. Elastic net 은 변수 선택이 데이터에 너무 의존적이어서 불안정할 수 있다는 Lasso(least absolute shrinkage and selection operator)에 대한 비판의 결과로 처음 등장했다. 해결책은 두 영역 모두에 최적이 되도록 능형(ridge) 회귀와 Lasso 의 페널티를 결합하는 것이다. Lasso 는 통계적 모델의 예측 정확도와 해석 가능성을 높이기 위하여 변수 선택과 정규화를 모두 수행하는 회귀분석 방법이다. RF 는 학습을 위한 많은 의사결정 트리에 대체(replacement)와 함께 샘플링된 입력 데이터를 할당하고, 약물-셀 라인 쌍의 의사결정 결과를 수집하고, 평균을 통해 약물 반응을 결정한다. 트리가 커지게 되면, 각 노드에서의 모든 특성 중 서브세트만을 고려하여 분할을 결정한다. 이 알고리즘은 간단하고, 빠르며, 과대적합을 유발하지 않는다. 일반적으로, 하나의 양호한 회귀모형을 사용하는 경우보다 우수한 성능을 보여준다. SVR 은 모델에서 허용 가능한 오차를 정의할 수 있는 유연성을 제공하고, 데이터에 적합한 고차원 초평면(hyperplane)을 찾을 수 있다. SVR 의 목표 함수는 오차 제곱이 아니라 계수, 특히, 계수 벡터의 L2-norm 을 최소화하는 것이다. 대신에, 오차항은 최대 오차 이라 불리는 지정된 여유보다 작거나 같은 절대 오차를 설정하는 제약조건에서 처리된다. 일실시예는 모델의 요구되는 정확도를 얻기 위하여 를 튜닝할 수 있다. SVR 은 실함수(real-value function) 추정에서 효과적인 도구임이 입증되었다. SVR 의 장점 중 하나는 SVR 은 연산의 복잡성이 입력 공간의 차원에 의존하지 않는다는 것이다. 또한, 이는 예측 정확도가 높은 뛰어난 일반화 기능을 가지고 있다. Elastic net, RF 및 SVR 의 입력은 2048 비트 길이의 ECFPs 의 연접 벡터와 1,444 개의 값에서 선택된 172 개의 값으로 구성되는 유전자 발현 벡터이다. 딥 러닝 모델 중 하나는 2048 비트 길이의 ECFP 벡터와 19,144 개의 값에서 선택된 172 개의 값으로 구성된 유전자 발현 벡터를 각 개별 1D CNN 의 입력으로 사용하는 앙상블 1D CNN 기반의 예측 모델이다. For the baseline model, three existing machine learning models and one deep learning-based model can be considered. Existing machine learning models are elastic net, RF, and SVR. Elastic net first emerged as a result of criticism of Lasso (least absolute shrinkage and selection operator), which said that variable selection was too dependent on the data and could be unstable. The solution is to combine ridge regression and Lasso's penalty to be optimal for both domains. Lasso is a regression analysis method that performs both variable selection and normalization to increase the prediction accuracy and interpretability of statistical models. RF assigns sampled input data with replacement to many decision trees for learning, collects decision results of drug-cell line pairs, and determines drug response through averaging. When the tree grows, division is determined by considering only a subset of all the characteristics at each node. This algorithm is simple, fast, and does not cause overfitting. In general, it shows better performance than using a single good regression model. SVR provides the flexibility to define the allowable errors in the model and find a high-dimensional hyperplane suitable for the data. The goal function of SVR is not to minimize the error squared but to minimize the coefficients, especially the L2-norm of the coefficient vector. Instead, the error term is the maximum error This is done in a constraint that sets an absolute error that is less than or equal to a specified margin, called . In one embodiment, to obtain the required accuracy of the model, can be tuned. SVR has proven to be an effective tool in real-value function estimation. One of the advantages of SVR is that its computational complexity does not depend on the dimension of the input space. Additionally, it has excellent generalization capabilities with high prediction accuracy. The inputs of the Elastic net, RF and SVR are a concatenation vector of ECFPs with a length of 2048 bits and a gene expression vector consisting of 172 values selected from 1,444 values. One of the deep learning models is a prediction model based on an ensemble 1D CNN that uses a 2048-bit long ECFP vector and a gene expression vector consisting of 172 values selected from 19,144 values as input to each individual 1D CNN.

도 4 는 약물 반응 값의 예측을 위한 앙상블 1D CNN 모델의 워크플로우를 나타낸다. 1D CNN 에서, 커널과 풀링은 한 차원을 따라 이동한다. 모든 합성곱 레이어(convolutional layer)에 대해 스트라이드(stride)는 1 로 설정한다. 모든 맥스 풀링 레이어(max-pooling layer)에 대해서 스트라이드를 2 로 설정한다. Figure 4 shows the workflow of an ensemble 1D CNN model for prediction of drug response values. In a 1D CNN, the kernel and pooling move along one dimension. The stride for all convolutional layers is set to 1. Set the stride to 2 for all max-pooling layers.

Dr.CNN 은 약물 유사도 벡터와 셀 라인 유사도 벡터의 외적을 입력으로 사용하는 2D CNN 기반의 예측 모델이다. Dr.CNN 모델의 입력을 위하여, 먼저 타니모토 계수에 기초하는 m x m 약물-약물 유사도 행렬 와 RBF 커널 함수를 통한 유전자 발현 스코어에 기초하는 n x n 셀 라인-셀 라인 유사도 행렬 를 연산한다. RDKit 의 2048 비트 길이의 ECFP 지문은 타니모토 계수를 연산하기 위해 사용된다. 여기서, m 과 n 은 각각 학습 데이터세트에서의 약물과 셀 라인의 수를 나타낸다. 그 후, 모든 약물-셀 라인 쌍에 대하여, m x 1 약물 유사도 벡터 과 n x 1 셀 라인 유사도 벡터 의 외적 를 연산한다. 여기서, 과 는 각각 유사도 행렬 의 i번째 열과 유사도 행렬 의 j번째 열을 나타낸다. 즉, 은 i번째 약물과 자신을 포함하는 다른 약물들 사이의 타니모토 유사도로 구성되고, 는 j번째 셀 라인과 자신을 포함하는 다른 셀 라인들 사이의 유전자 발현 유사도로 구성된다. Dr.CNN is a 2D CNN-based prediction model that uses the cross product of the drug similarity vector and the cell line similarity vector as input. For input to the Dr.CNN model, first, the mxm drug-drug similarity matrix based on the Tanimoto coefficient and an nxn cell line-cell line similarity matrix based on gene expression scores via the RBF kernel function. Calculate . RDKit's 2048-bit long ECFP fingerprint is used to calculate the Tanimoto coefficient. Here, m and n represent the number of drugs and cell lines in the training dataset, respectively. Then, for every drug-cell line pair, mx1 drug similarity vector and nx1 cell line similarity vector cross product of Calculate . here, class are the similarity matrices, respectively. similarity matrix with the ith column of Indicates the jth column of . in other words, consists of the Tanimoto similarity between the ith drug and other drugs including itself, consists of the gene expression similarity between the jth cell line and other cell lines including itself.

Dr.CNN 과 관련된 파라미터들은 외적을 입력으로 하고, 약물 반응을 출력으로 하여 얻어질 수 있다. Parameters related to Dr.CNN can be obtained by using the cross product as input and the drug response as output.

도 5 는 Dr.CNN 의 유사도 기반 CNN 서브모델의 아키텍쳐를 나타낸다. 도 5 에 나타난 바와 같이, CNN 서브모델은 각각 맥스-풀링으로 이어지는 두 개의 2D 합성곱 레이어, 1 개의 평탄화 레이어(flatten layer), FC(128), FC(64) 및 FC(1) 레이어로 구성될 수 있다. 괄호 안의 숫자는 노드의 수를 나타낸다. FC(128), FC(64) 및 FC(1) 레이어는 각각 ReLU(Rectified linear unit) 과 리니어 함수를 사용할 수 있다. 일실시예는 과대적합을 제거하기 위하여 평탄화 레이어와 FC(128) 레이어 사이, FC(128) 레이어와 FC(64) 레이어 사이, FC(64) 레이어와 FC(1) 레이어 사이에 비율 0.1 의 드롭아웃 레이어를 포함할 수 있다. 각각의 합성곱 레이어에 대한 필터 수는 각각 18 과 24 이다. 일실시예는 합성곱 레이어에 대하여 각각 5 x 5, 3 x 3 커널 사이즈의 필터를 사용할 수 있다. 맥스풀링 레이어는 사이즈 2, 스트라이드 2 이다. 일실시예는 학습 알고리즘에 대하여 배치 사이즈와 에폭 수를 각각 32 와 20 으로 설정할 수 있다. 일실시예는 학습율 0.001 의 아담 옵티마이저(Adam optimizer)를 사용할 수 있다. Figure 5 shows the architecture of Dr.CNN's similarity-based CNN submodel. As shown in Figure 5, the CNN submodel consists of two 2D convolution layers followed by max-pooling, one flatten layer, FC(128), FC(64), and FC(1) layers, respectively. It can be. The numbers in parentheses indicate the number of nodes. FC(128), FC(64), and FC(1) layers can use ReLU (Rectified linear unit) and linear functions, respectively. In one embodiment, to remove overfitting, a drop ratio of 0.1 is used between the flattening layer and the FC (128) layer, between the FC (128) layer and the FC (64) layer, and between the FC (64) layer and FC (1) layer. Can include out layers. The number of filters for each convolution layer is 18 and 24, respectively. In one embodiment, filters with kernel sizes of 5 x 5 and 3 x 3 may be used for the convolution layer, respectively. The max pooling layer is size 2 and stride 2. In one embodiment, the batch size and number of epochs for the learning algorithm may be set to 32 and 20, respectively. One embodiment may use the Adam optimizer with a learning rate of 0.001.

도 6 은 일실시예에 따른 약물 반응 예측 방법을 구현하기 위한 시스템의 일례를 나타내는 블록도로서, 본 실시예에 관련된 부분을 개념적으로 나타내고 있다. 각각의 구성은 하나의 장치에 모두 구비되어 단독으로 처리를 행할 수도 있으나 이에 한정되는 것은 아니며, 네트워크를 통해 접속되어 각각의 구성이 분리된 장치에서 수행되는 것 또한 포함할 수 있다. Figure 6 is a block diagram showing an example of a system for implementing a method for predicting drug response according to an embodiment, and conceptually shows parts related to the embodiment. Each component may be provided in one device and processed independently, but the present invention is not limited to this, and may also include that each component is performed on a separate device connected through a network.

외부 서버(20)는 네트워크를 통해 예측 시스템(10)과 서로 접속될 수 있고, 약물의 화학적 특성 정보, 셀 라인 정보, 약물 반응 정보 등에 대한 정보를 제공할 수도 있다. 예를 들어, 약물의 화학적 특성 정보는 SMILES(Simplified Molecular-Input Line-Entry System) 정보를 포함할 수 있고, 셀 라인 정보는 셀 라인 유전자 발현 정보를 포함할 수도 있다. 구체적으로 GDSC 는 셀 라인과 약물의 모든 쌍에서 관측되는 약물 반응에 대한 정보를 제공하는데 이를 외부 서버(20)로부터 전달받을 수도 있다. 예를 들어, 외부 서버(20)는 예측 시스템(10)의 약물 반응 예측 처리를 위한 데이터 베이스이거나 또는 이를 제공하는 서버일 수 있다. The external server 20 may be connected to the prediction system 10 through a network and may provide information on drug chemical characteristics, cell line information, drug reaction information, etc. For example, information on the chemical characteristics of a drug may include Simplified Molecular-Input Line-Entry System (SMILES) information, and cell line information may include cell line gene expression information. Specifically, GDSC provides information on drug responses observed in all pairs of cell lines and drugs, which can also be received from an external server 20. For example, the external server 20 may be a database for drug response prediction processing of the prediction system 10 or a server that provides the same.

예측 시스템(10)은 제어부(11), 통신부(12), 입출력 인터페이스부(13) 및 메모리부(14)를 포함할 수 있다. The prediction system 10 may include a control unit 11, a communication unit 12, an input/output interface unit 13, and a memory unit 14.

제어부(11)는 예측 시스템(10)의 전체를 제어하는 구성으로서, 예를 들어, CPU, GPU 등의 프로세싱 유닛을 포함할 수 있다. 제어부(11)는 메모리부(14)에 저장된 정보들을 이용하여 후술할 모델들을 학습시킬 수 있고, 또한 학습된 모델을 통해 새로운 입력에 대한 예측값 산출을 수행할 수도 있다. 구체적으로, 제어부(11)는 약물 반응을 예측하는 모델을 제어할 수 있다. 이를 위하여 제어부(11)는 OS(operating system) 등의 제어 프로그램이나, 각종의 처리 순서 등을 규정한 프로그램, 데이터를 저장하기 위한 내부 메모리를 포함할 수도 있다. 그리고, 제어부(11)는 이들 프로그램 등에 의해 다양한 처리를 실행하기 위한 정보 처리를 수행할 수 있다. The control unit 11 is a component that controls the entire prediction system 10 and may include, for example, a processing unit such as a CPU or GPU. The control unit 11 can train models to be described later using information stored in the memory unit 14, and can also calculate a predicted value for a new input through the learned model. Specifically, the control unit 11 can control a model that predicts drug response. To this end, the control unit 11 may include an internal memory for storing control programs such as an operating system (OS), programs defining various processing sequences, and data. And, the control unit 11 can perform information processing to execute various processes using these programs, etc.

또한, 통신부(12)는 통신 회선 등에 접속되는 라우터(router) 등의 통신 장치에 접속될 수 있는 인터페이스를 포함할 수 있고, 예측 시스템(10)과 외부 서버(20)와의 통신을 제어할 수 있다. Additionally, the communication unit 12 may include an interface that can be connected to a communication device such as a router connected to a communication line, and can control communication between the prediction system 10 and the external server 20. .

입출력 인터페이스부(13)는 입력부(15) 및/또는 디스플레이부(16)에 접속되는 인터페이스일 수 있다. 입출력 인터페이스부(13)를 통해 예측 시스템(10)과 사용자가 소통할 수 있다. 예를 들어, 디스플레이부(16)는 애플리케이션 등의 표시 화면을 표시하는 표시 수단(예를 들면, 액정 또는 유기 EL 등으로 구성되는 디스플레이, 모니터, 터치 패널 등)일 수도 있다. 또한, 입력부(15)는, 예를 들면 키입력부, 터치 패널, 컨트롤 패드(예를 들면 터치 패드, 게임 패드 등), 마우스, 키보드, 마이크 등일 수도 있다. The input/output interface unit 13 may be an interface connected to the input unit 15 and/or the display unit 16. The prediction system 10 and the user can communicate through the input/output interface unit 13. For example, the display unit 16 may be a display means (for example, a display made of liquid crystal or organic EL, a monitor, a touch panel, etc.) that displays a display screen such as an application. Additionally, the input unit 15 may be, for example, a key input unit, a touch panel, a control pad (eg, a touch pad, a game pad, etc.), a mouse, a keyboard, a microphone, etc.

또한, 메모리부(14)는 각종의 데이터 베이스나 테이블 등을 저장하는 장치일 수 있다. 예를 들어, 메모리부는 약물의 화학적 특성 정보, 셀 라인 정보, 약물 반응 정보 등에 대한 정보를 제공할 수도 있다. 예를 들어, 약물의 화학적 특성 정보는 SMILES(Simplified Molecular-Input Line-Entry System) 정보를 포함할 수 있고, 셀 라인 정보는 셀 라인 유전자 발현 정보를 포함할 수도 있다. 또한, 예측 시스템(10)의 입력 및 출력에 대한 프로세스를 저장할 수 있고, 프로세스 처리에 대한 결과값들을 저장할 수도 있다. Additionally, the memory unit 14 may be a device that stores various databases, tables, etc. For example, the memory unit may provide information on drug chemical characteristics, cell line information, drug reaction information, etc. For example, information on the chemical characteristics of a drug may include Simplified Molecular-Input Line-Entry System (SMILES) information, and cell line information may include cell line gene expression information. Additionally, processes for input and output of the prediction system 10 can be stored, and result values for process processing can also be stored.

이상 설명된 본 발명에 따른 실시예들은 다양한 컴퓨터 구성요소를 통하여 수행될 수 있는 프로그램 명령어의 형태로 구현되어 컴퓨터 판독 가능한 기록 매체에 기록될 수 있다. 컴퓨터 판독 가능한 기록 매체는 프로그램 명령어, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 컴퓨터 판독 가능한 기록 매체에 기록되는 프로그램 명령어는 본 발명을 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 분야의 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능한 기록 매체의 예에는, 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체, CD-ROM, DVD와 같은 광기록 매체, 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 ROM, RAM, 플래시 메모리 등과 같은 프로그램 명령어를 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령어의 예에는, 컴파일러에 의해 만들어지는 것과 같은 기계어 코드 뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드도 포함된다. 상기 하드웨어 장치는 본 발명에 따른 처리를 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.Embodiments according to the present invention described above may be implemented in the form of program instructions that can be executed through various computer components and recorded on a computer-readable recording medium. A computer-readable recording medium may include program instructions, data files, data structures, etc., singly or in combination. Program instructions recorded on a computer-readable recording medium may be specially designed and configured for the present invention, or may be known and usable by those skilled in the computer software field. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks and magnetic tapes, optical recording media such as CD-ROMs and DVDs, and magneto-optical media such as floptical disks. media), and hardware devices specifically configured to store and perform program instructions, such as ROM, RAM, flash memory, etc. Examples of program instructions include not only machine language code such as that created by a compiler, but also high-level language code that can be executed by a computer using an interpreter or the like. The hardware device may be configured to operate as one or more software modules to perform processing according to the invention and vice versa.

또한, 이상 설명된 본 발명에 따른 실시예들은 다양한 컴퓨터 구성요소를 통하여 수행될 수 있는 프로그램 명령어의 집합 및 이를 실행하기 위한 사용자 애플리케이션 자체일 수도 있다. 구체적으로, 서버를 통해 또는 저장매체를 통해 다운로드하여 클라이언트 컴퓨터에 설치할 수 있는 프로그램 그 자체일 수도 있다.Additionally, the embodiments according to the present invention described above may be a set of program instructions that can be executed through various computer components and a user application for executing them. Specifically, it may be a program itself that can be downloaded from a server or via a storage medium and installed on a client computer.

이상에서 본 발명이 구체적인 구성요소 등과 같은 특정 사항들과 한정된 실시예 및 도면에 의해 설명되었으나, 이는 본 발명의 보다 전반적인 이해를 돕기 위해서 제공된 것일 뿐, 본 발명이 상기 실시예들에 한정되는 것은 아니며, 본 발명이 속하는 기술분야에서 통상적인 지식을 가진 자라면 이러한 기재로부터 다양한 수정 및 변형을 꾀할 수 있다.In the above, the present invention has been described with specific details such as specific components and limited embodiments and drawings, but this is only provided to facilitate a more general understanding of the present invention, and the present invention is not limited to the above embodiments. , a person skilled in the art to which the present invention pertains can make various modifications and variations from this description.

따라서, 본 발명의 사상은 상기 설명된 실시예에 국한되어 정해져서는 아니되며, 후술하는 특허청구범위 뿐만 아니라 이 특허청구범위와 균등하게 또는 등가적으로 변형된 모든 것들은 본 발명의 사상의 범주에 속한다고 할 것이다.Accordingly, the spirit of the present invention should not be limited to the above-described embodiments, and the scope of the patent claims described below as well as all modifications equivalent to or equivalent to the scope of the patent claims fall within the scope of the spirit of the present invention. They will say they do it.

또한, 본 발명의 실시예들은 상호 배타적인 것은 아니며, 일 실시예의 구성이 다른 실시예에 적용될 수도 있다. 본 발명의 실시예들은 구성요소들의 다양한 조합으로 도출될 수 있는 여러가지 형태 중 일부를 예시로서 제공하는 것으로서, 본 발명의 구체적인 실시예 자체에 한정되는 것은 아니다. Additionally, the embodiments of the present invention are not mutually exclusive, and the configuration of one embodiment may be applied to another embodiment. The embodiments of the present invention provide examples of some of the various forms that can be derived from various combinations of components, and are not limited to the specific embodiments of the present invention itself.

10: 예측 시스템 20: 외부 서버
11: 제어부 12: 통신부
13: 입출력 인터페이스부 14: 메모리부
15: 입력부 16: 디스플레이부10: Prediction system 20: External server
11: Control unit 12: Communication unit
13: input/output interface unit 14: memory unit
15: input unit 16: display unit

Claims

A method of predicting the drug response of drugs and cell lines using a two-dimensional convolutional neural network model,
preparing a first drug-drug similarity matrix;
preparing a second similarity matrix between cell lines;
As a step of calculating a cross product between the first similarity matrix and the second similarity matrix, the ith column vector of the first similarity matrix (i = 1, 2, ... m: m is an integer) and the second similarity matrix Calculating a cross product between the j-th column vectors (j = 1, 2, ..., n: n is an integer);
Learning the two-dimensional convolutional neural network model using the two-dimensional matrix that is the calculated value of the cross product as an input value and the drug response value of the i-th drug and the j-th cell line as output values; and
A step of predicting and outputting drug response values of a new drug and cell line using the learned two-dimensional convolutional neural network model,
The i-th column vector of the first similarity matrix is the similarity between the i-th drug and other drugs including the i-th drug,
The j-th column vector of the second similarity matrix is the gene expression similarity between the j-th cell line and other cell lines including the j-th cell line.

According to claim 1,
The first similarity matrix is an mxm drug-drug similarity matrix based on the Tanimoto coefficient,
The second similarity matrix is an nxn cell line-cell line similarity matrix that is a radial basis function (RBF) kernel matrix,
A drug response prediction method, wherein m and n represent the number of drugs and cell lines in the learning dataset, respectively.

According to claim 1,
The i-th column vector of the first similarity matrix is the Tanimoto similarity between the i-th drug and other drugs including the i-th drug,
The two-dimensional convolutional neural network model includes two two-dimensional convolutional layers, two max pooling layers, one flattening layer, one dropout layer, and two fully connected layers. A drug response prediction method.

The method according to any one of claims 1 to 3,
The drug response value is quantified through the IC ₅₀ (half-maximal inhibitory concentration) value of the cell line for the drug. Value, method for predicting drug response.

According to claim 1,
The first similarity matrix is an mxm drug-drug similarity matrix,
The second similarity matrix is an nxn cell line-cell line similarity matrix,
The m and n represent the number of drugs and cell lines in the training dataset, respectively,
The second similarity matrix includes a first submatrix and a second submatrix,
The step of calculating a cross product between the first similarity matrix and the second similarity matrix includes:
Calculating a first cross product between the first similarity matrix and a first submatrix of the second similarity matrix, and calculating a second cross product between the first similarity matrix and a second submatrix of the second similarity matrix. Method for predicting drug response, including.

According to claim 5,
The two-dimensional convolutional neural network model includes a first convolutional neural network model and a second convolutional neural network model,
The step of learning the two-dimensional convolutional neural network model is,
A method for predicting drug response, which is a step of training the calculated value of the first cross product and the calculated value of the second cross product as input values for each of the first convolutional neural network model and the second convolutional neural network model.

As a learning method of a 2-dimensional Convolutional Neural Network model for predicting drug responses of drugs and cell lines,
preparing a first drug-drug similarity matrix;
preparing a second similarity matrix between cell lines;
A step of calculating a cross product between the first similarity matrix and the second similarity matrix, comprising the i-th column vector of the first similarity matrix (i = 1, 2, ... m: m is an integer) and the second similarity matrix Calculating a cross product between the j-th column vectors (j = 1, 2, ..., n: n is an integer); and
Learning the two-dimensional convolutional neural network model using the two-dimensional matrix that is the calculated value of the cross product as an input value and the drug response value of the i-th drug and the j-th cell line as output values; Including,
The i-th column vector of the first similarity matrix is the similarity between the i-th drug and other drugs including the i-th drug,
The j-th column vector of the second similarity matrix is a two-dimensional convolution that predicts the drug response of a drug and a cell line, which is the gene expression similarity between the j-th cell line and other cell lines including the j-th cell line. Learning methods for neural network models.

According to claim 7,
The second similarity matrix includes a first submatrix and a second submatrix,
The step of calculating a cross product between the first similarity matrix and the second similarity matrix includes:
Comprising a step of calculating a first cross product between the first similarity matrix and the first submatrix, and calculating a second cross product between the first similarity matrix and the second submatrix,
The two-dimensional convolutional neural network model includes a first convolutional neural network model and a second convolutional neural network model,
The first convolutional neural network model is learned using the calculated value of the first cross product as an input value, and the second convolutional neural network model is learned using the calculated value of the second cross product as an input value. Drug and cell line Learning method of a 2D convolutional neural network model to predict drug response.

A system that predicts the drug response of drugs and cell lines using a two-dimensional convolutional neural network model,
a control unit for controlling the two-dimensional convolutional neural network model;
A communication unit for communication with an external server;
memory unit;
display unit; and
Includes an input unit that receives user input,
The memory unit includes a first similarity matrix between drugs and drugs and a second similarity matrix between cell lines and cell lines,
The control unit controls the ith column vector of the first similarity matrix (i = 1, 2, ... m: m is an integer) and the jth column vector of the second similarity matrix (j = 1, 2, ..., n: n is an integer), the two-dimensional matrix that is the calculated value of the cross product is used as the input value, and the drug response value of the i-th drug and the j-th cell line is used as the output value to learn the two-dimensional convolutional neural network model. And
The drug response values of a new drug and a cell line are predicted using the learned two-dimensional convolutional neural network model, and the i-th column vector of the first similarity matrix is the i-th drug and other drugs including the i-th drug. is the similarity between them,
The j-th column vector of the second similarity matrix is the gene expression similarity between the j-th cell line and other cell lines including the j-th cell line, a drug response prediction system.

According to clause 9,
The first similarity matrix is an mxm drug-drug similarity matrix based on the Tanimoto coefficient,
The second similarity matrix is an nxn cell line-cell line similarity matrix that is a radial basis function (RBF) kernel matrix,
A drug response prediction system, where m and n represent the number of drugs and cell lines in the learning dataset, respectively.

According to clause 9,
The second similarity matrix includes a first submatrix and a second submatrix,
The calculated value of the cross product is,
A drug response prediction system comprising a calculated value of a first cross product between the first similarity matrix and the first submatrix, and a calculated value of a second cross product between the first similarity matrix and the second submatrix.

The method according to any one of claims 9 to 11,
The drug response value is quantified through the IC ₅₀ (half-maximal inhibitory concentration) value of the cell line for the drug. Value-in, drug response prediction system.

According to claim 11,
The two-dimensional convolutional neural network model includes a first convolutional neural network model and a second convolutional neural network model,
The first convolutional neural network model is learned using the calculated value of the first cross product as an input value, and the second convolutional neural network model is learned using the calculated value of the second cross product as an input value. A drug response prediction system .

According to claim 13,
Predicting drug response values of new drugs and cell lines using the learned two-dimensional convolutional neural network model is as follows:
The calculated values of the first cross product and the calculated values of the second cross product respectively calculated between the column vector of the first similarity matrix including the new drug, the column vector of the first partial matrix, and the column vector of the second partial matrix. Each input to the first convolutional neural network model and the second convolutional neural network model, and a final drug response based on two drug response values that are output values of each of the first convolutional neural network model and the second convolutional neural network model. Drug response prediction system that predicts value.

According to clause 9,
The second similarity matrix includes two or more submatrices,
Calculation of the cross product between the first similarity matrix and the second similarity matrix is:
Comprising two or more cross products between the first similarity matrix and each of the two or more partial matrices,
The two-dimensional convolutional neural network model includes two or more two-dimensional convolutional neural network submodels,
The control unit trains the two or more two-dimensional convolutional neural network submodels by using the calculated values of the two or more cross products as input values for each of the two or more two-dimensional convolutional neural network submodels and the drug response values as output values, respectively. Drug response prediction system.

According to clause 9,
The two-dimensional convolutional neural network model includes two two-dimensional convolutional layers, two max pooling layers, one smoothing layer, one dropout layer, and two fully connected layers. A drug response prediction system.