KR20180022537A

KR20180022537A - Method for predicting therapeutic efficacy of combined drug by machine learning ensemble model

Info

Publication number: KR20180022537A
Application number: KR1020170031981A
Authority: KR
Inventors: 송상옥; 김진한; 윤소정
Original assignee: 주식회사 스탠다임
Priority date: 2016-08-23
Filing date: 2017-03-14
Publication date: 2018-03-06
Also published as: KR102639616B1

Abstract

The present invention relates to a therapeutic efficacy of a combined drug. More specifically, the present invention relates to a method for predicting a therapeutic efficacy of a new combined drug which can effectively predict an efficacy of a combined drug on a cell by constructing data by using cell data, individual drug data, and data on a reaction with a cell and an individual drug through a computer.

Description

TECHNICAL FIELD The present invention relates to a method for predicting the effect of a combination drug using a machine learning ensemble model,

본 발명은 조합 약물의 효과 예측 방법에 관한 것으로서, 더욱 상세하게는 세포 데이터, 개별 약물 데이터 및 세포와 개별 약물과의 반응에 대한 데이터를 기반으로 하여 기계적 학습을 통한 특정 세포에 대한 약물 조합의 효과를 효율적으로 예측할 수 있는 새로운 조합 약물의 효과 예측 방법에 관한 것이다. The present invention relates to a method for predicting the effect of a combination drug, and more particularly, to a method for predicting the effect of a combination of drugs on a specific cell through mechanical learning based on cell data, individual drug data, To a method for predicting the effect of a new combination drug.

신약의 개발은, 판매를 위한 준비의 개념으로부터, 통상적으로 수억 달러의 비용이 들고 수년이 소요된다. 약물 발견은 약물에 의해 영향을 받을 타겟을 확인하여, 타겟에 영향을 주는 잠재적인 약물을 찾고, 이 잠재적인 약물 중 어떤 것이 안전하고 의존 가능한지를 결정하는 것으로 시작된다. 종종, 적절한 약물이 발견되지 않으며, 약물 후보자 중 하나가 이를 더욱 적절하게 하는 다양한 방식으로 변형된다. The development of new drugs, from the concept of preparation for sale, typically costs hundreds of millions of dollars and takes years. Drug discovery begins with identifying the target (s) to be affected by the drug, looking for potential drugs that affect the target, and determining which of these potential drugs are safe and dependable. Often, the right medication is not found, and one of the drug candidates is modified in various ways to make it more appropriate.

개발 과정은 잠재적 약제인 분자를 타겟에, 예컨대, 단백질을 인체 또는 미생물에 매칭시키는 단계로부터 출발한다. 약제에 대한 분자의 매칭은 약물의 개발을 유도할 수 있는 약물 리드(drug lead)로서 알려져 있다. 그 분자는 이후 더 활성으로, 더 선택적으로 및 더 약학적으로 허용 가능하도록 (예컨대, 독성을 줄이고 더 용이하게 투여되도록) 변형된다. 이들 단계에서 실패율은 매우 높다.The development process begins with the step of matching the potential drug molecule to the target, e. G., The protein to the human body or microorganism. Molecular matching to drugs is known as a drug lead that can lead to drug development. The molecule is then modified to be more active, more selective, and more pharmacologically acceptable (e. G., Less toxic and more easily administered). The failure rate at these stages is very high.

또한, 많은 경우에, 하나의 질병을 치료할 수 있는 다수의 약물이 존재할 수 있다. 어떤 타겟 (및 하우스키핑 단백질 및/또는 기타 인간 단백질)이 어떤 약물에 의해 영향을 받는지, 및 이것이 어떻게 상호작용 하는지에 대해 알아내는 것은 대안적인 치료법 중 선택하는 것, 부작용을 방지하는 것, 약물 상호작용을 방지 또는 통제하는 것, 및/또는 예컨대, 희귀질환 질병에 대해 선택될 정확한 약물이 없는 경우의 질병에 대한 치료법을 선택하는 것에 유용할 수 있다.Also, in many cases, there may be a number of drugs that can treat a single disease. Finding out which drugs (and housekeeping proteins and / or other human proteins) are affected by which drugs, and how they interact, is one of the alternative therapies to choose from, to avoid side effects, And / or for selecting a treatment for a disease, for example, in the absence of the correct drug to be selected for a rare disease disease.

종래 이러한 약물의 조합을 유도하기 위해 다양한 방법으로 연구가 진행되고 있다. 그러나, 세포주(cell line)에 대한 분석과 약물 및 타겟에 대한 분석을 각각 생체 내의 경로 모듈(pathway map module) 과 연관시켜 기계 학습에 의해 약물 조합의 효과를 예측하고자 하는 방법에 대해서는 연구된 바 없다.Conventionally, various methods are being studied to induce the combination of these drugs. However, no method has been studied to predict the effects of drug combinations by analyzing cell lines and analyzing drugs and targets, respectively, with a pathway map module in vivo .

본 발명은 상기와 같은 종래 기술의 문제점을 해결하기 위하여 컴퓨터의 학습을 이용하여 세포주(cell line) 에 대한 분석 및 복수개의 약물과 각각의 타겟에 대한 분석을 각각 생체 내의 경로 모듈(pathway map module) 으로 추론하여 약물 조합의 효과를 예측하는 새로운 방법을 제공하는 것을 목적으로 한다. Disclosure of Invention Technical Problem [8] The present invention has been made in view of the above problems, and it is an object of the present invention to provide a pathway map module for analyzing a cell line and analyzing a plurality of drugs and targets, And to provide a new method for predicting the effect of a drug combination.

본 발명은 또한, 특정 세포주에 대한 조합 약물의 효과를 예측할 수 있도록 기계적 학습을 위한 입력 특징(input feature)을 설계하는 방법을 제공하는 것을 목적으로 한다. The present invention also aims to provide a method for designing input features for mechanical learning so as to predict the effect of a combination drug on a particular cell line.

본 발명은 또한, 그래디언트 부스팅 분류 모델 (gradient boosting classifier)을 추론하여 약물과 세포와의 상관 관계에 대한 학습 모델을 구축하는 것을 목적으로 한다. The present invention also aims at constructing a learning model for correlation between drugs and cells by inferring a gradient boosting classifier.

본 발명은 상기와 같은 과제를 해결하기 위하여 The present invention has been made to solve the above problems

세포 관련 데이터를 제공하는 단계; Providing cell-related data;

약물 관련 데이터를 각각 제공하는 단계; Providing drug related data, respectively;

상기 약물과 세포와의 상관 관계에 대한 데이터를 제공하는 단계; Providing data on a correlation between the drug and the cell;

컴퓨터 알고리즘을 이용하여 상기 세포 관련 데이터, 상기 약물 관련 데이터 및 상기 약물과 세포와의 상관 관계에 대한 데이터를 학습하는 단계; 및Learning data on the cell-related data, the drug-related data, and the correlation between the drug and the cells using computer algorithms; And

조합하고자 하는 약물의 조합 효과를 평가하는 단계; 를 포함하는 조합 약물의 효과 예측 방법을 제공한다. Evaluating the combination effect of the drug to be combined; A method for predicting the effect of a combination drug.

도 1에 본 발명에 의한 조합 약물의 효과 예측 방법의 순서도를 나타내었다. 도 1에서 보는 바와 같이 본 발명에 의한 조합 약물의 효과 예측 방법은 순차형 앙상블 학습으로서 복수개의 그래디언트 부스팅 분류 모델(gradient boosting classifier)의 앙상블 (ensemble) 을 이용하여 약물의 조합 효과를 예측하는 것을 특징으로 한다. FIG. 1 shows a flowchart of a method for predicting the effect of a combination drug according to the present invention. As shown in FIG. 1, the method of predicting the effect of a combined drug according to the present invention is characterized in that, as sequential ensemble learning, a combination effect of a drug is predicted using an ensemble of a plurality of gradient boosting classifiers .

본 발명에 의한 조합 약물의 효과 예측 방법은 복수개의 그래디언트 부스팅 분류 모델(gradient boosting classifier)의 앙상블 (ensemble)시 각각 모델들에서 예측한 결과값에 대한 예측 신뢰도를 반영하여 합산하는 것을 특징으로 한다. The method of predicting the effect of a combined drug according to the present invention is characterized by summing up prediction reliability of results predicted by the respective models when ensembles of a plurality of gradient boosting classifiers are ensemble.

본 발명에 의한 조합 약물의 효과 예측 방법은 유전자 레벨 데이터로부터 경로 레벨의 데이터로 변환하는 과정을 포함하며, 경로 레벨 데이터 변환시에 매팅 추론이 적용될 수 있다. 본 발명에 의한 조합 약물의 효과 예측 방법은 세포 관련 데이터, 약물 관련 데이터 및 상기 약물과 세포와의 상관 관계에 대한 데이터를 유전자 레벨 에서의 데이터를 제공하는 제 1 단계와 상기 제 1 단계의 유전자 레벨에서의 데이터로부터 경로 레벨에서의 데이터를 추론하여 제공하는 제 2 단계를 포함하는 것을 특징으로 한다. The method of predicting the effect of the combination drug according to the present invention includes a step of converting the gene level data into the path level data, and the mathematical reasoning can be applied in the path level data conversion. The method of predicting the effect of a combination drug according to the present invention includes a first step of providing data on cell-related data, drug-related data, and correlation between the drug and cells at a gene level, And inferring the data at the path level from the data at the path level.

본 발명에 의한 조합 약물의 효과 예측 방법에 있어서, 상기 세포 관련 데이터를 제공하는 단계는 유전자 레벨 데이터를 제공하는 제 1 단계; 및 상기 유전자 레벨 데이터로부터 추론되는 경로 레벨 데이터를 제공하는 제 2 단계;를 포함하는 포함하는 것을 특징으로 한다. In the method of predicting the effect of the combination drug according to the present invention, the step of providing the cell-related data includes: a first step of providing gene level data; And a second step of providing path level data deduced from the gene level data.

본 발명에 의한 조합 약물의 효과 예측 방법의 상기 세포 관련 데이터를 제공하는 단계에서, 상기 유전자 레벨에서의 관련 데이터를 제공하는 단계는 돌연변이 관련 데이터, 유전자에서의 복제수 변이 데이터를 제공하는 것을 특징으로 한다. In the step of providing the cell-related data of the method for predicting the effect of a combination drug according to the present invention, the step of providing the related data at the gene level is characterized by providing mutation-related data, do.

도 2에 유전자 레벨에서 제공되는 돌연변이 관련 데이터, 도 3에 유전자 레벨에서 제공되는 복제수 변이 데이터를 나타내었다. 본 발명에 의한 조합 약물의 효과 예측 방법에 있어서, 유전자 레벨에서 제공되는 돌연변이 관련 데이터는 돌연변이가 있는 경우 1, 그렇지 않은 경우 0 으로 표시될 수 있다. 본 발명에 의한 조합 약물의 효과 예측 방법에 있어서, 유전자 레벨에서 제공되는 복제수 변이 데이터는 복제수가 >8 인 경우 1이고 이외의 경우 0 으로 표시될 수 있다. FIG. 2 shows the mutation-related data provided at the gene level, and FIG. 3 shows the copy number variation data provided at the gene level. In the method of predicting the effect of a combination drug according to the present invention, the mutation-related data provided at the gene level may be denoted as 1 if there is a mutation, and 0 otherwise. In the method of predicting the effect of the combination drug according to the present invention, the replica number data provided at the gene level may be represented as 1 when the replica number is > 8, and may be represented as 0 otherwise.

본 발명에 의한 조합 약물의 효과 예측 방법의 상기 세포 관련 데이터를 제공하는 단계에서는 유전자 레벨에서의 데이터로부터 추론되는 경로 레벨에서의 데이터를 제공하는 것을 특징으로 한다. 본 발명에 의한 조합 약물의 효과 예측 방법은 유전자 레벨 데이터로부터 경로 레벨의 데이터로 변환하는 과정을 포함하며, 경로 레벨 데이터 변환시에 매팅 추론이 적용될 수 있다. 예를 들어, 경로 레벨에서의 세포 관련 데이터는 돌연변이 유전자가 각각의 경로에 포함되는 숫자로부터 제공될 수 있다. In the step of providing the cell-related data of the method for predicting the effect of the combination drug according to the present invention, data at a path level deduced from data at a gene level is provided. The method of predicting the effect of the combination drug according to the present invention includes a step of converting the gene level data into the path level data, and the mathematical reasoning can be applied in the path level data conversion. For example, cell-related data at the path level may be provided from a number in which the mutation gene is included in each path.

도 4에 경로 레벨에서의 돌연변이 관련 데이터, 도 5에 경로 레벨에서 제공되는 복제수 변이 데이터를 나타내었다.FIG. 4 shows the mutation-related data at the path level, and FIG. 5 shows the copy number variation data provided at the path level.

본 발명에 의한 조합 약물의 효과 예측 방법에 있어서, 경로 레벨에서의 데이터로는 ACSN, Atlas of Cancer Signaling Network) 을 사용하는 것이 가능하다. ACSN 은 암과 관련된 신호 메카니즘과 관련된 정보를 포함하고 있으며, 기본적인 세포 신호 경로를 커버하는 5개의 맵(apoptosis, cell cycle, DNA repair, cell survival, EMT and cell motility)과 세분화된 52개의 모듈을 포함한다. 각각의 유전자 정보와 모듈 정보는 HUGO names 에 의하여 표준화(normalized)된다. ACSN 을 사용하여 각각의 맵 또는 모듈에서의 돌연변이 유전자의 갯수를 산정함으로써 유전자 레벨에서의 데이터가 맵(또는 모듈) 기반 매트릭스 형태로 변환된다. In the method of predicting the effect of the combination drug according to the present invention, it is possible to use ACSN, Atlas of Cancer Signaling Network as data at the path level. ACSN contains information related to cancer-related signaling mechanisms and includes five maps (apoptosis, cell cycle, DNA repair, cell survival, EMT and cell motility) covering the basic cellular signal path and 52 subdivided modules do. Each gene and module information is normalized by HUGO names. Data at the gene level is converted to map (or module) based matrix form by calculating the number of mutant genes in each map or module using ACSN.

본 발명에 의한 조합 약물의 효과 예측 방법에 있어서, 상기 약물 관련 데이터를 제공하는 단계에서는 조합 효과를 평가하고자 하는 복수개의 약물의 개개의 약물에 대해서 약물 관련 데이터를 제공하는 것을 특징으로 한다. In the method of predicting the effect of the combination drug according to the present invention, in the step of providing the drug-related data, the drug-related data is provided for each drug of a plurality of drugs to be evaluated for the combination effect.

본 발명에 의한 조합 약물의 효과 예측 방법에 있어서, 상기 약물 관련 데이터를 제공하는 단계에서는 유전자 레벨에서의 약물 관련 데이터를 제공하는 제 1 단계 및 경로 레벨에서의 약물 관련 데이터를 제공하는 제 2 단계;를 포함하는 것을 특징으로 한다. In the method of predicting the effect of a combination drug according to the present invention, in the step of providing the drug-related data, a first step of providing drug-related data at a gene level and a second step of providing drug- And a control unit.

본 발명에 의한 조합 약물의 효과 예측 방법에 있어서, 상기 유전자 레벨에서의 약물 관련 데이터를 제공하는 단계에서는 유전자 레벨에서 타겟에 대한 정보를 제공하는 것을 특징으로 한다. 본 발명에 의한 조합 약물의 효과 예측 방법에 있어서, 상기 유전자 레벨에서의 약물 관련 데이터를 제공하는 단계에서는 개별 약물에 대한 용량 반응 곡선, 약물 특이성 스코어를 제공하는 것을 특징으로 한다. In the method of predicting the effect of a combination drug according to the present invention, in the step of providing drug-related data at the gene level, information on a target at a gene level is provided. In the method of predicting the effect of the combination drug according to the present invention, in the step of providing drug-related data at the gene level, a dose response curve and a drug-specificity score for individual drugs are provided.

도 6에 유전자 레벨에서 제공되는 약물 타겟에 대한 데이터, 도 7에 경로 레벨에서 제공되는 약물 타겟에 대한 데이터를 나타내었다. FIG. 6 shows data on the drug targets provided at the gene level, and FIG. 7 shows the data of the drug targets provided at the path level.

본 발명에 의한 조합 약물의 효과 예측 방법에 있어서, 상기 경로 레벨에서의 약물 관련 데이터를 제공하는 단계에서는 경로 레벨에서의 타겟에 대한 맵핑 정보 및 모듈 정보를 제공하는 것을 특징으로 한다. In the method of predicting the effect of a combined drug according to the present invention, the step of providing drug-related data at the path level provides mapping information and module information for a target at a path level.

본 발명에 의한 조합 약물의 효과 예측 방법에 있어서, 상기 약물과 세포와의 상관 관계에 대한 데이터를 제공하는 단계에서는 조합 효과를 평가하고자 하는 복수개의 약물의 개개의 약물에 대해서 약물과 세포와의 상관 관계에 대한 데이터를 제공하는 것을 특징으로 한다. In the method for predicting the effect of the combination drug according to the present invention, in the step of providing data on the correlation between the drug and the cell, the correlation between the drug and the cell is evaluated for each drug of the plurality of drugs, And provides data on the relationship.

본 발명에 의한 조합 약물의 효과 예측 방법에 있어서, 상기 약물과 세포와의 상관 관계에 대한 데이터를 제공하는 단계에서는 유전자 레벨에서의 관련 데이터를 제공하는 제 1 단계; 및 상기 유전자 레벨에서의 관련 데이터로부터 추론되는 경로 레벨에서의 약물과 세포와의 상관 관계에 대한 관련 데이터를 제공하는 제 2 단계;를 포함하는 것을 특징으로 한다. In the method of predicting the effect of a combination drug according to the present invention, the step of providing data on a correlation between the drug and a cell may include a first step of providing related data at a gene level; And a second step of providing related data on the relationship between the drug and the cell at the path level deduced from the related data at the gene level.

본 발명에 의한 조합 약물의 효과 예측 방법에 있어서, 상기 약물과 세포와의 상관 관계에 대한 유전자 레벨에서의 관련 데이터는 약물 타겟 관련 데이터 용량 관련 데이터, 약물 반응 관련 파라미터를 포함하며, 구체적으로는 half-maximal inhibitory concentration (IC50), slope of the dose-response curve fitted (H), 및 and maximum cells killed percentage (E_inf) 데이터를 포함하는 것을 특징으로 한다. In the method of predicting the effect of the combination drug according to the present invention, the data related to the correlation between the drug and the cell at the gene level includes data related to the drug target-related data capacity, drug response-related parameters, the maximal inhibitory concentration (IC50), the slope of the dose-response curve fitted (H), and the maximum cell kill percentage (E_inf) data.

본 발명에 의한 조합 약물의 효과 예측 방법에 있어서, 상기 컴퓨터 알고리즘을 이용하여 상기 세포 관련 데이터, 상기 약물 관련 데이터 및 상기 약물과 세포와의 상관 관계에 대한 데이터를 학습하는 단계에서는 서로 다른 입력 데이터의 조합과 학습 파라미터의 조합으로 이루어진 다수의 분류 모델을 사용하여 최고의 cross-validation 성능을 보여주는 앙상블 모델을 사용하는 것을 특징으로 한다. In the method of predicting the effect of the combined drug according to the present invention, in the step of learning data on the cell-related data, the drug-related data, and the correlation between the drug and the cells using the computer algorithm, And an ensemble model showing the best cross-validation performance by using a plurality of classification models composed of a combination of combination and learning parameters.

본 발명에 의한 조합 약물의 효과 예측 방법은 그래디언트 부스팅 분류 모델(gradient boosting classifier)의 앙상블 (ensemble)시 각각 모델들에서 예측한 결과값에 대한 예측 신뢰도를 반영하여 합산하는 것을 특징으로 한다. The method of predicting the effect of the combined drug according to the present invention is characterized by summing up the predicted reliability of the results predicted by the respective models at the time of ensemble of the gradient boosting classifier.

본 발명에 의한 조합 약물의 효과 예측 방법에 있어서, 상기 컴퓨터 알고리즘을 이용하여 상기 세포 관련 데이터, 상기 약물 관련 데이터 및 상기 약물과 세포와의 상관 관계에 대한 데이터를 예측 및 평가하는 단계에서는 시너지가 있다고 예측하는 분류 모델들의 확률과 그렇지 않다고 예측하는 분류 모델들의 확률 계산을 통해 수행하는 것을 특징으로 한다. In the method of predicting the effect of the combination drug according to the present invention, there is a synergy in the step of predicting and evaluating the data on the cell-related data, the drug-related data and the correlation between the drug and the cells using the computer algorithm Is performed by calculating the probability of the classification models to be predicted and the probability of the classification models that are predicted to be not.

본 발명에 의한 조합 약물의 효과 예측 방법은 순차형 앙상블 학습으로서 상기 n(n>1)개 그래디언트 부스팅 분류 모델(gradient boosting classifier)의 앙상블 (ensemble) 을 이용하여 cross-validation의 성능을 최고로 하도록 약물의 조합 효과를 예측하는 것을 특징으로 한다. 도 13에 앙상블 모델에 기반한 확률 예측 모델을 나타내었다. The method of predicting the effect of a combined drug according to the present invention is a sequential ensemble learning method that uses an ensemble of n (n> 1) gradient boosting classifiers to achieve the best performance of cross- And the combined effect of the first and second signals is predicted. 13 shows a probability prediction model based on the ensemble model.

본 발명에 의한 조합 약물의 효과 예측 방법에 있어서, 상기 컴퓨터 알고리즘을 이용하여 상기 세포 관련 데이터, 상기 약물 관련 데이터 및 상기 약물과 세포와의 상관 관계에 대한 학습모델을 구축하는 단계에서는 서로 다른 특징 데이터의 조합과 학습 파라미터의 조합으로 이루어진 n(n>1)개 그래디언트 부스팅 분류 모델 (gradient boosting classifier)을 추론하는 것을 특징으로 한다. 본 발명에 의한 조합 약물의 효과 예측 방법에 있어서, 상기 그래디언트 부스팅 분류 모델 (gradient boosting classifier)은 scikit-learn 을 이용하는 것이 가능하다. 본 발명에 의한 조합 약물의 효과 예측 방법에 있어서 앙상블을 이루는 각 모델은 구성한 데이터와 사용된 학습 파라미터에 따라 예측 성능이 서로 다르게 나타난다.In the step of constructing a learning model for correlating the cell-related data, the drug-related data, and the drug with the cell using the computer algorithm, And n (n > 1) gradient boosting classifiers, which are combinations of combinations of learning parameters and learning parameters. In the method of predicting the effect of the combination drug according to the present invention, the gradient boosting classifier may use scikit-learn. In the method of predicting the effect of the combination drug according to the present invention, each ensemble model has different prediction performance depending on the data constructed and the learning parameters used.

본 발명에 의한 조합 약물의 효과 예측 방법에 있어서, 상기 조합하고자 하는 약물의 조합 효과를 예측 및 평가하는 단계에서는 조합 약물의 시너지가 있다고 예측하는 분류 모델들의 확률 (P_S)과 그렇지 않다고 예측하는 분류 모델들의 확률 (P_N)을 계산하여 수행하는 것을 특징으로 한다. In the method of predicting and evaluating the combination effect of the drug to be combined, the probability (P_S) of the classification models predicting the synergy of the combination drug and the classification model (P_N) of the probability of occurrence.

본 발명에 의한 조합 약물의 효과 예측 방법은 약물 조합에 따른 시너지 예측시 약물 조합 순서에 따른 예측 결과가 달라지는 것을 방지하기 위하여 조합 약물의 입력 순서를 변화시켜서 예측 결과를 검증하는 단계를 더 포함하는 것이 가능하다. The method of predicting the effect of the combination drug according to the present invention further comprises the step of verifying the prediction result by changing the input order of the combination drug in order to prevent the prediction result according to the drug combination order in the drug combination order from being changed in the case of the synergy prediction according to the drug combination It is possible.

본 발명에 의한 조합 약물의 효과 예측 방법은 조합 약물의 시너지 효과가 있고 없음을 구분하는 클래스간 분포 불균형 문제를 해결하기 위해 클래스 가중치 기법으로 precision 값을 높이는 것이 가능하다. The method of predicting the effect of the combination drug according to the present invention can increase the precision value by the class weighting technique to solve the problem of the disparity distribution between classes that discriminate between synergy effect and non-synergy effect of the combination drug.

본 발명에 의한 조합 약물의 효과 예측 방법은 세포 데이터, 개별 약물 데이터 및 세포와 개별 약물과의 반응에 대한 데이터 및 컴퓨터를 이용하여 특정 세포주에 대한 조합 약물의 효과를 예측할 수 있도록 기계적 학습을 위한 입력 특징(input feature)을 설계하는 데이터를 구축하고 학습함으로써 특정 세포에 대한 조합 약물의 효과를 효율적으로 예측할 수 있다. The method of predicting the effect of a combination drug according to the present invention is a method for predicting the effect of a combination drug on cell lines, individual drug data and data on the reaction between cells and individual drugs, and an input for mechanical learning By constructing and learning data for designing input features, we can efficiently predict the effect of a combination drug on a particular cell.

이에 따라 본 발명에 의한 조합 약물의 효과 예측 방법은 가능성이 높은 약물 조합을 선정하여 신약 개발 과정에 응용할 수 있다.Accordingly, the method of predicting the effect of the combination drug according to the present invention can be applied to the process of developing a new drug by selecting a drug combination having a high possibility.

도 1은 본 발명에 의한 조합 약물의 효과 예측 방법을 나타내는 개략도이다.
도 2 내지 도 8은 기계 학습을 위한 입력 데이터들을 나타낸다.
도 2는 유전자 레벨에서의 돌연변이 매트릭스를 나타낸다.
도 3은 유전자 레벨에서의 복제수 변이 매트릭스를 나타낸다.
도 4는 유전자 레벨에서의 약물 타겟 매트릭스를 나타낸다.
도 5는 경로 레벨에서의 돌연변이 매트릭스를 나타낸다.
도 6은 경로 레벨에서의 복제수 변이 매트릭스를 나타낸다.
도 7은 경로 레벨에서의 약물 타겟 매트릭스를 나타낸다.
도 8은 약물과 세포 사이의 상관 관계 매트릭스를 나타낸다.
도 9는 본 발명에 의한 조합 약물의 효과 예측 방법에 있어서, 그래디언트 부스팅 분류 모델 (gradient boosting classifier) 의 앙상블 학습을 위해 사용되는 입력 특성(input feature)을 나타낸다.
도 10은 본 발명에 의한 조합 약물의 효과 예측 방법에 있어서, 전체 앙상블 모델을 이루는 그래디언트 부스팅 분류 모델들이 학습 파라미터에 따라 각기 다른 성능 보완 패턴을 보임을 나타낸다.
도 11은 본 발명에 의한 조합 약물의 효과 예측 방법에 있어서, 약물 시너지 예측시 약물 조합 순서에 따른 예측 결과가 달라짐을 방지하기 위한 방법을 나타낸다.
도 12는 본 발명에 의한 조합 약물의 효과 예측 방법에 있어서, 시너지 효과의 있고 없음을 나누는 클래스간 불균형 문제를 해결하기 위해 클래스 가중치 기법으로 precision 값을 높임을 나타낸다.
도 13은 본 발명의 일 실시예에 의한 앙상블 모델의 앙상블시에 각 모델의 예측값 자체가 아니라 예측이 나온 확률값으로 앙상블함을 나타낸다.
도 14는 본 발명에 의한 조합 약물의 효과 예측 방법에 있어서, 앙상블 모델을 이루는 각각의 모델에 대해 각각 예측 경우에 대해 앙상블 모델이 예측한 시너지 결과값과 그에 따른 신뢰값을 나타낸다.
도 15는 본 발명의 일 실시예에 의한 앙상블 모델의 전체적 성능을 나타낸다.
도 16은 본 발명의 일 실시예에 의한 앙상블 모델의 예측 정확도가 높은 세포주 종류를 나타낸다.1 is a schematic diagram showing a method for estimating the effect of a combination drug according to the present invention.
Figures 2 to 8 show input data for machine learning.
Figure 2 shows the mutation matrix at the gene level.
Figure 3 shows a copy number variation matrix at the gene level.
Figure 4 shows the drug target matrix at the gene level.
Figure 5 shows the mutation matrix at the path level.
Figure 6 shows a replication number variation matrix at the path level.
Figure 7 shows the drug target matrix at the path level.
Figure 8 shows the correlation matrix between drug and cell.
FIG. 9 shows an input feature used for ensemble learning of a gradient boosting classifier in a method of predicting the effect of a combination drug according to the present invention.
FIG. 10 shows that the gradient boosting classification models constituting the whole ensemble model exhibit different performance supplement patterns according to the learning parameters in the method of predicting the effect of the combination drug according to the present invention.
FIG. 11 shows a method for preventing a prediction result according to a drug combination order from being changed in the drug synergistic prediction in the method of predicting the effect of the combination drug according to the present invention.
FIG. 12 shows a method of predicting the effect of a combination drug according to the present invention, in which a precision value is increased by a class weighting technique in order to solve the problem of unbalance between classes dividing the presence or absence of synergy.
FIG. 13 shows that ensemble of the ensemble model according to the embodiment of the present invention is ensembled with a probability value derived from the prediction rather than the predicted value itself of each model.
FIG. 14 is a method for predicting the effect of a combined drug according to the present invention, wherein a synergy result value predicted by an ensemble model for each predictive case and a corresponding confidence value are shown for each model constituting the ensemble model.
FIG. 15 shows the overall performance of an ensemble model according to an embodiment of the present invention.
FIG. 16 shows a cell line type having a high prediction accuracy of an ensemble model according to an embodiment of the present invention.

이하에서는 본 발명을 실시예에 의하여 더욱 상세히 설명한다. 그러나, 본 발명이 이하의 실시예에 의하여 한정되는 것은 아니다. Hereinafter, the present invention will be described in more detail by way of examples. However, the present invention is not limited by the following examples.

본 발명에 의한 조합 약물의 효과 예측 방법의 일 실시예에 따라 3가지 모델을 설정하고, 각각의 모델에 대한 테스트 결과를 도 9에 나타내었다. Three models are set according to one embodiment of the method for predicting the effect of the combination drug according to the present invention, and test results for each model are shown in FIG.

본 발명에 의한 조합 약물의 효과 예측 방법에 있어서 앙상블을 이루는 각 모델은 구성한 데이터와 사용된 학습 파라미터에 따라 예측 성능이 서로 다르게 나타난다. 상기 3개의 모델 S4, S5, S11 에 대해 각각 데이터셋, 학습 파라미터로서 손실함수 종류, 학습률, 데이터 샘플링 비율, 그래디언트 부스팅을 이루는 트리의 갯수, 클래스 가중치를 가지고 서로 다른 조합을 만들어 학습시켰다. In the method of predicting the effect of the combination drug according to the present invention, each ensemble model has different prediction performance depending on the data constructed and the learning parameters used. For each of the three models S4, S5, and S11, data sets, loss functions, learning rates, data sampling rates, number of trees for performing gradient boosting, and class weights were created and made different combinations.

도 10에 예측된 값을 정답값과 비교하여 각 모델의 성능 특성을 나타내었다. 도 10에서 서로 다른 성능 특성을 나타내는 모델들이 앙상블 효과에 의해 상호 보완적으로 최적의 예측 값을 결정하는 것을 확인할 수 있다. The performance characteristics of each model are shown by comparing the predicted values in FIG. 10 with the correct values. In FIG. 10, it can be seen that models showing different performance characteristics determine optimal predictive values complementarily by ensemble effect.

도 11은 본 발명에 의한 조합 약물의 효과 예측 방법의 일 실시예에 따라 약물 조합에 따른 시너지 예측시 약물 조합 순서에 따른 예측 결과가 달라짐을 방지하기 위한 방법을 나타낸다.FIG. 11 illustrates a method for preventing a prediction result according to a drug combination order in a synergistic prediction according to a drug combination according to an embodiment of the method for predicting the effect of a combination drug according to the present invention.

본 발명에 의한 조합 약물의 효과 예측 방법에 있어서, 기계학습은 학습 데이터와 예측 목적의 테스트 데이터로 나누어진다. 예측 테스트시 테스트 데이터와 학습 데이터의 데이터 형식과 일치해야 올바른 예측이 이루어진다. 예를 들면 정보의 위치가 바뀌었을 경우 원래의 정보 대신에 바뀐 위치가 들어가면 의도하지 않은 테스트가 이루어진다. In the method of predicting the effect of a combination drug according to the present invention, the machine learning is divided into learning data and test data for prediction purposes. In the predictive test, the data type of the test data and the learning data must match the correct prediction. For example, if the location of the information changes, an unintended test is performed if the changed location is substituted for the original information.

본 발명에 의한 조합 약물의 효과 예측 방법은 복수개의 약물을 학습해야 하나, 테스트시에 약물의 위치가 바뀔 수가 있다. 따라서, 본 발명에서는 기계 학습시 약물 정보의 위치를 서로 바꾸어 중복으로 학습하도록 함으로써 결과에 대한 신뢰성을 높이는 것이 가능하다. The method of predicting the effect of a combination drug according to the present invention requires learning of a plurality of drugs, but the position of a drug may be changed during a test. Therefore, in the present invention, it is possible to increase the reliability of the results by changing the positions of the drug information in the machine learning and learning them in duplicate.

본 발명에 의한 조합 약물의 효과 예측 방법에 있어서, 학습 데이터에서 시너지 있음과 시너지 없음을 정답으로 가지는 예제의 양이 일반적으로 불균형하다. 이는 생물학 문제에서 효과 있는 경우가 절대적으로 관측 사례가 적은 이유이다.In the method of predicting the effect of a combination drug according to the present invention, the amount of examples having correct synergy and no synergy in learning data is generally unbalanced. This is why there are absolutely few cases of observations that are effective in biological problems.

도 12는 본 발명에 있어서, 클래스 불균형 문제를 해결하기 위한 방법을 나타낸다. 본 발명에 의한 조합 약물의 효과 예측 방법은 이러한 클래스 불균형을 보완하기 위해 클래스 가중치를 변화시키면서 성능상의 최적의 범위를 찾는 것을 특징으로 한다. 도 12에서 보는 바와 같이 기본 가중치인 1.0 에서 recall 성능이 기준 이하로 떨어지지 않는 최고 가중치인 2.2까지 0.2 변화 단위로 적용하였다. 12 shows a method for solving the class unbalance problem in the present invention. The method of predicting the effect of a combination drug according to the present invention is characterized by finding an optimal range of performance while changing class weights in order to compensate for this class imbalance. As shown in FIG. 12, the basic weight is 1.0, and the recall performance is 2.2, which is the highest weight that does not fall below the standard.

본 발명에 의한 조합 약물의 효과 예측 방법에 있어서, 알고리즘의　약물　조합　시너지　값의　예측　성능을　평가하기　위해서　도 13에서 보는 바와 같이 아래 6가지의　평가　지표를 만들고 이에 대한 성능　평가 결과를　측정한 결과를 나타낸다. 　 In order to evaluate the prediction performance of the drug combination synergistic value of the algorithm in the method of predicting the effect of the combination drug according to the present invention, as shown in FIG. 13, the following six evaluation indexes were prepared and the results of the performance evaluation were measured .

1.　　Sequential three way ANOVA:　scoreglobal　= -sgn　x　log10(p)1. Sequential three way ANOVA: score global = -sgn x log10 (p)

　　　　 sgn:　sign of the effect size,　sgn: sign of the effect size,

p: F-statistic에　　대한　p-value값p: p-value value for F-statistic

2.　BAC_20=(Sensitivity + Specificity)/22. BAC_20 = (Sensitivity + Specificity) / 2

3.　Precision_20 = TP/(TP +FP)3. Precision_20 = TP / (TP + FP)

4.　Sensitivity_20 = TP/(TP +FN)4. Sensitivity_20 = TP / (TP + FN)

5.　Specificity_20 = TN/(TN +FP)5. Specificity_20 = TN / (TN + FP)

6.　F1_20=2 TP/(2TP +FP+ FN)6. F1_20 = 2 TP / (2TP + FP + FN)

TP: True Positive, TN: True Negative, FP: False Positive, FN: False NegativeTP: True Positive, TN: True Negative, FP: False Positive, FN: False Negative

BAC, Precision, Sensitivity, Specificity, F1　값을　계산하기　위해서 confusion matrix를　만들　때　시너지　값이　있고　없음에　대한　기준　값(cut off )은　20으로　하였다.BAC, Precision, Sensitivity, Specificity, and F1, the cutoff value for the presence of the synergy value and the absence of the confusion matrix was 20.

본 발명에 의한 조합 약물의 효과 예측 방법에 있어서는 앙상블을 이루는 각 그래디언트 부스팅 모델들에서 예측한 결과 값, 즉 시너지 있음 또는 없음,을 앙상블 할 때 단순하게 있음, 없음을 합하는 것이 아니라, 도 14에서 보는 바와 같이 각 모델의 있음과 없음에 대한 예측 신뢰도를 적용하여 합함으로써 앙상블에 의한 예측 성능을 더욱 높였다. In the method of predicting the effect of the combination drug according to the present invention, it is not necessary to sum up the predicted results in each gradient boosting models constituting the ensemble, that is, when there is ensemble with or without synergy, As a result, the prediction performance of the ensemble is further enhanced by adding the predicted reliability of the presence and absence of each model.

도 15는 본 발명에 의한 조합 약물의 효과 예측 방법에 있어서, 개발된　알고리즘으로　예측한　　약물　조합별(id)　시너지　값(synergy_score)과　신뢰값(confidence)을 나타낸다. FIG. 15 shows the synergy value (synergy_score) and the confidence value (id) predicted by the developed algorithm in the method for predicting the effect of the combination drug according to the present invention.

도 15에서 id 에는　특정　세포에　처리된　약물　조합과　농도를　표기하였다. In FIG. 15, id indicates drug combinations and concentrations treated in specific cells.

(예) NCI-H747;IGFR_2;MAP2K_1;3.000;10.000 (Example) NCI-H747; IGFR_2; MAP2K_1; 3.000; 10.000

　NCI-H1793:　세포　종류, IGFR_2:　약물1　이름, MAP2K_1:　약물2　이름　, 3.000:　약물　1의　최대농도(uM), 10.000:　약물2의　최대　농도(uM)NCI-H1793: cell type, IGFR_2: name of drug 1, MAP2K_1: name of drug 2, 3.000: maximum concentration of drug 1 (uM), 10.000: maximum concentration of drug 2

도 15에서 synergy_score는 조합에 의하여 시너지　값이　있고　없음에　대한　특정　기준　값(cut off)을　이용하여　1　혹은　0으로 표시한 결과를 나타낸다. In FIG. 15, synergy_score indicates a result of 1 or 0 using a specific reference value (cut off) with and without a synergy value by combination.

본 발명에 의한 조합 약물의 효과 예측 방법에 있어서, 그래디언트 부스팅에 의한 시너지 있음, 없음의 여부는 신뢰값에 달려 있다. 신뢰도에 cut-off를 적용하여 있음/없음을 결정한다. 도 15에서 신뢰값(confidence)은 그래디언트 부스팅 모델의 출력값으로, 시너지가 있는지 여부를 결정하는 확률값이다.In the method of predicting the effect of the combination drug according to the present invention, whether or not synergy caused by gradient boosting depends on the confidence value. Determine whether cut-off is applied to reliability. In Fig. 15, the confidence value is an output value of the gradient boosting model, and is a probability value for determining whether there is synergy.

도 16은 본 발명에 의한 조합 약물의 효과 예측 방법에 의하여 개발된　알고리즘으로 특정 cell line에 대한 조합 효과 예측에 강점을 가지고 있는지, 그렇다면 어떤 cell line에서 예측이 잘 되는지를 분석한 결과이다. FIG. 16 shows the results of analyzing whether a cell line has a strong effect on prediction of a combination effect on a specific cell line, and which cell line is predicted by the algorithm developed by the method of predicting the effect of the combination drug according to the present invention.

전체 85개의 cell line마다 우선 특정 cutoff를 기준으로 하여　 제시된 정답과 예측 결과간의 confusion matrix를 생성하였다. 이 confusion matrix를 이용하여 각 cell line별로 예측 결과의 accuracy와 accuracy p-value를 계산하였고,　그 결과 85개 중　 11개의　cell line에서　accuracy p-value가　0.1　미만인　것으로　나타났다. For each of the 85 cell lines, we first created a confusion matrix between the correct answers and the predicted results based on a specific cutoff. The accuracy and p-value of the predicted results were calculated for each cell line using the confusion matrix. As a result, the accuracy p-value was found to be less than 0.1 in 11 of the 85 cell lines.

따라서 해당 11개의 cell line에 대해 우리 예측 방법이 유의하게 강점을 가지고 있다고 판단하고, 각 cell line의 site primary 및 histology를 확인하였으며, accuracy p-value < 0.1이었던 11개의 cell line을 accuracy에 따라 정렬하여 organ별로 색을 표시한 막대 그래프를 도 16에 나타내었다. Therefore, we assumed that our prediction method had a significant strength for the 11 cell lines, confirmed the site primary and histology of each cell line, and sorted the 11 cell lines with accuracy p-value <0.1 according to the accuracy FIG. 16 shows a histogram showing the color of each organ.

도 16에서 보는 바와 같이 11개 중 8개가 lung carcinoma, 2개가 breast carcinoma,　　1개가 large intestine carcinoma cell line 임을 알 수 있다.As shown in FIG. 16, 8 out of 11 lung carcinomas, 2 breast carcinomas, and 1 large intestine carcinoma cell line.

Claims

Providing cell-related data;
Providing a plurality of drug-related data to be combined;
Providing data on a correlation between the drug and the cell;
Learning data on the cell-related data, the drug-related data, and the correlation between the drug and the cells using computer algorithms; And
Evaluating the combination effect of the drug to be combined;
A method for predicting the effect of a combination drug comprising

The method according to claim 1,
The step of providing the cell-
A first step of providing gene level data; And
And a second step of providing path level data deduced from the gene level data.

3. The method of claim 2,
The gene level data
A method for predicting the effect of a combination drug characterized by comprising mutation-related data, or copy number variation data in a gene

The method according to claim 1,
Wherein the providing of the plurality of drug-related data provides drug-related data for each drug of a plurality of drugs to be evaluated for the combination effect
How to predict the effect of combination drugs

The method according to claim 1,
Wherein the step of providing the plurality of drug-related data extracts the drug-related data at the path level from the drug-related data at the gene level
How to predict the effect of combination drugs

6. The method of claim 5,
The drug-related data at the gene level provided at the step of providing the drug-related data provides information about the target at the gene level
How to predict the effect of combination drugs

6. The method of claim 5,
Wherein drug-related data at a path level inferred from drug-related data at said gene level provides mapping information and module information for a target at a path level
How to predict the effect of combination drugs

The method according to claim 1,
Wherein the step of providing data on the correlation between the drug and the cell provides data on the relationship between the drug and the cell for each drug of the plurality of drugs to be evaluated for the combination effect, Method of Predicting Effectiveness

The method according to claim 1,
The step of providing data on the correlation between the drug and the cell is characterized by mapping pathway level feature data from the data on the correlation between the individual drug and the cell at the gene level
How to predict the effect of combination drugs

9. The method of claim 8,
Data on the correlation between individual drugs and cells at the gene level
A drug target-related data amount-related data, and a drug reaction-related parameter.
How to predict the effect of combination drugs

The method according to claim 1,
Wherein the step of constructing a learning model for correlating the cell-related data, the drug-related data, and the drug with the cell using the computer algorithm comprises the steps of: n (n> 1) characterized by inferring a gradient boosting classifier
How to predict the effect of combination drugs

12. The method of claim 11,
In the step of constructing a learning model for correlating the cell-related data, the drug-related data, and the drug with the cells using the computer algorithm, an ensemble of n (n> 1) characterized by predicting the combined effect of the drug to maximize the performance of the cross-validation
How to predict the effect of combination drugs

The method according to claim 1,
The probability P_S of the classification models predicting the synergy of the combination drug and the probability P_N of the classification models predicting the synergy of the combination drug are calculated and performed in the step of predicting and evaluating the combination effect of the drug to be combined Combination
Method of Predicting Effectiveness of Drug