KR102639616B1

KR102639616B1 - Method for predicting therapeutic efficacy of combined drug by machine learning ensemble model

Info

Publication number: KR102639616B1
Application number: KR1020170031981A
Authority: KR
Inventors: 송상옥; 김진한; 윤소정
Original assignee: 주식회사 스탠다임
Priority date: 2016-08-23
Filing date: 2017-03-14
Publication date: 2024-02-23
Also published as: KR20180022537A

Abstract

본 발명은 조합 약물의 효과 예측 방법에 관한 것으로서, 더욱 상세하게는 세포 데이터, 개별 약물 데이터 및 세포와 개별 약물과의 반응에 대한 데이터를 이용하여 컴퓨터를 이용하여 데이터를 구축하고 학습함으로써 세포에 대한 조합 약물의 효과를 효율적으로 예측할 수 있는 새로운 조합 약물의 효과 예측 방법에 관한 것이다. The present invention relates to a method for predicting the effect of a combination drug. More specifically, it relates to a method for predicting the effect of a combination drug, and more specifically, to construct and learn data using a computer using cell data, individual drug data, and data on the reaction between cells and individual drugs. This relates to a method for predicting the effects of new combination drugs that can efficiently predict the effects of combination drugs.

Description

Method for predicting the effect of combination drugs using a machine learning ensemble model {METHOD FOR PREDICTING THERAPEUTIC EFFICACY OF COMBINED DRUG BY MACHINE LEARNING ENSEMBLE MODEL}

본 발명은 조합 약물의 효과 예측 방법에 관한 것으로서, 더욱 상세하게는 세포 데이터, 개별 약물 데이터 및 세포와 개별 약물과의 반응에 대한 데이터를 기반으로 하여 기계적 학습을 통한 특정 세포에 대한 약물 조합의 효과를 효율적으로 예측할 수 있는 새로운 조합 약물의 효과 예측 방법에 관한 것이다. The present invention relates to a method for predicting the effect of a combination drug, and more specifically, to predict the effect of a drug combination on a specific cell through mechanical learning based on cell data, individual drug data, and data on the reaction between cells and individual drugs. This relates to a method for predicting the effects of new combination drugs that can efficiently predict.

신약의 개발은, 판매를 위한 준비의 개념으로부터, 통상적으로 수억 달러의 비용이 들고 수년이 소요된다. 약물 발견은 약물에 의해 영향을 받을 타겟을 확인하여, 타겟에 영향을 주는 잠재적인 약물을 찾고, 이 잠재적인 약물 중 어떤 것이 안전하고 의존 가능한지를 결정하는 것으로 시작된다. 종종, 적절한 약물이 발견되지 않으며, 약물 후보자 중 하나가 이를 더욱 적절하게 하는 다양한 방식으로 변형된다. The development of a new drug, from the concept of preparation for sale, typically costs hundreds of millions of dollars and takes several years. Drug discovery begins with identifying targets that will be affected by drugs, finding potential drugs that affect the targets, and determining which of these potential drugs are safe and dependable. Often, a suitable drug is not found, and one of the drug candidates is modified in various ways to make it more suitable.

개발 과정은 잠재적 약제인 분자를 타겟에, 예컨대, 단백질을 인체 또는 미생물에 매칭시키는 단계로부터 출발한다. 약제에 대한 분자의 매칭은 약물의 개발을 유도할 수 있는 약물 리드(drug lead)로서 알려져 있다. 그 분자는 이후 더 활성으로, 더 선택적으로 및 더 약학적으로 허용 가능하도록 (예컨대, 독성을 줄이고 더 용이하게 투여되도록) 변형된다. 이들 단계에서 실패율은 매우 높다.The development process begins with matching a potential drug molecule to a target, for example, a protein to the human body or microorganism. Matching a molecule to a drug is known as a drug lead that can lead to the development of a drug. The molecule is then modified to be more active, more selective, and more pharmaceutically acceptable (eg, to reduce toxicity and make it easier to administer). The failure rate at these stages is very high.

또한, 많은 경우에, 하나의 질병을 치료할 수 있는 다수의 약물이 존재할 수 있다. 어떤 타겟 (및 하우스키핑 단백질 및/또는 기타 인간 단백질)이 어떤 약물에 의해 영향을 받는지, 및 이것이 어떻게 상호작용 하는지에 대해 알아내는 것은 대안적인 치료법 중 선택하는 것, 부작용을 방지하는 것, 약물 상호작용을 방지 또는 통제하는 것, 및/또는 예컨대, 희귀질환 질병에 대해 선택될 정확한 약물이 없는 경우의 질병에 대한 치료법을 선택하는 것에 유용할 수 있다.Additionally, in many cases, there may be multiple drugs that can treat one disease. Finding out which targets (and housekeeping proteins and/or other human proteins) are affected by which drugs, and how these interact, can help you choose between alternative treatments, avoid side effects, and drug interactions. It may be useful for preventing or controlling an action and/or selecting a treatment for a disease, for example, a rare disease when there is no exact drug to select for the disease.

종래 이러한 약물의 조합을 유도하기 위해 다양한 방법으로 연구가 진행되고 있다. 그러나, 세포주(cell line)에 대한 분석과 약물 및 타겟에 대한 분석을 각각 생체 내의 경로 모듈(pathway map module) 과 연관시켜 기계 학습에 의해 약물 조합의 효과를 예측하고자 하는 방법에 대해서는 연구된 바 없다.Conventionally, research is being conducted in various ways to induce combinations of such drugs. However, there has been no research on methods for predicting the effect of drug combinations through machine learning by linking cell line analysis and drug and target analysis with the in vivo pathway map module, respectively. .

본 발명은 상기와 같은 종래 기술의 문제점을 해결하기 위하여 컴퓨터의 학습을 이용하여 세포주(cell line) 에 대한 분석 및 복수개의 약물과 각각의 타겟에 대한 분석을 각각 생체 내의 경로 모듈(pathway map module) 으로 추론하여 약물 조합의 효과를 예측하는 새로운 방법을 제공하는 것을 목적으로 한다. In order to solve the problems of the prior art as described above, the present invention uses computer learning to analyze cell lines and analyze multiple drugs and each target using a pathway map module in the body. The purpose is to provide a new method to predict the effect of drug combinations by inferring.

본 발명은 또한, 특정 세포주에 대한 조합 약물의 효과를 예측할 수 있도록 기계적 학습을 위한 입력 특징(input feature)을 설계하는 방법을 제공하는 것을 목적으로 한다. The present invention also aims to provide a method for designing input features for mechanical learning to predict the effect of a combination drug on a specific cell line.

본 발명은 또한, 그래디언트 부스팅 분류 모델 (gradient boosting classifier)을 추론하여 약물과 세포와의 상관 관계에 대한 학습 모델을 구축하는 것을 목적으로 한다. The present invention also aims to build a learning model for the correlation between drugs and cells by inferring a gradient boosting classifier.

본 발명은 상기와 같은 과제를 해결하기 위하여 The present invention is intended to solve the above problems.

세포 관련 데이터를 제공하는 단계; providing cell-related data;

약물 관련 데이터를 각각 제공하는 단계; providing respective drug-related data;

상기 약물과 세포와의 상관 관계에 대한 데이터를 제공하는 단계; Providing data on the correlation between the drug and cells;

컴퓨터 알고리즘을 이용하여 상기 세포 관련 데이터, 상기 약물 관련 데이터 및 상기 약물과 세포와의 상관 관계에 대한 데이터를 학습하는 단계; 및Learning the cell-related data, the drug-related data, and data about the correlation between the drug and cells using a computer algorithm; and

조합하고자 하는 약물의 조합 효과를 평가하는 단계; 를 포함하는 조합 약물의 효과 예측 방법을 제공한다. Evaluating the combination effect of the drugs to be combined; Provides a method for predicting the effect of a combination drug containing.

도 1에 본 발명에 의한 조합 약물의 효과 예측 방법의 순서도를 나타내었다. 도 1에서 보는 바와 같이 본 발명에 의한 조합 약물의 효과 예측 방법은 순차형 앙상블 학습으로서 복수개의 그래디언트 부스팅 분류 모델(gradient boosting classifier)의 앙상블 (ensemble) 을 이용하여 약물의 조합 효과를 예측하는 것을 특징으로 한다. Figure 1 shows a flowchart of the method for predicting the effect of combination drugs according to the present invention. As shown in Figure 1, the method for predicting the effect of a combination drug according to the present invention is sequential ensemble learning, and is characterized by predicting the effect of the combination of drugs using an ensemble of a plurality of gradient boosting classifiers. Do it as

본 발명에 의한 조합 약물의 효과 예측 방법은 복수개의 그래디언트 부스팅 분류 모델(gradient boosting classifier)의 앙상블 (ensemble)시 각각 모델들에서 예측한 결과값에 대한 예측 신뢰도를 반영하여 합산하는 것을 특징으로 한다. The method for predicting the effect of a combination drug according to the present invention is characterized by reflecting and summing the prediction reliability of the results predicted by each model when an ensemble of a plurality of gradient boosting classifiers is performed.

본 발명에 의한 조합 약물의 효과 예측 방법은 유전자 레벨 데이터로부터 경로 레벨의 데이터로 변환하는 과정을 포함하며, 경로 레벨 데이터 변환시에 매핑 추론이 적용될 수 있다. 본 발명에 의한 조합 약물의 효과 예측 방법은 세포 관련 데이터, 약물 관련 데이터 및 상기 약물과 세포와의 상관 관계에 대한 데이터를 유전자 레벨 에서의 데이터를 제공하는 제 1 단계와 상기 제 1 단계의 유전자 레벨에서의 데이터로부터 경로 레벨에서의 데이터를 추론하여 제공하는 제 2 단계를 포함하는 것을 특징으로 한다. The method for predicting the effect of a combination drug according to the present invention includes a process of converting gene-level data into pathway-level data, and mapping inference can be applied when converting pathway-level data. The method for predicting the effect of a combination drug according to the present invention includes a first step of providing cell-related data, drug-related data, and data on the correlation between the drug and cells at the gene level, and a gene level of the first step. It is characterized in that it includes a second step of inferring and providing data at the path level from the data in.

본 발명에 의한 조합 약물의 효과 예측 방법에 있어서, 상기 세포 관련 데이터를 제공하는 단계는 유전자 레벨 데이터를 제공하는 제 1 단계; 및 상기 유전자 레벨 데이터로부터 추론되는 경로 레벨 데이터를 제공하는 제 2 단계;를 포함하는 포함하는 것을 특징으로 한다. In the method for predicting the effect of a combination drug according to the present invention, the step of providing cell-related data includes a first step of providing gene level data; and a second step of providing pathway level data inferred from the gene level data.

본 발명에 의한 조합 약물의 효과 예측 방법의 상기 세포 관련 데이터를 제공하는 단계에서, 상기 유전자 레벨에서의 관련 데이터를 제공하는 단계는 돌연변이 관련 데이터, 유전자에서의 복제수 변이 데이터를 제공하는 것을 특징으로 한다. In the step of providing the cell-related data in the method for predicting the effect of a combination drug according to the present invention, the step of providing related data at the gene level is characterized by providing mutation-related data and copy number variation data in genes. do.

도 2에 유전자 레벨에서 제공되는 돌연변이 관련 데이터, 도 3에 유전자 레벨에서 제공되는 복제수 변이 데이터를 나타내었다. 본 발명에 의한 조합 약물의 효과 예측 방법에 있어서, 유전자 레벨에서 제공되는 돌연변이 관련 데이터는 돌연변이가 있는 경우 1, 그렇지 않은 경우 0 으로 표시될 수 있다. 본 발명에 의한 조합 약물의 효과 예측 방법에 있어서, 유전자 레벨에서 제공되는 복제수 변이 데이터는 복제수가 >8 인 경우 1이고 이외의 경우 0 으로 표시될 수 있다. Figure 2 shows mutation-related data provided at the gene level, and Figure 3 shows copy number mutation data provided at the gene level. In the method for predicting the effect of a combination drug according to the present invention, mutation-related data provided at the gene level can be expressed as 1 if there is a mutation, and 0 if not. In the method for predicting the effect of a combination drug according to the present invention, copy number variation data provided at the gene level can be displayed as 1 if the copy number is >8 and as 0 in other cases.

본 발명에 의한 조합 약물의 효과 예측 방법의 상기 세포 관련 데이터를 제공하는 단계에서는 유전자 레벨에서의 데이터로부터 추론되는 경로 레벨에서의 데이터를 제공하는 것을 특징으로 한다. 본 발명에 의한 조합 약물의 효과 예측 방법은 유전자 레벨 데이터로부터 경로 레벨의 데이터로 변환하는 과정을 포함하며, 경로 레벨 데이터 변환시에 매핑 추론이 적용될 수 있다. 예를 들어, 경로 레벨에서의 세포 관련 데이터는 돌연변이 유전자가 각각의 경로에 포함되는 숫자로부터 제공될 수 있다. The step of providing cell-related data in the method for predicting the effect of a combination drug according to the present invention is characterized by providing data at the pathway level inferred from data at the gene level. The method for predicting the effect of a combination drug according to the present invention includes a process of converting gene-level data into pathway-level data, and mapping inference can be applied when converting pathway-level data. For example, cell-related data at the pathway level can be provided from the number of mutant genes included in each pathway.

도 4에 경로 레벨에서의 돌연변이 관련 데이터, 도 5에 경로 레벨에서 제공되는 복제수 변이 데이터를 나타내었다.Figure 4 shows mutation-related data at the pathway level, and Figure 5 shows copy number mutation data provided at the pathway level.

본 발명에 의한 조합 약물의 효과 예측 방법에 있어서, 경로 레벨에서의 데이터로는 ACSN, Atlas of Cancer Signaling Network) 을 사용하는 것이 가능하다. ACSN 은 암과 관련된 신호 메카니즘과 관련된 정보를 포함하고 있으며, 기본적인 세포 신호 경로를 커버하는 5개의 맵(apoptosis, cell cycle, DNA repair, cell survival, EMT and cell motility)과 세분화된 52개의 모듈을 포함한다. 각각의 유전자 정보와 모듈 정보는 HUGO names 에 의하여 표준화(normalized)된다. ACSN 을 사용하여 각각의 맵 또는 모듈에서의 돌연변이 유전자의 갯수를 산정함으로써 유전자 레벨에서의 데이터가 맵(또는 모듈) 기반 매트릭스 형태로 변환된다. In the method for predicting the effect of combination drugs according to the present invention, it is possible to use ACSN (Atlas of Cancer Signaling Network) as data at the pathway level. ACSN contains information related to cancer-related signaling mechanisms and includes 5 maps (apoptosis, cell cycle, DNA repair, cell survival, EMT and cell motility) covering basic cell signaling pathways and 52 subdivided modules. do. Each gene information and module information is normalized by HUGO names. By calculating the number of mutant genes in each map or module using ACSN, data at the gene level are converted into a map (or module)-based matrix form.

본 발명에 의한 조합 약물의 효과 예측 방법에 있어서, 상기 약물 관련 데이터를 제공하는 단계에서는 조합 효과를 평가하고자 하는 복수개의 약물의 개개의 약물에 대해서 약물 관련 데이터를 제공하는 것을 특징으로 한다. In the method for predicting the effect of a combination drug according to the present invention, the step of providing drug-related data is characterized by providing drug-related data for each drug of a plurality of drugs for which the combination effect is to be evaluated.

본 발명에 의한 조합 약물의 효과 예측 방법에 있어서, 상기 약물 관련 데이터를 제공하는 단계에서는 유전자 레벨에서의 약물 관련 데이터를 제공하는 제 1 단계 및 경로 레벨에서의 약물 관련 데이터를 제공하는 제 2 단계;를 포함하는 것을 특징으로 한다. In the method for predicting the effect of a combination drug according to the present invention, the step of providing drug-related data includes: a first step of providing drug-related data at the gene level and a second step of providing drug-related data at the pathway level; It is characterized by including.

본 발명에 의한 조합 약물의 효과 예측 방법에 있어서, 상기 유전자 레벨에서의 약물 관련 데이터를 제공하는 단계에서는 유전자 레벨에서 타겟에 대한 정보를 제공하는 것을 특징으로 한다. 본 발명에 의한 조합 약물의 효과 예측 방법에 있어서, 상기 유전자 레벨에서의 약물 관련 데이터를 제공하는 단계에서는 개별 약물에 대한 용량 반응 곡선, 약물 특이성 스코어를 제공하는 것을 특징으로 한다. In the method for predicting the effect of a combination drug according to the present invention, the step of providing drug-related data at the gene level is characterized by providing information about the target at the gene level. In the method for predicting the effect of combination drugs according to the present invention, the step of providing drug-related data at the gene level is characterized by providing dose response curves and drug specificity scores for individual drugs.

도 6에 유전자 레벨에서 제공되는 약물 타겟에 대한 데이터, 도 7에 경로 레벨에서 제공되는 약물 타겟에 대한 데이터를 나타내었다. Figure 6 shows data on drug targets provided at the gene level, and Figure 7 shows data on drug targets provided at the pathway level.

본 발명에 의한 조합 약물의 효과 예측 방법에 있어서, 상기 경로 레벨에서의 약물 관련 데이터를 제공하는 단계에서는 경로 레벨에서의 타겟에 대한 맵핑 정보 및 모듈 정보를 제공하는 것을 특징으로 한다. In the method for predicting the effect of a combination drug according to the present invention, the step of providing drug-related data at the pathway level is characterized by providing mapping information and module information about the target at the pathway level.

본 발명에 의한 조합 약물의 효과 예측 방법에 있어서, 상기 약물과 세포와의 상관 관계에 대한 데이터를 제공하는 단계에서는 조합 효과를 평가하고자 하는 복수개의 약물의 개개의 약물에 대해서 약물과 세포와의 상관 관계에 대한 데이터를 제공하는 것을 특징으로 한다. In the method for predicting the effect of a combination drug according to the present invention, in the step of providing data on the correlation between the drug and cells, the correlation between the drug and the cell is related to each drug of the plurality of drugs for which the combination effect is to be evaluated. It is characterized by providing data about relationships.

본 발명에 의한 조합 약물의 효과 예측 방법에 있어서, 상기 약물과 세포와의 상관 관계에 대한 데이터를 제공하는 단계에서는 유전자 레벨에서의 관련 데이터를 제공하는 제 1 단계; 및 상기 유전자 레벨에서의 관련 데이터로부터 추론되는 경로 레벨에서의 약물과 세포와의 상관 관계에 대한 관련 데이터를 제공하는 제 2 단계;를 포함하는 것을 특징으로 한다. In the method for predicting the effect of a combination drug according to the present invention, the step of providing data on the correlation between the drug and cells includes a first step of providing related data at the gene level; And a second step of providing related data on the correlation between the drug and cells at the pathway level inferred from the related data at the gene level.

본 발명에 의한 조합 약물의 효과 예측 방법에 있어서, 상기 약물과 세포와의 상관 관계에 대한 유전자 레벨에서의 관련 데이터는 약물 타겟 관련 데이터 용량 관련 데이터, 약물 반응 관련 파라미터를 포함하며, 구체적으로는 half-maximal inhibitory concentration (IC50), slope of the dose-response curve fitted (H), 및 and maximum cells killed percentage (E_inf) 데이터를 포함하는 것을 특징으로 한다. In the method for predicting the effect of a combination drug according to the present invention, the relevant data at the gene level for the correlation between the drug and cells include drug target-related data, dose-related data, and drug response-related parameters, and specifically, half -Maximal inhibitory concentration (IC50), slope of the dose-response curve fitted (H), and maximum cells killed percentage (E_inf) data.

본 발명에 의한 조합 약물의 효과 예측 방법에 있어서, 상기 컴퓨터 알고리즘을 이용하여 상기 세포 관련 데이터, 상기 약물 관련 데이터 및 상기 약물과 세포와의 상관 관계에 대한 데이터를 학습하는 단계에서는 서로 다른 입력 데이터의 조합과 학습 파라미터의 조합으로 이루어진 다수의 분류 모델을 사용하여 최고의 cross-validation 성능을 보여주는 앙상블 모델을 사용하는 것을 특징으로 한다. In the method for predicting the effect of a combination drug according to the present invention, in the step of learning the cell-related data, the drug-related data, and data about the correlation between the drug and cells using the computer algorithm, different input data It is characterized by using an ensemble model that shows the best cross-validation performance by using multiple classification models consisting of a combination of combinations and learning parameters.

본 발명에 의한 조합 약물의 효과 예측 방법은 그래디언트 부스팅 분류 모델(gradient boosting classifier)의 앙상블 (ensemble)시 각각 모델들에서 예측한 결과값에 대한 예측 신뢰도를 반영하여 합산하는 것을 특징으로 한다. The method for predicting the effect of a combination drug according to the present invention is characterized by reflecting and summing the prediction reliability of the results predicted by each model when an ensemble of gradient boosting classifiers is assembled.

본 발명에 의한 조합 약물의 효과 예측 방법에 있어서, 상기 컴퓨터 알고리즘을 이용하여 상기 세포 관련 데이터, 상기 약물 관련 데이터 및 상기 약물과 세포와의 상관 관계에 대한 데이터를 예측 및 평가하는 단계에서는 시너지가 있다고 예측하는 분류 모델들의 확률과 그렇지 않다고 예측하는 분류 모델들의 확률 계산을 통해 수행하는 것을 특징으로 한다. In the method for predicting the effect of a combination drug according to the present invention, there is synergy in the step of predicting and evaluating the cell-related data, the drug-related data, and the data on the correlation between the drug and cells using the computer algorithm. It is characterized by calculating the probabilities of classification models that predict and the probabilities of classification models that predict otherwise.

본 발명에 의한 조합 약물의 효과 예측 방법은 순차형 앙상블 학습으로서 상기 n(n>1)개 그래디언트 부스팅 분류 모델(gradient boosting classifier)의 앙상블 (ensemble) 을 이용하여 cross-validation의 성능을 최고로 하도록 약물의 조합 효과를 예측하는 것을 특징으로 한다. 도 13에 앙상블 모델에 기반한 확률 예측 모델을 나타내었다. The method for predicting the effect of combination drugs according to the present invention is sequential ensemble learning, which uses an ensemble of n (n>1) gradient boosting classifiers to maximize cross-validation performance. It is characterized by predicting the combination effect of. Figure 13 shows a probability prediction model based on the ensemble model.

본 발명에 의한 조합 약물의 효과 예측 방법에 있어서, 상기 컴퓨터 알고리즘을 이용하여 상기 세포 관련 데이터, 상기 약물 관련 데이터 및 상기 약물과 세포와의 상관 관계에 대한 학습모델을 구축하는 단계에서는 서로 다른 특징 데이터의 조합과 학습 파라미터의 조합으로 이루어진 n(n>1)개 그래디언트 부스팅 분류 모델 (gradient boosting classifier)을 추론하는 것을 특징으로 한다. 본 발명에 의한 조합 약물의 효과 예측 방법에 있어서, 상기 그래디언트 부스팅 분류 모델 (gradient boosting classifier)은 scikit-learn 을 이용하는 것이 가능하다. 본 발명에 의한 조합 약물의 효과 예측 방법에 있어서 앙상블을 이루는 각 모델은 구성한 데이터와 사용된 학습 파라미터에 따라 예측 성능이 서로 다르게 나타난다.In the method for predicting the effect of a combination drug according to the present invention, in the step of constructing a learning model for the cell-related data, the drug-related data, and the correlation between the drug and the cell using the computer algorithm, different feature data It is characterized by inferring n(n>1) gradient boosting classifiers consisting of a combination of and learning parameters. In the method for predicting the effect of combination drugs according to the present invention, it is possible to use scikit-learn as the gradient boosting classifier. In the method for predicting the effect of combination drugs according to the present invention, each model forming the ensemble has different prediction performance depending on the constructed data and the learning parameters used.

본 발명에 의한 조합 약물의 효과 예측 방법에 있어서, 상기 조합하고자 하는 약물의 조합 효과를 예측 및 평가하는 단계에서는 조합 약물의 시너지가 있다고 예측하는 분류 모델들의 확률 (P_S)과 그렇지 않다고 예측하는 분류 모델들의 확률 (P_N)을 계산하여 수행하는 것을 특징으로 한다. In the method for predicting the effect of combination drugs according to the present invention, in the step of predicting and evaluating the combination effect of the drugs to be combined, the probability (P_S) of classification models predicting that there is synergy of the combination drugs and the classification model predicting otherwise It is characterized by calculating the probability (P_N) of.

본 발명에 의한 조합 약물의 효과 예측 방법은 약물 조합에 따른 시너지 예측시 약물 조합 순서에 따른 예측 결과가 달라지는 것을 방지하기 위하여 조합 약물의 입력 순서를 변화시켜서 예측 결과를 검증하는 단계를 더 포함하는 것이 가능하다. The method for predicting the effect of a combination drug according to the present invention further includes the step of verifying the prediction result by changing the input order of the combination drug to prevent the prediction result from changing depending on the drug combination order when predicting synergy according to the drug combination. possible.

본 발명에 의한 조합 약물의 효과 예측 방법은 조합 약물의 시너지 효과가 있고 없음을 구분하는 클래스간 분포 불균형 문제를 해결하기 위해 클래스 가중치 기법으로 precision 값을 높이는 것이 가능하다. The method for predicting the effect of combination drugs according to the present invention can increase the precision value using a class weighting technique to solve the distribution imbalance problem between classes that distinguishes between the synergistic effect of combination drugs and the absence thereof.

본 발명에 의한 조합 약물의 효과 예측 방법은 세포 데이터, 개별 약물 데이터 및 세포와 개별 약물과의 반응에 대한 데이터 및 컴퓨터를 이용하여 특정 세포주에 대한 조합 약물의 효과를 예측할 수 있도록 기계적 학습을 위한 입력 특징(input feature)을 설계하는 데이터를 구축하고 학습함으로써 특정 세포에 대한 조합 약물의 효과를 효율적으로 예측할 수 있다. The method for predicting the effect of a combination drug according to the present invention uses cell data, individual drug data, data on the reaction between cells and individual drugs, and a computer to input mechanical learning to predict the effect of a combination drug on a specific cell line. By building and learning data to design input features, the effect of combination drugs on specific cells can be efficiently predicted.

이에 따라 본 발명에 의한 조합 약물의 효과 예측 방법은 가능성이 높은 약물 조합을 선정하여 신약 개발 과정에 응용할 수 있다.Accordingly, the method for predicting the effect of drug combinations according to the present invention can be applied to the new drug development process by selecting drug combinations with high potential.

도 1은 본 발명에 의한 조합 약물의 효과 예측 방법을 나타내는 개략도이다.
도 2 내지 도 8은 기계 학습을 위한 입력 데이터들을 나타낸다.
도 2는 유전자 레벨에서의 돌연변이 매트릭스를 나타낸다.
도 3은 유전자 레벨에서의 복제수 변이 매트릭스를 나타낸다.
도 4는 유전자 레벨에서의 약물 타겟 매트릭스를 나타낸다.
도 5는 경로 레벨에서의 돌연변이 매트릭스를 나타낸다.
도 6은 경로 레벨에서의 복제수 변이 매트릭스를 나타낸다.
도 7은 경로 레벨에서의 약물 타겟 매트릭스를 나타낸다.
도 8은 약물과 세포 사이의 상관 관계 매트릭스를 나타낸다.
도 9는 본 발명에 의한 조합 약물의 효과 예측 방법에 있어서, 그래디언트 부스팅 분류 모델 (gradient boosting classifier) 의 앙상블 학습을 위해 사용되는 입력 특성(input feature)을 나타낸다.
도 10은 본 발명에 의한 조합 약물의 효과 예측 방법에 있어서, 전체 앙상블 모델을 이루는 그래디언트 부스팅 분류 모델들이 학습 파라미터에 따라 각기 다른 성능 보완 패턴을 보임을 나타낸다.
도 11은 본 발명에 의한 조합 약물의 효과 예측 방법에 있어서, 약물 시너지 예측시 약물 조합 순서에 따른 예측 결과가 달라짐을 방지하기 위한 방법을 나타낸다.
도 12는 본 발명에 의한 조합 약물의 효과 예측 방법에 있어서, 시너지 효과의 있고 없음을 나누는 클래스간 불균형 문제를 해결하기 위해 클래스 가중치 기법으로 precision 값을 높임을 나타낸다.
도 13은 본 발명의 일 실시예에 의한 앙상블 모델의 앙상블시에 각 모델의 예측값 자체가 아니라 예측이 나온 확률값으로 앙상블함을 나타낸다.
도 14는 본 발명에 의한 조합 약물의 효과 예측 방법에 있어서, 앙상블 모델을 이루는 각각의 모델에 대해 각각 예측 경우에 대해 앙상블 모델이 예측한 시너지 결과값과 그에 따른 신뢰값을 나타낸다.
도 15는 본 발명의 일 실시예에 의한 앙상블 모델의 전체적 성능을 나타낸다.
도 16은 본 발명의 일 실시예에 의한 앙상블 모델의 예측 정확도가 높은 세포주 종류를 나타낸다.1 is a schematic diagram showing a method for predicting the effect of a combination drug according to the present invention.
2 to 8 show input data for machine learning.
Figure 2 shows the mutation matrix at the gene level.
Figure 3 shows the copy number variation matrix at the gene level.
Figure 4 shows the drug target matrix at the gene level.
Figure 5 shows the mutation matrix at the pathway level.
Figure 6 shows the copy number variation matrix at the pathway level.
Figure 7 shows the drug target matrix at the pathway level.
Figure 8 shows the correlation matrix between drugs and cells.
Figure 9 shows input features used for ensemble learning of a gradient boosting classifier in the method for predicting the effect of combination drugs according to the present invention.
Figure 10 shows that in the method for predicting the effect of a combination drug according to the present invention, the gradient boosting classification models that make up the entire ensemble model show different performance complementation patterns depending on the learning parameters.
Figure 11 shows a method for preventing the prediction results from being different depending on the order of drug combination when predicting drug synergy in the method for predicting the effect of combination drugs according to the present invention.
Figure 12 shows that in the method for predicting the effect of a combination drug according to the present invention, the precision value is increased using a class weighting technique to solve the problem of imbalance between classes that divides the presence or absence of synergy effect.
Figure 13 shows that when ensemble models according to an embodiment of the present invention are ensembled, the ensemble is performed not by the predicted value of each model itself, but by the probability value of the prediction.
Figure 14 shows the synergy result predicted by the ensemble model and the resulting confidence value for each prediction case for each model forming the ensemble model in the method for predicting the effect of a combination drug according to the present invention.
Figure 15 shows the overall performance of the ensemble model according to an embodiment of the present invention.
Figure 16 shows cell line types with high prediction accuracy of the ensemble model according to an embodiment of the present invention.

이하에서는 본 발명을 실시예에 의하여 더욱 상세히 설명한다. 그러나, 본 발명이 이하의 실시예에 의하여 한정되는 것은 아니다. Hereinafter, the present invention will be described in more detail through examples. However, the present invention is not limited to the following examples.

본 발명에 의한 조합 약물의 효과 예측 방법의 일 실시예에 따라 3가지 모델을 설정하고, 각각의 모델에 대한 테스트 결과를 도 9에 나타내었다. According to one embodiment of the method for predicting the effect of combination drugs according to the present invention, three models were set, and the test results for each model are shown in FIG. 9.

본 발명에 의한 조합 약물의 효과 예측 방법에 있어서 앙상블을 이루는 각 모델은 구성한 데이터와 사용된 학습 파라미터에 따라 예측 성능이 서로 다르게 나타난다. 상기 3개의 모델 S4, S5, S11 에 대해 각각 데이터셋, 학습 파라미터로서 손실함수 종류, 학습률, 데이터 샘플링 비율, 그래디언트 부스팅을 이루는 트리의 갯수, 클래스 가중치를 가지고 서로 다른 조합을 만들어 학습시켰다. In the method for predicting the effect of combination drugs according to the present invention, each model forming the ensemble has different prediction performance depending on the constructed data and the learning parameters used. For the three models S4, S5, and S11, different combinations were created and trained using the data set, the type of loss function as learning parameters, the learning rate, the data sampling rate, the number of trees forming gradient boosting, and the class weight.

도 10에 예측된 값을 정답값과 비교하여 각 모델의 성능 특성을 나타내었다. 도 10에서 서로 다른 성능 특성을 나타내는 모델들이 앙상블 효과에 의해 상호 보완적으로 최적의 예측 값을 결정하는 것을 확인할 수 있다. In Figure 10, the performance characteristics of each model are shown by comparing the predicted values with the correct answer values. In Figure 10, it can be seen that models showing different performance characteristics determine the optimal prediction value in a complementary manner through the ensemble effect.

도 11은 본 발명에 의한 조합 약물의 효과 예측 방법의 일 실시예에 따라 약물 조합에 따른 시너지 예측시 약물 조합 순서에 따른 예측 결과가 달라짐을 방지하기 위한 방법을 나타낸다.Figure 11 shows a method for preventing the prediction results from being different depending on the order of drug combination when predicting synergy according to drug combination according to an embodiment of the method for predicting the effect of combination drugs according to the present invention.

본 발명에 의한 조합 약물의 효과 예측 방법에 있어서, 기계학습은 학습 데이터와 예측 목적의 테스트 데이터로 나누어진다. 예측 테스트시 테스트 데이터와 학습 데이터의 데이터 형식과 일치해야 올바른 예측이 이루어진다. 예를 들면 정보의 위치가 바뀌었을 경우 원래의 정보 대신에 바뀐 위치가 들어가면 의도하지 않은 테스트가 이루어진다. In the method for predicting the effect of combination drugs according to the present invention, machine learning is divided into learning data and test data for prediction purposes. When testing predictions, the data formats of the test data and training data must match for a correct prediction to be made. For example, when the location of information changes, an unintended test occurs if the changed location is entered in place of the original information.

본 발명에 의한 조합 약물의 효과 예측 방법은 복수개의 약물을 학습해야 하나, 테스트시에 약물의 위치가 바뀔 수가 있다. 따라서, 본 발명에서는 기계 학습시 약물 정보의 위치를 서로 바꾸어 중복으로 학습하도록 함으로써 결과에 대한 신뢰성을 높이는 것이 가능하다. The method for predicting the effect of combination drugs according to the present invention requires learning a plurality of drugs, but the location of the drugs may change during testing. Therefore, in the present invention, it is possible to increase the reliability of the results by changing the positions of drug information during machine learning and learning them redundantly.

본 발명에 의한 조합 약물의 효과 예측 방법에 있어서, 학습 데이터에서 시너지 있음과 시너지 없음을 정답으로 가지는 예제의 양이 일반적으로 불균형하다. 이는 생물학 문제에서 효과 있는 경우가 절대적으로 관측 사례가 적은 이유이다.In the method for predicting the effect of combination drugs according to the present invention, the amount of examples with synergy and no synergy as correct answers in the learning data is generally unbalanced. This is why there are absolutely few observed cases of effectiveness in biological problems.

도 12는 본 발명에 있어서, 클래스 불균형 문제를 해결하기 위한 방법을 나타낸다. 본 발명에 의한 조합 약물의 효과 예측 방법은 이러한 클래스 불균형을 보완하기 위해 클래스 가중치를 변화시키면서 성능상의 최적의 범위를 찾는 것을 특징으로 한다. 도 12에서 보는 바와 같이 기본 가중치인 1.0 에서 recall 성능이 기준 이하로 떨어지지 않는 최고 가중치인 2.2까지 0.2 변화 단위로 적용하였다. Figure 12 shows a method for solving the class imbalance problem in the present invention. The method for predicting the effect of combination drugs according to the present invention is characterized by finding the optimal range for performance while changing class weights to compensate for such class imbalance. As shown in Figure 12, changes were applied in increments of 0.2 from the basic weight of 1.0 to 2.2, the highest weight at which recall performance does not fall below the standard.

본 발명에 의한 조합 약물의 효과 예측 방법에 있어서, 알고리즘의　약물　조합　시너지　값의　예측　성능을　평가하기　위해서　도 13에서 보는 바와 같이 아래 6가지의　평가　지표를 만들고 이에 대한 성능　평가 결과를　측정한 결과를 나타낸다. 　 In the method for predicting the effect of combination drugs according to the present invention, in order to evaluate the prediction performance of the drug combination synergy value of the algorithm, as shown in Figure 13, the following six evaluation indicators were created and the performance evaluation results were measured. indicates.

1.　　Sequential three way ANOVA:　scoreglobal　= -sgn　x　log10(p)1. Sequential three way ANOVA: 　scoreglobal　= -sgn 　x　log10(p)

　　　　 sgn:　sign of the effect size,　sgn:　sign of the effect size,

p: F-statistic에　　대한　p-value값p: p-value value for F-statistic

2.　BAC_20=(Sensitivity + Specificity)/22.BAC_20=(Sensitivity + Specificity)/2

3.　Precision_20 = TP/(TP +FP)3.Precision_20 = TP/(TP +FP)

4.　Sensitivity_20 = TP/(TP +FN)4.Sensitivity_20 = TP/(TP +FN)

5.　Specificity_20 = TN/(TN +FP)5.Specificity_20 = TN/(TN +FP)

6.　F1_20=2 TP/(2TP +FP+ FN)6.　F1_20=2 TP/(2TP +FP+ FN)

TP: True Positive, TN: True Negative, FP: False Positive, FN: False NegativeTP: True Positive, TN: True Negative, FP: False Positive, FN: False Negative

BAC, Precision, Sensitivity, Specificity, F1　값을　계산하기　위해서 confusion matrix를　만들　때　시너지　값이　있고　없음에　대한　기준　값(cut off )은　20으로　하였다.When creating a confusion matrix to calculate BAC, Precision, Sensitivity, Specificity, and F1 values, the standard value (cut off) for the presence and absence of synergy value was set to 20.

본 발명에 의한 조합 약물의 효과 예측 방법에 있어서는 앙상블을 이루는 각 그래디언트 부스팅 모델들에서 예측한 결과 값, 즉 시너지 있음 또는 없음,을 앙상블 할 때 단순하게 있음, 없음을 합하는 것이 아니라, 도 14에서 보는 바와 같이 각 모델의 있음과 없음에 대한 예측 신뢰도를 적용하여 합함으로써 앙상블에 의한 예측 성능을 더욱 높였다. In the method for predicting the effect of a combination drug according to the present invention, when ensembleting the result values predicted from each gradient boosting model forming the ensemble, that is, the presence or absence of synergy, rather than simply adding the presence and absence of synergy, as shown in FIG. 14 As shown, the prediction performance of the ensemble was further improved by applying and summing the prediction reliability for the presence and absence of each model.

도 15는 본 발명에 의한 조합 약물의 효과 예측 방법에 있어서, 개발된　알고리즘으로　예측한　　약물　조합별(id)　시너지　값(synergy_score)과　신뢰값(confidence)을 나타낸다. Figure 15 shows the synergy value (synergy_score) and confidence value for each drug combination (id) predicted by the developed algorithm in the method for predicting the effect of combination drugs according to the present invention.

도 15에서 id 에는　특정　세포에　처리된　약물　조합과　농도를　표기하였다. In Figure 15, id indicates the drug combination and concentration treated with specific cells.

(예) NCI-H747;IGFR_2;MAP2K_1;3.000;10.000 (Example) NCI-H747;IGFR_2;MAP2K_1;3.000;10.000

　NCI-H1793:　세포　종류, IGFR_2:　약물1　이름, MAP2K_1:　약물2　이름　, 3.000:　약물　1의　최대농도(uM), 10.000:　약물2의　최대　농도(uM)NCI-H1793: Cell type, IGFR_2: Drug 1 name, MAP2K_1: Drug 2 name, 3.000: Maximum concentration of drug 1 (uM), 10.000: Maximum concentration of drug 2 (uM)

도 15에서 synergy_score는 조합에 의하여 시너지　값이　있고　없음에　대한　특정　기준　값(cut off)을　이용하여　1　혹은　0으로 표시한 결과를 나타낸다. In Figure 15, synergy_score shows the result expressed as 1 or 0 using a specific standard value (cut off) for the presence or absence of a synergy value by combination.

본 발명에 의한 조합 약물의 효과 예측 방법에 있어서, 그래디언트 부스팅에 의한 시너지 있음, 없음의 여부는 신뢰값에 달려 있다. 신뢰도에 cut-off를 적용하여 있음/없음을 결정한다. 도 15에서 신뢰값(confidence)은 그래디언트 부스팅 모델의 출력값으로, 시너지가 있는지 여부를 결정하는 확률값이다.In the method for predicting the effect of combination drugs according to the present invention, whether there is or is not synergy due to gradient boosting depends on the confidence value. Presence/absence is determined by applying a cut-off to reliability. In Figure 15, confidence is the output value of the gradient boosting model and is a probability value that determines whether there is synergy.

도 16은 본 발명에 의한 조합 약물의 효과 예측 방법에 의하여 개발된　알고리즘으로 특정 cell line에 대한 조합 효과 예측에 강점을 가지고 있는지, 그렇다면 어떤 cell line에서 예측이 잘 되는지를 분석한 결과이다. Figure 16 shows the results of analyzing whether the algorithm developed by the method for predicting the effect of combination drugs according to the present invention has strengths in predicting the combination effect for a specific cell line, and if so, which cell line the prediction is good for.

전체 85개의 cell line마다 우선 특정 cutoff를 기준으로 하여　 제시된 정답과 예측 결과간의 confusion matrix를 생성하였다. 이 confusion matrix를 이용하여 각 cell line별로 예측 결과의 accuracy와 accuracy p-value를 계산하였고,　그 결과 85개 중　 11개의　cell line에서　accuracy p-value가　0.1　미만인　것으로　나타났다. First, for each of the 85 cell lines, a confusion matrix was created between the presented correct answer and the predicted result based on a specific cutoff. Using this confusion matrix, the accuracy and accuracy p-value of the prediction result were calculated for each cell line, and as a result, the accuracy p-value of 11 cell lines out of 85 was found to be less than 0.1.

따라서 해당 11개의 cell line에 대해 우리 예측 방법이 유의하게 강점을 가지고 있다고 판단하고, 각 cell line의 site primary 및 histology를 확인하였으며, accuracy p-value < 0.1이었던 11개의 cell line을 accuracy에 따라 정렬하여 organ별로 색을 표시한 막대 그래프를 도 16에 나타내었다. Therefore, we determined that our prediction method had significant strengths for the 11 cell lines, confirmed the site primary and histology of each cell line, and sorted the 11 cell lines with accuracy p-value < 0.1 according to accuracy. A bar graph showing colors for each organ is shown in Figure 16.

도 16에서 보는 바와 같이 11개 중 8개가 lung carcinoma, 2개가 breast carcinoma,　　1개가 large intestine carcinoma cell line 임을 알 수 있다.As shown in Figure 16, it can be seen that out of 11, 8 are lung carcinoma, 2 are breast carcinoma, and 1 is large intestine carcinoma cell line.

Claims

In a method for predicting the effect of a combination drug in a computing device,
identifying first data related to the cell;
Confirming second data on a plurality of drugs to be combined;
confirming third data on the correlation between the plurality of drugs and the cells;
Building a learning model based on the first data, the second data, and the third data; and
Using the learning model, outputting information about the effects of the plurality of drugs; Including,
The first data includes cell-related data at the pathway level extracted based on mapping inference from cell-related data at the gene level,
The second data includes drug-related data at the pathway level extracted based on mapping inference from drug-related data at the gene level,
The third data includes data on the correlation between drugs and cells at the pathway level extracted based on mapping inference from data on the correlation between drugs and cells at the gene level,
The learning model is built based on a plurality of gradient boosting classification models, and each of the plurality of gradient boosting classification models uses a combination of different feature data and a combination of different learning parameters, and
A method for predicting the effect of a combination drug, characterized in that each of the plurality of gradient boosting classification models further uses different class weights corresponding to each of the plurality of gradient boosting classification models.

delete

According to claim 1,
A method for predicting the effect of a combination drug, wherein the cell-related data at the gene level includes mutation-related data or copy number variation data in genes.

delete

According to claim 1,
The method for predicting the effect of a combination drug, wherein the second data includes mapping information about the target at the pathway level and module information about the target at the pathway level.

According to claim 1,
A method for predicting the effect of a combination drug, characterized in that the drug-related data at the gene level includes information about the target at the gene level.

According to claim 1,
A method for predicting the effect of a combination drug, characterized in that the class weight is confirmed based on information about the synergy effect corresponding to the plurality of gradient boosting classification models.

According to claim 1,
A method for predicting the effect of a combination drug, characterized in that cell-related data at the pathway level are confirmed based on a plurality of mutant genes included in each pathway.

delete

According to claim 1,
A method for predicting the effect of a combination drug, wherein the data on the correlation between the drug and cells at the gene level include drug target-related data, dose-related data, and drug response-related parameters.

According to claim 1,
A method for predicting the effect of a combination drug, wherein the learning parameters include a loss function and a learning rate.

According to claim 1,
A method for predicting the effect of a combination drug, characterized in that the learning model is built based on cross-validation performance using an ensemble of each of the plurality of gradient boosting classification models.

According to claim 1,
The information on the effect is confirmed by calculating the probability of classification models predicting that there is synergy of the combination drug and the probability of classification models predicting that there is no synergy.