KR102182091B1

KR102182091B1 - Prediction method for resistance to immunotherapeutic agent and analysis apparatus

Info

Publication number: KR102182091B1
Application number: KR1020190123919A
Authority: KR
Inventors: 최정균; 김권일
Original assignee: 한국과학기술원
Priority date: 2019-10-07
Filing date: 2019-10-07
Publication date: 2020-11-23
Also published as: WO2021071181A1

Abstract

The present invention is to provide a technique for rapidly predicting the resistance of an anticancer immuno-therapeutic agent of a specific patient. According to the present invention, a method for predicting resistance to an anticancer immuno-therapeutic agent comprises the steps of: receiving, by an analysis device, genome data of a sample; inputting, by the analysis device, the genome data to a previously learned classifier; and predicting, by the analysis device, resistance to the anticancer immuno-therapeutic agent with respect to the sample by the analysis device based on output information of the classifier. The classifier predicts resistance to the anticancer immuno-therapeutic agent based on characteristics of a functional mutation related sequence caused by the tumor.

Description

Method and analysis device for predicting resistance to anticancer drugs of immunotherapy TECHNICAL FIELD [PREDICTION METHOD FOR RESISTANCE TO IMMUNOTHERAPEUTIC AGENT AND ANALYSIS APPARATUS}

이하 설명하는 기술은 면역함암제에 대한 저항성을 예측하는 기법에 관한 것이다.The technology to be described below relates to a technique for predicting resistance to cancer-containing immunological agents.

면역항암제(cancer immunotherapy)는 암 자체를 공격하는 기존 항암제와는 달리 인공면역 단백질을 체내에 주입하여 면역체계를 자극함으로써 면역세포가 선택적으로 암세포만을 공격하도록 유도하는 치료약제이다. 면역항암제에는 면역관문억제제(CTLA4 억제제, PD-1 억제제, PD-L1 억제제), 면역세포치료제, 면역바이러스치료제 등이 있다.Unlike existing anticancer drugs that attack cancer itself, cancer immunotherapy is a therapeutic drug that induces immune cells to selectively attack only cancer cells by injecting artificial immune proteins into the body to stimulate the immune system. Immune anticancer drugs include immune checkpoint inhibitors (CTLA4 inhibitor, PD-1 inhibitor, PD-L1 inhibitor), immune cell therapy, and immunoviral therapy.

암 세포는 면역세포의 면역관문을 이용하여 면역을 회피한다. 면역관문억제제(immune checkpoint inhibitor)는 면역관문을 억제하여 체내 면역세포의 활성으로 암 세포를 사멸한다. 그러나, 면역관문억제제는 모든 환자에 대해 반응성을 나타내는 것이 아니다. 따라서, 면역항암치료에 대한 반응성을 예측할 수 있는 바이오마커 발굴이 중요하다.Cancer cells evade immunity by using the immune checkpoint of immune cells. Immune checkpoint inhibitors suppress immune checkpoints and kill cancer cells by activating immune cells in the body. However, immune checkpoint inhibitors are not responsive to all patients. Therefore, it is important to discover biomarkers that can predict responsiveness to immunotherapy.

한국공개특허 제10-2019-0094710호Korean Patent Publication No. 10-2019-0094710

종양변이부담(tumor mutation burden: TMB)은 면역항암치료에 대한 반응성을 예측하는 대표적인 바이오마커이다. TMB가 높으면 신항원(neoantigen)의 에피토프가 T 세포에 잘 인식되어 면역항암치료에 대한 반응성이 좋다고 알려져 있다. 그러나, 암 세포 돌연변이의 상당 부분이 면역원성이 아니며, 증가된 기능적 돌연변이(functional mutation)가 치료에 대한 저항성을 유발하기도 한다.Tumor mutation burden (TMB) is a representative biomarker that predicts the responsiveness to immunotherapy. When TMB is high, epitopes of neoantigens are well recognized by T cells, and it is known that the responsiveness to immune chemotherapy is good. However, a large proportion of cancer cell mutations are not immunogenic, and increased functional mutations may lead to resistance to treatment.

이하 설명하는 기술은 면역항암제(면역관문억제제)의 저항성을 예측하는 기법을 제공하고자 한다. 또한, 이하 설명하는 기술은 면역항암제의 저항성을 예측하는 마커를 발굴하는 기법을 제공하고자 한다.The technique described below is intended to provide a technique for predicting the resistance of an immune anticancer agent (immune checkpoint inhibitor). In addition, the technique described below is intended to provide a technique for discovering a marker that predicts the resistance of an anticancer drug.

면역항암제에 대한 저항성을 예측하는 방법은 분석장치가 샘플의 유전체 데이터를 입력받는 단계, 상기 분석장치가 상기 유전체 데이터를 사전에 학습된 분류기(classifier)에 입력하는 단계 및 상기 분석장치가 상기 분류기의 출력 정보를 기준으로 상기 샘플에 대한 면역항암제의 저항성을 예측하는 단계를 포함한다. 상기 분류기는 암(tumor)이 유발하는 기능적 돌연변이(functional mutation) 연관 서열의 특징을 기준으로 면역항암제의 저항성을 예측한다.The method of predicting resistance to anticancer drugs includes the steps of: receiving, by an analysis device, genome data of a sample, by the analysis device, inputting the genome data to a pre-learned classifier, and by the analysis device And predicting the resistance of the anticancer agent to the sample based on the output information. The classifier predicts the resistance of an immune anticancer drug based on the characteristics of a sequence associated with a functional mutation caused by a cancer.

면역항암제에 대한 저항성을 예측하는 분석장치는 샘플의 유전체 데이터를 입력받는 입력장치, 종양이 유발하는 기능적 돌연변이(functional mutation) 연관 서열의 특징을 기준으로 면역항암제의 저항성을 예측하는 분류기(classifier)를 저장하는 저장장치 및 상기 유전체 데이터를 상기 분류기에 입력하여 상기 샘플에 대한 면역항암제의 저항성을 예측하는 연산장치를 포함한다.The analysis device for predicting resistance to anticancer drugs is an input device that receives genome data of a sample, and a classifier that predicts the resistance of anticancer drugs based on the characteristics of the sequence associated with functional mutations caused by tumors. And a storage device that stores and a computing device that inputs the genome data to the classifier to predict resistance of the immuno-anticancer agent to the sample.

이하 설명하는 기술은 학습모델을 사용하여 특정 환자의 면역항암제에 대한 저항성을 빠르게 예측할 수 있다. 따라서, 이하 설명하는 기술은 환자별 맞춤 진료에 기여할 수 있다. 나아가, 이하 설명하는 기술은 특정 질환 또는 특정 코호트(cohort)를 대상으로 면역항암제에 대한 저항성을 예측하는 마커를 발굴하여 맞춤형 치료에 기여할 수 있다.The technique described below can quickly predict the resistance of a specific patient to an anticancer drug by using a learning model. Therefore, the technology described below can contribute to customized treatment for each patient. Further, the technology described below may contribute to customized treatment by discovering a marker that predicts resistance to an immuno-anticancer agent targeting a specific disease or a specific cohort.

도 1은 면역항암제에 대한 저항성을 예측하는 시스템에 대한 예이다.
도 2는 면역항암제 저항성을 예측하는 모델을 훈련하는 과정에 대한 예이다.
도 3은 면역항암제 저항성을 예측하는 과정에 대한 예이다.
도 4는 면역항암제 저항성을 판단하는 마커를 발굴하는 과정에 대한 예이다.
도 5는 면역항암제에 대한 저항성을 예측하는 분석장치의 구조에 대한 예이다.
도 6은 면역항암제 저항성을 예측하는 모델을 평가한 결과이다.1 is an example of a system for predicting resistance to anticancer drugs.
2 is an example of a process of training a model for predicting immune anticancer drug resistance.
3 is an example of a process for predicting resistance to anticancer drugs.
4 is an example of a process of discovering a marker for determining resistance to an anticancer drug.
5 is an example of the structure of an analysis device for predicting resistance to anticancer drugs.
6 is a result of evaluating a model for predicting immune anticancer drug resistance.

이하 설명하는 기술은 다양한 변경을 가할 수 있고 여러 가지 실시례를 가질 수 있는 바, 특정 실시례들을 도면에 예시하고 상세하게 설명하고자 한다. 그러나, 이는 이하 설명하는 기술을 특정한 실시 형태에 대해 한정하려는 것이 아니며, 이하 설명하는 기술의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다.The technology to be described below may be modified in various ways and may have various embodiments, and specific embodiments will be illustrated in the drawings and described in detail. However, this is not intended to limit the technology to be described below with respect to a specific embodiment, and it should be understood to include all changes, equivalents, and substitutes included in the spirit and scope of the technology described below.

제1, 제2, A, B 등의 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 해당 구성요소들은 상기 용어들에 의해 한정되지는 않으며, 단지 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다. 예를 들어, 이하 설명하는 기술의 권리 범위를 벗어나지 않으면서 제1 구성요소는 제2 구성요소로 명명될 수 있고, 유사하게 제2 구성요소도 제1 구성요소로 명명될 수 있다. 및/또는 이라는 용어는 복수의 관련된 기재된 항목들의 조합 또는 복수의 관련된 기재된 항목들 중의 어느 항목을 포함한다.Terms such as 1st, 2nd, A, B, etc. may be used to describe various components, but the components are not limited by the above terms, only for the purpose of distinguishing one component from other components. Is only used. For example, a first component may be referred to as a second component, and similarly, a second component may be referred to as a first component without departing from the scope of the rights of the technology described below. The term and/or includes a combination of a plurality of related listed items or any of a plurality of related listed items.

본 명세서에서 사용되는 용어에서 단수의 표현은 문맥상 명백하게 다르게 해석되지 않는 한 복수의 표현을 포함하는 것으로 이해되어야 하고, "포함한다" 등의 용어는 설시된 특징, 개수, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것이 존재함을 의미하는 것이지, 하나 또는 그 이상의 다른 특징들이나 개수, 단계 동작 구성요소, 부분품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 배제하지 않는 것으로 이해되어야 한다.In terms of the terms used in the present specification, expressions in the singular should be understood as including plural expressions unless clearly interpreted differently in context, and terms such as "includes" are specified features, numbers, steps, actions, and components. It is to be understood that the presence or addition of one or more other features or numbers, step-acting components, parts or combinations thereof is not meant to imply the presence of, parts, or combinations thereof.

또, 방법 또는 동작 방법을 수행함에 있어서, 상기 방법을 이루는 각 과정들은 문맥상 명백하게 특정 순서를 기재하지 않은 이상 명기된 순서와 다르게 일어날 수 있다. 즉, 각 과정들은 명기된 순서와 동일하게 일어날 수도 있고 실질적으로 동시에 수행될 수도 있으며 반대의 순서대로 수행될 수도 있다.In addition, in performing the method or operation method, each of the processes constituting the method may occur differently from the specified order unless a specific order is clearly stated in the context. That is, each process may occur in the same order as the specified order, may be performed substantially simultaneously, or may be performed in the reverse order.

이하 설명에서 사용되는 용어에 대하여 설명한다.Hereinafter, terms used in the description will be described.

항원은 면역 반응을 유도하는 물질이다.Antigens are substances that induce an immune response.

신항원(neoantigen)은 종양 세포에서의 돌연변이 또는 종양 세포에 특이적인 번역 후 변형을 통해 발생하는 변경을 갖는 항원이다. 신항원은 폴리펩티드 서열 또는 뉴클레오티드 서열을 포함할 수 있다. 돌연변이는 프레임 이동 또는 비-격자 이동 인델(indel), 미스센스(missense) 또는 넌센스 (nonsense) 치환, 스플라이스 부위 변경, 게놈 재배열 또는 유전자 융합, 또는 신생 ORF를 야기하는 임의의 게놈 또는 발현 변경을 포함할 수 있다. 돌연변이는 스플라이스 변이(splice variant)도 포함할 수 있다. 종양 세포에 특이적인 번역 후 변형은 비정상적인 인산화를 포함할 수 있다. 종양 세포에 특이적인 번역 후 변형은 또한 프로테아솜-생성된 스플라이싱된 항원을 포함할 수 있다. Neoantigens are antigens with alterations that occur through mutations in tumor cells or post-translational modifications specific to tumor cells. The neoantigen may comprise a polypeptide sequence or a nucleotide sequence. Mutations can be frame shifted or non-lattice shifted indels, missense or nonsense substitutions, splice site alterations, genomic rearrangements or gene fusions, or any genomic or expression alterations that result in a newborn ORF. It may include. Mutations can also include splice variants. Post-translational modifications specific to tumor cells can include abnormal phosphorylation. Post-translational modifications specific to tumor cells can also include proteasome-generated spliced antigens.

엑솜(exome)은 단백질을 암호화하는 게놈의 서브셋이다. 엑솜(exome)은 세포, 세포 그룹 또는 개체에 존재하는 엑손(exon)들의 집합을 지칭할 수 있다.Exomes are a subset of the genome that encodes proteins. An exome may refer to a cell, a group of cells, or a collection of exons present in an individual.

에피토프(epitope)는 항체 또는 T-세포 수용체가 통상 결합하는 항원의 특이적인 부분을 지칭할 수 있다.An epitope may refer to a specific portion of an antigen to which an antibody or T-cell receptor usually binds.

면역원성(immunogenic)은 T 세포, B 세포 또는 둘 모두를 통해 면역 반응을 유도할 수 있는 능력이다.Immunogenicity is the ability to elicit an immune response through T cells, B cells, or both.

내성(tolerance), 면역 내성(immune tolerance), 또는 저항성(resistance)은 하나 이상의 항원에 대한 면역 비-반응성 상태이다.Tolerance, immune tolerance, or resistance is a state of immune non-reactivity to one or more antigens.

시료 내지 샘플(sample)은 분석 대상이 되는 개체에서 채취한 단일 세포 또는 다중 세포, 세포 단편, 체액 등을 의미한다.A sample or sample means a single cell or multiple cells, cell fragments, body fluids, etc. collected from an individual to be analyzed.

개체(subject)는 세포, 조직 또는 유기체를 포함한다. 개체는 기본적으로 인간을 대상으로 하지만, 이에 한정되지 않는다.Subjects include cells, tissues or organisms. The entity is primarily intended for humans, but is not limited thereto.

유전체 데이터 내지 유전체 정보는 샘플을 분석하여 산출되는 유전 정보를 의미한다. 예컨대, 유전체 데이터는 세포, 조직 등으로부터 데옥시리보 핵산(DNA), 리보핵산(RNA), 또는 단백질(Protein) 등에서 얻어진 염기서열, 유전자 발현 데이터, 표준 유전체 데이터와의 유전 변이, DNA 메틸화(methylation) 등을 포함할 수 있다. 일반적으로 유전체 데이터는 특정 시료를 분석하여 얻은 서열 정보를 포함한다. 유전체 데이터는 다양한 방식으로 획득될 수 있다. 예컨대, NGS 분석을 통해 유전체 데이터를 생성할 수 있다. 유전체 데이터는 컴퓨터가 이해하는 디지털 데이터로 표현될 수 있다.Genomic data or genome information refers to genetic information that is calculated by analyzing a sample. For example, genomic data is a base sequence obtained from deoxyribonucleic acid (DNA), ribonucleic acid (RNA), or protein from cells, tissues, etc., gene expression data, genetic mutation with standard genomic data, DNA methylation ), etc. In general, genomic data includes sequence information obtained by analyzing a specific sample. Genomic data can be obtained in a variety of ways. For example, genome data can be generated through NGS analysis. Genomic data can be expressed as digital data understood by computers.

기계 학습(machine learning) 또는 학습은 인공 지능의 한 분야로, 컴퓨터가 학습할 수 있도록 알고리즘을 개발하는 분야를 의미한다. 기계학습모델 또는 학습모델은 컴퓨터가 학습할 수 있도록 개발된 모델을 의미한다. 학습모델은 접근 방법에 따라 인공신경망, 결정 트리 등과 같은 다양한 모델이 있다.Machine learning, or learning, is a field of artificial intelligence, which refers to the field in which algorithms are developed so that computers can learn. A machine learning model or learning model means a model developed so that a computer can learn. Learning models include various models such as artificial neural networks and decision trees, depending on the approach.

앙상블 기법(Ensemble)은 기계 학습에서 복수의 학습 알고리즘을 이용하는 기법을 총칭한다. 대표적으로 앙상블 기법은 랜덤 포레스트(Random Forest)를 포함한 배깅(bagging) 기법이나 부스팅(boosting) 기법 등이 있다.Ensemble is a generic term for a technique that uses a plurality of learning algorithms in machine learning. Representatively, the ensemble technique includes a bagging technique including a random forest or a boosting technique.

랜덤 포레스트는 CART의 의사 결정 트리의 조합으로 이루어진 배깅(bagging) 알고리즘의 일종이다. 랜덤 포레스트는 복수의 의사 결정 트리로 구성된다. 복수의 의사 결정 트리는 각각 훈련 데이터와 특징 변수 중 일부를 무작위로 선택하여 사전에 학습된다. 랜덤 포레스트는 각각의 트리는 개별적으로 목표 변수를 결정한 후 모든 트리의 결정을 취합해 최종 결정을 내린다. Random forest is a kind of bagging algorithm composed of a combination of CART decision trees. The random forest consists of a plurality of decision trees. Each of the plurality of decision trees is trained in advance by randomly selecting some of the training data and feature variables. In the random forest, each tree individually determines the target variable and then aggregates the decisions of all trees to make a final decision.

도 1은 면역항암제에 대한 저항성을 예측하는 시스템(100)에 대한 예이다. 분석장치가 면역항암제에 대한 저항성을 예측한다. 도 1에서 분석장치는 서버(130) 및 컴퓨터 단말(140) 형태로 도시하였다. 서버(130)는 네트워크 상에서 면역항암제에 대한 저항성을 예측하는 서비스를 제공할 수 있다. 컴퓨터 단말(140)은 네트워크에 연결되어 또는 개별 장치로 유전체 데이터를 분석하여 면역항암제에 대한 저항성을 예측할 수 있다. 분석장치(130, 140)는 다양한 형태로 구현될 수 있다.1 is an example of a system 100 for predicting resistance to anti-cancer drugs. The analysis device predicts resistance to anticancer drugs. In FIG. 1, the analysis device is shown in the form of a server 130 and a computer terminal 140. The server 130 may provide a service for predicting resistance to anti-cancer drugs on a network. The computer terminal 140 may be connected to a network or analyzed genomic data with an individual device to predict resistance to anti-cancer drugs. The analysis devices 130 and 140 may be implemented in various forms.

분석장치(130, 140)는 유전체 데이터를 이용하여 면역항암제에 대한 저항성을 분석한다. 여기서, 유전체 데이터는 유전체 서열에 대한 정보를 포함한다. 유전체 분석장치(110)는 시료를 분석하여 유전체 데이터를 생성한다. 예컨대, 유전체 분석장치(110)는 NGS 분석장치일 수 있다. 유전체 분석장치(110)는 생성한 유전체 데이터를 별도의 DB(120)에 저장할 수도 있다.The analysis devices 130 and 140 analyze resistance to an anticancer drug by using the genome data. Here, the genome data includes information on the genome sequence. The genome analysis device 110 analyzes a sample and generates genome data. For example, the genome analysis device 110 may be an NGS analysis device. The genome analysis device 110 may store the generated genome data in a separate DB 120.

유전체 분석장치(110)는 유전체 라이브러리를 이용하여 유전체 데이터를 생성한다. 유전체 분석장치(110)는 유전체 라이브러리에 대한 엑솜 서열 검사(whole exome sequencing)를 수행할 수 있다. 유전체 라이브러리는 상용 키트를 사용하여 준비될 수 있다. 예컨대, AllPrep DNA/RNA Mini Kit (Qiagen, 80204), AllPrep DNA/RNA Micro Kit (Qiagen, 80284), 또는 QIAamp DNA FFPE Tissue Kit (Qiagen, 56404) 등을 사용하여 서열 분석을 위한 유전체 라이브러리를 생성할 수 있다.The genome analysis device 110 generates genome data using a genome library. The genome analysis device 110 may perform whole exome sequencing on the genome library. The genome library can be prepared using a commercial kit. For example, using the AllPrep DNA / RNA Mini Kit (Qiagen, 80204), AllPrep DNA / RNA Micro Kit (Qiagen, 80284), or QIAamp DNA FFPE Tissue Kit (Qiagen, 56404) to generate a genomic library for sequence analysis. I can.

사용자(10, 20)는 특정 환자에 대한 면역항암제에 대한 저항성 결과를 확인할 수 있다. 사용자(10)는 사용자 단말(PC, 스마트폰 등)을 통해 서버(130)에 접속하여, 서버(130)가 수행한 분석 결과를 확인할 수 있다. 사용자(20)는 자신이 사용하는 컴퓨터 단말(140)을 통해 면역항암제에 대한 저항성 결과를 확인할 수 있다. Users (10, 20) can check the result of resistance to anticancer drugs for a specific patient. The user 10 may access the server 130 through a user terminal (PC, smartphone, etc.) and check the analysis result performed by the server 130. The user 20 may check the result of resistance to anticancer drugs through the computer terminal 140 used by the user 20.

사용자(10, 20)는 면역항암제에 대한 저항성 평가를 수행하는 연구자일 수 있다. 또는 사용자(10, 20)는 특정 환자에 대한 면역항암제 처방을 고려하는 의료진일 수도 있다.Users (10, 20) may be researchers who perform resistance evaluation against anticancer drugs. Alternatively, the users 10 and 20 may be medical staff who consider prescribing an anticancer drug for a specific patient.

분석장치(130, 140)는 유전체 데이터 분석을 통해 면역항암제 저항성을 예측한다. 분석장치(130, 140)는 사전에 마련된 학습 모델을 이용하여 면역항암제 저항성을 예측한다. 분석장치(130, 140)는 다양한 학습 모델을 이용할 수 있다. 예컨대, 분석장치(130, 140)는 앙상블 기법을 이용하여 면역항암제 저항성을 분석할 수 있다. 이하 설명의 편의를 위하여, 분석장치(130, 140)가 랜덤 포레스트 모델을 사용하여 면역항암제 저항성을 분석한다고 가정한다. The analysis devices 130 and 140 predict resistance to anticancer drugs through genomic data analysis. The analysis devices 130 and 140 predict immune anticancer drug resistance using a learning model prepared in advance. The analysis devices 130 and 140 may use various learning models. For example, the analysis apparatuses 130 and 140 may analyze the resistance to an anticancer drug using an ensemble technique. For convenience of description below, it is assumed that the analysis devices 130 and 140 analyze the resistance of the immune anticancer drug using a random forest model.

분석장치(130, 140)가 사용하는 학습 모델은 사전에 마련되어야 한다. 도 2는 면역항암제 저항성을 예측하는 모델을 훈련하는 과정(200)에 대한 예이다. 학습 모델은 사전에 마련된 훈련 데이터를 이용하여 훈련된다. The learning model used by the analysis devices 130 and 140 must be prepared in advance. 2 is an example of a process 200 of training a model for predicting immune anticancer drug resistance. The learning model is trained using training data prepared in advance.

암 환자 코호트는 복수의 환자의 유전체 데이터를 포함한다. 최초 암 환자 코호트가 모집단에 해당한다. 이제 몇 가지 기준으로 훈련 데이터를 선별할 수 있다. 최초 모집단에서 일정한 기준을 갖는 그룹을 선택하는 과정을 반복하면서, 훈련 데이터를 선별할 수 있다. 일정한 기준으로 모집단을 필터링하는 과정은 순서에 관계 없다.The cancer patient cohort contains genomic data from multiple patients. The first cohort of cancer patients is the population. Now you can select your training data by several criteria. Training data can be selected by repeating the process of selecting a group with a certain criterion from the initial population. The process of filtering a population based on a certain criterion is irrelevant.

훈련 데이터를 선별하는 몇 가지 기준에 대하여 설명한다. (i) TMB의 양이 기준이 될 수 있다. 즉, 모집단에서 기준값보다 많은 TMB를 갖는 개체를 선택할 수 있다. (ii) 암 세포가 생성한 신항원 개수가 기준이 될 수 있다. 즉, 모집단에서 신항원의 개수가 기준값보다 많은 개체를 선택할 수 있다. (iii) 기능적 돌연변이가 기준이 될 수 있다. 즉, 모집단에서 기능적 돌연변이의 정도가 기준값 이상인 개체가 선택될 수 있다. 한편, 기능적 돌연변이는 다양한 알고리즘으로 평가될 수 있다.Several criteria for selecting training data are described. (i) The amount of TMB may be the standard. That is, individuals with more TMBs than the reference value can be selected from the population. (ii) The number of new antigens produced by cancer cells can be a standard. That is, individuals with more new antigens than the reference value in the population can be selected. (iii) Functional mutation can be the criterion. That is, individuals with a degree of functional mutation in the population that are greater than or equal to the reference value may be selected. Meanwhile, functional mutations can be evaluated by various algorithms.

예컨대, 기능적 돌연변이는 돌연변이가 발생한 서열이 단백질 기능에 영향을 주는 정도로 평가할 수 있다. 돌연변이가 연관 단백질의 기능에 영향을 주는 정도는 몇 가지 솔루션 내지 알고리즘을 이용하여 측정될 수도 있다. 몇 가지 예를 설명한다. For example, functional mutations can be assessed to the extent that the sequence in which the mutation occurs affects protein function. The degree to which the mutation affects the function of the associated protein may be measured using several solutions or algorithms. Here are some examples.

(i) SIFT(Sorting Intolerant From Tolerant, https://sift.bii.a-star.edu.sg/)는 아미노산의 대체가 단백질 기능에 영향을 주는 정도를 정량한다. SIFT 점수는 돌연변이가 단백질 기능에 영향을 주는 정도를 정량한 값이다. (ii) PROVEAN(Protein Variation Effect Analyzer, http://provean.jcvi.org)은 아미노산 대체 또는 삭제(indel)가 단백질의 기능에 영향을 주는 정도를 정량한다. PROVEAN 점수는 돌연변이가 단백질 기능에 영향을 주는 정도를 정량한 값이다.(i) SIFT (Sorting Intolerant From Tolerant, https://sift.bii.a-star.edu.sg/) quantifies the degree to which amino acid substitution affects protein function. The SIFT score is a quantification of the extent to which mutations affect protein function. (ii) PROVEAN (Protein Variation Effect Analyzer, http://provean.jcvi.org) quantifies the extent to which amino acid substitution or deletion (indel) affects the function of a protein. The PROVEAN score is a quantification of the extent to which mutations affect protein function.

분석대상 집단에 대하여 SIFT 점수가 기준값 이상이고, 동시에 PROVEAN 점수가 기준값 이상인 경우, 해당 개체는 기능적 돌연변이가 임계값 이상이라고 판단될 수 있다. When the SIFT score is equal to or higher than the reference value and the PROVEAN score is equal to or higher than the reference value for the group to be analyzed, the individual may be determined to have a functional mutation equal to or higher than the threshold value.

한편, 훈련 데이터 선별을 위하여 임상 데이터를 활용할 수도 있다. 이 경우 모집단을 구성하는 개체에 대한 임상 데이터를 전제로 한다. 임상 데이터는 실제 면역항암제에 대한 저항성을 갖고 있는지 여부에 대한 정보를 포함한다. 예컨대, 모집단에서 임상데이터를 기준으로 면역항암제에 대한 저항성을 갖는 개체를 선택하여 훈련 데이터를 필터링할 수 있다.Meanwhile, clinical data may be used to select training data. In this case, clinical data on individuals constituting the population are premised. Clinical data include information on whether or not you have resistance to actual anticancer drugs. For example, it is possible to filter the training data by selecting an individual with resistance to an anticancer drug based on clinical data from the population.

도 2를 기준으로, 훈련 데이터를 선별하는 과정을 설명한다. 분석하고자 하는 암 환자 코호트를 획득한다(210). 암 환자 코호트를 대상으로 신항원 개수가 기준값 이상인 개체의 데이터 집합을 선별한다(220). 예컨대, 신항원의 개수가 70개 보다 많은 데이터 집합을 선별할 수 있다. 선별한 집합에서 기능적 돌연변이의 정도가 기준값 이상인 훈련 데이터를 선별한다(230). 최종적으로 마련된 훈련 데이터를 이용하여 랜덤 포레스트 모델을 학습한다(240). Referring to FIG. 2, a process of selecting training data will be described. A cohort of cancer patients to be analyzed is obtained (210). For a cohort of cancer patients, a data set of individuals whose number of new antigens is greater than or equal to a reference value is selected (220). For example, a data set having more than 70 new antigens may be selected. Training data in which the degree of functional mutation is greater than or equal to a reference value in the selected set are selected (230). Finally, the random forest model is trained using the prepared training data (240).

학습 모델은 양성(positive) 훈련 데이터 및 음성(negative) 훈련 데이터를 이용하여 마련될 수 있다. 훈련 데이터는 복수의 개체에 대한 유전체 데이터를 포함한다. 따라서, 양성 훈련 데이터는 양성 훈련 데이터군이고, 음성 훈련 데이터는 음성 훈련 데이터 군에 해당한다. 암 환자 코호트에서 양성 훈련 데이터군을 선별하는 과정은 전술한 바와 같다. 도 2는 암 환자 코호트에서 신항원의 개수가 기준값 이상이고, 기능적 돌연변이의 정도가 기준값 이상인 데이터를 양성 훈련 데이터군으로 선별한 예이다. 기능적 돌연변이 정도는 SIFT 점수 및 PROVEAN 점수를 기준으로 판별할 수 있다. 음성 훈련 데이터군은 모집단에서 양성 훈련 데이터군을 제외한 훈련 데이터들로 구성된다.The learning model may be prepared using positive training data and negative training data. The training data includes genomic data for a plurality of individuals. Therefore, the positive training data corresponds to the positive training data group, and the negative training data corresponds to the negative training data group. The process of selecting a positive training data group in a cohort of cancer patients is as described above. 2 is an example of selecting data in which the number of new antigens is greater than or equal to the reference value and the degree of functional mutation is greater than or equal to the reference value in a cohort of cancer patients as a positive training data group. The degree of functional mutation can be determined based on the SIFT score and PROVEAN score. The negative training data group consists of training data excluding the positive training data group from the population.

랜덤 포레스트 모델은 훈련 데이터에 포함된 복수의 개체에 대한 유전체 데이터를 학습할 수 있다. 랜덤 포레스트를 구성하는 복수의 의사 결정 트리는 각각 임의로 훈련 데이터를 선택하고, 임의로 특징 변수를 선택하여 학습된다.The random forest model may learn genomic data for a plurality of individuals included in the training data. A plurality of decision trees constituting the random forest are trained by randomly selecting training data and randomly selecting feature variables, respectively.

랜덤 포레스트가 학습되는 특징 변수는 유전체 서열 중 돌연변이가 발생한 서열일 수 있다. 즉, 랜덤 포레스트는 전체 서열을 이용하지 않고, 면역항암제 저항성과 관련성 높은 특정 서열 구간을 이용하여 학습될 수 있다. 랜덤 포레스트 학습을 위하여 유전체 서열은 일정한 벡터 형태의 정보로 사전에 변환될 수 있다.The characteristic variable from which the random forest is learned may be a sequence in which a mutation occurs among genome sequences. That is, the random forest can be learned using a specific sequence section that is highly related to immuno-anticancer resistance without using the entire sequence. For random forest learning, the genome sequence can be converted into information in a certain vector form in advance.

도 3은 면역항암제 저항성을 예측하는 과정(300)에 대한 예이다. 도 3은 분석장치가 사전에 훈련한 학습 모델을 이용하여 면역항암제 저항성을 예측하는 예이다. 분석장치는 특정 환자에 대하여 면역항암제 효과를 사전에 예측한다.3 is an example of a process 300 for predicting resistance to anticancer drugs. 3 is an example of predicting immune anticancer drug resistance using a learning model trained in advance by an analysis device. The analysis device predicts the effect of an anticancer drug in advance for a specific patient.

분석장치는 샘플의 유전체 데이터를 입력받는다(310). 샘플은 면역항암제 저항성을 판단하고자 하는 개체(환자)를 의미한다. 샘플 유전체 데이터는 분석 대상인 환자의 유전체 데이터를 말한다.The analysis device receives the genome data of the sample (310). The sample refers to an individual (patient) who wants to determine the resistance to an anticancer drug. The sample genome data refers to the genome data of a patient to be analyzed.

분석장치는 사전에 학습된 모델에 샘플 유전체 데이터를 입력한다. 학습 모델은 샘플 유전체 데이터를 분석한다(320). 도 3은 랜덤 포레스트 모델을 예시한다. 랜덤 포레스트를 구성하는 의사 결정 트리는 각각 입력 데이터를 시작으로 의사 결정을 하면서 최종적인 판단 결과를 출력한다. 도 3을 살펴보면, 의사 결정 트리 A는 저항성이 높음(High)이라는 결과를 출력하고, 의사 결정 트리 B는 저항성이 낮음(Low)이라는 결과를 출력한다. 랜덤 포레스트는 각 의사 결정 트리의 출력 결과를 모두 고려하여 최종적인 판단을 수행한다. 예컨대, 랜덤 포레스트는 다수결 원칙에 따라 최종 결론을 결정할 수 있다. 분석장치는 랜덤 포레스트가 출력하는 정보를 기준으로 샘플에 대한 면역항암제 저항성을 예측한다(330). 예컨대, 분석장치는 해당 환자의 면역항암제 저항성이 높다라는 정보를 출력할 수 있다. 학습 모델은 샘플에 대한 저항성을 분류하는 기능을 수행하여 분류기(classifier)라고 할 수 있다.The analysis device inputs sample genome data into the model learned in advance. The learning model analyzes the sample genomic data (320). 3 illustrates a random forest model. Each decision tree constituting a random forest makes a decision starting with input data and outputs a final decision result. Referring to FIG. 3, the decision tree A outputs a result that resistance is high, and the decision tree B outputs a result that resistance is low. The random forest considers all of the output results of each decision tree to make a final decision. For example, a random forest can decide its final conclusion according to the principle of majority vote. The analysis device predicts immune anticancer drug resistance to the sample based on the information output from the random forest (330). For example, the analysis device may output information indicating that the patient has high resistance to anticancer drugs. The learning model can be referred to as a classifier by performing a function of classifying resistance to a sample.

도 4는 면역항암제 저항성을 판단하는 마커를 발굴하는 과정(400)에 대한 예이다. 바이오마커는 특정 환자 코호트 또는 특정 환자에 대해서 개별적으로 결정될 수 있다.4 is an example of a process 400 of discovering a marker for determining resistance to an anticancer drug. Biomarkers can be determined individually for specific patient cohorts or specific patients.

분석장치는 샘플 유전체 데이터를 입력받는다(410). 분석장치는 학습모델(분류기)에 샘플 유전체 데이터를 입력하여 면역항암제 저항성이 높은 후보 데이터를 선별한다(420). The analysis device receives sample genome data (410). The analysis device inputs sample genome data into a learning model (classifier) and selects candidate data having high resistance to anticancer drugs (420).

분석장치는 후보 데이터를 대상으로 변수 중요도(variable importance)를 기준으로 후보 유전자를 검출할 수 있다(430). 변수 중요도는 특정 변수가 학습모델의 분석 결과에 미치는 영향을 정량한 값이다. The analysis device may detect a candidate gene based on variable importance for the candidate data (430). The importance of a variable is a quantification of the effect of a specific variable on the analysis result of the learning model.

전술한 바와 같이, 분석 장치는 유전체 데이터에서 돌연변이 서열을 변수로 삼을 수 있다. 이 경우, 변수 중요도는 특정 서열의 구성을 변경하고, 변경된 특정 서열을 포함한 데이터를 학습모델에 입력한 결과를 기준으로 결정될 수 있다. 예컨대, 분석 장치는 특정 변수인 서열을 임의의 순서로 변경(random permutation)하고, 특정 변수 또는 특정 변수를 포함하는 입력 데이터를 학습모델에 입력한다. 변경된 변수를 포함하는 샘플 유전체 데이터를 가공된 샘플 유전체 데이터라고 명명한다. As described above, the analysis device may take the mutation sequence as a variable in the genomic data. In this case, the importance of a variable may be determined based on a result of changing the composition of a specific sequence and inputting data including the changed specific sequence into a learning model. For example, the analysis apparatus changes a sequence, which is a specific variable, in a random order, and inputs a specific variable or input data including a specific variable into a learning model. The sample genome data including the changed variables is referred to as processed sample genome data.

가공된 샘플 유전체 데이터를 분류기에 입력하면 원본 샘플 유전체 데이터를 학습모델에 입력한 경우와 비교하여 출력되는 결과가 달라질 수 있다. 원본 샘플 유전체 데이터는 임의로 서열의 순서를 변경하지 않은 샘플 유전체 데이터를 의미한다. 이때, 예측 정확도가 달라지는 정도가 변수 중요도이다. 다양한 기준으로 변수 중요도를 정량할 수 있다. 랜덤 포레스트 경우 복수의 의사 결정 트리를 기준으로 변수 중요도를 산출할 수 있다. When the processed sample genome data is input to the classifier, the output result may be different compared to the case where the original sample genome data is input to the learning model. The original sample genome data refers to sample genome data without arbitrarily changing the sequence of the sequence. At this time, the degree to which the prediction accuracy varies is the importance of the variable. Variable importance can be quantified by various criteria. In the case of a random forest, variable importance can be calculated based on a plurality of decision trees.

예컨대, 복수의 의사 결정 트리가 원본 샘플 유전체 데이터를 입력받은 경우와 가공된 샘플 유전체 데이터를 입력받는 경우 출력 결과가 달라질 수 있다. 이때 출력 결과가 달라진 의사 결정 트리의 개수가 변수 중요도를 결정하는 기준이 될 수 있다. 예컨대, 출력 결과가 달라진 의사 결정 트리의 개수가 3개 이상인 경우, 해당 변수의 변수 중요도가 높다고 판단할 수 있다. 이 경우, 분석장치는 해당 변수(서열)를 후보 유전자로 검출할 수 있다(430).For example, when a plurality of decision trees receives original sample genome data and processed sample genome data, output results may be different. At this time, the number of decision trees with different output results may be a criterion for determining the importance of a variable. For example, when the number of decision trees with different output results is 3 or more, it may be determined that the variable importance of the corresponding variable is high. In this case, the analysis device may detect the variable (sequence) as a candidate gene (430).

분석 장치나 연구자는 후보 유전자를 면역항암제의 저항성을 식별하는 마커로 결정할 수 있다.An analysis device or a researcher can determine the candidate gene as a marker that identifies the resistance of an immuno-cancer drug.

나아가, 후보 유전자와 연관된 단백질에 대한 상호 작용체(interactome) 분석을 더 수행하여, 분석 장치 또는 연구자가 후보 유전자를 결정할 수도 있다. 연관된 단백질은 후보 유전자가 영향을 미치는 단백질, 후보 유전자가 번역(translation)되어 생성되는 단백질 등을 의미한다. 분석 장치는 연관된 단백질의 상호 작용체 분석을 통해 연관된 단백질이 어떤 기작 경로에 영향을 미치는지 확인할 수 있다. 분석 장치는 연관된 단백질이 면역항암제 저항성 발현에 영향을 미치는 경우 해당 후보 유전자를 마커로 결정할 수 있다. 또는 분석 장치는 연관된 단백질이 암 발생에 영향을 주는 경우 해당 후보 유전자를 마커로 결정할 수 있다.Further, by further performing an interactional analysis on the protein associated with the candidate gene, the analysis device or the researcher may determine the candidate gene. The related protein refers to a protein that a candidate gene affects, a protein produced by translation of a candidate gene, and the like. The analysis device can determine which mechanism pathway the related protein affects through the analysis of the interactor of the related protein. The analysis device may determine a corresponding candidate gene as a marker when the associated protein affects the expression of resistance to an anticancer drug. Alternatively, the analysis device may determine a corresponding candidate gene as a marker when the associated protein affects the occurrence of cancer.

도 5는 면역항암제에 대한 저항성을 예측하는 분석장치(500)의 구조에 대한 예이다. 분석장치(500)는 도 1의 분석 장치(130 또는 140)에 해당하는 장치이다.5 is an example of the structure of the analysis device 500 for predicting resistance to anti-cancer drugs. The analysis device 500 is a device corresponding to the analysis device 130 or 140 of FIG. 1.

분석장치(500)는 전술한 학습 모델을 이용하여 면역암항제의 저항성을 예측할 수 있다. 분석장치(500)는 물리적으로 다양한 형태로 구현될 수 있다. 예컨대, 분석장치(500)는 PC와 같은 컴퓨터 장치, 네트워크의 서버, 영상 처리 전용 칩셋 등의 형태를 가질 수 있다. 컴퓨터 장치는 스마트 기기 등과 같은 모바일 기기를 포함할 수 있다.The analysis device 500 may predict the resistance of the immune cancer drug using the above-described learning model. The analysis device 500 may be physically implemented in various forms. For example, the analysis device 500 may have a form such as a computer device such as a PC, a server of a network, and a chipset for image processing. The computer device may include a mobile device such as a smart device.

분석장치(500)는 저장장치(510), 메모리(520), 연산장치(530), 인터페이스 장치(540), 통신장치(550) 및 출력장치(560)를 포함한다.The analysis device 500 includes a storage device 510, a memory 520, an operation device 530, an interface device 540, a communication device 550, and an output device 560.

저장장치(510)는 면역암항제의 저항성을 예측하는 분류기를 저장한다. 분류기는 사전에 학습되어야 한다. 나아가 저장장치(510)는 데이터 처리에 필요한 프로그램 내지 소스 코드 등을 저장할 수 있다. 저장장치(510)는 입력되는 유전체 데이터 및 예측된 저항성에 대한 데이터를 저장할 수 있다.The storage device 510 stores a classifier that predicts the resistance of an immuno-cancer agent. The classifier must be learned in advance. Furthermore, the storage device 510 may store programs or source codes required for data processing. The storage device 510 may store input dielectric data and predicted resistance data.

메모리(520)는 분석장치(500)가 수신한 데이터를 분석하는 과정에서 생성되는 데이터 및 정보 등을 저장할 수 있다.The memory 520 may store data and information generated in the process of analyzing the data received by the analysis device 500.

인터페이스 장치(540)는 외부로부터 일정한 명령 및 데이터를 입력받는 장치이다. 인터페이스 장치(540)는 물리적으로 연결된 입력 장치 또는 외부 저장장치로부터 유전체 데이터를 입력받을 수 있다. 인터페이스 장치(540)는 데이터 분석을 위한 학습모델을 입력받을 수 있다. 인터페이스 장치(540)는 학습모델 훈련을 위한 학습데이터, 정보 및 파라미터값을 입력받을 수도 있다.The interface device 540 is a device that receives certain commands and data from the outside. The interface device 540 may receive dielectric data from an input device physically connected or an external storage device. The interface device 540 may receive a learning model for data analysis. The interface device 540 may receive training data, information, and parameter values for training a learning model.

통신장치(550)는 유선 또는 무선 네트워크를 통해 일정한 정보를 수신하고 전송하는 구성을 의미한다. 통신장치(550)는 외부 객체로부터 유전체 데이터를 수신할 수 있다. 통신장치(550)는 모델 학습을 위한 데이터도 수신할 수 있다. 통신장치(550)는 입력된 샘플에 대하여 결정된 면역항암제 저항성에 대한 정보를 외부 객체로 송신할 수 있다.The communication device 550 refers to a component that receives and transmits certain information through a wired or wireless network. The communication device 550 may receive genome data from an external object. The communication device 550 may also receive data for model training. The communication device 550 may transmit information about resistance to an anticancer drug determined for an input sample to an external object.

통신장치(550) 내지 인터페이스 장치(540)는 외부로부터 일정한 데이터 내지 명령을 전달받는 장치이다. 통신장치(550) 내지 인터페이스 장치(540)를 입력장치라고 명명할 수 있다.The communication device 550 to the interface device 540 are devices that receive certain data or commands from the outside. The communication device 550 to the interface device 540 may be referred to as an input device.

출력장치(560)는 일정한 정보를 출력하는 장치이다. 출력장치(560)는 데이터 처리 과정에 필요한 인터페이스, 분석 결과 등을 출력할 수 있다.The output device 560 is a device that outputs certain information. The output device 560 may output an interface required for a data processing process and an analysis result.

연산 장치(530)는 저장장치(510)에 저장된 분류기를 이용하여 입력되는 샘플 유전체 데이터에 대한 면역항암제 저항성을 예측할 수 있다. 연산 장치(530)는 분류기가 출력하는 결과를 직접 또는 일정하게 가공하여 면역항암제에 대한 저항성을 예측할 수 있다. 연산 장치(530)는 주어진 훈련 데이터를 이용하여 면역항암제 저항성을 예측에 사용되는 학습모델을 훈련할 수도 있다. 연산 장치(530)는 데이터를 처리하고, 일정한 연산을 처리하는 프로세서, AP, 프로그램이 임베디드된 칩과 같은 장치일 수 있다.The computing device 530 may predict resistance to an immuno-anticancer drug for input sample genome data using a classifier stored in the storage device 510. The computing device 530 may predict resistance to an anticancer drug by directly or consistently processing a result output from the classifier. The computing device 530 may train a learning model used for predicting resistance to anticancer drugs by using the given training data. The computing device 530 may be a device such as a processor, an AP, or a chip in which a program is embedded that processes data and processes certain operations.

이하 연구자가 전술한 면역항암제 저항성을 예측하는 모델을 생성한 과정 및 효과에 대하여 설명한다.Hereinafter, the process and effect of the researcher creating the model for predicting the above-described immune anticancer drug resistance will be described.

코호트Cohort 이름 name 참고문헌references 종양 유형Tumor type 코호트Cohort 크기 size 타겟target 면역관문 Immune checkpoint SMCSMC 폐암Lung cancer 122122 PD-1/PD-L1PD-1/PD-L1 RizviRizvi Science 348:124Science 348:124 폐암Lung cancer 3434 PD-1PD-1 HellmannHellmann Cancer Cell 33:843Cancer Cell 33:843 폐암Lung cancer 7575 PD-1 & CTLA-4PD-1 & CTLA-4 Van AllenVan Allen Science 350:207Science 350:207 흑색종Melanoma 110110 CTLA-4CTLA-4 SnyderSnyder NEJM 371:2189NEJM 371:2189 흑색종Melanoma 6464 CTLA-4CTLA-4 RohRoh Sci. Transl. Med. 9:eaah3560Sci. Transl. Med. 9:eaah3560 흑색종Melanoma 5656 PD-1 & CTLA-4PD-1 & CTLA-4 RiazRiaz Cell 171:934Cell 171:934 흑색종Melanoma 6868 PD-1PD-1

표 1은 실험 및 개발 과정에서 사용된 코호트를 나타낸다. SMC 코호트를 제외한 코호트는 종래 연구에서 사용된 코호트이다.Table 1 shows the cohorts used in the experiment and development process. The cohort excluding the SMC cohort is the cohort used in previous studies.

SMC 코호트는 국내 병원에서 제공받은 데이터이다. 구체적인 정보는 다음과 같다. 병원에서 2014 년부터 2017 년까지 항 PD-1/PD-L1으로 치료받은 122 명의 진행성 비소세포폐암 환자를 대상으로 하였다. 임상 반응은 RECIST(Response Evaluation Criteria in Solid Tumours) 버전 1.1의 응답 평가 기준에 의해 최소 6개월간의 추적 관찰을 통해 평가하였다. 면역치료에 대한 반응은 반응성(지속된 임상 이익, durable clinical benefit: DCB)) 또는 비반응성(비지속된 임상 이익, non-durable benefit: NDB)으로 분류하였다. 부분 반응성(Partial response: PR) 또는 안정된 질병(stable disease: SD) 또는 6개월 이상 지속된 환자는 DCB/반응성으로 간주되었다. 6개월 미만 지속된 진행성 질환(Progressive disease: PD) 또는 SD는 NDB/비반응성으로 간주되었다. 무진행 생존율 (Progression-free survival: PFS)은 치료 시작일부터 진행일 또는 사망일 중 빠른 날짜까지 계산하였다. 환자가 살아 있다면 PFS에 대한 마지막 추적 관찰 날짜에 평가 하였다. The SMC cohort is data provided by domestic hospitals. Detailed information is as follows. We enrolled in 122 patients with advanced non-small cell lung cancer treated with anti-PD-1/PD-L1 from 2014 to 2017 in hospital. The clinical response was evaluated through follow-up for at least 6 months according to the response evaluation criteria of RECIST (Response Evaluation Criteria in Solid Tumours) version 1.1. Responses to immunotherapy were classified as responsive (durable clinical benefit: DCB) or non-responsive (non-durable benefit: NDB). Patients with partial response (PR) or stable disease (SD) or lasting longer than 6 months were considered DCB/responsive. Progressive disease (PD) or SD lasting less than 6 months was considered NDB/non-reactive. Progression-free survival (PFS) was calculated from the start of treatment to the date of progression or death, whichever is earlier. If the patient was alive, it was evaluated on the date of the last follow-up for PFS.

모든 시료에 대하여 돌연변이를 검토하였다. 기능적 돌연변이는 SIFT 및 PROVEAN으로 평가하였다. 기능적 돌연변이는 SIFT에 의해 피해입은 것(damaging)으로 분류되고, 동시에 PROVEAN에 의해 결실된 것(deleterious)으로 분류된 상태로 정의하였다.All samples were examined for mutations. Functional mutations were evaluated by SIFT and PROVEAN. Functional mutations were defined as being classified as damaged by SIFT and at the same time classified as deleted by PROVEAN.

훈련 데이터에서 5% 보다 높은 돌연변이 빈도를 갖는 유전자의 돌연변이를 특징 내지 변수로 선택하여 랜덤 포레스트를 학습하였다. 랜덤 포레스트는 1000개의 결정 트리로 구성하였다. 랜덤 포레스트 R 패키지를 5-fold cross validation을 10회 반복하여 랜덤 포레스트를 학습하였다.In the training data, a random forest was learned by selecting a mutation of a gene with a mutation frequency higher than 5% as a feature or variable. The random forest consisted of 1000 decision trees. The random forest was learned by repeating 5-fold cross validation 10 times in the random forest R package.

또한, 동일한 유전자 세트 상에서 동의(synonymous) 돌연변이의 상태를 이용하여 저항성 예측 모델과 동일한 방법으로 학습시켜 음성 대조군 학습 모델을 생성하였다. 음성 대조군 학습 모델은 저항성 예측 모델과 동일한 특징 개수를 사용하였다.In addition, a negative control learning model was created by learning in the same way as a resistance prediction model using the state of a synonymous mutation on the same gene set. The negative control learning model used the same number of features as the resistance prediction model.

동일한 암의 경우 하나의 코호트 제외하고, 나머지 코호트를 통합하여 모델을 훈련하였다. 제외한 하나의 코호트는 테스트 데이터로 사용하였다. 즉 하나의 코호트를 입력 데이터로 삼아 훈련된 모델을 이용하여 결과를 살펴보았다.In the case of the same cancer, the model was trained by excluding one cohort and integrating the remaining cohorts. Excluding one cohort was used as test data. That is, the results were examined using a model trained with one cohort as input data.

도 6은 면역항암제 저항성을 예측하는 모델을 평가한 결과이다. 붉은색 커브는 결실/피해 돌연변이에 대한 ROC(receiver operating characteristic) 커브이다. 파란색 커브는 음성 대조군인 동의 돌연변이에 대한 ROC 커브이다. AUC(area under the curve)는 ROC 커브의 아래 면적을 뜻하며 1일때 이상적인 모델이다.6 is a result of evaluating a model for predicting immune anticancer drug resistance. The red curve is the receiver operating characteristic (ROC) curve for the deletion/damage mutation. The blue curve is the ROC curve for the negative control, synonymous mutation. AUC (area under the curve) refers to the area under the ROC curve, and it is an ideal model when it is 1.

도 6(A)는 흑색종(melanoma) 코호트에 대한 평가 결과이다. 각 그래프에서 상단에 표시한 코호트는 테스트 용도로 사용한 코호트를 표시한다. 도 6(B)는 폐암 코호트에 대한 평가 결과이다. 각 그래프에서 상단에 표시한 코호트는 테스트 용도로 사용한 코호트를 표시한다. 흑색종과 폐암의 경우 모두 음성 대조군보다 저항성 예측의 효과가 좋았다. 6(A) is an evaluation result of melanoma cohort. The cohort displayed at the top of each graph indicates the cohort used for testing purposes. 6(B) is an evaluation result for a lung cancer cohort. The cohort displayed at the top of each graph indicates the cohort used for testing purposes. For both melanoma and lung cancer, the effect of predicting resistance was better than that of the negative control group.

또한, 상술한 바와 같은 면역항암제 저항성 예측 방법 또는 바이오마커 발굴 방법은 컴퓨터에서 실행될 수 있는 실행가능한 알고리즘을 포함하는 프로그램(또는 어플리케이션)으로 구현될 수 있다. 상기 프로그램은 비일시적 판독 가능 매체(non-transitory computer readable medium)에 저장되어 제공될 수 있다.In addition, the method for predicting resistance to an anticancer drug or a method for discovering biomarkers as described above may be implemented as a program (or application) including an executable algorithm that can be executed on a computer. The program may be provided by being stored in a non-transitory computer readable medium.

비일시적 판독 가능 매체란 레지스터, 캐쉬, 메모리 등과 같이 짧은 순간 동안 데이터를 저장하는 매체가 아니라 반영구적으로 데이터를 저장하며, 기기에 의해 판독(reading)이 가능한 매체를 의미한다. 구체적으로는, 상술한 다양한 어플리케이션 또는 프로그램들은 CD, DVD, 하드 디스크, 블루레이 디스크, USB, 메모리카드, ROM 등과 같은 비일시적 판독 가능 매체에 저장되어 제공될 수 있다.The non-transitory readable medium refers to a medium that stores data semi-permanently and can be read by a device, not a medium that stores data for a short moment, such as a register, cache, or memory. Specifically, the above-described various applications or programs may be provided by being stored in a non-transitory readable medium such as a CD, DVD, hard disk, Blu-ray disk, USB, memory card, and ROM.

본 실시례 및 본 명세서에 첨부된 도면은 전술한 기술에 포함되는 기술적 사상의 일부를 명확하게 나타내고 있는 것에 불과하며, 전술한 기술의 명세서 및 도면에 포함된 기술적 사상의 범위 내에서 당업자가 용이하게 유추할 수 있는 변형 예와 구체적인 실시례는 모두 전술한 기술의 권리범위에 포함되는 것이 자명하다고 할 것이다.The present embodiment and the accompanying drawings are merely illustrative of some of the technical ideas included in the above-described technology, and those skilled in the art will be able to easily within the scope of the technical ideas included in the specification and drawings of the above-described technology. It will be apparent that all of the modified examples and specific embodiments that can be inferred are included in the scope of the rights of the above-described technology.

Claims

Receiving, by the analysis device, genome data of the sample;
Inputting, by the analysis device, the genome data into a pre-learned classifier; And
The analysis device comprises the step of predicting the resistance of the anticancer agent to the sample based on the output information of the classifier,
The classifier predicts the resistance of an immune anticancer agent based on the characteristics of a sequence associated with a functional mutation caused by a cancer,
The classifier is trained using a positive training data group having resistance and a negative training data group not having resistance, based on resistance to anticancer drugs,
The positive training data group is a method for predicting resistance to an anticancer agent selected according to the degree to which a tumor-induced mutation affects protein function.

The method of claim 1,
The classifier is an ensemble model, a method of predicting resistance to anticancer drugs.

The method of claim 1,
The classifier is a random forest (random forest) model, a method of predicting resistance to anticancer drugs.

The method of claim 1,
The classifier is a method of predicting resistance to an anticancer drug, which is learned using genomic data of a patient having a tumor-induced neoantigen above a reference value.

delete

The method of claim 1,
The positive training data group is a method of predicting resistance to an immuno-anticancer drug, which is training data in which the SIFT score and PROVEAN score for quantitatively evaluating the degree to which the amino acid sequence affects protein function is greater than or equal to a reference value.

The method of claim 1,
The positive training data group is training data in which the number of neoantigens is greater than or equal to a reference value.

Receiving, by the analysis device, genome data of the sample;
Inputting, by the analysis device, the genome data into a pre-learned classifier; And
Including the step of determining a marker capable of determining resistance to the anti-cancer drug based on the output information of the classifier,
The classifier predicts the resistance of an immune anticancer agent based on the characteristics of a sequence associated with a functional mutation caused by a tumor,
The determining of the marker may include detecting a candidate gene having a variable importance greater than or equal to a reference value in the genomic data; And determining the marker among the candidate genes through an interactional analysis of the protein associated with the candidate gene.

The method of claim 8,
The classifier is trained using a positive training data group having resistance and a negative training data group not having resistance, based on resistance to anticancer drugs,
The positive training data group is a method of detecting a marker for predicting resistance to an anticancer agent selected according to the degree to which a tumor-induced mutation affects protein function.

The method of claim 8,
The positive training data group is a method of detecting a marker for predicting resistance to an immune anticancer drug having a SIFT score and a PROVEAN score for quantitatively evaluating the degree to which an amino acid sequence affects protein function.

The method of claim 9,
The positive training data group is training data in which the number of neoantigens is greater than or equal to a reference value.

The method of claim 8,
The variable importance is a method of detecting a marker for predicting resistance to an anticancer drug, indicating the degree to which the predicted result of the classifier becomes inaccurate after random permutation of the mutant sequence.

An input device for receiving dielectric data of a sample;
A storage device for storing a classifier that predicts resistance of an immuno-anticancer agent based on a characteristic of a sequence associated with a tumor-induced functional mutation; And
Comprising a computing device for predicting the resistance of the immune anticancer agent to the sample by inputting the genome data into the classifier,
The classifier is learned in advance using a positive training data group having resistance based on resistance to an immuno-anticancer drug, and the positive training data group is selected according to the degree to which a tumor-induced mutation affects protein function. An analysis device that predicts resistance to anticancer drugs.

The method of claim 13,
The classifier is a random forest (random forest) analysis device for predicting resistance to anti-cancer drugs that are models.

delete

The method of claim 13,
The positive training data group includes data in which both the SIFT score and the PROVEAN score are positive, which quantitatively evaluates the degree to which the amino acid sequence affects protein function among training data in which the number of neoantigens is higher than the reference value Analysis device that predicts resistance.

The method of claim 13,
The computing device is
An immuno-anticancer agent that detects a candidate gene whose variable importance is greater than or equal to a reference value in the genomic data, and determines a resistance marker of the immuno-anti-cancer agent among the candidate genes through interaction analysis of proteins associated with the candidate gene. Analysis device that predicts resistance to

A computer-readable recording medium in which a program for executing a method for predicting resistance to an anti-cancer agent according to any one of claims 1 to 4 and 6 to 7 is recorded on a computer.