KR20190000169A

KR20190000169A - System and method of biomarker identification for cancer recurrence prediction

Info

Publication number: KR20190000169A
Application number: KR1020170079168A
Authority: KR
Inventors: 이관수; 차기훈
Original assignee: 한국과학기술원
Priority date: 2017-06-22
Filing date: 2017-06-22
Publication date: 2019-01-02
Also published as: KR101990430B1

Abstract

The present invention present invention provides a method where a system operated by at least one processor discovers a biomarker for cancer recurrence prognosis prediction. The method comprises: a step of collecting large-scale gene expression data of cancer patients of a specific type; a step of analyzing similarity between the large-scale gene expression data to classify the large-scale gene expression data into a plurality of clusters; a step of extracting genes commonly expressed in gene expression data classified into each cluster to determine the genes as a cancer-associated gene set of the corresponding cluster; and a step of selecting at least one gene set among a plurality of cancer-associated gene sets consisting of the cancer-associated gene set of each cluster as a biomarker based on a hazard ratio calculated by cancer recurrence prognosis-related information of each cancer-associated gene set.

Description

BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a biomarker identification system and a biomarker identification system,

본 발명은 바이오마커에 관한 것이다.The present invention relates to a biomarker.

암은 발생 위치에 따라 유방암, 간암, 폐암 등으로 분류된다. 같은 종류의 암이라도 서로 다른 유전자의 변이 또는 세포내 기능의 이상으로 암이 발생한다. 따라서 같은 종류의 암이 발병한 환자일지라도, 환자마다 암의 재발이나 사망 시기와 같은 예후가 다르게 나타날 수 있고, 같은 약물에 대해서도 반응이 다르게 나타난다. 따라서, 암환자의 예후를 예측하거나 치료 전략을 수립할 때, 개인 맞춤형 재발 예후 예측에 대한 필요성이 높아지고 있다. 암 진단과 관련되어 최종적으로는 환자별로 적합한 치료 전략을 수립하기 위해 우선적으로 암환자를 분류하는 것이 필요하다.Cancer is divided into breast cancer, liver cancer, and lung cancer depending on the location of the cancer. Even cancer of the same kind causes cancer due to mutation of different genes or abnormality of intracellular function. Therefore, even in patients with the same type of cancer, the prognosis, such as the recurrence of cancer or the timing of death, may be different for each patient, and the response may be different for the same drug. Therefore, when predicting the prognosis of cancer patients or establishing a treatment strategy, the need for personalized prediction of recurrence prognosis is increasing. Finally, it is necessary to classify cancer patients first to establish appropriate treatment strategies for cancer diagnosis.

지금까지의 연구된 암환자의 재발 예후를 예측하는 진단키트로는 Mammaprint, Prosigna, Oncotype DX 가 있고, Mammaprint, Prosigna는 2007년, 2013년 FDA 승인을 받아 사용되고 있다. 지금까지의 진단키트는 라벨(label)이 있는 데이터를 이용하여 분류하기 때문에 해당 진단키트의 데이터로 과적합(overfitting)되는 문제점이 있다. 또한 지금까지의 진단키트는 환자의 세부 그룹(subgroup)의 특성을 고려하지 않고 전체 유전자로 판별하기 때문에 환자를 정확히 분류하기 어렵고, 질병의 기전을 고려하지 않아서 환자샘플 분류에 있어 강건함(robustness)이 낮다. Mammaprint, Prosigna and Oncotype DX have been used as diagnostic kits to predict the recurrence of cancer patients. Mammaprint and Prosigna have been approved by the FDA in 2007 and 2013, respectively. Since the diagnostic kits so far are classified using data having labels, there is a problem that they are overfitted with the data of the diagnostic kits. In addition, the conventional diagnostic kits are difficult to accurately classify the patients because they discriminate the whole gene without considering the characteristics of the subgroup of the patient, and do not consider the mechanism of the disease, Is low.

본 발명이 해결하고자 하는 과제는 대규모 암환자의 유전자 발현 데이터를 분석하여 환자군별로 공동발현하는 유전자셋을 추출하고, 추출한 복수의 유전자셋에서 암의 재발 예후를 예측할 수 있는 임상적 특징 가지는 바이오마커를 발굴하는 시스템 및 방법에 관한 것이다.The problem to be solved by the present invention is to analyze a gene expression data of a large-scale cancer patient, to extract a gene set coexpressing each patient group, and to identify a clinical marker for predicting cancer recurrence in a plurality of extracted gene sets And to a system and method for excavation.

한 실시예에 따른 적어도 하나의 프로세서에 의해 동작하는 시스템이 암의 재발 예후 예측을 위한 바이오마커를 발굴하는 방법으로서, 특정 종류의 암환자들의 대규모 유전자 발현 데이터들을 수집하는 단계, 상기 대규모 유전자 발현 데이터들 사이의 유사성을 분석하여 상기 대규모 유전자 발현 데이터들을 복수의 클러스터들로 분류하는 단계, 각 클러스터로 분류된 유전자 발현 데이터들에서 공통 발현되는 유전자들을 추출하여, 해당 클러스터의 암 연관 유전자셋으로 결정하는 단계, 그리고 각 암 연관 유전자셋의 암 재발 예후 관련 정보로 계산한 위험비(harzard ratio)를 기초로, 각 클러스터의 암 연관 유전자셋으로 구성된 복수의 암 연관 유전자셋들 중 적어도 하나의 유전자셋을 바이오마커로 선정하는 단계를 포함한다.A method for locating a biomarker for predicting recurrence prognosis of a cancer, the system being operated by at least one processor according to an embodiment, comprising the steps of: collecting large scale gene expression data of a specific type of cancer patient; Classifying the large-scale gene expression data into a plurality of clusters by analyzing similarities among the clusters, extracting genes commonly expressed in gene expression data classified into each cluster, and determining the set of cancer-associated genes in the cluster Based on the harzard ratio calculated from the cancer-associated gene set of each cancer-associated gene set and the cancer-relapse prognosis-related information of each cancer-associated gene set, at least one gene set of plural cancer- And selecting the biomarker.

상기 복수의 클러스터들로 분류하는 단계는 상기 대규모 유전자 발현 데이터들을 바이클러스터링(Biclustering)하여 임의 유전자 발현 데이터를 적어도 하나의 클러스터로 분류할 수 있다.The classification into the plurality of clusters may classify the random gene expression data into at least one cluster by bi-clustering the large-scale gene expression data.

상기 복수의 클러스터들로 분류하는 단계는 각 클러스터가 피어슨 상관계수(Pearson correlation) 0.9 이상이고, 유의 확률 P값이 0.05 이하가 되도록 상기 대규모 유전자 발현 데이터들을 바이클러스터링할 수 있다.The classification into the plurality of clusters may be performed by bi-clustering the large-scale gene expression data such that each cluster has a Pearson correlation of 0.9 or more and a significance P value of 0.05 or less.

상기 해당 클러스터의 암 연관 유전자셋으로 결정하는 단계는 각 클러스터로 분류된 유전자 발현 데이터들로부터 발현 유전자들을 추출하고, 각 클러스터의 발현 유전자들 중에서 기준 개수 이상의 유전자 발현 데이터에서 공통발현되고, 피어슨 상관계수가 기준값 이상인 유전자들을 해당 클러스터의 암 연관 유전자셋으로 결정할 수 있다.The step of determining the set of cancer-associated genes in the cluster is performed by extracting expression genes from gene expression data classified into each cluster, expressing commonly in gene expression data of a reference number or more among the expression genes of each cluster, Can be determined as the cancer-associated gene set of the cluster.

상기 대규모 유전자 발현 데이터들을 수집하는 단계는 각 유전자 발현 데이터의 발현양을 정규화하여 저장할 수 있다.The step of collecting the large-scale gene expression data may normalize and store the expression amount of each gene expression data.

상기 암 재발 예후 관련 정보는 환자별 재발 여부 그리고 생존 시간을 포함하고, 각 암 연관 유전자셋의 유전자 발현 데이터로부터 획득될 수 있다.The cancer recurrence prognosis related information can be obtained from gene expression data of each cancer-associated gene set, including recurrence and survival time for each patient.

상기 바이오마커로 선정하는 단계는 각 암 연관 유전자셋의 위험비가 유의한지 평가하고, 상기 복수의 암 연관 유전자셋 중에서 위험비가 유의한 적어도 하나의 유전자셋을 상기 바이오마커로 선정할 수 있다.In the step of selecting the biomarker, the risk ratio of each cancer-associated gene set is evaluated, and at least one gene set having a significant risk ratio among the plurality of cancer-associated genes may be selected as the biomarker.

다른 실시예에 따른 적어도 하나의 프로세서에 의해 동작하는 시스템이 암의 재발 예후 예측을 위한 바이오마커를 발굴하는 방법으로서, 바이클러스터링(Biclustering) 기술을 이용하여, 특정 종류의 암환자들의 대규모 유전자 발현 데이터들을 복수의 클러스터들로 분류하는 단계, 상기 복수의 클러스터 각각에서 공통발현되는 유전자들을 해당 클러스터의 암 연관 유전자셋으로 결정하는 단계, 각 클러스터의 암 연관 유전자셋으로 구성된 복수의 암 연관 유전자셋들을 바이오마커 후보로 결정하는 단계, 그리고 각 유전자 발현 데이터에 포함된 암 재발 예후 관련 정보를 기초로, 상기 바이오마커 후보로 결정된 각 암 연관 유전자셋의 예후 예측 유의성을 평가하여 바이오마커를 선정하는 단계를 포함한다.A system operating with at least one processor in accordance with another embodiment is a method for locating a biomarker for predicting recurrence prognosis of a cancer using bi-clustering techniques to identify large-scale gene expression data Classifying cancer genes into a plurality of clusters, determining genes that are commonly expressed in each of the plurality of clusters as a cancer-associated gene set of the clusters, setting a plurality of cancer- And determining the biomarker by evaluating the prognostic significance of each cancer-associated gene set determined as the biomarker candidate based on the cancer recurrence prognostic information included in each gene expression data do.

상기 복수의 클러스터들로 분류하는 단계는 상기 대규모 유전자 발현 데이터들 중 임의 유전자 발현 데이터를 적어도 하나의 클러스터로 분류할 수 있다.The classification into the plurality of clusters may classify random gene expression data among the large-scale gene expression data into at least one cluster.

상기 암 재발 예후 관련 정보는 환자별 재발 여부 그리고 생존 시간을 포함할 수 있다.The cancer recurrence prognostic information may include a recurrence rate and a survival time for each patient.

상기 바이오마커를 선정하는 단계는 상기 바이오마커 후보로 결정된 임의 암 연관 유전자셋의 암 재발 예후 관련 정보를 기초로, 상기 임의 암 연관 유전자셋이 생존에 영향을 주는 정도를 나타내는 위험비(harzard ratio)를 계산하고, 계산한 위험비의 유의성을 기초로 상기 임의 암 연관 유전자셋을 바이오마커로 선정할지 결정할 수 있다.Wherein the step of selecting the biomarker comprises the steps of: determining a harzard ratio indicating a degree of influence of the cancer-associated gene set on the basis of information on cancer recurrence prognosis of the arbitrary cancer-associated gene set determined as the biomarker candidate; And determine whether to select the arbitrary cancer-associated gene set as the biomarker based on the calculated significance of the risk ratio.

또 다른 실시예에 따른 암의 재발 예후 예측을 위한 바이오마커를 발굴하는 시스템으로서, 특정 종류의 암환자들의 대규모 유전자 발현 데이터들을 기초로 상기 암환자들을 분류하고, 환자군별로 공통발현되는 유전자셋을 추출하는 암 연관 유전자셋 추출 장치, 그리고 환자군별 유전자셋으로 구성된 복수의 유전자셋들 중에서, 각 유전자셋의 암 재발 예후 관련 정보를 기초로 해당 유전자셋의 예후 예측 유의성을 평가하여, 상기 복수의 유전자셋들 중 적어도 하나를 바이오마커로 선정하는 바이오마커 선정 장치를 포함한다.According to another embodiment of the present invention, there is provided a system for identifying a biomarker for predicting cancer recurrence prognosis, which comprises classifying the cancer patients based on large-scale gene expression data of a specific cancer patient, A cancer-associated gene set extracting apparatus, and a plurality of gene sets each consisting of a patient group-specific gene set, the prognostic significance of the gene set is evaluated based on information on the cancer recurrence prognosis of each gene set, And a biomarker selection device for selecting at least one of the biomarkers as a biomarker.

상기 암 연관 유전자셋 추출 장치는 상기 대규모 유전자 발현 데이터들을 바이클러스터링(Biclustering)하여 상기 대규모 유전자 발현 데이터들을 복수의 환자군들로 분류할 수 있다.The cancer-associated gene set extracting apparatus may classify the large-scale gene expression data into a plurality of patient groups by bi-clustering the large-scale gene expression data.

상기 암 재발 예후 관련 정보는 환자별 재발 여부 그리고 생존 시간을 포함하고, 상기 바이오마커 선정 장치는 각 유전자셋의 암 재발 예후 관련 정보를 기초로, 해당 유전자셋이 생존에 영향을 주는 정도를 나타내는 위험비(harzard ratio)를 계산하고, 계산한 위험비의 유의성을 기초로 해당 유전자셋을 바이오마커로 선정할지 결정할 수 있다.The cancer recurrence prognosis information includes the recurrence rate and the survival time of each patient. The biomarker selection device is a device for determining the risk of cancer gene deletion based on information on cancer recurrence prognosis of each gene set, It is possible to calculate the harzard ratio and decide whether to select the gene set as the biomarker based on the significance of the calculated hazard ratio.

상기 바이오마커는 환자군별로 공통발현되는 유전자셋들 중에서 예후 예측 유의성이 기준 이상인 유전자셋들의 조합일 수 있다.The biomarker may be a combination of genetic sets having a prognostic significance value of more than a reference among genes set commonly expressed in each patient group.

본 발명의 실시예에 따르면 진단키트에 라벨링된 데이터에 과적합되는 문제점을 종래의 해결하고, 암환자별 유전자 특성에 따라 세분화된 암 연관 유전자셋을 추출할 수 있다. 따라서, 본 발명의 실시예에 따르는 바이오마커는 다양한 환자의 유전자 특성으로부터 추출된 암 연관 유전자셋으로 구성되므로, 개인 맞춤형 암 재발 예후 예측을 할 수 있다.According to the embodiment of the present invention, the problem of overexposure to the data labeled in the diagnostic kit can be solved conventionally, and the cancer-associated gene set can be extracted according to the gene characteristics of each cancer patient. Therefore, the biomarker according to the embodiment of the present invention is composed of a set of cancer-related genes extracted from gene characteristics of various patients, so that a personalized cancer relapse prognosis can be predicted.

본 발명의 실시예에 따르는 바이오마커를 이용하여 개발된 진단제 혹은 진단키트는 투자 대비 수익창출이 높아 신약 못지않은 수익을 창출할 수 있다. 본 발명의 실시예에 따르는 바이오마커를 통해, genomics, proteomics, epigenomics 등 분자생물정보에 관련된 BT 기반의 바이오마커 컨텐츠가 헬스 케어와 관련된 의료기기, 웹 기반 건강관리 서비스 등 IT 기반의 하드웨어와 접목되어 보건의료 및 제약산업 발전에 기여할 수 있다.The diagnostic agent or diagnostic kit developed using the biomarker according to the embodiment of the present invention can generate a profit that is equal to that of a new drug due to high profitability in relation to investment. BT-based biomarker content related to molecular biological information such as genomics, proteomics, and epigenomics is combined with IT-based hardware such as medical equipment related to healthcare and web-based healthcare service through the biomarker according to the embodiment of the present invention It can contribute to the development of healthcare and pharmaceutical industry.

도 1은 본 발명의 한 실시예에 따른 바이오마커 발굴 시스템의 구성도이다.
도 2는 본 발명의 한 실시예에 따른 바이오마커 발굴 방법을 예시적으로 설명하는 도면이다.
도 3과 도 4 각각은 본 발명의 한 실시예에 따른 바이오마커 발굴 발굴 방법의 흐름도이다.1 is a configuration diagram of a biomarker excavation system according to an embodiment of the present invention.
2 is a view for explaining a biomarker discovery method according to an embodiment of the present invention.
3 and 4 are flowcharts of a biomarker excavation method according to an embodiment of the present invention.

아래에서는 첨부한 도면을 참고로 하여 본 발명의 실시예에 대하여 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 상세히 설명한다. 그러나 본 발명은 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시예에 한정되지 않는다. 그리고 도면에서 본 발명을 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면 부호를 붙였다.Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings so that those skilled in the art can easily carry out the present invention. The present invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. In order to clearly illustrate the present invention, parts not related to the description are omitted, and similar parts are denoted by like reference characters throughout the specification.

명세서 전체에서, 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있는 것을 의미한다. 또한, 명세서에 기재된 "…부", "?기", "모듈" 등의 용어는 적어도 하나의 기능이나 동작을 처리하는 단위를 의미하며, 이는 하드웨어나 소프트웨어 또는 하드웨어 및 소프트웨어의 결합으로 구현될 수 있다.Throughout the specification, when an element is referred to as "comprising ", it means that it can include other elements as well, without excluding other elements unless specifically stated otherwise. Also, the terms " part, "" module, "and the like, which are described in the specification, refer to a unit for processing at least one function or operation, which may be implemented by hardware or software or a combination of hardware and software have.

바이오마커(biomarker)는 단백질이나 유전자로부터 유래한 특이한 분자적 정보로 생체 변화를 알아낼 수 있는 표지자로서, 암을 비롯해 뇌졸중, 치매 등 각종 난치병을 진단, 재발 예후 예측, 신약개발에도 사용된다.A biomarker is a unique molecular information derived from proteins or genes that can be used to detect biologic changes. It is also used to diagnose various intractable diseases such as cancer, stroke, and dementia, to predict recurrence, and to develop new drugs.

다양한 암 중에서, 유방암은 분자 특성 기반 아형(subtype)의 연구가 많이 이루어진 암에 속하는데, 전통적으로 3개의 단백질[estrogen receptor(ER), progesterone receptor(PR), human epidermal growth factor receptor 2(HER2)]의 발현에 따라 Luminal A, Luminal B, HER2-overexpressed, Basal-like (Triple-negative)로 분류된다. 또한 각 아형에 따라 치료방법이 결정되는데, 예를 들어 HER2-overexpressed 아형은 HER2를 억제하기 위해 Trastuzumab 약물을 투여한다. 하지만 이러한 타겟 치료 전략을 수행하더라도 Trastuzumab은 약 26%의 약물 반응성을 보이며, Doxorubicin이나 Docetaxel 등의 약물과 함께 사용하였을 경우 또한 약 65%의 약물 반응성을 가진다. 이러한 암의 이질성 때문에 암환자의 재발 예후를 예측하는 진단키트가 사용된다. Mammaprint, Prosigna, Oncotype DX와 같은 진단키트는 유방암 환자들의 유방암 조직에서 추출한 mRNA의 발현을 이용하여 재발 예후를 예측하는 바이오마커를 사용한다.Among the various cancers, breast cancer belongs to a cancer that has been studied in a subtype based on molecular characteristics. Traditionally, three proteins (estrogen receptor, progesterone receptor, human epidermal growth factor receptor 2 (HER2) Luminal A, Luminal B, HER2-overexpressed, and Basal-like (Triple-negative) For example, the HER2-overexpressed subtype is administered the Trastuzumab drug to suppress HER2. However, even with this targeted treatment strategy, Trastuzumab shows about 26% of drug reactivity, and when used with drugs such as Doxorubicin or Docetaxel, it also has about 65% drug reactivity. Because of the heterogeneity of these cancers, diagnostic kits are used to predict the recurrence of cancer patients. Diagnostic kits, such as Mammaprint, Prosigna, and Oncotype DX, use biomarkers to predict recurrence prognosis using the expression of mRNA extracted from breast cancer tissues in breast cancer patients.

하지만, 기존의 진단키트는 특정 암과 관련된 유전자로 분석된 전체 유전자(analyzed Genes), 즉 라벨링된 70개의 유전자 데이터(Mammaprint), 50개의 유전자 데이터(Prosigna), 21개의 유전자 데이터(Oncotype DX)를 통해 환자를 분류하므로, 알고 있는 유전자에 의해서만 분류되는 한계가 있다. 또한 기존의 진단키트는 특정 암과 관련된 전체 유전자로 판별하므로, 환자군별 유전자 특성을 반영하지 못하는 한계가 있다.However, existing diagnostic kits include analyzed genes (genes) that are related to specific cancers, namely, 70 labeled gene data (Mammaprint), 50 gene data (Prosigna) and 21 gene data (Oncotype DX) There is a limit to be classified only by the known gene, since the patient is classified through. In addition, existing diagnostic kits are discriminated as whole genes related to specific cancers, and thus there is a limit that can not reflect the gene characteristics of each patient group.

다음에서, 이러한 종래 문제를 해결하기 위한 바이오마커를 발굴하는 시스템 및 방법에 대해 자세히 설명한다.In the following, a system and method for discovering biomarkers for solving such conventional problems will be described in detail.

도 1은 본 발명의 한 실시예에 따른 바이오마커 발굴 시스템의 구성도이고, 도 2는 본 발명의 한 실시예에 따른 바이오마커 발굴 방법을 예시적으로 설명하는 도면이다.FIG. 1 is a configuration diagram of a biomarker excavation system according to an embodiment of the present invention, and FIG. 2 is a diagram for explaining an exemplary biomarker excavation method according to an embodiment of the present invention.

도 1을 참고하면, 바이오마커 발굴 시스템(10)은 대규모 암환자의 유전자 발현 데이터 분석을 통해 특정 암의 환자군별로 발현하는 유전자셋을 발굴하고, 이를 통해 암의 재발 예후를 예측할 수 있는 임상적 특징 가지는 바이오마커를 발굴한다. 이를 위해, 바이오마커 발굴 시스템(10)은 암 연관 유전자셋 추출 장치(100), 그리고 암 재발 예후 예측용 바이오마커 선정 장치(간단히, '바이오마커 선정 장치'라고 한다)(200)를 포함한다. 바이오마커 발굴 시스템(10)은 적어도 하나의 공개 유전자 발현 데이터베이스(예를 들면, Gene expression omnibus)(300)와 연동하여 필요한 유전자 발현 데이터를 획득할 수 있다.Referring to FIG. 1, the biomarker excavation system (10) analyzes gene expression data of a large-scale cancer patient and finds out a gene set expressing a specific cancer patient group. Thus, a clinical feature Branches uncover biomarkers. To this end, the biomarker search system 10 includes a cancer-associated gene set extraction apparatus 100 and a biomarker selection apparatus 200 for predicting the cancer recurrence prognosis (hereinafter simply referred to as "biomarker selection apparatus") 200. The biomarker search system 10 may obtain necessary gene expression data in conjunction with at least one open gene expression database (e.g., Gene expression omnibus) 300.

암 연관 유전자셋 추출 장치(100)는 바이오마커 가능성이 있는 복수의 후보 유전자셋을 추출하는데, 대규모 암환자 유전자 발현 데이터를 클러스터링하여 환자군별로 공통발현되는 유전자셋을 추출한다. 암 연관 유전자셋 추출 장치(100)는 적어도 하나의 프로세서로 동작하고, 특정 암에 대한 유전자 발현 데이터 수집부(110), 유전자 발현 데이터 클러스터링부(130), 그리고 암 연관 유전자셋 결정부(150)를 포함한다.The cancer-associated gene set extraction apparatus 100 extracts a plurality of candidate gene sets having a possibility of biomarker, and clusters gene expression data of large-scale cancer patients to extract gene sets that are commonly expressed for each patient group. The cancer-associated gene set extraction apparatus 100 operates as at least one processor and includes a gene expression data collection unit 110, a gene expression data clustering unit 130, and a cancer-associated gene set determination unit 150 for a specific cancer. .

유전자 발현 데이터 수집부(110)는 유전자 발현 데이터베이스(300)로부터 특정 종류의 암환자들의 유전자 발현 데이터를 획득하고, 각 데이터의 발현양을 정규화하여 저장한다. 예를 들어, 유전자 발현 데이터 수집부(110)는 breast, tumor, cancer, neoplasm 등의 키워드를 이용하여 유전자 발현 데이터베이스(300)에서 환자의 유방암 조직에서 추출한 유전자 발현 데이터를 획득할 수 있다. 유전자 발현 데이터 수집부(110)는 각 데이터의 발현양을 퀀타일 정규화(quantile normalization)하여 전처리할 수 있다. The gene expression data collection unit 110 acquires gene expression data of a cancer patient of a certain kind from the gene expression database 300 and normalizes and stores the expression amount of each data. For example, the gene expression data collection unit 110 may acquire gene expression data extracted from the breast cancer tissue of the patient in the gene expression database 300 using keywords such as breast, tumor, cancer, and neoplasm. The gene expression data collection unit 110 can perform preprocessing by quantile normalizing the amount of expression of each data.

유전자 발현 데이터 클러스터링부(130)는 전체 환자들의 유전자 발현 데이터를 분석하여 유사성을 가지는 유전자 발현 데이터를 클러스터링한다. 유전자 발현 데이터 클러스터링부(130)는 데이터들간의 유사성을 기초로 클러스터(환자군)를 생성하므로, 각 환자는 적어도 하나의 클러스터로 분류될 수 있다. 예를 들어, 도 2를 참고하면, 환자 P1의 유전자 발현 데이터는 복수의 클러스터(C1, C2, C3)로 분류될 수 있고, 이 경우 환자 P1은 클러스터 C1에 속한 다른 환자 P2와 유전자 발현 데이터의 유사성을 가지고, 클러스터 C2에 속한 다른 환자 P3와도 유전자 발현 데이터의 유사성을 가지는 것으로 예측된다. 이때, 환자 P2와 환자 P3는 동일 클러스터에 속하지 않으므로, 환자 P2와 환자 P3는 동종의 암이 발병했지만, 유전자 발현 데이터가 상이한 것으로 볼 수 있다.The gene expression data clustering unit 130 analyzes gene expression data of all patients and clusters gene expression data having similarity. The gene expression data clustering unit 130 generates a cluster (patient group) based on the similarity between the data, so that each patient can be classified into at least one cluster. For example, referring to FIG. 2, the gene expression data of patient P1 may be classified into a plurality of clusters (C1, C2, C3) It is predicted to have similarity of gene expression data with another patient P3 belonging to cluster C2 with similarity. At this time, since the patient P2 and the patient P3 do not belong to the same cluster, the patient P2 and the patient P3 have the same kind of cancer, but the gene expression data may be different.

유전자 발현 데이터 클러스터링부(130)는 바이클러스터링(Biclustering) 기술을 이용하여 유전자 발현 데이터(샘플)를 복수의 클러스터들로 분류할 수 있다. 각 클러스터는 피어슨 상관계수(Pearson correlation) 0.9 이상을 가지고, 유의 확률인 P값 0.05 이하를 가지는 데이터로 구성될 수 있다. 유전자 발현 데이터 클러스터링부(130)는 t-테스트와 오발견률(false discovery rate) 기반으로 계산한 발현 차이를 기초로 P값을 구할 수 있다. 피어슨 상관계수(r)는 수학식 1과 같이 계산되고, t-테스트는 수학식 2와 같이 계산될 수 있다. 수학식 1은 두 벡터 X={X₁,X₂,..X_n}, Y={Y₁,Y₂,..Y_n}와 평균값

,

가 있을 때 계산되는 피어슨 상관계수 r을 계산하는 식이다. 수학식 2는 두 벡터 X₁={X₁₁,X₁₂,..X_1n}, X₂={X₂₁,X₂₂,..X_2n}와 평균값

,

, 각 벡터의 분산값 s²으로 계산된 표준편차인 s_p가 있을 때, t 값을 계산하는 식이다.The gene expression data clustering unit 130 may classify gene expression data (samples) into a plurality of clusters using a bi-clustering technique. Each cluster may consist of data with a Pearson correlation of 0.9 or greater and a P value of 0.05 or less. The gene expression data clustering unit 130 can obtain the P value based on the expression difference calculated based on the t-test and the false discovery rate. The Pearson correlation coefficient (r) can be calculated as shown in Equation (1), and the t-test can be calculated as Equation (2). Equation (1) are the two vectors _{_{X = {X 1, X 2}} , .. X n}, Y = {Y 1, Y 2, .. Y n} , and the average value

,

Is calculated by the following equation. Equation 2 can be expressed by two vectors X ₁ = {X ₁₁ , X ₁₂ , X _1n }, X ₂ = {X ₂₁ , X ₂₂ , .. X _2n }

,

, And s _{p, which} is the standard deviation calculated by the variance value s ² of each vector, is calculated.

[수학식 1][Equation 1]

[수학식 2]&Quot; (2) "

암 연관 유전자셋 결정부(150)는 각 클러스터에서 공통으로 발현되는 유전자들을 해당 클러스터(환자군)의 유전자셋으로 추출한다. 다시 도 2를 참고하면, 암 연관 유전자셋 결정부(150)는 클러스터별로 암 연관 유전자셋(예를 들면, C1 유전자셋, C2 유전자셋, C3 유전자셋 등)을 추출한다. The cancer-associated gene set determining unit 150 extracts the genes commonly expressed in each cluster as a set of genes of the cluster (patient group). Referring again to FIG. 2, the cancer-associated gene set determining unit 150 extracts a cancer-associated gene set (for example, a C1 gene set, a C2 gene set, a C3 gene set, etc.) for each cluster.

공통발현 유전자를 추출하는 방법은 다양할 수 있다. 예를 들면, 먼저 암 연관 유전자셋 결정부(150)는 각 클러스터에 속한 유전자 발현 데이터로부터 발현 유전자들을 추출한다. 그리고, 암 연관 유전자셋 결정부(150)는 각 클러스터의 발현 유전자들 중에서 복수(예를 들면 5개 이상)의 유전자 발현 데이터에서 공통발현되고, 피어슨 상관계수가 기준값(예를 들면, 0.7) 이상인 유전자들을 해당 클러스터의 암 연관 유전자셋으로 결정할 수 있다.Methods for extracting common expression genes can be varied. For example, the cancer-associated gene set determining unit 150 extracts expression genes from gene expression data belonging to each cluster. The cancer-associated gene set determiner 150 is commonly expressed in a plurality of gene expression data (for example, five or more genes) among the expression genes of each cluster, and the cancer-associated gene set determiner 150 determines whether the Pearson correlation coefficient is equal to or greater than a reference value Genes can be determined as the cancer-associated gene set of the cluster.

암 연관 유전자셋 결정부(150)는 클러스터별 암 연관 유전자셋으로 구성된 복수의 암 연관 유전자셋을 바이오마커 후보로 추출한다.The cancer-associated gene set determination unit 150 extracts a plurality of cancer-associated gene sets composed of cluster-specific cancer-associated gene sets as biomarker candidates.

바이오마커 선정 장치(200)는 암 연관 유전자셋 추출 장치(100)에서 추출한 복수의 암 연관 유전자셋의 예후 예측 유의성을 평가하여, 복수의 암 연관 유전자셋 중에서 통계적으로 유의미한 적어도 하나의 유전자셋을 암 재발 예후 예측용 바이오마커로 선정한다. The biomarker selection apparatus 200 evaluates the prognostic significance of a plurality of cancer-associated gene sets extracted from the cancer-associated gene-set extraction apparatus 100 to determine at least one gene set that is statistically significant among a plurality of cancer- Biomarkers for prediction of recurrence prognosis are selected.

바이오마커 선정 장치(200)는 유전자 발현 데이터베이스(300)에서 획득한 유전자 발현 데이터로부터 암 재발 예후 관련 정보를 확인한다. 유전자 발현 데이터는 환자의 조직에서 추출되므로, 환자별 재발 여부, 생존 시간(survival time) 정보 등을 포함하는 암 재발 예후 관련 정보를 포함할 수 있다.The biomarker selection apparatus 200 confirms the cancer relapse prognosis-related information from the gene expression data obtained in the gene expression database 300. Because gene expression data is extracted from the patient's tissues, it may include information on cancer recurrence prognosis, including the recurrence of individual patients, survival time information, and the like.

바이오마커 선정 장치(200)는 암 재발 예후 관련 정보를 기초로 각 암 연관 유전자셋의 예후 예측 지표를 계산한다. 예후 예측 지표는 위험비(hazard ratio)일 수 있다. 즉, 바이오마커 선정 장치(200)는 유전자셋(클러스터)마다 위험비를 계산한다. 암 연관 유전자셋의 위험비는 생존에 영향을 주는 요소의 영향력을 분석하는 콕스 비례 위험 모델(Cox proportional hazards model)로 계산될 수 있고, 수학식 3부터 수학식 5를 통해 계산될 수 있다.The biomarker selection apparatus 200 calculates a prognostic index of each cancer-associated gene set based on information on cancer relapse prognosis. The prognostic indicator can be a hazard ratio. That is, the biomarker selection apparatus 200 calculates a risk ratio for each gene set (cluster). The risk ratio of a cancer-associated gene set can be calculated as a Cox proportional hazards model, which analyzes the influence of factors influencing survival, and can be calculated from Equations (3) through (5).

수학식 3은 위험 함수 h(t)로서, T는 생존 시간에 대한 확률 변수(random variable for survival time)이고, t는 확률 변수 T에 대한 고정값(specific value for random variable T)이다. h₀은 기준 위험 함수로서, X₁=0, X₂=0인 경우이다. Equation 3 is a risk function h (t), T is a random variable for survival time, and t is a specific value for a random variable T. h ₀ is a reference risk function, where X ₁ = 0 and X ₂ = 0.

[수학식 3]&Quot; (3) "

[수학식 4]&Quot; (4) "

[수학식 5]&Quot; (5) "

바이오마커 선정 장치(200)는 로그-랭크(log-rank) 테스트 기반으로 P값을 계산하여, 각 유전자셋의 위험비가 유의한지 평가한다. 바이오마커 선정 장치(200)는 위험비의 P값이 가장 낮은 유전자셋을 특정 암의 바이오마커로 결정한다. 경우에 따라서, 바이오마커 선정 장치(200)는 위험비의 P값이 기준값 이하인 복수의 유전자셋들을 특정 암의 바이오마커로 결정할 수 있다다.The biomarker selection device 200 calculates a P value based on a log-rank test to evaluate the significance of the risk ratio of each gene set. The biomarker selection device 200 determines the gene set having the lowest P value of the risk ratio as a biomarker of a specific cancer. In some cases, the biomarker selection apparatus 200 can determine a plurality of gene sets having a P value of the risk ratio equal to or less than a reference value as a biomarker of a specific cancer.

바이오마커 선정 장치(200)는 수집된 정보 양에 따라 수만에서 수십만 번의 유의성 계산을 할 수 있고, 고속 계산을 위해 분산 컴퓨팅을 이용하여 병렬 처리할 수 있다.The biomarker selection apparatus 200 can perform tens or hundreds of thousands of significance calculations according to the amount of collected information, and can perform parallel processing using distributed computing for high-speed calculation.

도 3과 도 4 각각은 본 발명의 한 실시예에 따른 바이오마커 발굴 발굴 방법의 흐름도이다.3 and 4 are flowcharts of a biomarker excavation method according to an embodiment of the present invention.

도 3을 참고하면, 암 연관 유전자셋 추출 장치(100)는 특정 종류의 암환자들의 유전자 발현 데이터들을 수집한다(S110).Referring to FIG. 3, the cancer-associated gene set extraction apparatus 100 collects gene expression data of a cancer patient of a certain kind (S110).

암 연관 유전자셋 추출 장치(100)는 수집한 유전자 발현 데이터들의 유사성을 분석하여 유전자 발현 데이터들을 복수의 클러스터로 분류한다(S120).The cancer-associated gene set extraction apparatus 100 analyzes the similarity of the collected gene expression data and classifies the gene expression data into a plurality of clusters (S120).

암 연관 유전자셋 추출 장치(100)는 각 클러스터에서 공통으로 발현되는 유전자들을 해당 클러스터의 암 연관 유전자셋으로 추출한다(S130).The cancer-associated gene set extraction apparatus 100 extracts the genes commonly expressed in each cluster as a cancer-associated gene set of the cluster (S130).

암 연관 유전자셋 추출 장치(100)는 클러스터별 암 연관 유전자셋으로 구성된 복수의 암 연관 유전자셋을 바이오마커 후보로 추출한다(S140).The cancer-associated gene set extraction apparatus 100 extracts a plurality of cancer-associated gene sets composed of cluster-specific cancer-associated gene sets as biomarker candidates (S140).

도 4를 참고하면, 바이오마커 선정 장치(200)는 클러스터별 암 연관 유전자셋으로 구성된 복수의 암 연관 유전자셋을 바이오마커 후보로 입력받는다(S210).Referring to FIG. 4, the biomarker selection apparatus 200 receives a plurality of cancer-associated gene sets composed of cancer-associated gene sets per cluster as biomarker candidates (S210).

바이오마커 선정 장치(200)는 유전자 발현 데이터의 암 재발 예후 관련 정보를 기초로, 각 암 연관 유전자셋의 위험비를 계산한다(S220).The biomarker selection apparatus 200 calculates the risk ratio of each cancer-associated gene set based on information on the cancer recurrence prognosis of gene expression data (S220).

바이오마커 선정 장치(200)는 각 암 연관 유전자셋의 위험비가 유의한지 평가한다(S230).The biomarker selection apparatus 200 evaluates whether the risk ratio of each cancer-associated gene set is significant (S230).

바이오마커 선정 장치(200)는 복수의 암 연관 유전자셋 중에서 위험비가 유의한 적어도 하나의 유전자셋을 특정 암의 바이오마커로 선정한다(S240). 바이오마커는 위험비의 P값이 가장 낮은 유전자셋을 포함한다. 바이오마커는 위험비의 P값이 기준 이하인 복수의 유전자셋으로 구성될 수 있다.The biomarker selection apparatus 200 selects at least one gene set having a significant risk ratio among a plurality of cancer-associated gene sets as a biomarker of a specific cancer (S240). The biomarker contains the set of genes with the lowest P value for the hazard ratio. The biomarker can be composed of a plurality of sets of genes whose P value of the risk ratio is below a reference.

이처럼, 본 발명의 실시예에 따르면 대규모 암환자의 유전자 발현 데이터 분석을 통해 환자군별로 공통발현하는 유전자셋을 발굴하므로, 다양한 유전자 특성을 반영한 유전자셋을 획득할 수 있고, 이를 통해 다양한 유전자 특성을 가지는 환자의 암의 재발 예후를 예측할 수 있는 바이오마커를 발굴할 수 있다. 본 발명의 실시예에 따르는 바이오마커를 통해 암 진단과 관련되어 최종적으로는 환자 상태에 적합한 치료 전략을 수립할 수 있다. As described above, according to the embodiment of the present invention, the gene set common to each patient group is identified through analysis of gene expression data of a large-scale cancer patient, so that a gene set reflecting various gene characteristics can be obtained, Biomarkers that can predict the recurrence of cancer in a patient can be identified. A biomarker according to an embodiment of the present invention can be used to diagnose cancer and ultimately establish a therapeutic strategy suitable for the patient's condition.

이상에서 설명한 본 발명의 실시예는 장치 및 방법을 통해서만 구현이 되는 것은 아니며, 본 발명의 실시예의 구성에 대응하는 기능을 실현하는 프로그램 또는 그 프로그램이 기록된 기록 매체를 통해 구현될 수도 있다.The embodiments of the present invention described above are not implemented only by the apparatus and method, but may be implemented through a program for realizing the function corresponding to the configuration of the embodiment of the present invention or a recording medium on which the program is recorded.

이상에서 본 발명의 실시예에 대하여 상세하게 설명하였지만 본 발명의 권리범위는 이에 한정되는 것은 아니고 다음의 청구범위에서 정의하고 있는 본 발명의 기본 개념을 이용한 당업자의 여러 변형 및 개량 형태 또한 본 발명의 권리범위에 속하는 것이다.While the present invention has been particularly shown and described with reference to exemplary embodiments thereof, it is to be understood that the invention is not limited to the disclosed exemplary embodiments, It belongs to the scope of right.

Claims

A method for locating a biomarker for predicting recurrence prognosis of a cancer, the system being operated by at least one processor,
Collecting large-scale gene expression data of certain types of cancer patients,
Classifying the large-scale gene expression data into a plurality of clusters by analyzing similarities among the large-scale gene expression data,
Extracting genes commonly expressed in gene expression data classified into each cluster, and determining the cancer-associated gene set of the cluster; and
Based on the harzard ratio calculated from the cancer recurrence prognosis information of each cancer-associated gene set, at least one gene set of plural cancer-associated gene sets composed of cancer-associated genes in each cluster is used as a biomarker Selection step
The method comprising:

The method of claim 1,
The step of classifying into the plurality of clusters
And bi-clustering the large-scale gene expression data to classify the random gene expression data into at least one cluster.

3. The method of claim 2,
The step of classifying into the plurality of clusters
Clustering the large-scale gene expression data such that each cluster has a Pearson correlation of 0.9 or more and a significance P value of 0.05 or less.

The method of claim 1,
The step of determining the cancer-associated gene set of the cluster
Expression genes are extracted from gene expression data classified into respective clusters, and genes having a Pearson correlation coefficient equal to or higher than a reference value are commonly expressed in gene expression data of a reference number or more among the expression genes of each cluster, How to find biomarkers to determine.

The method of claim 1,
The step of collecting the large scale gene expression data
Wherein the expression amount of each gene expression data is normalized and stored.

The method of claim 1,
Wherein the cancer recurrence prognosis related information includes recurrence and survival time of each patient and is obtained from gene expression data of each cancer-associated gene set.

The method of claim 1,
The step of selecting the biomarker
Determining a risk ratio of each cancer-associated gene set to be significant, and selecting at least one gene set having a significant risk ratio among the plurality of cancer-associated gene sets as the biomarker.

A method for locating a biomarker for predicting recurrence prognosis of a cancer, the system being operated by at least one processor,
Classifying large-scale gene expression data of a specific type of cancer patient into a plurality of clusters using bi-clustering technology,
Determining genes that are commonly expressed in each of the plurality of clusters as a cancer-associated gene set of the cluster,
Determining a plurality of cancer-associated gene sets composed of cancer-associated genes in each cluster as candidate biomarkers, and
Selecting the biomarker by evaluating the prognostic significance of each cancer-associated gene set determined as the biomarker candidate on the basis of cancer relapse prognostic information included in each gene expression data
Wherein the gene expression data comprises at least one of:

9. The method of claim 8,
The step of classifying into the plurality of clusters
And sorting random gene expression data among the large-scale gene expression data into at least one cluster.

9. The method of claim 8,
Wherein the cancer recurrence prognostic information includes a recurrence rate and a survival time of each patient.

The method of claim 9,
The step of selecting the biomarker
Based on the information on the cancer recurrence prognosis of the arbitrary cancer-associated gene set determined as the biomarker candidate, calculates a harzard ratio indicating the degree of survival of the arbitrary cancer-associated gene set, Determining whether the selected cancer-associated gene set is a biomarker based on the significance of the cancer-associated gene set.

A system for locating biomarkers for predicting cancer recurrence prognosis,
A cancer-associated gene set extraction apparatus for classifying cancer patients based on large-scale gene expression data of a specific type of cancer patient and extracting a gene set commonly expressed for each patient group, and
The prognostic significance of the set of genes is evaluated based on the information on the cancer recurrence prognosis of each gene set among a plurality of gene sets composed of gene sets for each patient group and at least one of the plurality of gene sets is selected as a biomarker Biomarker selection device
/ RTI >

The method of claim 12,
The cancer-associated gene set extracting apparatus
Wherein the large-scale gene expression data is classified into a plurality of patient groups by bi-clustering the large-scale gene expression data.

The method of claim 12,
The cancer recurrence prognostic information includes the recurrence rate and the survival time of each patient,
The biomarker selection device
Based on the information on cancer recurrence prognosis of each gene set, a harzard ratio, which indicates the degree to which the gene set affects survival, is calculated, and based on the calculated significance of the risk ratio, The system determines whether to select.

The method of claim 12,
The biomarker
Wherein the prognostic significance of the gene sets common to all patient groups is greater than or equal to a reference.