KR20160089981A

KR20160089981A - Device and Method for evaluating performace of cancer biomarker

Info

Publication number: KR20160089981A
Application number: KR1020150009686A
Authority: KR
Inventors: 박태성; 권민석; 이성영
Original assignee: 서울대학교산학협력단
Priority date: 2015-01-21
Filing date: 2015-01-21
Publication date: 2016-07-29
Also published as: KR101765999B1; WO2016117812A1

Abstract

According to an embodiment of the present invention, a device for evaluating the performance of a cancer biomarker comprises: a preprocessing module collecting a cancer expression dataset from a public DB and preprocessing the same; a database module rearranging the preprocessed cancer expression dataset to construct a new cancer information DB; and a performance evaluation module evaluating the performance of a cancer biomarker for each cancer type by using the constructed new cancer information DB. Therefore, the device can perform evaluation using a stored prediction model with any selected testing dataset, and leave-one-out cross-validation (LOOCV) evaluation using the selected dataset can be performed. In addition, the device can perform evaluation using a user-provided training dataset and a selected testing dataset and evaluate multi-marker using the selected training and testing datasets.

Description

BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a cancer biomarker,

본 발명은 암 바이오마커의 성능을 평가하는 장치 및 방법에 관한 것으로서, 더욱 상세하게는 바이오마커의 후보로서의 단일- 및/또는 다중-유전자를 평가하는 암 바이오마커의 성능을 평가하는 장치 및 방법에 관한 것이다.The present invention relates to an apparatus and a method for evaluating the performance of cancer biomarkers, and more particularly, to an apparatus and a method for evaluating the performance of cancer biomarkers that evaluate single-and / or multi-genes as candidates for biomarkers .

인체에는 60가지 이상의 여러 조직에 200 종류 이상의 암이 존재한다. 이러한 암 종류는 조직, 원발 부위 세포의 종류, 및 종양의 진행을 유도하고, 치료 반응도에 영향을 주는 유전자 전사적 변형에 의해서 규정된다. 여러 조직의 몇몇 암들은 치료 반응도와 같은 많은 공통점을 가지는 반면에 동일한 조직으로부터의 몇몇 암의 서브타입은 꽤 상이한 것이 보통이다. 이러한 암의 특성들은 암 종류에 있는 단일 또는 다중 유전자의 분자 패턴에 따라 달라진다.There are over 200 types of cancer in more than 60 different tissues in the human body. These types of cancers are defined by the tissue, the type of primary site cells, and the gene transcriptional modifications that drive the progression of the tumor and affect the therapeutic response. Some cancers of many tissues have many similarities, such as therapeutic response, while some of the subtypes of cancer from the same tissue are usually quite different. The characteristics of these cancers depend on the molecular pattern of the single or multiple genes in the cancer species.

모든 암 종류에 대한 분자적 패턴의 전사체 분석으로 또 다른 것에 적용될 수 있는 하나의 암 종류의 병인 및 치료적 지식을 수득할 수 있다. 예를 들면, ERBB2-HER2는 교아 모세포종 및 위 암, 심각한 자궁내막 암, 방광 암 및 폐 암의 서브세트의 증폭물이다. 몇몇 경우 이러한 결과는 HER2-증폭 유방 암에서 이전에 관찰된 것과 유사한 HER2-표적 치료에 민감하다.By analyzing the transcripts of molecular patterns for all cancer types, one can obtain the pathogenesis and therapeutic knowledge of one kind of cancer that can be applied to another. For example, ERBB2-HER2 is an amplification of a subset of glioblastoma and gastric cancer, severe endometrial cancer, bladder cancer and lung cancer. In some cases, these results are sensitive to HER2-targeted therapies similar to those previously observed in HER2-amplified breast cancer.

이러한 하나의 성공 스토리(ERRB2)에도 불구하고, 종양 바이오마커에 대한 무수한 논문들 중 100건 미만만이 승인을 받았다. 바이오마커의 벤치 투 베드사이드(즉, 실험실의 연구결과를 가능한 빨리 환자들에게 적용할 수 있는 시스템: 실험실에서 임상까지) 임상적 유용성의 문제를 해결하기 위해서, 임상으로의 적용을 증가시킬 목적으로 다양한 표준 및 가이드라인이 제시되어 개발 과정에 어려움을 주고 있다. Despite this single success story (ERRB2), less than 100 of the myriad papers on tumor biomarkers received approval. For the purpose of increasing the clinical application to solve the problem of clinical usability, the bench-to-bedside of the biomarker (ie the system that enables the laboratory results to be applied to patients as soon as possible: from laboratory to clinical) Various standards and guidelines have been presented to make the development process difficult.

특히 1994년 이래로 승인된 84개의 바이오마커 중 53개만이 한 번의 시험에 사용되었으며, 한 시험 이상에 사용된 것은 30개이며, 평균 승인 기간은 15년이었다. 게다가, 몇 개의 예후 바이오마커(예를 들면, 난소 암을 위한 CA125 및 췌장 암을 위한 CA19-9)의 광범위한 용도에도 불구하고, 이들 각각의 종양의 진행에서의 이들 단백질의 정확한 역할에 있어서는 알려진 바가 적다. Of the 84 biomarkers approved since 1994, only 53 were used in one trial, 30 used for more than one test, and the average approval period was 15 years. In addition, despite the widespread use of several prognostic biomarkers (e.g., CA125 for ovarian cancer and CA19-9 for pancreatic cancer), the exact role of these proteins in the progression of each of these tumors is well known little.

대부분의 일반적인 암에 대해 진단 바이오마커를 스크리닝 하기 위해서는 일반적으로 95%의 특이성과 95%의 민감성의 높은 기준치를 요구한다. 일반적으로 단일 바이오마커로는 이와 같은 높은 기준치를 만족시키지 못하나, 여러 바이오마커를 이용한 바이오마커의 조합("바이오마커 패널")의 경우 이러한 기준치를 만족시킬 수 있다.Screening for diagnostic biomarkers for most common cancers generally requires a high standard of 95% specificity and 95% sensitivity. In general, a single biomarker can not satisfy such a high standard value, but a combination of biomarkers using a plurality of biomarkers ("biomarker panel") can satisfy this criterion.

향후 “빅데이터”의 형태를 지닌 인간 유전자 발현 데이터는 GEO, TCGA, ICGC, ArrayExpress 및 Pan-cancer initiative와 같은 저장소에 지속적으로 저장될 것이다. 따라서 이러한 저장된 인간 유전자 발현 데이터를 이용한 후보 바이오마커 및 바이오마커 패널을 발굴할 수 있는 통합적인 분석 방법이 필요하다. Future human gene expression data in the form of "Big Data" will be stored persistently in repositories such as GEO, TCGA, ICGC, ArrayExpress and Pan-cancer initiatives. Therefore, there is a need for an integrated analysis method capable of identifying candidate biomarkers and biomarker panels using the stored human gene expression data.

1. 미국 개특허 제2009-0269856호1. US Patent No. 2009-0269856 2. 한국 공개특허 제2010-0131435호2. Korean Patent Publication No. 2010-0131435

1. Nam S, et al. PATHOME: an algorithm for accurately detecting differentially expressed subpathways. Oncogene 33, 4941-4951 (2014).1. Nam S, et al. PATHOME: an algorithm for finding correctly differentially expressed subpathways. Oncogene 33, 4941-4951 (2014).

본 발명의 목적은 암 바이오마커의 후보로서의 단일- 및/또는 다중-유전자를 평가하는 암 바이오마커의 성능을 평가하는 장치 및 방법을 제공하는 것이다.It is an object of the present invention to provide an apparatus and a method for evaluating the performance of cancer biomarkers that evaluate single-and / or multi-genes as candidates for cancer biomarkers.

상기와 같은 본 발명의 과제를 해결하게 위해 본 발명은,In order to solve the problems of the present invention as described above,

공개 DB로부터 암 발현 데이트세트를 수집하여 전처리하는 전처리 모듈; 상기 전처리된 암 발현 데이트세트를 재배열하여 새로운 암 정보 DB를 구축하는 데이터베이스 모듈; 및 상기 구축된 새로운 암 정보 DB를 이용하여 암 유형별로 암 바이오마커의 성능을 평가하는 성능 평가 모듈;을 포함하는 암 바이오마커의 성능 평가 장치를 제공한다.A preprocessing module for collecting and pre-processing a cancer expression date set from an open database; A database module for rearranging the pre-processed cancer expression dating sets to construct a new cancer information DB; And a performance evaluation module for evaluating the performance of the cancer biomarker according to the cancer type using the constructed new cancer information DB. The present invention also provides an apparatus for evaluating the performance of an cancer biomarker.

본 발명의 일실시예에 있어서, 암 유형별로 평가된 암 바이오마커의 성능 평가 결과를 맵핑, 그래프 또는 표로 표시하는 웹-인터페이스 모듈을 더 포함할 수 있다. In one embodiment of the present invention, the apparatus may further include a web-interface module for displaying a result of performance evaluation of the cancer biomarker evaluated for each type of cancer by mapping, graph or table.

본 발명의 일실시예에 있어서, 상기 전처리 모듈은, 상기 수집된 암 발현 데이터세트를 분석하여 변위치 정규화(quantile normalization) 및 로우버스트 다중-어레이 평균(robust multiple-array; RMA) 정규화를 이용하여 정규화하는 것일 수 있다.In one embodiment of the present invention, the preprocessing module analyzes the collected cancer expression data sets and uses quantile normalization and robust multiple-array (RMA) normalization It can be normalizing.

본 발명의 일실시예에 있어서, 상기 전처리 모듈은, TCGA 공개 DB에서 RNA 시퀀싱에 의한 RNA 발현 데이터에 대해서는, 정규화된 계수를 발현값으로 사용하는 것일 수 있다.In one embodiment of the present invention, the pre-processing module may use a normalized coefficient as an expression value for RNA expression data by RNA sequencing in a TCGA open DB.

본 발명의 일실시예에 있어서, 상기 전처리 모듈은, 그룹-내 상관관계(within-group correlation) 및 그룹-간 상관관계(between-group correlation)를 이용해 이상값(outlier)을 검출하는 것일 수 있다.In one embodiment of the present invention, the preprocessing module may be to detect outliers using within-group correlation and between-group correlation .

본 발명의 일실시예에 있어서, 상기 전처리 모듈은, 사용자의 외부 입력에 따라 상기 검출된 이상값을 갖는 샘플을 배제하는 것일 수 있다.In one embodiment of the present invention, the preprocessing module may exclude the sample having the detected abnormal value according to the external input of the user.

본 발명의 일실시예에 있어서, 상기 전처리 모듈은, 상기 수집된 암 발현 데이트세트가 임상적 정보 및 샘플 주석을 포함하는 경우, 진단, 예후 및 약물 반응 정보의 분석을 통하여 새로운 데이터세트로 설계하여 정확한 암 타입 또는 서브타입을 규정하는 것일 수 있다.In one embodiment of the present invention, the preprocessing module is designed as a new data set through analysis of diagnosis, prognosis and drug response information when the collected cancer expression date set includes clinical information and sample annotations It may be that it defines the exact arm type or subtype.

본 발명의 일실시예에 있어서, 상기 데이터베이스 모듈은 상기 전처리된 발현 데이터세트 및 이에 상응하는 주석데이터를 포함하는 것일 수 있다.In one embodiment of the present invention, the database module may include the preprocessed expression data set and corresponding annotation data.

본 발명의 일실시예에 있어서, 상기 데이터베이스 모듈은, 모든 발현 데이터를 사용자 맞춤형 인덱스 바이너리 파일 형태로 저장하는 것일 수 있다.In one embodiment of the present invention, the database module may store all the expression data in the form of a user-customized index binary file.

본 발명의 일실시예에 있어서, 상기 성능 평가 모듈은 사용자-선별된 다중 마커의 경우, 곡선 아래 면적(AUC), 밸런스 정확도(BA), 민감도, 특이성, 양성 예측값(PPV), 음성 예측값(NPV), 위양성율(FPR), 위발견율(FDR), 및 F1 스코어 중 적어도 어느 하나를 포함하는 평가 결과를 제공하는 것일 수 있다.In one embodiment of the present invention, the performance evaluation module includes a user-selected multi-marker (AUC), a balance accuracy (BA), sensitivity, specificity, positive predictive value (PPV) ), False positive rate (FPR), false positive rate (FDR), and F1 score.

본 발명의 일실시예에 있어서, 상기 성능 평가 모듈은, 다중-마커의 성능에 대한 단일 마커의 기여도를 측정하기 위해, 모든 마커에 대한 밸런스 정확도(BA)와 단일 마커 이외의 모든 마커에 대한 밸런스 정확도(BA) 간의 차이를 제공하는 것일 수 있다.In one embodiment of the present invention, the performance evaluation module is configured to determine a balance accuracy (BA) for all markers and a balance accuracy for all markers other than a single marker, in order to measure the contribution of a single marker to the performance of the multi- And to provide a difference between accuracy (BA).

본 발명의 일실시예에 있어서, 상기 웹-인터페이스 모듈는, 사용자-선별 다중 마커와 쿼리 파라미터(queried parameter)를 상기 성능 평가 모듈로 트랜스퍼하는 인풋 레이아웃; 및 상기 성능 평가 결과에 대한 표 및 그래프 시각화를 제공하는 결과 익스플로어를 포함하는 것일 수 있다.In one embodiment of the present invention, the web-interface module comprises: an input layout for transferring user-selected multiple markers and queried parameters to the performance evaluation module; And a result explorer that provides table and graph visualization of the performance evaluation results.

본 발명의 일실시예에 있어서, 상기 인풋 레이아웃은, 트레이닝 데이터세트로서 전처리된 공개 데이터세트 또는 사용자 제공 개별 데이터세트를 선별하는 것일 수 있다.In one embodiment of the invention, the input layout may be to select a set of pre-processed public data or a user-provided individual data set as a training data set.

또한, 본 발명은 공개 DB로부터 암 발현 데이트세트를 수집하여 전처리하는 단계; 상기 전처리된 암 발현 데이트세트를 재배열하여 새로운 암 정보 DB를 구축하는 단계; 및 상기 구축된 새로운 암 정보 DB를 이용하여 암 유형별로 암 바이오마커의 성능을 평가하는 단계;를 포함하는 암 바이오마커의 성능 평가 방법을 제공한다.In addition, the present invention relates to a method for detecting cancer, comprising collecting and pretreating a cancer expression dating set from an open DB; Rearranging the pretreated cancer expression date set to construct a new cancer information DB; And evaluating the performance of the cancer biomarker according to the cancer type using the constructed new cancer information DB.

본 발명의 일실시예에 있어서, 암 유형별로 평가된 암 바이오마커의 성능 평가 결과를 맵핑, 그래프 또는 표로 표시하는 성능 평가 결과 표시 단계를 더 포함할 수 있다.In one embodiment of the present invention, the method may further include a performance evaluation result display step of displaying a performance evaluation result of the cancer biomarker evaluated for each type of cancer by a mapping, a graph or a table.

본 발명의 일실시예에 있어서, 상기 전처리하는 단계에서는, 상기 수집된 암 발현 데이터세트를 분석하여 변위치 정규화(quantile normalization) 및 로우버스트 다중-어레이 평균(robust multiple-array; RMA) 정규화를 이용하여 정규화하는 것일 수 있다.In one embodiment of the present invention, in the pre-processing step, the collected cancer expression data sets are analyzed and quantized normalization and robust multiple-array (RMA) normalization are used To normalize it.

본 발명의 일실시예에 있어서, 상기 전처리하는 단계에서는, 그룹-내 상관관계(within-group correlation) 및 그룹-간 상관관계(between-group correlation)를 이용해 이상값(outlier)을 검출하는 것일 수 있다.In one embodiment of the present invention, the pre-processing may be to detect an outlier using within-group correlation and between-group correlation. have.

본 발명의 일실시예에 있어서, 상기 전처리하는 단계에서는, 상기 수집된 암 발현 데이트세트가 임상적 정보 및 샘플 주석을 포함하는 경우, 진단, 예후 및 약물 반응 정보의 분석을 통하여 새로운 데이터세트로 설계하여 정확한 암 타입 또는 서브타입을 규정하는 것일 수 있다.In one embodiment of the present invention, in the preprocessing step, when the collected cancer expression date set includes clinical information and sample annotations, it is designed as a new data set through analysis of diagnosis, prognosis, and drug response information To define the exact arm type or subtype.

본 발명의 일실시예에 있어서, 상기 새로운 암 정보 DB를 구축하는 단계에서는, 모든 발현 데이터를 사용자 맞춤형 인덱스 바이너리 파일 형태로 저장하는 것일 수 있다.In one embodiment of the present invention, in the step of constructing the new cancer information DB, all the expression data may be stored in the form of a user-customized index binary file.

본 발명의 일실시예에 있어서, 상기 성능 평가 단계에서는, 사용자-선별된 다중 마커의 경우, 곡선 아래 면적(AUC), 밸런스 정확도(BA), 민감도, 특이성, 양성 예측값(PPV), 음성 예측값(NPV), 위양성율(FPR), 위발견율(FDR), 및 F1 스코어 중 적어도 어느 하나를 포함하는 평가 결과를 제공하는 것일 수 있다.In one embodiment of the present invention, in the performance evaluation step, in the case of user-selected multiple markers, the area under the curve (AUC), the balance accuracy (BA), the sensitivity, the specificity, the positive predictive value (PPV) NPV), false positive rate (FPR), false positive rate (FDR), and F1 score.

본 발명의 일실시예에 있어서, 상기 성능 평가 단계에서는, 다중-마커의 성능에 대한 단일 마커의 기여도를 측정하기 위해, 모든 마커에 대한 밸런스 정확도(BA)와 단일 마커 이외의 모든 마커에 대한 밸런스 정확도(BA) 간의 차이를 제공하는 것일 수 있다.In one embodiment of the present invention, in the performance evaluation step, in order to measure the contribution of a single marker to the performance of the multi-marker, a balance accuracy (BA) for all the markers and a balance And to provide a difference between accuracy (BA).

본 발명의 일실시예에 있어서, 상기 성능 평가 결과 표시 단계에서는, 선별된 트레이닝 및 테스팅 데이터세트를 이용한 다중 마커의 평가와, 임의의 선별된 테스팅 데이터세트에 대한 저장된 예측 모델을 이용한 평가와, 선별된 데이터세트를 이용한 리브-원-아웃 교차-검증(LOOCV)의 평가와, 사용자-제공의 트레이닝 데이터세트 및 선별된 테스팅 데이터세트를 이용한 평가 중 적어도 어느 하나의 평가를 수행하는 것일 수 있다.In one embodiment of the present invention, in the performance evaluation result display step, evaluation of multiple markers using the selected training and testing data sets, evaluation using a stored prediction model for an arbitrary selected test data set, Evaluation of the rib-one-out cross-validation (LOOCV) using the user-provided training data set and the evaluation using the user-provided training data set and the selected testing data set.

본 발명의 일실시예에 있어서, 상기 성능 평가 결과 표시 단계에서는, 트레이닝 데이터세트로서 전처리된 공개 데이터세트 또는 사용자 제공 개별 데이터세트를 선별하는 것일 수 있다.In one embodiment of the present invention, the performance evaluation result display step may be to select a pre-processed public data set or a user-provided individual data set as a training data set.

본 발명에 따른 암 바이오마커의 성능을 평가하는 장치 및 방법은 선별된 트레이닝 및 테스팅 데이터세트를 이용한 다중 마커의 평가를 할 수 있으며, 임의의 선별된 테스팅 데이터세트에서 저장된 예측 모델을 이용하여 평가할 수 있다. 또한, 본 발명의 암 바이오마커의 성능을 평가하는 장치 및 방법은 선별된 데이터세트를 이용한 리브-원-아웃 교차-검증(LOOCV)의 평가를 할 수 있으며, 사용자-제공의 트레이닝 데이터세트 및 선별된 테스팅 데이터세트를 이용한 평가를 할 수 있는 효과가 있다. An apparatus and method for evaluating the performance of an arm biomarker in accordance with the present invention can evaluate multiple markers using selected training and testing data sets and can be evaluated using a stored prediction model in any selected testing data set have. In addition, an apparatus and method for evaluating the performance of the cancer biomarkers of the present invention can evaluate rib-one-out cross-validation (LOOCV) using a selected data set and provide user- It is possible to perform evaluation using the test data set.

도 1은 본 발명의 실시예에 따른 암 바이오마커의 성능을 평가하는 장치의 기능 블록도이고,
도 2는 본 발명의 실시예에 따른 암 바이오마커의 성능을 평가하는 방법의 순서도이고,
도 3은 본 발명의 실시예에 따른 암 바이오마커의 성능을 평가하는 과정을 기능 단위로 블록화한 것이고,
도 4는 본 발명의 실시예에 따라 18가지의 종양 유형 데이터세트로부터 단일-암 마커 후보물질 및/또는 다중-암 마커 후보물질의 성능 평가의 전반적인 과정을 도시한 것이고,
도 5A 내지 5C는 7가지의 다중 유방암 바이오마커에 대한 CANES 평가 보고를, 테스트 데이터세트와 폐암 데이터세트를 이용한 서포트 벡터 머신과 리브-원-아웃 교차 검증에 의해 분류한 것을 도시한 것이다.1 is a functional block diagram of an apparatus for evaluating the performance of an arm biomarker according to an embodiment of the present invention,
2 is a flowchart of a method for evaluating the performance of an arm biomarker according to an embodiment of the present invention,
FIG. 3 is a block diagram of a process for evaluating the performance of the cancer biomarker according to the embodiment of the present invention,
Figure 4 illustrates the overall process of evaluating the performance of a single-cancer marker candidate and / or multiple-cancer marker candidate from 18 tumor type data sets in accordance with an embodiment of the present invention,
Figures 5A-5C show CANES evaluation reports for seven multi-breast cancer biomarkers sorted by support vector machines and rib-one-out cross validation using test data sets and lung cancer data sets.

이하, 본 발명의 바람직한 실시예를 첨부된 도면들을 참조하여 상세히 설명한다. 우선 각 도면의 구성 요소들에 참조 부호를 부가함에 있어서, 동일한 구성 요소들에 대해서는 비록 다른 도면상에 표시되더라도 가능한 한 동일한 부호를 가지도록 하고 있음에 유의해야 한다. 또한, 본 발명을 설명함에 있어, 관련된 공지 구성 또는 기능에 대한 구체적인 설명이 본 발명의 요지를 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명은 생략하며, 단수로 기재된 용어도 복수의 개념을 포함할 수 있다. 또한, 이하에서 본 발명의 바람직한 실시예를 설명할 것이나, 본 발명의 기술적 사상은 이에 한정하거나 제한되지 않고 당업자에 의해 변형되어 다양하게 실시될 수 있음은 물론이다.Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings. In the drawings, the same reference numerals are used to designate the same or similar components throughout the drawings. In the following description of the present invention, a detailed description of known functions and configurations incorporated herein will be omitted when it may make the subject matter of the present invention rather unclear. The terms used in the singular may also include a plurality of concepts . In addition, the preferred embodiments of the present invention will be described below, but it is needless to say that the technical idea of the present invention is not limited thereto and can be variously modified by those skilled in the art.

지난 10년 동안 상당량의 “빅 데이터”형태의 암 관련 유전자 발현자료를 축적한 저장소가 지속적으로 개발되어왔다. 이러한 데이터세트의 많은 잠재력에도 불구하고, 암 관련 분야에는 여전히 단일/다중 바이오마커를 임상적으로 평가하는 일관된 방법의 부족으로 인해 승인된 바이오마커의 수가 거의 없는 실정이다.Over the last decade, a repository of cancer-related gene expression data in the form of "big data" forms has been continuously developed. Despite the great potential of these datasets, the cancer-related sector still has few biomarkers approved due to the lack of a consistent method of clinically evaluating single / multiple biomarkers.

여기서, 본 발명자들은 다양한 분류 방법을 사용해서 바이오마커 후보로서 단일- 및/또는 다중-유전자를 평가하는 초기 단계에 있어서 포괄적인 평가 방식인 “암-특이적 평가 시스템(CANES, CANcer-specific multi-marker Evaluation System)”을 개발하였다.Here, the present inventors have developed a comprehensive evaluation system called CANES (CANES-specific multi-gene) system in the initial stage of evaluating single- and / or multi-genes as biomarker candidates using various classification methods. marker Evaluation System ".

본 발명자들은 94,147 샘플(세포주, 정상 및 암 조직)을 가지고 전체 2,134 전사체 데이터세트를 다루는 분류 방법을 사용하는, 바이오마커 평가를 위한 직관적으로 이해가능하며, 임상적으로 가치있는 웹-기반 암-특이적 다중-마커 평가 시스템(CANES)을 소개한다. CANES는 단일 및 다중 암 마커 상에서 서포트 벡터 머신, 랜덤 포레스트, 신경망 및 분류 및 회귀 트리를 통한 엄격한 평가에 의해 18개 암 종류에 대한 진단 및 예후 평가력을 측정한다. 사용자가 제공하는 다중 마커에 있어서, CANES는 단순화한 평가 결과 및 그래픽 시각화를 제공한다. 최종적으로, 본 발명자들은 이전에 연구되지 않은 바이오마커 세트에 대한 두 가지의 분석을 실행하여 CANES의 유용성을 입증하였다.
We have found that intuitively understandable, clinically valuable, web-based cancer-screening assays for biomarker assessment using 94,147 samples (cell lines, normal and cancer tissues) and a classification approach covering the entire 2,134 transcript data set, We introduce the specific multi-marker evaluation system (CANES). CANES measures the diagnostic and prognostic value of 18 cancer types by means of a support vector machine, random forest, neural network and a rigorous evaluation through classification and regression trees on single and multiple cancer markers. For user-provided multiple markers, CANES provides simplified evaluation results and graphical visualization. Finally, we demonstrated the utility of CANES by performing two analyzes of previously unexplored biomarker sets.

이하에서 첨부된 도면을 참조하여 본 발명의 일실시예에 따른 암 바이오마커의 성능을 평가하는 장치 및 방법을 설명한다. Hereinafter, an apparatus and method for evaluating the performance of an arm biomarker according to an embodiment of the present invention will be described with reference to the accompanying drawings.

도 1은 본 발명의 일실시예에 따른 암 바이오마커의 성능을 평가하는 장치의 기능 블록도이고, 도 2는 본 발명의 실시예에 따른 암 바이오마커의 성능을 평가하는 방법의 순서도이다. FIG. 1 is a functional block diagram of an apparatus for evaluating the performance of an arm biomarker according to an embodiment of the present invention, and FIG. 2 is a flowchart of a method of evaluating the performance of an arm biomarker according to an embodiment of the present invention.

도 1 및 2를 참조하면, 본 발명의 일실시예에 따른 암 바이오마커의 성능 평가 장치(100)는, 공개 DB로부터 암 발현 데이트세트를 수집하여 전처리하는 전처리 모듈(110)과, 상기 전처리된 암 발현 데이트세트를 재배열하여 새로운 암 정보 DB를 구축하는 데이터베이스 모듈과(120), 상기 구축된 새로운 암 정보 DB를 이용하여 암 유형별로 암 바이오마커의 성능을 평가하는 성능 평가 모듈(130)을 포함한다. 바람직하게, 암 바이오마커의 성능 평가 장치(100)는 암 유형별로 평가된 암 바이오마커의 성능 평가 결과를 맵핑, 그래프 또는 표로 표시하는 웹-인터페이스 모듈(140)을 더 포함할 수 있다. 1 and 2, an apparatus 100 for evaluating the performance of an arm biomarker according to an embodiment of the present invention includes a preprocessing module 110 for collecting and pre-processing cancer expression dating sets from an open DB, A database module 120 for rearranging cancer expression data sets to construct a new cancer information DB 120, and a performance evaluation module 130 for evaluating the performance of the cancer biomarkers for each type of cancer using the new cancer information DB . Preferably, the cancer biomarker performance evaluation apparatus 100 may further include a web-interface module 140 for mapping, graphing, or tabulating the results of performance evaluation of the cancer biomarkers evaluated for each type of cancer.

본 발명자들은 암 바이오마커의 성능 평가 장치(100)를 CANES(100)라 지칭하였으며, 이하에서 CANES(100)는 성능 평가 장치(100)를 의미한다. The present inventors referred to the cancer biomarker performance evaluation apparatus 100 as CANES 100, and hereinafter, CANES 100 refers to the performance evaluation apparatus 100.

본 발명자들은 CANES(100)의 유용성을 설명하기 위해 특정 암-종류 바이오마커를 또 다른 암 종류의 바이어마커로 운반하기 위한 CANES의 유용성을 설명하는 폐암 통계에 대한 7개의 공지된 유방암 마커의 잠재적인 재평가를 실시하였다. We describe the potential of seven known breast cancer markers for lung cancer statistics that illustrate the utility of CANES to deliver specific cancer-type biomarkers to biomarkers of another cancer type to illustrate the utility of CANES (100) Revaluation.

도 3은 본 발명의 실시예에 따른 암 바이오마커의 성능을 평가하는 과정을 기능 단위로 블록화한 것이고, 도 4는 본 발명의 실시예에 따라 18가지의 종양 유형 데이터세트로부터 단일-암 마커 후보물질 및/또는 다중-암 마커 후보물질의 성능 평가의 전반적인 과정을 도시한 것이고, 도 5A 내지 C는 7가지의 다중 유방암 바이오마커에 대한 CANES 평가 보고를, 테스트 데이터세트와 폐암 데이터세트를 이용한 서포트 벡터 머신과 리브-원-아웃 교차 검증에 의해 분류한 것을 도시한 것이다.
FIG. 3 is a functional block diagram of a process for evaluating the performance of an cancer biomarker according to an embodiment of the present invention, and FIG. 4 is a block diagram of a single-cancer marker candidate from 18 tumor type data sets according to an embodiment of the present invention. 5A-C illustrate the overall process of evaluating the performance of a multi-cancer marker candidate material and / or a multi-cancer marker candidate, and FIGS. 5A-C illustrate the CANES evaluation report for seven multiple breast cancer biomarkers in support of test data sets and lung cancer data sets Vector machine and rib-one-out cross validation.

이하에서는 도 1 내지 5C를 참조하여, 본 발명의 일실시예에 따른 일실시예에 따른 암 바이오마커의 성능 평가 장치(100)를 더욱 상세히 설명한다.Hereinafter, an apparatus 100 for evaluating the performance of an arm biomarker according to an embodiment of the present invention will be described in more detail with reference to FIGS. 1 to 5C.

CANES(100)는 공개 데이터베이스로부터 입수한 RNA 분자적 프로파일을 수집하고, 맵핑 프로세스(mapping process)와 품질 관리 과정에 따라 종양 유형으로 재분류하였다. 공개적인 발현 데이터세트를 이용해, CANES(100)는 다양한 암 유형 또는 연구에서 사용자-특이화된 다중 마커에 대한 평가 결과를 제공한다. 도 1에 도시된 CANES(100)는 4가지 모듈(module)인 전처리 모듈(110), 데이터베이스 모듈(120), 성능 평가 모듈(130)및 웹-인터페이스 모듈(140)을 가진다.CANES (100) collected RNA molecular profiles from public databases and reclassified them into tumor types according to the mapping process and quality control process. Using public expression data sets, CANES 100 provides evaluation results for user-specificized multiple markers in various cancer types or studies. The CANES 100 shown in FIG. 1 has a preprocessing module 110, a database module 120, a performance evaluation module 130, and a web-interface module 140, which are four modules.

이하, CANES(100)에 포함된 4가지 모듈(module)인 전처리 모듈(110), 데이터베이스 모듈(120), 성능 평가 모듈(130) 및 웹-인터페이스 모듈(140)에 대하여 더욱 상세히 설명한다.
Hereinafter, the preprocessing module 110, the database module 120, the performance evaluation module 130, and the web-interface module 140, which are four modules included in the CANES 100, will be described in further detail.

1. 전처리 모듈(110)에 대하여, 1. For the preprocessing module 110,

현재 개발된 CANES(110)는 2가지 공개 리파지토리(repository)와 2가지 암 컨소시엄인 GEO, ArrayExpress, TCGA, 및 ICGC로부터 수득한 마이크로어레이 데이터를 사용한다. 모든 발현 데이터세트는 R 패키지를 사용해 수집될 수 있다. CANES의 전처리 모듈(110)에서, 공개 리파지토리로부터 수득한 발현 데이터세트를 분석하고, 변위치 정규화(quantile normalization) 및 로우버스트 다중-어레이 평균(robust multiple-array; RMA) 정규화를 이용하여 정규화한다. TCGA 리파지토리에서 RNA 시퀀싱에 의한 RNA 발현 데이터에 대해, 본 발명자들은 정규화된 계수를 발현값으로서 사용하였다. 결측율(missing rate)이 5% 초과인 모든 데이터세트는 배제하였으며, 결측율을 가진 나머지 데이터세트를 Bioconductor의 “임퓨트(impute)” 패키지에 의해 결측값을 대체하였다. 장비의 에러나 샘플의 오염, 표지실수(mislabeling) 및 프로세싱실수(misprocessing)로 인해 유발되는 이상값(outlier)을 검출하기 위해, 본 발명자들은 그룹-내 상관관계(within-group correlation) 및 그룹-간 상관관계(between-group correlation)를 이용해 이상값 검출을 수행하였다. 모든 검출된 이상값의 샘플에 표시를 해두기 때문에, 사용자들은 구체적인 옵션을 이용해 자신들의 분석에서 이들 이상값의 샘플을 배제할 수 있다. 더욱이, 임상적 정보 및 샘플 주석이 이용 가능한 경우, 이들을 CANES의 데이터베이스로 분석한다. 정확한 암 유형 또는 아형(subtype)을 규정하기 위해, 본 발명자들은 진단, 예후 및 약물 반응을 데이터세트의 설계로서 확인하거나 재유형화하였다. 모든 프로세싱된 발현 데이터는 빠른 전처리를 위해 사용자 맞춤형 인덱스 바이너리 파일(customized indexed binary file)로 변환한다. 전처리 모듈은 Python 및 R을 사용해 구현하였다.
Currently developed CANES (110) uses microarray data from two public repositories and two cancer consortia: GEO, ArrayExpress, TCGA, and ICGC. All expression data sets can be collected using the R package. In CANES pre-processing module 110, the set of expression data obtained from the published repository is analyzed and normalized using quantile normalization and robust multiple-array (RMA) normalization. For RNA expression data by RNA sequencing in the TCGA repository, the present inventors used normalized coefficients as the expression value. All data sets with a missing rate of more than 5% were excluded, and the remaining data sets with missing rates were replaced by Bioconductor's "impute" package. In order to detect outliers caused by errors in equipment, contamination of samples, mislabeling and misprocessing, the present inventors have found that within-group correlation and group- We performed anomaly detection using the between-group correlation. Since all detected abnormal values are displayed on the sample, users can use specific options to exclude samples of these abnormal values from their analysis. Moreover, if clinical information and sample annotations are available, they are analyzed in CANES database. To define the precise cancer type or subtype, the inventors have identified or reshaped the diagnosis, prognosis, and drug response as a design of the data set. All processed expression data is converted to a customized indexed binary file for fast preprocessing. The preprocessing module was implemented using Python and R.

2. 데이터베이스 모듈(120)에 대하여, 2. For the database module 120,

데이터베이스 모듈(120)은 전처리된 발현 데이터세트 및 상응하는 주석 데이터(annotation data)를 포함한다. 표 1을 참조하면, 전처리된 데이터세트는 94,147개 샘플에 대한 유전자 발현 데이터와 주석 데이터로 구성된다. 유전자 발현 데이터는, 광범위하게 사용되는 유전자 발현 마이크로어레이 플랫폼과 RNA 시퀀싱 플랫폼과 함께 이들 샘플로부터 수득된 다음 기술된 바와 같이 처리된다. 데이터베이스 모듈(120)은 MySQL 및 Python을 사용해 구현되었다. 모든 발현 데이터는 사용자 맞춤형 인덱스 바이너리 파일로서 저장된다. 하기 표 1은 CANES에서 샘플의 수를 나타낸 것이다.The database module 120 includes pre-processed expression data sets and corresponding annotation data. Referring to Table 1, the preprocessed data set consists of gene expression data and annotation data for 94,147 samples. Gene expression data are processed as described below, obtained from these samples along with a widely used gene expression microarray platform and an RNA sequencing platform. The database module 120 is implemented using MySQL and Python. All expression data is stored as a user-customized index binary file. Table 1 below shows the number of samples in CANES.

3. 성능 평가 모듈(130)에 대하여, 3. For performance evaluation module 130,

CANES의 성능 평가 모듈(130)은 SVM, RF, NN, 및 CART와 같은 분류화 방법을 이용해 실행한다. 표 2를 참조하면, 사용자-선별된 다중 마커의 경우, 이 모듈은 곡선 아래 면적(AUC), 밸런스 정확도(BA), 민감도, 특이성, 양성 예측값(PPV), 음성 예측값(NPV), 위양성율(FPR), 위발견율(FDR), 및 F1 스코어를 포함하는 평가 결과를 제공한다. 다중-마커의 성능에 대한 단일 마커의 기여도를 측정하기 위해, CANES는, 모든 마커에 대한 BA와 단일 마커 이외의 모든 마커에 대한 BA 간의 차이인 영향 측정(influence measure)을 제공한다. 이러한 모든 조작은 도 4에 도시된 바와 같이, 다음과 같은 4가지 서로 다른 평가 도식에 의해 수행될 수 있다. The performance evaluation module 130 of CANES is implemented using classification methods such as SVM, RF, NN, and CART. Referring to Table 2, in the case of user-selected multiple markers, the module measures the area under the curve (AUC), the balance accuracy (BA), sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV) ), FDR (FDR), and F1 score. To measure the contribution of a single marker to the performance of a multi-marker, CANES provides an influence measure that is the difference between BA for all markers and BA for all markers other than a single marker. All of these operations can be performed by the following four different evaluation schemes, as shown in FIG.

1) 선별된 트레이닝 및 1) Selected training and 테스팅Testing 데이터세트를 이용한 다중 Multiplexing with Data Sets 마커의Marker 평가 evaluation

CANES는 특정 암 유형 또는 연구를 이용해 예측 분석을 수행할 수 있다. 사용자는, 선별된 데이터세트 및 분류화 모델을 이용해 자기 자신의 다중-마커 목록에 대한 예측 모델을 만들어 저장할 수 있다. 그래프 및 상호작용 결과 레이아웃이 제공되며 저장될 수 있다.
CANES can perform predictive analysis using specific cancer types or studies. The user can create and store a prediction model for his / her own multi-marker list using the selected data set and the classification model. Graphs and interactive result layouts can be provided and saved.

2) 임의의 선별된 2) any selected 테스팅Testing 데이터세트에서, 저장된 예측 모델을 이용한 평가 In data sets, evaluation using stored prediction models

CANES는 평가 결과를 저장하며, 이 평가 결과는 서로 다른 테스팅 데이터세트에 사용될 수 있다. 예를 들어, 사용자는 유방암 마커 및 유방암 데이터세트를 이용한 예측 모델을 저장한 다음, 이 저장된 모델을 간암 데이터세트에서 평가할 수 있다.
CANES stores evaluation results, which can be used for different sets of testing data. For example, a user may store a prediction model using a breast cancer marker and a breast cancer data set, and then evaluate the stored model in a liver cancer data set.

3) 선별된 데이터세트를 이용한 3) Using the selected data set 리브live -원-아웃 교차-검증(- One-out cross-validation ( LOOCVLOOCV )의 평가)

임의의 특정 트레이닝 데이터세트에 의한 과적응(overfit)을 방지하기 위해, CANES는 LOOCV를 사용해 다중 마커를 평가할 수 있다. 이러한 평가 도식에서, CANES는 다중-마커의 성능 기여도를 측정함으로써 다중-마커 세트에서의 개별 마커의 평가를 또한 뒷받침할 수 있다.
To prevent overfitting by any particular training data set, CANES can evaluate multiple markers using LOOCV. In this evaluation scheme, CANES can also support the evaluation of individual markers in a multi-marker set by measuring the performance contribution of the multi-markers.

4) 사용자-제공의 트레이닝 데이터세트 및 선별된 4) User-provided training data sets and selected 테스팅Testing 데이터세트를 이용한 평가 Evaluating with Data Sets

CANES를 이용함으로써, 사용자-제공의 데이터세트에 의해 발생된 예측 모델을 평가할 수 있다. 사용자 데이터세트는 웹-인터페이스 모듈을 통해 업로드되며, 전처리된 다음 정규화되고, 이후 서로 다른 분류화 방법을 적용하는 트레이닝 데이터세트로서 사용된다. 사용자 자신의 데이터세트를 이용해 트레이닝된 예측 모델은 공개 저장소와 독립적인 데이터세트를 이용해 테스트된다.
By using CANES, a prediction model generated by a user-provided data set can be evaluated. The user data sets are uploaded through the web-interface module, used as training data sets that are preprocessed and then normalized and then applied different classification methods. Predictive models trained using your own data sets are tested using a data set that is independent of the public repository.

4. 웹-인터페이스 모듈(140)에 대하여, 4. For web-interface module 140,

CANES의 웹-인터페이스 모듈(140)은 인풋 레이아웃(input layout)과 결과 익스플로어(result explorer)로 구성된다. 인풋 레이아웃은 사용자-선별 다중 마커와 쿼리 파라미터(queried parameter)를 평가 모듈로 전달하는 인터페이스이다. 상기 인풋 레이아웃에서, 사용자는 유전자 기호, miRNA또는 프로브 ID 세트를 투입하고, 전처리된 공개 데이터세트 또는 사용자-업로드된 개별 데이터세트를 트레이닝 데이터세트로서 선별할 수 있다. 상기 결과 익스플로어는 평가 결과에 대한 표 및 그래프 시각화를 제공한다. CANES 웹-인터페이스 모듈(140)은 JQuery 및 CodeIgniter 프레임워크와 더불어 PHP를 사용해 구현된다.
The web-interface module 140 of CANES consists of an input layout and a result explorer. The input layout is an interface that passes the user-selected multiple markers and queried parameters to the evaluation module. In the input layout, a user can input a genetic symbol, miRNA or probe ID set, and select a preprocessed public data set or a user-uploaded individual data set as a training data set. The result explorer provides a table and graph visualization of the evaluation results. The CANES web-interface module 140 is implemented using PHP along with the JQuery and CodeIgniter frameworks.

이하에서는 CANES(100)의 주요 특징들을 상세히 기술한다. The main features of the CANES 100 are described in detail below.

도 1, 3 내지 4는 4가지 평가 도식을 기반으로, 18가지의 종양 유형 데이터세트로부터 단일-암 마커 후보물질 및/또는 다중-암 마커 후보물질의 성능을 평가하기 위한 CANES(100)의 전반적인 도식을 도시한 것이다. CANES의 특색적인 특징이 요약되어 있으며, 표 1에 기술된 바이오마커 평가에 필수적인 툴-기능(tool-function)의 면에서 다른 바이오마커-관련 데이터베이스 (Oncomine, IPA-biomarker (http://www.qiagen.com/ingenuity), cBioPortal)와 비교되어 있다.Figures 1, 3, and 4 illustrate the overall performance of CANES 100 for evaluating the performance of single-cancer marker candidates and / or multi-cancer marker candidates from 18 tumor type datasets based on four evaluation schemes. FIG. The unique features of CANES are summarized and compared with other biomarker-related databases (Oncomine, IPA-biomarker (http://www.biomarker.com/)) in terms of tool-functions essential for biomarker evaluation described in Table 1. qiagen.com/ingenuity), cBioPortal).

또한, 1) CANES는 카플란-마이어 플롯(Kaplan-Meier plot) 및 로그-랭크 테스트(log-rank test)를 이용한 생존율 분석을 제공하고, 2) miRNA 마커 뿐만 아니라 진단 또는 예후를 목적으로 하는 유전자의 바이오마커로써의 평가를 수행하며, 3) 각각의 단일 마커에 대한 전암적인(Pan-cancer) 평가 결과를 제공한다(표 3 참조).In addition, 1) CANES provides survival analysis using the Kaplan-Meier plot and log-rank test, and 2) provides genes for miRNA markers as well as diagnostic or prognostic genes Perform biomarker assessments, and 3) provide pan-cancer assessment results for each single marker (see Table 3).

폐암에 대한 유방암 Breast cancer for lung cancer 마커의Marker 다중- multiple- 마커Marker 평가 evaluation

본 발명자들은 18가지 종양 유형의 패널 중에서 또 다른 암 유형에 대한 공지의 유방암 마커의 다중-마커 예측력(prediction power)을 평가하였다. CANES의 주요 특징들 중 하나는 다중-암 유형에 대한 다중-마커를 평가하는 것이다. 본 연구에서, 이들은 폐암에서 공지의 유방암 마커인 BRCA1, BRCA2, BRIP1, CHEK2, PALB2, RB1, 및 TP53의 폐암 예측력을 평가한다. 도 5는 7가지의 다중 유방암 바이오마커에 대한 CANES 평가 보고를, 테스트 데이터세트로서 46개의 암 조직과 45개의 정상 조직으로부터 폐암 데이터세트를 이용한 서포트 벡터 머신(support vector machine)과 리브-원-아웃 교차 검증(leave-one-out cross-validation)에 의해 분류하여 나타낸 것이다. 도 5는 테스트 데이터세트에서 CANES 성능 보고의 대표를 나타낸 것이다. 7가지 다중-마커(도 5A 참조)를 다중-암 유형(도 5B 참조)에서 평가하였으며, 이는 폐암에서 높은 AUC, BA, SN 및 SP를 나타내었다(도 5C 참조). 이들은, 성능 평가를 통해 이들 7가지의 바이오마커가 폐암에도 잠재적으로 적용될 수 있음을 보여준다. 기존의 보고에서, ERBB2-HER2는 교아종 및 위 장액성 자궁내막암(gastric serous endometrial cancer), 방광암 및 폐암의 서브셋에서 과발현되는 것이 보고되어 있다. 병리학적 지식이나 치료적인 방법은 동일한 전사체 발현 패턴을 가지는 암 유형에 대해 공통으로 적용할 수 있다. 따라서 CANES는 암 유형에 대한 바이오마커를 평가하는 강력한 예측수단을 제공한다.
We evaluated the multi-marker prediction power of known breast cancer markers for another cancer type among panels of 18 tumor types. One of the key features of CANES is to evaluate multi-markers for multi-cancer types. In this study, they assess lung cancer predictability of known breast cancer markers BRCA1, BRCA2, BRIP1, CHEK2, PALB2, RB1, and TP53 in lung cancer. Figure 5 shows the CANES assessment report for seven multiple breast cancer biomarkers as support data from a support vector machine and rib-one-out with 46 lung cancer tissues and 45 normal tissues as a test data set, And are categorized by leave-one-out cross-validation. Figure 5 shows a representative of the CANES performance report in a test data set. Seven multi-markers (see FIG. 5A) were evaluated in a multi-cancer type (see FIG. 5B), which showed high AUC, BA, SN and SP in lung cancer (see FIG. 5C). They show that these seven biomarkers can potentially be applied to lung cancer through a performance evaluation. In a previous report, ERBB2-HER2 has been reported to be overexpressed in a subset of glioblastoma and gastric serous endometrial cancer, bladder cancer and lung cancer. Pathological knowledge or therapeutic methods can be applied commonly to cancer types having the same transcript expression pattern. Thus, CANES provides a powerful predictor of biomarker assessment for cancer types.

결론적으로, CANES(100)는 표적 치료 성공의 기준 성분인 더 나은 환자 계층을 위한 임상의 뿐만 아니라 연구원에 의해 사용될 수 있는 진단 및 예후를 위한 단일 또는 다중-마커에 있어서 18개의 암 종류에 대한 경관 평가 결과를 제공하는 강력한 도구이다.
In conclusion, CANES (100) is a benchmark for the success of target treatment, as well as a clinician for a better patient class, as well as a single or multi-marker for diagnosis and prognosis that can be used by researchers. It is a powerful tool to provide evaluation results.

1000가지가 넘는 문헌에도 불구하고, 임상적으로 승인된 바이오마커의 실제 수는 100개 미만이다. 스크리닝을 목적으로, 전립선-특이 항원(PSA)이 유일하게 승인된 혈청 바이오마커이며, 이의 용도에도 불구하고 가이드라인은 여전히 논쟁적이다. 전체 게놈 및 전사체 시퀀싱이 특정 암을 진단받은 환자에 대한 “개인화된 의학”으로 여겨지긴 하지만, 이러한 광범위(massive) 분석의 비용/이점은 여전히 논쟁 중이다. 더욱이, 이러한 프로파일을 진단학적 유전자 발현 “시그너처”의 존재 여부를 확인하기 위해 사용하는 것 또한 의구심을 받고 있다. 암 배아 항원 (CEA, 결장암), CA19-9 (췌장암), 및 CA-125 (난소암)과 같이 잘 알려진 진단학적 바이오마커조차, 질환의 진행에 있어서 이런 마커들의 정확한 역할은 거의 알려져 있지 않다.Despite more than 1000 references, the actual number of clinically approved biomarkers is less than 100. For screening purposes, the prostate-specific antigen (PSA) is the only approved serum biomarker, and despite its use, the guidelines are still controversial. Although total genome and transcript sequencing are considered "personalized medicine" for patients diagnosed with a particular cancer, the cost / benefit of this massive analysis is still controversial. Furthermore, it is also suspected that this profile is used to confirm the presence of a diagnostic gene expression " signature ". Even the well-known diagnostic biomarkers such as cancer embryonic antigen (CEA, colon cancer), CA19-9 (pancreatic cancer), and CA-125 (ovarian cancer) have little or no known precise role of these markers in disease progression.

이러한 쟁점을 해결하기 위해, 본 발명자들은 광범위하게 공개된 암 데이터세트에 대한 다중 마커를 평가하는 데 있어 간단하면서도 사용자-친화적인 웹-기반 적용의 단일/다중 바이오마커 평가 툴인 CANES를 설계, 개발하였다. 더욱이, CANES는 임상에서 거의 성공을 거둔 적이 없는 영역에서 다수의 파라미터 (진단, 치료 반응, 생존율 등)에 대한 다중 바이오마커의 성능을 평가할 수 있다. 모든 평가 결과는 표 및 그래프 시각화와 함께 제공되며, 고품질의 PDF 이미지와 CSV-기반의 텍스트기반의 스프레드시트 파일로 다운로드받을 수 있다. CANES는 진단 또는 예후에 대한 독립적인 데이터세트에서 다중 후보물질 마커를 평가하는 강력한 툴이다. 현재, CANES는 모든 공개적으로 이용가능한 마이크로어레이 데이터 세트를 이용가능할 뿐만 아니라, 최근에는 TCGA 데이터베이스에서 차세대 시퀀싱 기술을 이용해 만들어진 특정 암에 대한 RNA-seq 데이터세트도 이용가능하다. To address this issue, the inventors have designed and developed CANES, a simple and multi-biomarker assessment tool for simple, user-friendly, web-based applications for evaluating multiple markers for widely published cancer data sets . Furthermore, CANES is able to assess the performance of multiple biomarkers for a number of parameters (diagnosis, therapeutic response, survival rate, etc.) in areas of little clinical success. All evaluation results are provided with table and graph visualizations, and can be downloaded as high-quality PDF images and CSV-based text-based spreadsheet files. CANES is a powerful tool for evaluating multiple candidate material markers in an independent set of data for diagnosis or prognosis. At present, CANES not only makes available all publicly available microarray datasets, but also recently has an RNA-seq dataset for specific cancers made using the next generation sequencing technology in the TCGA database.

CANES의 유용성은 전술한 예시들 외에 더 존재한다. 예를 들어, CANES는 개별 암 유형에 대한 (최소한, 이러한 암 유형에 대해 고 위험군에 속하는 개체들에 대한) 광범위한 스크리닝에서 특정 바이오마커 패널을 사용할 경우의 마커로써의 평가결과 (예를 들어, > 95% 민감도 및 > 95% 특이성)를 제공하는 데 사용될 수 있다. 본 발명자들은 CANES의 사용을 통해 향상된 정확도를 가진 바이오 마커의 개발을 편리하게 해 준다. 더욱이, 연구의 관점에서, 특정 유전자(들)와 특정 종양과 강한 연관성이 있음을 확인하는 것은 해당 종양의 진행 (및 이의 가능한 저해)에 대한 메커니즘적인 이해를 도울 수 있다.The usefulness of CANES exists further in addition to the above-mentioned examples. For example, CANES can be used as an evaluation of a marker (eg,>) for a particular cancer type (at least, for individuals belonging to the high risk group for this type of cancer) 95% sensitivity and > 95% specificity). The inventors of the present invention have made it easier to develop biomarkers with improved accuracy through the use of CANES. Moreover, from a research point of view, confirming that there is a strong association with a particular gene (s) and a particular tumor may help to understand the mechanistic understanding of its progression (and its possible inhibition).

요약하자면, 본 발명자들은 특정 암 유형에 대한 단일 바이오마커 또는 다중 바이오마커 세트의 연구/개발을 위한 공개적으로 이용가능한 새로운 툴을 개발하였다. 이 툴은 특정 암 유형을 특징화하고, 암의 진행 경로를 확인하고, 개별 바이오마커의 잠재적인 임상적 유용성 (예를 들어, 진단, 예후, 생존율 등)을 크게 향상시키려는 임상 및 바이오메디칼 연구 커뮤니티에 특히 유용할 것이다.
In summary, the inventors have developed a new publicly available tool for research / development of a single biomarker or a set of multiple biomarkers for a particular cancer type. The tool is used by clinical and biomedical research communities to characterize specific cancer types, identify cancer pathways, and greatly improve the potential clinical utility (eg, diagnosis, prognosis, survival rate, etc.) of individual biomarkers . &Lt; / RTI >

본 발명에서 사용된 약어 목록The list of abbreviations used in the present invention

AUC, 곡선 아래 면적; BA, 밸런스 정확도(balanced accuracy); CART, 분류 및 회귀 나무(classification and regression tree); GEO, 유전자 발현 옴니부스; FPR, 위양성율(false positive rate); FDR, 위발견율(false discovery rate); GC, 위암; ICGC, 국제 암 게놈 협력단(international cancer genome consortium); LOOCV, 리브-원-아웃 교차-검증(leave-one-out cross-validation); NN, 신경망; NPV, 음성 예측값; PPV, 양성 예측값; RF, 랜덤 포레스트(random forest); RMA, 다중-어레이 평균; SVM, 서포트 벡터 머신(support vector machine); TCGA, 암 게놈 아틀라스;AUC, area under the curve; BA, balanced accuracy; CART, classification and regression tree; GEO, Gene Expression Omnibus; FPR, false positive rate; FDR, false discovery rate; GC, stomach cancer; ICGC, international cancer genome consortium; LOOCV, rib-one-out cross-validation; NN, neural network; NPV, negative predictive value; PPV, positive predictive value; RF, random forest; RMA, multi-array average; SVM, support vector machine; TCGA, cancer genome atlas;

이상의 설명은 본 발명의 기술 사상을 예시적으로 설명한 것에 불과한 것으로써, 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자라면 본 발명의 본질적인 특성에서 벗어나지 않는 범위 내에서 다양한 수정, 변경 및 치환이 가능할 것이다. 따라서, 본 발명에 개시된 실시예 및 첨부된 도면들은 본 발명의 기술 사상을 한정하기 위한 것이 아니라 설명하기 위한 것이고, 이러한 실시예 및 첨부된 도면에 의하여 본 발명의 기술 사상의 범위가 한정되는 것은 아니다. 본 발명의 보호 범위는 아래의 청구범위에 의하여 해석되어야 하며, 그와 동등한 범위 내에 있는 모든 기술 사상은 본 발명의 권리범위에 포함되는 것으로 해석되어야 할 것이다. 또한, 특허청구범위에서 명시적인 인용 관계가 있지 않은 청구항들을 결합하여 실시예를 구성하거나 출원 후의 보정에 의해 새로운 청구항으로 포함시킬 수 있다.It will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims. It will be possible. Therefore, the embodiments disclosed in the present invention and the accompanying drawings are intended to illustrate and not to limit the technical spirit of the present invention, and the scope of the technical idea of the present invention is not limited by these embodiments and the accompanying drawings . The scope of protection of the present invention should be construed according to the following claims, and all technical ideas within the scope of equivalents should be construed as falling within the scope of the present invention. In addition, claims that do not have an explicit citation in the claims may be combined to form an embodiment or included in a new claim by amendment after the application.

110 : 전처리 모듈
120 : 데이터베이스 모듈
130 : 성능 평가 모듈
140 : 웹-인터페이스 모듈110: preprocessing module
120: Database module
130: performance evaluation module
140: Web-Interface Module

Claims

A preprocessing module for collecting and pre-processing a cancer expression date set from an open database;
A database module for rearranging the pre-processed cancer expression dating sets to construct a new cancer information DB; And
A performance evaluation module for evaluating the performance of the cancer biomarker for each type of cancer using the established new cancer information DB;
Wherein said cancer biomarker is a cancer biomarker.

The method according to claim 1,
And a web-interface module for displaying the result of performance evaluation of the cancer biomarker evaluated for each cancer type in a mapping, a graph or a table.

The method according to claim 1,
Wherein the pre-processing module analyzes the collected cancer expression data sets and normalizes them using quantile normalization and robust multiple-array (RMA) normalization. &Lt; RTI ID = 0.0 > .

The method of claim 3,
Wherein the pre-processing module uses a normalized coefficient as an expression value for RNA expression data by RNA sequencing in a TCGA public DB.

The method according to claim 1,
Wherein the pre-processing module detects an outlier using within-group correlation and between-group correlation. &Lt; RTI ID = 0.0 > 8. < / RTI >

6. The method of claim 5,
Wherein the preprocessing module excludes the sample having the detected abnormal value according to an external input of the user.

The method according to claim 1,
The preprocessing module may be designed with a new data set through analysis of diagnosis, prognosis and drug response information to define the correct cancer type or subtype if the collected cancer expression date set includes clinical information and sample annotations Wherein said cancer biomarker is a cancer biomarker.

The method according to claim 1,
Wherein the database module comprises the pre-processed expression data set and corresponding annotation data.

9. The method of claim 8,
Wherein the database module stores all the expression data in the form of a user-customized index binary file.

The method according to claim 1,
The performance evaluation module is designed to evaluate the performance of the user-selected multiple markers under the curve AUC, balance accuracy BA, sensitivity, specificity, positive predictive value PPV, negative predictive value NPV, false positive rate (FPR) (FDR), and an F1 score. The apparatus for evaluating the performance of a cancer biomarker according to claim 1,

The method according to claim 1,
The performance evaluation module may be configured to provide a difference between a balance accuracy (BA) for all markers and a balance accuracy (BA) for all markers other than a single marker to measure the contribution of a single marker to the performance of the multi- Wherein said cancer biomarker is a cancer biomarker.

3. The method of claim 2,
The web-interface module comprises:
An input layout for transferring a user-selected multiple marker and a queried parameter to the performance evaluation module; And
And a result explorer providing a table and graph visualization of the performance evaluation result.

13. The method of claim 12,
Wherein the input layout selects a public data set pre-processed as a training data set or a user-provided individual data set.

Collecting and pre-processing the cancer expression dating set from the public DB;
Rearranging the pretreated cancer expression date set to construct a new cancer information DB; And
Evaluating the performance of the cancer biomarker according to the type of cancer using the established new cancer information DB;
/ RTI > of cancer biomarkers.

15. The method of claim 14,
And displaying a performance evaluation result indicating the performance evaluation result of the cancer biomarker evaluated for each cancer type in a mapping, a graph or a table.

15. The method of claim 14,
Wherein the pre-processing step comprises analyzing the collected cancer expression data sets and normalizing them using quantile normalization and robust multiple-array (RMA) normalization. A method for evaluating the performance of a marker.

15. The method of claim 14,
Wherein the pre-processing step detects an outlier using within-group correlation and between-group correlation. The method according to claim 1, .

15. The method of claim 14,
In the pre-processing step, when the collected cancer expression date set includes clinical information and sample annotations, a new data set is designed through analysis of diagnosis, prognosis, and drug response information to specify the precise cancer type or subtype Wherein the method comprises the steps of:

15. The method of claim 14,
Wherein the step of constructing the new cancer information DB stores all the expression data in the form of a user-customized index binary file.

15. The method of claim 14,
(AUC), a balance accuracy (BA), a sensitivity, a specificity, a positive predictive value (PPV), a negative predictive value (NPV), a false positive rate (FPR) A detection rate (FDR), and an F1 score. &Lt; Desc / Clms Page number 20 >

15. The method of claim 14,
In the performance evaluation step, a difference between the balance accuracy (BA) for all the markers and the balance accuracy (BA) for all the markers other than the single marker is measured in order to measure the contribution of the single marker to the performance of the multi- Wherein said cancer biomarker is a cancer biomarker.

22. The method according to claim 20 or 21,
In the performance evaluation result display step,
Evaluation of multiple markers using selected training and testing data sets, evaluation using stored predictive models for arbitrary selected testing data sets, and evaluation of rib-one-out cross-validation (LOOCV) using selected data sets Evaluating at least one of evaluation and evaluation using a user-provided training data set and a selected testing data set.

16. The method of claim 15,
Wherein the performance evaluation result display step selects a public data set pre-processed as a training data set or a user-provided individual data set.