KR20210044400A

KR20210044400A - Method and apparatus for discovering biomarker for predicting cancer prognosis using heterogeneous platform of DNA methylation data

Info

Publication number: KR20210044400A
Application number: KR1020190127477A
Authority: KR
Inventors: 김진호; 박웅양; 김가현; 정준기; 박동현
Original assignee: 사회복지법인 삼성생명공익재단; 지니너스 주식회사
Priority date: 2019-10-15
Filing date: 2019-10-15
Publication date: 2021-04-23
Also published as: KR102543757B1

Abstract

The present invention relates to a biomarker discovery method for predicting the prognosis of cancer using DNA methylation data of a heterogeneous platform and a device thereof. According to an aspect, the method and the device increase the accuracy of a cancer prognosis prediction model using DNA methylation data of a heterogeneous platform. The method and the device can derive methylation markers related to the prognosis of cancer and develop a cancer prognosis prediction kit with higher accuracy.

Description

[Method and apparatus for discovering biomarker for predicting cancer prognosis using heterogeneous platform of DNA methylation data}

본 발명은 이종 플랫폼의 DNA 메틸레이션 데이터를 이용한 암의 예후 예측용 바이오마커 발굴 방법 및 장치에 관한 것이다. The present invention relates to a method and apparatus for discovering biomarkers for predicting cancer prognosis using DNA methylation data of heterogeneous platforms.

후성유전학(epigenetics)은 DNA의 염기서열이 변화하지 않은 상태에서 이루어지는 유전자의 발현 조절을 연구하는 분야이다. 후성유전학은 DNA 메틸화, miRNA 또는 히스톤의 아세틸화, 메틸화, 인산화 및 유비퀴틴화 등과 같은 후성적 변이를 통한 유전자 발현 조절을 연구한다. 이중 DNA 메틸화가 가장 많이 연구가 되어있는 후성적 변이이다. 후성적 변이는 유전자 기능 변이 및 종양 세포로의 변화를 초래할 수 있다. 따라서 DNA 메틸화는 세포 내 질환 조절 유전자의 발현 또는 억제와 연관되어 있으며, 최근 DNA 메틸화 측정을 통해 암의 위험성 또는 예후를 진단하는 방법에 대한 연구들이 다수 보고된 바 있고 (한국공개특허 10-2014-0022293), 기계학습을 이용한 연구들이 진행되고 있다.Epigenetics is a field that studies the regulation of gene expression in a state where the nucleotide sequence of DNA is unchanged. Epigenetics studies the regulation of gene expression through epigenetic mutations such as DNA methylation, miRNA or histone acetylation, methylation, phosphorylation and ubiquitination. Of these, DNA methylation is the most studied epigenetic mutation. Epigenetic mutations can lead to mutations in gene function and changes to tumor cells. Therefore, DNA methylation is associated with the expression or suppression of disease-regulating genes in cells, and recently, a number of studies on a method of diagnosing the risk or prognosis of cancer through DNA methylation measurement have been reported (Korean Patent Laid-Open No. 10-2014- 0022293), researches using machine learning are being conducted.

한편, 메틸레이션 데이터를 이용하여 모델을 만들 때, 메틸화 사이트의 수가 데이터의 수보다 압도적으로 많으면 학습데이터를 과하게 학습하게 되어 모델을 학습시키는 학습 데이터에 대해서는 정확도가 과하게 높고, 테스트 데이터에서는 정확도가 낮은 과적합(overfitting)의 문제가 발생한다. 이런 이유로 잘 알려지거나 실험적으로 검증된 소수의 유전자 메틸레이션만을 마커로 활용하고 있는 실정이다. 과적합 문제를 해결하기 위해서는 더 많은 수의 데이터를 확보해야 하나, 동 플랫폼으로부터 많은 수의 데이터를 확보하기 위해서는 많은 비용이 발생한다. 따라서, 비용 증가 없이 과적합 문제를 해결하여 모델의 정확도를 향상시키는 방법에 대한 필요성이 요구된다. On the other hand, when creating a model using methylation data, if the number of methylation sites is overwhelmingly larger than the number of data, the training data is overtrained, resulting in excessively high accuracy for the training data that trains the model, and low accuracy for the test data. The problem of overfitting occurs. For this reason, only a few well-known or experimentally verified gene methylation are used as markers. In order to solve the overfitting problem, a larger number of data must be secured, but a large amount of data is incurred to secure a large number of data from the platform. Therefore, there is a need for a method of improving the accuracy of the model by solving the overfitting problem without increasing the cost.

이종 플랫폼의 DNA 메틸레이션 데이터를 이용하여 암의 예후 예측용 메틸레이션 마커를 발굴하는 방법 및 장치를 제공하는데 있다. 본 실시예가 이루고자 하는 기술적 과제는 상기된 바와 같은 기술적 과제들로 한정되지 않으며, 이하의 실시예들로부터 또 다른 기술적 과제들이 유추될 수 있다. It is to provide a method and apparatus for discovering methylation markers for predicting cancer prognosis using DNA methylation data from heterogeneous platforms. The technical problem to be achieved by the present embodiment is not limited to the technical problems as described above, and other technical problems may be inferred from the following embodiments.

일 측면은 적어도 하나의 프로세서에 의해 동작하는 장치가 기계 학습을 이용하여 암의 예후 예측용 메틸레이션 마커를 발굴하는 방법으로서, 상기 적어도 하나의 프로세서가 제1 메틸레이션 데이터 중에서 암의 예후 예측에 중요한 변수와 유의한 상관관계가 있는 CpG 사이트를 추출하는 제 1 단계; 상기 적어도 하나의 프로세서가 제2 메틸레이션 데이터 중에서 상기 추출된 CpG 사이트에 대응하는 데이터를 최종 학습용 입력 데이터로 선정하는 제 2 단계; 및 상기 적어도 하나의 프로세서가 상기 최종 학습용 입력 데이터로 암의 예후 예측 모델을 생성하는 제 3 단계;를 포함하는 기계 학습을 이용한 암의 예후 예측용 메틸레이션 마커 발굴 방법을 제공한다.One aspect is a method of discovering a methylation marker for predicting cancer prognosis by using machine learning by a device operated by at least one processor, wherein the at least one processor is important for predicting cancer prognosis among the first methylation data. A first step of extracting a CpG site having a significant correlation with a variable; A second step of selecting, by the at least one processor, data corresponding to the extracted CpG site from among the second methylation data as final input data for learning; And a third step of generating, by the at least one processor, a cancer prognosis prediction model using the final training input data.

또 다른 측면에 있어서, 상기 제1 단계 이전에, 상기 적어도 하나의 프로세서가 제 1 메틸레이션 데이터의 차원을 축소시켜 변수를 도출하는 단계; 상기 적어도 하나의 프로세서가 상기 도출된 변수를 학습용 입력 데이터로 하고, 암 환자의 임상 데이터를 학습용 출력 데이터로 하여 기계 학습을 통해 암의 예후 예측 테스트 모델을 생성하는 단계; 및 상기 적어도 하나의 프로세서가 상기 테스트 모델의 성능을 평가하여 암의 예후 예측에 중요한 하나 이상의 변수를 결정하는 단계;를 더 포함할 수 있다.In another aspect, prior to the first step, the at least one processor derives a variable by reducing the dimension of the first methylation data; Generating, by the at least one processor, a test model for predicting a prognosis of cancer through machine learning using the derived variable as input data for learning and clinical data of a cancer patient as output data for learning; And determining, by the at least one processor, the performance of the test model to determine one or more variables important for predicting the prognosis of cancer.

또 다른 측면에 있어서, 상기 제 1 메틸레이션 데이터 또는 상기 제 2 메틸레이션 데이터는 array 기반 메틸레이션 데이터 또는 바이설파이트 시퀀싱 (bisulfite sequencing) 데이터일 수 있다. In another aspect, the first methylation data or the second methylation data may be array-based methylation data or bisulfite sequencing data.

또 다른 측면에 있어서, 상기 차원을 축소시켜 변수를 도출하는 단계는 주성분 분석 (Principle component analysis; PCA), 특이값분해(Singular Value Decomposition; SVD), 비음수 행렬 인수분해(Non-negative Matrix Factorization NMF)로 이루어진 군에서 선택된 어느 하나의 방법을 이용하여 수행될 수 있다. In another aspect, the step of deriving the variable by reducing the dimension includes principal component analysis (PCA), singular value decomposition (SVD), and non-negative matrix factorization NMF. ) May be performed using any one method selected from the group consisting of.

또 다른 측면에 있어서, 상기 암의 예후 예측 테스트 모델을 생성하는 단계는 부트스트랩 (bootstrap) 알고리즘을 사용하여 생성할 수 있다,In another aspect, the step of generating the cancer prognosis prediction test model may be generated using a bootstrap algorithm.

또 다른 측면에 있어서, 상기 암의 예후 예측에 중요한 하나 이상의 변수를 도출하는 단계는 교차 검증을 수행하여 최대 성능을 나타내는 모델로부터 암의 예후 예측에 중요한 하나 이상의 변수를 도출할 수 있다.In another aspect, in the step of deriving one or more variables important for predicting the prognosis of cancer, one or more variables important for predicting the prognosis of cancer may be derived from a model showing maximum performance by performing cross-validation.

또 다른 측면은 상기 방법을 컴퓨터에서 실행시키기 위한 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록 매체를 제공한다.Another aspect provides a computer-readable recording medium storing a program for executing the method on a computer.

또 다른 측면은 제1 메틸레이션 데이터 중에서 암의 예후 예측에 중요한 변수와 유의한 상관관계가 있는 CpG 사이트를 추출하는 CpG 사이트 추출부; 제2 메틸레이션 데이터 중에서 상기 추출된 CpG 사이트에 대응하는 데이터를 최종 학습용 입력 데이터로 선정하는 최종 학습용 입력 데이터 선정부; 및 상기 최종 학습용 입력 데이터로 암의 예후 예측 모델을 생성하는 모델 생성부;를 포함하는 기계 학습을 이용한 암의 예후 예측용 메틸레이션 마커 발굴 장치를 제공한다.Another aspect is a CpG site extraction unit for extracting a CpG site having a significant correlation with a variable important for predicting the prognosis of cancer from the first methylation data; A final training input data selection unit for selecting data corresponding to the extracted CpG site from among the second methylation data as final training input data; And a model generator for generating a cancer prognosis prediction model using the final training input data.

또 다른 측면에 있어서, 제 1 메틸레이션 데이터의 차원을 축소시켜 변수를 도출하는 변수 도출부; 상기 도출된 변수를 학습용 입력 데이터로 하고, 암 환자의 임상 데이터를 학습용 출력 데이터로 하여 기계 학습을 통해 암의 예후 예측 테스트 모델을 생성하는 테스트 모델 생성부; 및 상기 테스트 모델의 성능을 평가하여 암의 예후 예측에 중요한 하나 이상의 변수를 결정하는 중요 변수 결정부;를 더 포함할 수 있다. In yet another aspect, a variable derivation unit for deriving a variable by reducing the dimension of the first methylation data; A test model generator configured to generate a cancer prognosis prediction test model through machine learning by using the derived variables as input data for learning and clinical data of a cancer patient as output data for learning; And It may further include an important variable determining unit that evaluates the performance of the test model and determines one or more variables that are important for predicting the prognosis of cancer.

또 다른 측면에 있어서, 상기 제 1 메틸레이션 데이터 또는 상기 제 2 메틸레이션 데이터는 array 기반 메틸레이션 데이터 또는 바이설파이트 시퀀싱 (bisulfite sequencing) 데이터일 수 있다.In another aspect, the first methylation data or the second methylation data may be array-based methylation data or bisulfite sequencing data.

또 다른 측면에 있어서, 상기 차원 축소는 주성분 분석 (Principle component analysis; PCA), 특이값분해(Singular Value Decomposition; SVD), 비음수 행렬 인수분해(Non-negative Matrix Factorization NMF)로 이루어진 군에서 선택된 어느 하나의 방법을 이용하여 수행될 수 있다. In another aspect, the dimensional reduction is any one selected from the group consisting of principal component analysis (PCA), singular value decomposition (SVD), and non-negative matrix factorization NMF. It can be done using one method.

또 다른 측면에 있어서, 상기 테스트 모델 생성은 부트스트랩 (bootstrap) 알고리즘을 사용하여 생성할 수 있다. In another aspect, the test model generation may be generated using a bootstrap algorithm.

또 다른 측면에 있어서, 상기 암의 예후 예측에 중요한 하나 이상의 변수의 결정은 교차 검증을 수행하여 최대 성능을 나타내는 모델로부터 암의 예후 예측에 중요한 하나 이상의 변수를 결정할 수 있다.In another aspect, the determination of one or more variables important for predicting the prognosis of cancer may perform cross-validation to determine one or more variables important for predicting the prognosis of cancer from a model showing maximum performance.

일 양상에 따른 방법, 및 장치에 따르면, 이종 플랫폼의 DNA 메틸레이션 데이터를 이용하여 암 예후 예측 모델의 정확도를 향상시킴으로써, 암의 예후 예측과 관련된 메틸레이션 마커 도출이 가능해지고 정확도가 더 높은 암 예후 예측 키트를 개발할 수 있다. According to a method and apparatus according to an aspect, by improving the accuracy of a cancer prognosis prediction model using DNA methylation data from heterogeneous platforms, it is possible to derive methylation markers related to cancer prognosis prediction and cancer prognosis with higher accuracy. Predictive kits can be developed.

도 1은 일 실시예에 따른 기계 학습을 이용한 암의 예후 예측용 메틸레이션 마커를 발굴하는 컴퓨팅 장치의 하드웨어 구성을 도시한 블록도이다.
도 2는 도 1의 프로세서의 상세 하드웨어 구성을 도시한 블록도이다.
도 3은 일 실시예에 따른 기계 학습을 이용한 암의 예후 예측용 메틸레이션 마커 발굴 방법의 순서도이다.
도 4는 일 실시예에 따른 테스트 모델 성능 평가를 통해 암의 예후 예측에 중요한 하나 이상의 변수를 결정하는 과정을 설명하는 그래프이다.
도 5는 (A) Bisulfite sequencing 데이터를 학습용 입력 데이터로 사용하여 생성된 암의 예후 예측 모델, 및 (B) 일 실시예에 따른 array 기반의 메틸레이션 데이터를 가공하여 추출된 CpG site에 대한 bisulfite sequencing 데이터를 학습용 입력 데이터로 사용하여 생성된 암의 예후 예측 모델의 암의 단계 (stage) 별 평균 예측 정확도를 비교 평가한 그래프이다.1 is a block diagram illustrating a hardware configuration of a computing device for discovering a methylation marker for predicting a prognosis of cancer using machine learning according to an exemplary embodiment.
2 is a block diagram showing a detailed hardware configuration of the processor of FIG. 1.
3 is a flowchart of a method of discovering methylation markers for predicting cancer prognosis using machine learning according to an exemplary embodiment.
4 is a graph illustrating a process of determining one or more variables important for predicting a prognosis of cancer through evaluation of a test model performance according to an exemplary embodiment.
FIG. 5 shows (A) a cancer prognosis prediction model generated using Bisulfite sequencing data as input data for learning, and (B) bisulfite sequencing for a CpG site extracted by processing array-based methylation data according to an embodiment. This is a graph comparing and evaluating the average prediction accuracy for each cancer stage of a cancer prognosis prediction model generated using data as input data for learning.

본 실시예들에서 사용되는 용어는 본 실시예들에서의 기능을 고려하면서 가능한 현재 널리 사용되는 일반적인 용어들을 선택하였으나, 이는 당 기술분야에 종사하는 기술자의 의도 또는 판례, 새로운 기술의 출현 등에 따라 달라질 수 있다. 또한, 특정한 경우는 임의로 선정된 용어도 있으며, 이 경우 해당 실시예의 설명 부분에서 상세히 그 의미를 기재할 것이다. 따라서, 본 실시예들에서 사용되는 용어는 단순한 용어의 명칭이 아닌, 그 용어가 가지는 의미와 본 실시예들의 전반에 걸친 내용을 토대로 정의되어야 한다.The terms used in the embodiments were selected from general terms that are currently widely used as possible while considering the functions in the embodiments, but this varies depending on the intention or precedent of a technician engaged in the relevant technical field, the emergence of new technologies, etc. I can. In addition, in a specific case, there are terms that are arbitrarily selected, and in this case, the meaning will be described in detail in the description of the corresponding embodiment. Accordingly, the terms used in the present embodiments should be defined based on the meaning of the term and the contents of the present embodiments, not a simple name of the term.

실시예들에 대한 설명들에서, 어떤 부분이 다른 부분과 연결되어 있다고 할 때, 이는 직접적으로 연결되어 있는 경우뿐 아니라, 그 중간에 다른 구성요소를 사이에 두고 전기적으로 연결되어 있는 경우도 포함한다. 또한 어떤 부분이 어떤 구성요소를 포함한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있는 것을 의미한다. 또한, 실시예들에 기재된 "...부", "...모듈"의 용어는 적어도 하나의 기능이나 동작을 처리하는 단위를 의미하며, 이는 하드웨어 또는 소프트웨어로 구현되거나 하드웨어와 소프트웨어의 결합으로 구현될 수 있다.In the descriptions of the embodiments, when a part is said to be connected to another part, this includes not only the case that it is directly connected, but also the case that it is electrically connected with another component interposed therebetween. . In addition, when a part includes a certain component, it means that other components may be further included, rather than excluding other components unless specifically stated to the contrary. In addition, the terms "... unit" and "... module" described in the embodiments mean a unit that processes at least one function or operation, which is implemented in hardware or software, or by a combination of hardware and software. Can be implemented.

본 실시예들에서 사용되는 "구성된다" 또는 "포함된다" 등의 용어는 명세서 상에 기재된 여러 구성 요소들, 또는 여러 단계들을 반드시 모두 포함하는 것으로 해석되지 않아야 하며, 그 중 일부 구성 요소들 또는 일부 단계들은 포함되지 않을 수도 있고, 또는 추가적인 구성 요소 또는 단계들을 더 포함할 수 있는 것으로 해석되어야 한다.Terms such as "consisting" or "included" used in the present embodiments should not be construed as necessarily including all of the various constituent elements or various steps described in the specification, and some constituent elements or It should be construed that some steps may not be included, or may further include additional elements or steps.

하기 실시예들에 대한 설명은 권리범위를 제한하는 것으로 해석되지 말아야 하며, 해당 기술분야의 당업자가 용이하게 유추할 수 있는 것은 실시예들의 권리범위에 속하는 것으로 해석되어야 할 것이다. 이하 첨부된 도면들을 참조하면서 오로지 예시를 위한 실시예들을 상세히 설명하기로 한다.The description of the following embodiments should not be construed as limiting the scope of the rights, and what those skilled in the art can easily infer should be construed as belonging to the scope of the embodiments. Hereinafter, embodiments for illustration only will be described in detail with reference to the accompanying drawings.

도 1은 일 실시예에 따른 기계 학습을 이용한 암의 예후 예측용 메틸레이션 마커를 발굴하는 컴퓨팅 장치의 하드웨어 구성을 도시한 블록도이다. 1 is a block diagram illustrating a hardware configuration of a computing device for discovering a methylation marker for predicting a prognosis of cancer using machine learning according to an exemplary embodiment.

도 1을 참고하여 설명하면, 컴퓨팅 장치(50)는 제 1 메틸레이션 데이터 (40), 제 2 메틸레이션 데이터 (41), 및 암 환자의 임상정보 데이터(42)를 전처리/가공/분석하여 암의 예후 예측용 마커를 발굴하기 위한 장치로, 데이터 인터페이스(80), 메모리(90), 및 프로세서(100)를 포함할 수 있다. 한편, 도 1에 도시된 컴퓨팅 장치 (50)는 본 실시예의 특징이 흐려지는 것을 방지하기 위하여 본 실시예에 관련된 구성요소들만이 도시되어 있을 뿐이므로, 컴퓨팅 장치 (50)는 도 1에 도시된 구성요소들 외에 다른 범용적인 구성요소들이 더 포함될 수 있다.Referring to FIG. 1, the computing device 50 pre-processes/processes/analyzes the first methylation data 40, the second methylation data 41, and the clinical information data 42 of a cancer patient. A device for discovering a marker for predicting the prognosis of, may include a data interface 80, a memory 90, and a processor 100. Meanwhile, in the computing device 50 illustrated in FIG. 1, only components related to the present exemplary embodiment are shown in order to prevent the features of the present exemplary embodiment from being blurred. In addition to the elements, other general-purpose components may be further included.

용어 "암", "종양" 또는 "악성"은 일반적으로 비조절된 세포 성장의 특징을 갖는 포유동물의 생리학적 상태를 나타내거나 설명한다. 암의 예는 암종, 림프종, 백혈병, 모세포종 및 육종을 포함할 수 있으나, 이로 제한되지 않는다The terms “cancer”, “tumor” or “malignant” generally refer to or describe the physiological condition of a mammal characterized by unregulated cell growth. Examples of cancer may include, but are not limited to, carcinoma, lymphoma, leukemia, blastoma and sarcoma.

용어 "예후(Prognosis)"는 질병의 경과와 결과의 예측을 의미하는 것으로, 암의 재발(Recurrence) 진행(progression)을 예측하는 것을 포함할 수 있다. The term "prognosis" refers to predicting the course and outcome of a disease, and may include predicting the recurrence progression of cancer.

용어 "메틸레이션 (methylation)"은 DNA를 구성하는 염기에 메틸기가 부착되어 유전자 발현 양상이 변하는 것으로, 특정 유전자의 특정 CpG 부위의 시토신에서 일어나는 것을 의미할 수 있다.The term "methylation" refers to a change in gene expression pattern by attaching a methyl group to a base constituting DNA, and may mean that it occurs in the cytosine of a specific CpG site of a specific gene.

제 1 메틸레이션 데이터 (40) 또는 제 2 메틸레이션 데이터 (41)는 실험적으로 획득될 수 있다. 메틸레이션 데이터는 임의의 시간점(time point)에서 통상의 당업자에게 자명한 방법 또는 장치에 의해 실험적으로 획득될 수 있다. 예를 들어, PCR, 메틸화 특이 PCR(methylation specific PCR), 실시간 메틸화 특이 PCR(real time methylation specific PCR), MethyLight PCR, MehtyLight digital PCR, EpiTYPER, 메틸화 DNA 특이적 결합 단백질을 이용한 PCR, 정량 PCR, DNA 칩, 파이로시퀀싱, 바이설파이트 시퀀싱, 마이크로어레이로 이루어진 군에서 선택되는 어느 하나의 방법으로 수행될 수 있으나, 이에 제한되는 것은 아니다.The first methylation data 40 or the second methylation data 41 may be obtained experimentally. The methylation data can be obtained experimentally at any time point by methods or apparatus that are apparent to those of ordinary skill in the art. For example, PCR, methylation specific PCR, real time methylation specific PCR, MethyLight PCR, MehtyLight digital PCR, EpiTYPER, PCR using methylated DNA specific binding protein, quantitative PCR, DNA It may be performed by any one method selected from the group consisting of chip, pyro-sequencing, bisulfite sequencing, and microarray, but is not limited thereto.

나아가서, 제 1 메틸레이션 데이터 또는 제 2 메틸레이션 데이터는 공개 데이터베이스(DB)로부터 획득된 것일 수 있다. 예를 들어, NCBI(National Center for Biotechnology Information), Gene Expression Omnibus (GEO), European Bioinformatics Institute databases, 또는 European Nucleotide Archive, 등과 같은 당해 기술분야에서 이미 공지된 데이터베이스(DB)로부터 획득된 것일 수 있다. 일 실시예에 있어서, 제 1 메틸레이션 데이터 또는 제 2 메틸레이션 데이터는 공개된 array 기반 메틸레이션 데이터 또는 바이설파이트 시퀀싱 (bisulfite sequencing) 데이터로부터 획득된 것일 수 있다. 하지만, 유전자 분석 기술의 발달로 인하여 새로운 DNA 메틸레이션 데이터가 계속하여 발견되고 업데이트되고 있으므로, 본 실시예에서 설명하고자 하는 DNA 메틸레이션 데이터는 공개 데이터베이스(DB)로부터 얻을 수 있는 것에만 제한되지 않는다.Furthermore, the first methylation data or the second methylation data may be obtained from a public database (DB). For example, it may be obtained from a database (DB) already known in the art, such as National Center for Biotechnology Information (NCBI), Gene Expression Omnibus (GEO), European Bioinformatics Institute databases, or European Nucleotide Archive. In an embodiment, the first methylation data or the second methylation data may be obtained from published array-based methylation data or bisulfite sequencing data. However, since new DNA methylation data are continuously discovered and updated due to the development of gene analysis technology, the DNA methylation data to be described in this example is not limited to those that can be obtained from the public database (DB).

Array 기반의 메틸레이션 데이터는 Illumina 사의 메틸레이션 레벨 평가를 위한 방법 중 하나로, Illumina Infinium HumanMethylation 450 BeadChip의 메틸레이션된 bead와 메틸레이션 되지 않은 bead의 intensity를 측정하여 획득될 수 있다. Array-based methylation data is one of Illumina's methods for evaluating the methylation level, and can be obtained by measuring the intensity of the methylated and non-methylated beads of the Illumina Infinium HumanMethylation 450 BeadChip.

바이설파이트 시퀀싱 데이터는 메틸레이션 패턴을 확인하기 위해 시퀀싱 이전에 바이설파이트를 처리하여 사이토신(cytosine) 염기를 우라실 (uracil)로 변환하는데, 이때 메틸레이션 된 사이토신은 바이설파이트를 처리하여도 여전히 사이토신으로 남아있게되므로 이러한 차이를 이용하여 메틸레이션 레벨을 평가할 수 있다. Bisulfite sequencing data is converted to uracil by treating bisulfite prior to sequencing to confirm the methylation pattern.At this time, the methylated cytosine can be treated with bisulfite. Since it still remains cytosine, these differences can be used to assess the level of methylation.

데이터 인터페이스(80)는 생물학적 샘플로부터 실험적으로 측정되거나 또는 데이터베이스(DB)에 저장된, 제 1 메틸레이션 데이터(40), 제 2 메틸레이션 데이터 (41), 암 환자의 임상정보 데이터 (42)를 획득한다. 즉, 데이터 인터페이스(110)는 컴퓨팅 장치(50)가 외부의 다른 디바이스들과 통신하기 위한 유/무선 네트워크 인터페이스의 하드웨어로 구현될 수 있다.The data interface 80 acquires first methylation data 40, second methylation data 41, and clinical information data 42 of cancer patients, which are experimentally measured from a biological sample or stored in a database (DB). do. That is, the data interface 110 may be implemented as hardware of a wired/wireless network interface for the computing device 50 to communicate with other external devices.

메모리(90)는 컴퓨팅 장치(50) 내에서 처리될 데이터들 및 처리가 완료된 결과들을 저장하기 위한 하드웨어로서, RAM(random access memory), ROM(read only memory) 등의 메모리 칩들 또는 HDD(hard disk drive), SSD(solid state drive) 등의 스토리지를 포함한다. 즉, 메모리(90)는 데이터 인터페이스(80)에 의해 획득된 제 1 메틸레이션 데이터(40), 제 2 메틸레이션 데이터 (41), 암 환자의 임상정보 데이터 (42)를 저장할 수 있고, 프로세서(100)에 의해 발굴된 메틸레이션 마커에 대한 데이터도 저장할 수 있다.The memory 90 is hardware for storing data to be processed and results of processing in the computing device 50, memory chips such as random access memory (RAM), read only memory (ROM), or hard disk (HDD). drive), solid state drive (SSD), and the like. That is, the memory 90 may store the first methylation data 40, the second methylation data 41, and the clinical information data 42 of a cancer patient acquired by the data interface 80, and the processor ( 100) can also be stored for the methylation marker discovered.

프로세서(100)는 제 1 메틸레이션 데이터(40), 제 2 메틸레이션 데이터 (41), 암 환자의 임상정보 데이터 (42)를 이용하여 암 예후 예측용 모델을 생성하고 그로부터 메틸레이션 마커를 도출하기 위한 메틸레이션 마커 발굴용 하드웨어에 해당된다. 프로세서(100)는 하나 이상의 프로세싱 유닛들로 구현된 모듈로서, 다수의 논리 게이트들의 어레이를 갖는 마이크로프로세서와 이 마이크로프로세서에서 실행될 수 있는 프로그램이 저장된 메모리 모듈의 조합으로 구현될 수도 있다. 프로세서(100)는 응용 프로그램의 모듈 형태로 구현될 수도 있다.The processor 100 generates a model for predicting cancer prognosis using the first methylation data 40, the second methylation data 41, and the cancer patient's clinical information data 42, and derives a methylation marker therefrom. It corresponds to the hardware for discovering methylation markers for. The processor 100 is a module implemented with one or more processing units, and may be implemented as a combination of a microprocessor having an array of a plurality of logic gates and a memory module in which a program executable in the microprocessor is stored. The processor 100 may be implemented in the form of a module of an application program.

프로세서(100)에 의해 발굴된 메틸레이션 마커는 데이터 인터페이스(80)를 통해 외부의 다른 디바이스, 예를 들어 디스플레이 디바이스, 다른 컴퓨팅 장치 등으로 전송되거나, 또는 외부 네트워크, 예를 들어 인터넷, 공개 데이터베이스(DB) 상으로 전송될 수 있다.The methylation marker discovered by the processor 100 is transmitted to an external device, for example, a display device, another computing device, or the like through the data interface 80, or an external network, for example, the Internet, a public database ( DB) can be transmitted.

도 2는 도 1의 프로세서의 상세 하드웨어 구성을 도시한 블록도이다.FIG. 2 is a block diagram showing a detailed hardware configuration of the processor of FIG. 1.

도 2를 참조하여 설명하면, 프로세서(100)는 변수 도출부 (101), 테스트 모델 생성부 (102), 중요 변수 결정부 (103), CpG 사이트 추출부 (104), 최종 학습용 입력 데이터 선정부 (105), 및 모델 생성부 (106)를 포함할 수 있다. 한편, 도 1에 도시된 프로세서(100)는 본 실시예의 특징이 흐려지는 것을 방지하기 위하여 본 실시예에 관련된 구성요소들만이 도시되어 있을 뿐이므로, 프로세서(100)는 도 1에 도시된 구성요소들 외에 다른 범용적인 구성요소들이 더 포함될 수 있다. 변수 도출부 (101), 테스트 모델 생성부 (102), 중요 변수 결정부 (103), CpG 사이트 추출부 (104), 최종 학습용 입력 데이터 선정부 (105), 및 모델 생성부 (106)는 각각의 기능들에 따라 별도의 독립적인 명칭들로 구분된 것일 뿐, 변수 도출부 (101), 테스트 모델 생성부 (102), 중요 변수 결정부 (103), CpG 사이트 추출부 (104), 최종 학습용 입력 데이터 선정부 (105), 및 모델 생성부 (106)는 하나의 프로세서(100)로 구현될 수 있다. 또는, 변수 도출부 (101), 테스트 모델 생성부 (102), 중요 변수 결정부 (103), CpG 사이트 추출부 (104), 최종 학습용 입력 데이터 선정부 (105), 및 모델 생성부 (106) 각각은 프로세서(100) 내의 하나 이상의 프로세싱 모듈들에 대응될 수도 있다. 또는, 변수 도출부 (101), 테스트 모델 생성부 (102), 중요 변수 결정부 (103), CpG 사이트 추출부 (104), 최종 학습용 입력 데이터 선정부 (105), 및 모델 생성부 (106)는 각각의 기능들에 따라 구분된 별도의 소프트웨어 알고리즘 단위에 해당될 수도 있다. 즉, 프로세서(100) 내에서 변수 도출부 (101), 테스트 모델 생성부 (102), 중요 변수 결정부 (103), CpG 사이트 추출부 (104), 최종 학습용 입력 데이터 선정부 (105), 및 모델 생성부 (106)의 구현 형태는 어느 하나에 의해 제한되지 않는다. 2, the processor 100 includes a variable derivation unit 101, a test model generation unit 102, an important variable determination unit 103, a CpG site extraction unit 104, and an input data selection unit for final learning. (105), and a model generation unit (106) may be included. Meanwhile, in the processor 100 shown in FIG. 1, only the components related to the present embodiment are shown in order to prevent the features of the present embodiment from being blurred. Therefore, the processor 100 includes the components shown in FIG. In addition, other general-purpose components may be further included. The variable derivation unit 101, the test model generation unit 102, the important variable determination unit 103, the CpG site extraction unit 104, the final training input data selection unit 105, and the model generation unit 106 are respectively It is only divided into separate independent names according to the functions of the variable derivation unit 101, the test model generation unit 102, the important variable determination unit 103, the CpG site extraction unit 104, for final learning. The input data selection unit 105 and the model generation unit 106 may be implemented with one processor 100. Alternatively, a variable derivation unit 101, a test model generation unit 102, an important variable determination unit 103, a CpG site extraction unit 104, an input data selection unit for final learning 105, and a model generation unit 106 Each may correspond to one or more processing modules within the processor 100. Alternatively, a variable derivation unit 101, a test model generation unit 102, an important variable determination unit 103, a CpG site extraction unit 104, an input data selection unit for final learning 105, and a model generation unit 106 May correspond to a separate software algorithm unit classified according to each function. That is, in the processor 100, a variable derivation unit 101, a test model generation unit 102, an important variable determination unit 103, a CpG site extraction unit 104, a final learning input data selection unit 105, and The implementation form of the model generation unit 106 is not limited by any one.

변수 도출부 (101)는 제 1 메틸레이션 데이터를 수신하고, 수신된 제 1 메틸레이션 데이터의 차원을 축소시켜 변수를 도출한다. 용어 “차원 축소”는 데이터를 압축하여 데이터의 밀도를 증가시키는 데이터의 전처리 과정을 의미할 수 있다. 고차원의 데이터를 사용하여 기계 학습을 수행할 경우 학습 속도가 느려지고 모델의 성능이 저하될 수 있는 바, 일 실시예에 따른 장치는 변수 도출부 (101)를 포함하여 학습 속도와 모델의 성능을 향상시키는 것일 수 있다. 일 구체예에 있어서, 차원 축소 방법은 주성분 분석 (Principle component analysis; PCA), 특이값분해(Singular Value Decomposition; SVD), 비음수 행렬 인수분해(Non-negative Matrix Factorization NMF)로 이루어진 군에서 선택된 어느 하나의 방법을 이용하여 수행될 수 있다. 일 실시예에 따른 차원 축소 방법은 Array 기반의 메틸레이션 데이터를 압축하기 위하여 주성분 분석 (Principle component analysis; PCA)를 수행하여 데이터의 분산이 최대가 되게 하는 축(axis)을 찾아 주성분(Principle components; PCs)을 도출함으로써 고차원 공간의 표본들을 저차원 공간으로 변환할 수 있다. 본 명세서에서 용어 “변수 (feature)”는 용어 “주성분(Principle components; PCs)”을 포함하는 의미로 사용될 수 있다. The variable derivation unit 101 receives the first methylation data and derives a variable by reducing the dimension of the received first methylation data. The term “dimensional reduction” may refer to a preprocessing process of data that increases the density of data by compressing the data. When machine learning is performed using high-dimensional data, the learning speed may be slowed and the performance of the model may be degraded.The device according to an embodiment includes the variable derivation unit 101 to improve the learning speed and the performance of the model. It may be to let you do. In one embodiment, the dimension reduction method is any selected from the group consisting of principal component analysis (PCA), singular value decomposition (SVD), and non-negative matrix factorization NMF. It can be done using one method. The dimensionality reduction method according to an exemplary embodiment may perform principal component analysis (PCA) to compress array-based methylation data to find an axis that maximizes the variance of the data, and finds the principal components; PCs) can be used to transform samples of high-dimensional space into low-dimensional space. In the present specification, the term “feature” may be used as a meaning including the term “Principle components (PCs)”.

테스트 모델 생성부 (102)는 변수 도출부 (101)로부터 차원 축소를 통해 도출된 변수를 수신하여 이를 학습용 입력데이터로 하고, 암 환자의 임상데이터를 수신하여 이를 학습용 출력 데이터로 하여 기계 학습을 통해 암의 예후 예측 테스트 모델을 생성한다. 테스트 모델 생성은 부트스트랩 (bootstrap) 알고리즘을 사용하여 생성될 수 있으나, 이에 제한되는 것은 아니다.The test model generation unit 102 receives the variable derived through the dimension reduction from the variable derivation unit 101 and uses it as input data for learning, receives clinical data of cancer patients, and uses it as output data for learning through machine learning. Generate a test model for predicting the prognosis of cancer. The test model generation may be generated using a bootstrap algorithm, but is not limited thereto.

중요 변수 결정부 (103)는 테스트 모델 생성부 (102)에서 생성된 테스트 모델의 성능을 평가하고 그로부터 암의 예후 예측에 중요한 하나 이상의 변수를 결정한다. 암의 예후 예측에 중요한 하나 이상의 변수의 결정은 교차 검증을 수행하여 최대 성능을 나타내는 모델로부터 암의 예후 예측에 중요한 하나 이상의 변수를 결정할 수 있다. The important variable determination unit 103 evaluates the performance of the test model generated by the test model generation unit 102 and determines one or more variables important for predicting the prognosis of cancer therefrom. Determination of one or more variables that are important for predicting the prognosis of cancer can be cross-validated to determine one or more variables that are important for predicting the prognosis of cancer from a model showing maximum performance by performing cross-validation.

일 실시예에 따른 테스트 모델 생성과 중요 변수의 결정은 Array 기반의 메틸레이션 데이터에 대해 주성분 분석 (Principle component analysis; PCA)를 수행하여 도출된 주성분을 학습용 입력데이터로 하고, 암 환자의 임상데이터를 학습용 출력 데이터로 하여 암의 예후 예측 모델을 생성하고 모델의 성능을 평가하여 중요 변수를 결정할 수 있다. 예후 예측 모델을 학습할 때 메틸레이션 데이터와 같이 샘플 수에 비해 변수(feature)의 수가 많은 경우 모델이 과적합되는 현상이 발생할 수 있으므로, 일 실시예에 따른 테스트 모델은 변수의 계수에 대한 제약 조건을 추가함으로써 모델이 과적합되는 현상을 방지하는 feature regularization 방법인 Elastic net을 사용하여 생성될 수 있다. 일 실시예에 따른 중요 변수의 결정은 Elastic net에 의하여 예후 예측에 중요하지 않은 변수들은 모델의 계수가 0이 되므로, 200번의 교차 검증 중에서 non-zero 계수로 100번 이상 나온 변수들을 예후 예측에 중요한 변수들로 판단하고 선별할 수 있다. 도 4는 일 실시예에 따른 테스트 모델 성능 평가를 통해 암의 예후 예측에 중요한 하나 이상의 변수를 결정하는 과정을 설명하는 그래프이다. 도 4에 나타난 바와 같이, 주성분의 개수를 달리하여 생성된 테스트 모델에 대해 200번의 교차 검증을 수행하여 얻은 평균 C-index를 평가한 결과, PC1 ~ PC50까지를 예측 변수로 사용한 모델이 가장 적합한 모델로 선택되었고, 200 번의 테스트 결과 가장 빈번하게 중요한 변수로 뽑힌 주성분들이 예후 예측에 중요한 변수로 선별되었다.In the creation of a test model and determination of important variables according to an embodiment, principal components derived by performing principal component analysis (PCA) on array-based methylation data are used as input data for learning, and clinical data of cancer patients are used. An important variable can be determined by generating a cancer prognosis prediction model using the training output data and evaluating the model's performance. When a prognosis prediction model is trained, if the number of features is larger than the number of samples, such as methylation data, the model may overfit, so the test model according to an embodiment is a constraint on the coefficient of the variable. It can be created using Elastic net, a feature regularization method that prevents the model from overfitting by adding. In the determination of important variables according to an embodiment, the coefficient of the model is 0 for variables that are not important for predicting prognosis by the Elastic net, so variables that appear more than 100 times as non-zero coefficients out of 200 cross-validations are important for predicting prognosis. It can be judged and selected by variables. 4 is a graph illustrating a process of determining one or more variables important for predicting a prognosis of cancer through evaluation of a test model performance according to an exemplary embodiment. As shown in FIG. 4, as a result of evaluating the average C-index obtained by performing 200 cross-validations on the test model generated by varying the number of principal components, the model using PC1 to PC50 as predictor variables is the most suitable model. As a result of 200 tests, the main components selected as the most frequently important variables were selected as important variables in predicting prognosis.

CpG 사이트 추출부 (104)는 제 1 메틸레이션 데이터를 수신하고, 중요 변수 결정부 (103)로부터 암의 예후 예측에 중요한 변수로 결정된 하나 이상의 변수를 수신하여, 제1 메틸레이션 데이터 중에서 암의 예후 예측에 중요한 변수와 상관관계가 높은 CpG 사이트를 추출한다. 주성분 분석을 수행할 때 각 주성분들은 CpG 사이트의 weighted sum으로 이루어지는데, 가중치가 큰 CpG site일수록, 즉 상관관계가 높은 CpG 사이트일수록 해당 주성분을 형성하는데 많은 영향력을 미치는 것을 의미하므로, 예후 예측에 중요한 주성분에 영향을 많이 끼치는 CpG 사이트들을 선별할 수 있다. 일 실시예에 따른 CpG 사이트 추출은 Array 기반의 메틸레이션 데이터 중에서 테스트 모델 생성과 평가를 통해 결정된 암의 예후 예측에 중요한 각 주성분과 상관관계가 0.6이상인 CpG 사이트들이 선별될 수 있다.The CpG site extraction unit 104 receives the first methylation data, receives one or more variables determined as important variables for predicting the prognosis of cancer from the important variable determination unit 103, and receives the cancer prognosis among the first methylation data. CpG sites that are highly correlated with variables important for prediction are extracted. When performing the principal component analysis, each principal component is composed of the weighted sum of the CpG site, and the CpG site with a higher weight, that is, the CpG site with a higher correlation, means that it has a greater influence on the formation of the principal component, which is important for predicting prognosis. CpG sites that have a lot of influence on the active ingredient can be selected. In the extraction of CpG sites according to an embodiment, CpG sites having a correlation of 0.6 or more with each principal component important for predicting the prognosis of cancer determined through test model generation and evaluation among array-based methylation data may be selected.

최종 학습용 입력 데이터 선정부 (105)는 제 2 메틸레이션 데이터를 수신하고, CpG 사이트 추출부 (104)로부터 추출된 CpG 사이트를 수신하여, 제2 메틸레이션 데이터 중에서 추출된 CpG 사이트에 대응하는 데이터를 최종 학습용 입력 데이터로 선정한다. The final training input data selection unit 105 receives the second methylation data, receives the CpG site extracted from the CpG site extraction unit 104, and receives data corresponding to the CpG site extracted from the second methylation data. It is selected as the final learning input data.

모델 생성부 (106)는 최종 학습용 입력 데이터 선정부 (105)로부터 메틸레이션 데이터를 수신하여 이를 학습용 입력 데이터로 하여 암의 예후 예측 모델을 생성한다. The model generation unit 106 receives methylation data from the final training input data selection unit 105 and uses it as training input data to generate a cancer prognosis prediction model.

일 실시예에 따른 최종 학습용 입력 데이터의 선정은 바이설파이트 시퀀싱 데이터 중에서 Array 기반의 메틸레이션 데이터로부터 추출된 CpG 사이트들에 대응하는 데이터를 최종 학습용 입력 데이터로 선정하여 예후 예측 모델을 생성하고, 200번의 교차검증(cross validation)을 수행하여 모델의 평균 예측 정확도(Area under the curve; AUC)를 측정함으로써 추출된 CpG 사이트에 대한 바이설파이트 시퀀싱 데이터만으로 예후 예측이 가능한지 평가할 수 있다.일 실시예에 따른 예후 예측 모델 생성은 bootstrap 기법을 사용하여 학습용 데이터와 테스트용 데이터를 나누고, Elastic net을 적용하여 생성할 수 있다.In the selection of final training input data according to an embodiment, data corresponding to CpG sites extracted from array-based methylation data among bisulfite sequencing data are selected as final training input data to generate a prognosis prediction model, and 200 By measuring the area under the curve (AUC) of the model by performing cross validation twice, it is possible to evaluate whether prognosis can be predicted with only bisulfite sequencing data for the extracted CpG site. The prognostic prediction model can be generated by dividing the training data and the test data using the bootstrap technique, and applying the Elastic net.

도 5는 이종 플랫폼의 메틸레이션 데이터를 사용하여 생성된 모델이, 단일 플랫폼의 메틸레이션 데이터를 사용한 모델에 비해 예측 정확도가 향상된 것인지 확인하기 위하여, AUC 분석을 수행하여 암의 단계 별 평균 예측 정확도를 비교 평가한 그래프이다. 도 5A는 단일 플랫폼의 메틸레이션 데이터(바이설파이트 시퀀싱 데이터)를 사용한 경우, 도 5B는 이종 플랫폼의 메틸레이션 데이터(array 기반의 메틸레이션 데이터를 가공하여 추출된 CpG site에 대한 바이설파이트 시퀀싱 데이터)를 사용한 경우의 암 예후 예측 모델의 평균 예측 정확도를 평가한 결과이다. 도 5A와 도 5B에서 각 단일 플랫폼 또는 이종 플랫폼의 메틸레이션 데이터에 대하여 주성분 분석을 수행한 후, variance가 큰 주성분들을 예측 변수로 사용하여 모델을 생성한 후 정확도를 측정하였다. 이 때, 사용한 주성분의 개수를 10 단위로 증가시키며 모델링을 하였고, 이 중 가장 AUC가 클 때의 모델을 암의 특정 단계의 최적의 모델로 판단하였다. 그 결과, 도 5에 나타난 바와 같이, 이종 플랫폼의 메틸레이션 데이터를 최종 학습용 입력 데이터로 사용한 경우 단일 플랫폼의 메틸레이션 데이터를 사용한 경우에 비하여 AUC 값이 대략 stage 1에서는 0.16, stage 2에서는 0.13, stage 3에서는 0.06, stage 4에서는 0.19 정도가 증가되었는 바, 이종 플랫폼의 DNA 메틸레이션 데이터를 사용하여 예후 예측 모델의 정확도가 향상됨을 확인하였다. 따라서, 이는 예후 예측 모델에서 과적합의 문제가 해소되고 암 예후와 관련된 메틸레이션 마커 발굴이 가능해짐을 의미한다.FIG. 5 shows the average prediction accuracy for each stage of cancer by performing AUC analysis in order to confirm whether the model generated using methylation data of heterogeneous platforms has improved prediction accuracy compared to the model using methylation data of a single platform. It is a graph that was compared and evaluated. 5A is a case where methylation data of a single platform (bisulfite sequencing data) is used, and FIG. 5B is a bisulfite sequencing data for a CpG site extracted by processing methylation data (array-based methylation data) of a heterogeneous platform. ) Is the result of evaluating the average prediction accuracy of the cancer prognosis prediction model. After principal component analysis was performed on methylation data of each single platform or heterogeneous platform in FIGS. 5A and 5B, a model was generated using principal components having a large variance as predictor variables, and accuracy was measured. At this time, modeling was performed by increasing the number of principal components used by 10 units, and among them, the model with the largest AUC was determined as the optimal model for a specific stage of cancer. As a result, as shown in FIG. 5, when the methylation data of heterogeneous platforms is used as the final learning input data, the AUC value is approximately 0.16 in stage 1, 0.13 in stage 2, and stage compared to the case of using methylation data of a single platform. It was confirmed that the accuracy of the prognosis prediction model was improved by using the DNA methylation data of the heterogeneous platform as it increased by 0.06 in 3 and 0.19 in stage 4. Therefore, this means that the problem of overfitting in the prognosis prediction model is solved and methylation markers related to cancer prognosis can be discovered.

도 3은 일 실시예에 따른 기계 학습을 이용한 암의 예후 예측용 메틸레이션 마커 발굴 방법의 순서도이다. 도 3을 참조하여 설명하면, 기계 학습을 이용한 암의 예후 예측용 메틸레이션 마커 발굴 방법은 앞선 도면들에서 설명된 컴퓨팅 장치(50)에서 시계열적으로 처리되는 단계들을 포함한다. 따라서, 이하 생략된 내용이라 하더라도 앞선 도면들에서 설명되었던 내용들은 도 3의 기계 학습을 이용한 암의 예후 예측용 메틸레이션 마커 발굴 방법에도 적용될 수 있다.3 is a flowchart of a method of discovering methylation markers for predicting cancer prognosis using machine learning according to an exemplary embodiment. Referring to FIG. 3, a method of discovering a methylation marker for predicting a prognosis of cancer using machine learning includes steps processed in a time series by the computing device 50 described in the previous drawings. Accordingly, even if omitted below, the contents described in the previous drawings may be applied to the method of discovering methylation markers for predicting the prognosis of cancer using machine learning of FIG. 3.

제1 단계는 적어도 하나의 프로세서가 제 1 메틸레이션 데이터의 차원을 축소시켜 변수를 도출할 수 있다 (S10). 예를 들면, 제 1 메틸레이션 데이터는 array 기반 메틸레이션 데이터일 수 있다. 일 실시예에 있어서, 차원을 축소시켜 변수를 도출하는 방법은 주성분 분석 (Principle component analysis; PCA), 특이값분해(Singular Value Decomposition; SVD), 비음수 행렬 인수분해(Non-negative Matrix Factorization NMF)로 이루어진 군에서 선택된 어느 하나의 방법을 이용하여 수행될 수 있다.In the first step, at least one processor may derive a variable by reducing the dimension of the first methylation data (S10). For example, the first methylation data may be array-based methylation data. In one embodiment, the method of deriving the variable by reducing the dimension is principal component analysis (PCA), singular value decomposition (SVD), and non-negative matrix factorization (NMF). It may be performed using any one method selected from the group consisting of.

제2 단계는 적어도 하나의 프로세서가 상기 도출된 변수를 학습용 입력 데이터로 하고, 암 환자의 임상 데이터를 학습용 출력 데이터로 하여 기계 학습을 통해 암의 예후 예측 테스트 모델을 생성할 수 있다 (S20). 일 실시예에 있어서, 테스트 모델은 부트스트랩 (bootstrap) 알고리즘을 사용하여 생성할 수 있다.In the second step, at least one processor may generate a cancer prognosis prediction test model through machine learning by using the derived variable as training input data and the cancer patient's clinical data as training output data (S20). In one embodiment, the test model may be generated using a bootstrap algorithm.

제3 단계는 적어도 하나의 프로세서가 상기 테스트 모델의 성능을 평하여 암의 예후 예측에 중요한 하나 이상의 변수를 결정할 수 있다 (S30). 일 실시예에 있어서, 교차 검증을 수행하여 최대 성능을 나타내는 모델로부터 암의 예후 예측에 중요한 하나 이상의 변수를 도출할 수 있다.In the third step, at least one processor may determine one or more variables important for predicting cancer prognosis by evaluating the performance of the test model (S30). In an embodiment, one or more variables important for predicting cancer prognosis may be derived from a model showing maximum performance by performing cross-validation.

제4 단계는 적어도 하나의 프로세서가 제1 메틸레이션 데이터 중에서 상기 암의 예후 예측에 중요한 변수와 유의한 상관관계가 있는 CpG 사이트를 추출할 수 있다 (S40).In the fourth step, at least one processor may extract a CpG site having a significant correlation with a variable important for predicting the prognosis of the cancer from among the first methylation data (S40).

제5 단계는 적어도 하나의 프로세서가 제2 메틸레이션 데이터 중에서 상기 추출된 CpG 사이트에 대응하는 데이터를 최종 학습용 입력 데이터로 선정할 수 있다 (S50). 예를 들면, 제 2 메틸레이션 데이터는 바이설파이트 시퀀싱 (bisulfite sequencing) 데이터일 수 있다.In the fifth step, at least one processor may select data corresponding to the extracted CpG site from among the second methylation data as final input data for learning (S50). For example, the second methylation data may be bisulfite sequencing data.

제 6 단계는 적어도 하나의 프로세서가 상기 최종 학습용 입력 데이터로 암의 예후 예측 모델을 생성할 수 있다 (S60).In the sixth step, at least one processor may generate a cancer prognosis prediction model using the final training input data (S60).

한편, 상술한 본 발명의 실시예들은 컴퓨터에서 실행될 수 있는 프로그램으로 작성 가능하고, 컴퓨터로 읽을 수 있는 기록매체를 이용하여 상기 프로그램을 동작시키는 범용 디지털 컴퓨터에서 구현될 수 있다. 또한, 상술한 본 발명의 실시예에서 사용된 데이터의 구조는 컴퓨터로 읽을 수 있는 기록매체에 여러 수단을 통하여 기록될 수 있다. 상기 컴퓨터로 읽을 수 있는 기록매체는 마그네틱 저장매체(예를 들면, 롬, 플로피 디스크, 하드 디스크 등), 광학적 판독 매체(예를 들면, 시디롬, 디브이디 등)와 같은 저장매체를 포함한다.Meanwhile, the above-described embodiments of the present invention can be written as a program that can be executed on a computer, and can be implemented in a general-purpose digital computer that operates the program using a computer-readable recording medium. In addition, the structure of the data used in the above-described embodiment of the present invention can be recorded on a computer-readable recording medium through various means. The computer-readable recording medium includes a storage medium such as a magnetic storage medium (for example, a ROM, a floppy disk, a hard disk, etc.) and an optical reading medium (for example, a CD-ROM, a DVD, etc.).

40: 제 1 메틸레이션 데이터
41: 제 2 메틸레이션 데이터
42: 암 환자의 임상정보 데이터
50: 컴퓨팅 장치
80: 데이터 인터페이스
90: 메모리
100: 프로세서
101: 변수 도출부
102: 테스트 모델 생성부
103: 중요 변수 결정부
104: CpG 사이트 추출부
105: 최종 학습용 입력 데이터 선정부
106: 모델 생성부40: first methylation data
41: second methylation data
42: Clinical information data of cancer patients
50: computing device
80: data interface
90: memory
100: processor
101: variable derivation unit
102: test model generation unit
103: important variable decision section
104: CpG site extraction unit
105: input data selection unit for final learning
106: model generation unit

Claims

A method for discovering a methylation marker for predicting cancer prognosis by an apparatus operated by at least one processor using machine learning,
A first step of extracting, by the at least one processor, a CpG site having a significant correlation with a variable important for predicting cancer prognosis from the first methylation data;
A second step of selecting, by the at least one processor, data corresponding to the extracted CpG site from among the second methylation data as final input data for learning; And
A method for discovering methylation markers for predicting cancer prognosis using machine learning comprising a third step of generating, by the at least one processor, a cancer prognosis prediction model using the final training input data.

The method according to claim 1, before the first step,
Deriving a variable by reducing, by the at least one processor, a dimension of the first methylation data;
Generating, by the at least one processor, a test model for predicting a prognosis of cancer through machine learning by using the derived variable as input data for learning and clinical data of a cancer patient as output data for learning; And
And determining, by the at least one processor, the performance of the test model to determine one or more variables important for predicting a prognosis of cancer.

The method of claim 1, wherein the first methylation data or the second methylation data is array-based methylation data or bisulfite sequencing data.

The method of claim 2, wherein the step of deriving the variable by reducing the dimension comprises principal component analysis (PCA), singular value decomposition (SVD), and non-negative matrix factorization (NMF). A method that is performed using any one method selected from the group consisting of.

The method of claim 2, wherein the generating of the test model for predicting prognosis of cancer is generated using a bootstrap algorithm.

The method of claim 2, wherein the step of deriving one or more variables important for predicting the prognosis of cancer is to derive one or more variables important for predicting the prognosis of cancer from a model showing maximum performance by performing cross-validation.

A computer-readable recording medium storing a program for executing the method of any one of claims 1 to 6 on a computer.

A CpG site extraction unit for extracting a CpG site having a significant correlation with a variable important for predicting cancer prognosis from among the first methylation data;
A final training input data selection unit for selecting data corresponding to the extracted CpG site from among the second methylation data as final training input data; And
A model generation unit for generating a cancer prognosis prediction model using the final training input data; and a methylation marker discovery device for predicting cancer prognosis using machine learning.

The method of claim 8,
A variable derivation unit for deriving a variable by reducing the dimension of the first methylation data;
A test model generator configured to generate a cancer prognosis prediction test model through machine learning by using the derived variables as input data for learning and clinical data of a cancer patient as output data for learning; And
The apparatus further comprises a critical variable determining unit for determining one or more variables important for predicting cancer prognosis by evaluating the performance of the test model.

The apparatus of claim 8, wherein the first methylation data or the second methylation data is array-based methylation data or bisulfite sequencing data.

The method of claim 8, wherein the dimensional reduction is any one selected from the group consisting of principal component analysis (PCA), singular value decomposition (SVD), and non-negative matrix factorization NMF. The device that is carried out using the method of.

The apparatus of claim 8, wherein the test model generation is generated using a bootstrap algorithm.

The apparatus of claim 8, wherein the determination of one or more variables important for predicting the prognosis of cancer is to determine one or more variables important for predicting the prognosis of cancer from a model showing maximum performance by performing cross-validation.