KR102265529B1

KR102265529B1 - Method for predicting disease risk based on analysis of complex genetic information

Info

Publication number: KR102265529B1
Application number: KR1020190064200A
Authority: KR
Inventors: 조동호; 서혜인; 송용준; 한규범; 지동진
Original assignee: 한국과학기술원
Priority date: 2018-05-31
Filing date: 2019-05-31
Publication date: 2021-06-17
Also published as: KR20190137012A

Abstract

본 발명은 복합 유전 정보 네트워크 분석에 기반한 질병 위험도 진단 방법에 관한 것으로, 본 발명에 따라 개발된 복합 유전 정보 네트워크 분석에 기반한 질병 위험도 진단 방법은 최적화 또는 학습 기법(learning method)의 도입을 통해 적은 수의 유전 정보 조합으로부터 질병과의 안정적인 상관 관계 도출이 가능하고, 네트워크 모델에 기반한 유전 정보 상관 관계를 제공한다. 본 발명에서 도출한 유전 정보와 질병 간의 상관 관계를 이용하여, 실제 의료 산업에서 상용 가능한 수준의 정확도 및 경제성을 만족하는 진단 기술을 확보할 수 있을 것으로 기대된다.
또한, 본 발명에서 도출한 바이오 마커는 진단 칩 및 단말기를 포함하는 의료 기기의 제작 및 질병 진단 서비스로의 사업화를 통해 질병 예후 판정 등에 효율적으로 활용될 수 있을 것으로 기대된다.The present invention relates to a disease risk diagnosis method based on a complex genetic information network analysis, and the disease risk diagnosis method based on the complex genetic information network analysis developed according to the present invention can be performed in a small number through optimization or introduction of a learning method. It is possible to derive a stable correlation with disease from the combination of genetic information of , and provides a correlation of genetic information based on a network model. By using the correlation between genetic information and disease derived from the present invention, it is expected to be able to secure a diagnostic technology that satisfies the level of accuracy and economic feasibility that can be used in the actual medical industry.
In addition, it is expected that the biomarker derived from the present invention can be efficiently utilized for disease prognosis determination, etc. through the manufacture of medical devices including diagnostic chips and terminals and commercialization of disease diagnosis services.

Description

Method for predicting disease risk based on analysis of complex genetic information

본 발명은 복합 유전 정보 네트워크 분석에 기반한 질병 위험도 진단 방법에 관한 것이다.The present invention relates to a disease risk diagnosis method based on complex genetic information network analysis.

현재까지의 질병 진단을 위한 기술 동향을 살펴보면, 특정 유전자의 인간공유 다형성(단일염기 다형성, 유전자 복제 수 변이, 염기 삽입/삭제 등)을 이용하거나 유전자 집단 전반의 발현 정보를 이용하여 마이크로어레이(microarray) 또는 단백질 칩(protein chip) 등을 이용한 유전자 및 단백질들의 발현 변화를 측정하여, 특정 질병에 연관된 유전자들을 탐색하고, 유전자들의 기능을 연구하는 방향으로 연구가 진행되고 있다.Looking at the technological trends for disease diagnosis so far, microarray (microarray) using human-shared polymorphisms (single nucleotide polymorphism, gene copy number mutation, base insertion/deletion, etc.) of a specific gene or using expression information for the entire gene group ) or by measuring the expression change of genes and proteins using a protein chip, etc., research is being conducted in the direction of searching for genes related to specific diseases and studying the functions of genes.

그러나, 기존 연구는 한 가지 종류의 검체와 질병을 대상으로 하여 이들 간의 연관성을 살펴보는 것을 중심으로 진행되어 다양한 유전 정보와 질병 간의 관계 및 상호 연관성에 대한 이해가 부족한 실정이다. 또한, 복잡한 유전 정보와 질병 사이의 관계를 분석하는 기술의 부족으로, 종래 밝혀지지 않은 새로운 질병에 특이적인 변이를 찾아내기 어려우며, 진단 기법의 정확도 또한 현저히 낮은 문제점이 있었다. However, existing research is focused on examining the relationship between a single type of sample and a disease, so the understanding of the relationship and interrelationship between various genetic information and diseases is lacking. In addition, due to the lack of a technique for analyzing the relationship between complex genetic information and disease, it is difficult to find a mutation specific for a new disease that has not been previously identified, and the accuracy of the diagnosis technique is also significantly low.

유전 정보로부터 바이오마커를 추출하는 기술은 질병과 관련된 유전 정보를 통계적으로 분석하여 마커를 추출하는 방법이다. 그러나, 종래의 바이오마커 추출 기술은 상향식(bottom-up)으로 얻어진 기존의 정보 범위 내에서만 수행되는 것으로, 유전자를 비롯한 일부의 유전 정보 중심으로 마커를 추출하는 수준에 머물고 있으며, 하나의 유전 정보와 질병 사이에 1:1 관계에 제한되는 한계가 있다.A technology for extracting biomarkers from genetic information is a method of extracting markers by statistically analyzing genetic information related to a disease. However, the conventional biomarker extraction technology is performed only within the range of existing information obtained in a bottom-up manner, and remains at the level of extracting markers based on some genetic information including genes, and There is a limit to the one-to-one relationship between diseases.

또한, 바이오마커를 기반으로 하는 질병 진단 서비스는 특정 유전정보가 질병, 형질에 기여하는 정도를 산출하여 진단 서비스를 수행하는 방법을 이용하는 것이다. 그러나, 종래의 진단 서비스 기술은 하나의 질병과 한 종류의 유전 정보에 대하여 단순한 관계를 도출하는 데 의존하며, 질병과 유전 정보 사이의 복합적 분석을 수행하지 못하는 문제가 있으며, 추가적인 변수로 시간의 경과, 치료, 재발과 같은 고차원 변수에 따른 특성 변화에 대한 반영이 이루어지지 못하는 한계가 있다. 이로 인해, 진단의 정확도가 낮으며, 서비스 플랫폼 종류에 따라 다른 결과를 도출하는 한계가 있다.In addition, the disease diagnosis service based on the biomarker uses a method of performing the diagnosis service by calculating the degree of contribution of specific genetic information to the disease or trait. However, the conventional diagnostic service technology relies on deriving a simple relationship between one disease and one type of genetic information, and there is a problem in that it is not possible to perform a complex analysis between the disease and the genetic information, and as an additional variable, the passage of time There is a limit in that it is not possible to reflect changes in characteristics according to high-dimensional variables such as , treatment, and recurrence. For this reason, the accuracy of diagnosis is low, and there is a limitation in deriving different results depending on the type of service platform.

국제공개특허공보 제2014-052909호.International Patent Publication No. 2014-052909.

상기와 같은 문제의 해결을 위해 본 발명에서는 복합적인 유전 정보 사이의 관계로부터 질병 상태 특이적 정보를 도출하고, 네트워크 모델에 기반한 최적화와 인공지능 기반의 머신 러닝(machine learning) 기법을 활용하여, 높은 정확도를 갖는 질병 진단용 바이오 마커 및 질병 위험도 예측을 위한 방법을 개발하고자 하였다.In order to solve the above problems, in the present invention, disease state-specific information is derived from the relationship between complex genetic information, and optimization based on network models and artificial intelligence-based machine learning techniques are utilized to provide high An attempt was made to develop a biomarker for disease diagnosis with accuracy and a method for predicting disease risk.

이와 같은 목표의 달성을 위해 복합적이고 다양한 유전 정보와 질병 간의 관계 이해를 위한 유전 정보 사이의 관계를 분석하고, 최적의 유전 정보 조합을 추출하여 네트워크 모델에 기반한 분석 기법을 도입함으로써, 진단의 정확도가 높고 경제성 또한 우수한 바이오마커를 도출하고자 하였다.To achieve this goal, the accuracy of diagnosis is improved by analyzing the relationship between complex and diverse genetic information and genetic information to understand the relationship between diseases, extracting the optimal combination of genetic information, and introducing an analysis technique based on a network model. An attempt was made to derive a biomarker that is high and economical.

본 발명은, The present invention is

질병 환자 및 정상인의 검체로부터 복합 유전 정보를 추출하는 단계;extracting complex genetic information from samples of diseased patients and normal persons;

상기 복합 유전 정보 사이의 정보를 비교 분석하여 복합 유전 정보 라이브러리를 구축하는 단계;constructing a complex genetic information library by comparing and analyzing information between the complex genetic information;

상기 복합 유전 정보 라이브러리에 최적화 기법(optimization method) 또는 학습 기법(learning method)을 적용하여 질병 상태 특이적 바이오마커를 도출하는 단계; 및deriving a disease state-specific biomarker by applying an optimization method or a learning method to the complex genetic information library; and

상기 질병 상태 특이적 바이오마커로부터 질병 위험도 예측을 위한 네트워크 모델을 구축하고 위험도를 예측하는 단계;constructing a network model for predicting disease risk from the disease state-specific biomarkers and predicting the risk;

를 포함하는, 복합 유전 정보 관계 분석을 통한 질병 위험도 예측 방법을 제공한다.It provides a method of predicting disease risk through complex genetic information relationship analysis, including.

또한 본 발명은 상기 복합 유전 정보 관계 분석을 통한 질병 위험도 예측 방법을 통해 도출된 질병 상태 특이적 바이오마커를 제공한다.In addition, the present invention provides a disease state-specific biomarker derived through the disease risk prediction method through the complex genetic information relationship analysis.

본 발명에 따라 개발된 혈액 내 복합 유전 정보 네트워크 분석에 기반한 질병 위험도 진단 방법은 학습 기법(learning method)의 도입을 통해 적은 수의 유전 정보 조합으로부터 질병과의 안정적인 상관 관계 도출이 가능하고, 네트워크 모델에 기반한 유전 정보 상관 관계를 제공한다. 본 발명에서 도출한 유전 정보와 질병 간의 상관 관계를 이용하여, 실제 의료 산업에서 상용 가능한 수준의 정확도 및 경제성을 만족하는 진단 기술을 확보할 수 있을 것으로 기대된다.The disease risk diagnosis method based on the complex genetic information network analysis in the blood developed according to the present invention can derive a stable correlation with the disease from a small number of combinations of genetic information through the introduction of a learning method, and a network model provides a correlation of genetic information based on By using the correlation between genetic information and disease derived from the present invention, it is expected that a diagnostic technology that satisfies the level of accuracy and economic feasibility that can be used in the actual medical industry is expected to be secured.

또한, 본 발명에서 도출한 바이오 마커는 진단 칩 및 단말기를 포함하는 의료 기기의 제작 및 질병 진단 서비스로의 사업화를 통해 질병 예후 판정 등에 효율적으로 활용될 수 있을 것으로 기대된다.In addition, it is expected that the biomarker derived from the present invention can be efficiently utilized for disease prognosis determination, etc. through the manufacture of medical devices including diagnostic chips and terminals and commercialization of disease diagnosis services.

도 1은 본 발명의 복합 유전 정보 네트워크 분석에 기반한 질병 위험도 진단 방법에 관한 개념의 일예를 도시한 것이다.
도 2는 유전자 발현 과정에 기반한 단계별 유전정보의 개념을 도시한 것이다.
도 3은 학습 기법(learning method)을 이용한 질병 상태 특이적 바이오마커의 도출 및 검증 방법에 관한 개념의 일예를 도시한 것이다.
도 4는 질병 상태 특이적 바이오마커의 특성 모델링의 일예를 도시한 것이다.
도 5는 단백질 발현 데이터에 대한 CNN 분석에 관한 방법의 일예를 도시한 것이다.
도 6은 mi-RNA 정보로부터 소화기 암의 위험도 예측 알고리즘에 관한 일예를 도시한 것이다.
도 7은 기본적인 CNN 분석만을 이용한 검증 결과이다.
도 8은 중요 mi-RNA 후보 조합을 추출하여 학습한 결과의 변화에 관한 결과이다.
도 9는 단백질에서 확인한 동시 검진 및 정밀 진단 가능성에 관한 결과이다.1 illustrates an example of a concept of a method for diagnosing disease risk based on a complex genetic information network analysis of the present invention.
2 shows the concept of step-by-step genetic information based on the gene expression process.
3 shows an example of a concept related to a method for deriving and verifying a disease state-specific biomarker using a learning method.
4 shows an example of characteristic modeling of disease state-specific biomarkers.
5 shows an example of a method for CNN analysis of protein expression data.
6 shows an example of an algorithm for predicting the risk of digestive cancer from mi-RNA information.
7 is a verification result using only basic CNN analysis.
8 is a result regarding the change of the learning result by extracting important mi-RNA candidate combinations.
9 is a result regarding the possibility of simultaneous screening and precise diagnosis confirmed for proteins.

이하 첨부한 표 또는 도면들을 참조하여 본 발명의 복합 유전 정보 네트워크 분석에 기반한 질병 위험도 진단 방법에 대해 상세히 설명한다.Hereinafter, the disease risk diagnosis method based on the complex genetic information network analysis of the present invention will be described in detail with reference to the accompanying tables or drawings.

도면이 기재되어 있을 경우, 이는 당업자에게 본 발명의 사상이 충분히 전달될 수 있도록 하기 위해 예로서 제공되는 것이다. 따라서 본 발명은 제시되는 도면들에 한정되지 않고 다른 형태로 구체화될 수도 있으며, 상기 도면들은 본 발명의 사상을 명확히 하기 위해 과장되어 도시될 수 있다.When drawings are described, they are provided as examples so that the spirit of the present invention can be sufficiently conveyed to those skilled in the art. Therefore, the present invention is not limited to the drawings presented and may be embodied in other forms, and the drawings may be exaggerated to clarify the spirit of the present invention.

이때, 사용되는 기술 용어 및 과학 용어에 있어서 다른 정의가 없다면, 이 발명이 속하는 기술분야에서 통상의 지식을 가진 자가 통상적으로 이해하고 있는 의미를 가지며, 하기의 설명 및 첨부 도면에서 본 발명의 요지를 불필요하게 흐릴 수 있는 공지 기능 및 구성에 대한 설명은 생략한다.At this time, if there is no other definition in the technical terms and scientific terms used, it has the meaning commonly understood by those of ordinary skill in the art to which this invention belongs, and the summary of the present invention in the following description and accompanying drawings Descriptions of known functions and configurations that may be unnecessarily obscure will be omitted.

본 발명에 있어 "검체 시료" 또는 "샘플"은 분석을 위해 확보한 유전 정보를 나타내는 것으로, 명세서에 걸쳐 동일한 의미로 사용되었다.In the present invention, "specimen sample" or "sample" refers to genetic information obtained for analysis, and is used in the same sense throughout the specification.

본 발명은 혈액 내 복합 유전 정보 네트워크 분석에 기반한 질병 위험도 진단 방법에 관한 것이다.The present invention relates to a method for diagnosing disease risk based on analysis of a complex genetic information network in blood.

본 발명은 추출된 복합 유전 정보를 기반으로, 일반 생명 현상 및 질병 관련 정보를 비교 분석하고, 판별하여, 유전 정보 기능의 이해를 돕고, 더 나아가 높은 정확도를 갖는 질병 상태 특이적 바이오마커의 도출 및 질병 위험도 예측 모델을 구축할 수 있다.The present invention compares and analyzes general life phenomena and disease-related information based on the extracted complex genetic information, helps to understand the function of genetic information, and furthermore, derives disease state-specific biomarkers with high accuracy and A disease risk prediction model can be built.

본 발명에서 질병 상태 특이적 바이오마커의 도출 및 질병 위험도 예측 모델을 구축하기 위해 방대한 양의 유전 정보에 빅데이터 처리 기술, 인공 지능 기반의 딥러닝 기술(deep learning method), 일예로 머신 러닝 기술(machine learning method) 등이 복합되어 사용될 수 있다.In the present invention, in order to construct a disease state-specific biomarker derivation and disease risk prediction model, big data processing technology, artificial intelligence-based deep learning method, for example, machine learning technology ( machine learning method) may be used in combination.

이하, 복합 유전 정보 관계 분석을 통한 질병 위험도 예측 방법에 대해 상세히 설명한다.Hereinafter, a method for predicting disease risk through complex genetic information relationship analysis will be described in detail.

본 발명은 the present invention

이하에서 각 단계에 대해 상세히 설명한다.Each step will be described in detail below.

먼저, 질병 환자 및 정상인의 검체로부터 복합 유전 정보를 추출하는 단계에 대해 상세히 설명한다.First, the step of extracting complex genetic information from samples of diseased patients and normal persons will be described in detail.

질병 환자 및 정상인의 검체로부터 복합 유전 정보를 추출하는 단계에서는 상기 검체의 유전체 전체에 대한 DNA, RNA, 단백질 등과 관련된 정보를 확보할 수 있다. 상기 정보를 획득하는 방법은 본 발명의 목적을 저해하지 않는 한 제한되지는 않으나, 일예로 유전 정보 데이터베이스 등으로부터 확보할 수 있으며, 보다 구체적인 예로 미국 국립 보건원(National Institutes of Health, NIH)에서 제공하는 데이터베이스를 이용할 수 있으며, 더욱 구체적인 예로 암과 관련된 정보에 대해서는 TCGA(The Cancer Genome Atlas)로부터 종류별 질병에 대해 제공하는 전체 유전체 정보를 통해 확보할 있다. 다른 예로는, 병원 또는 직접 채취한 환자의 검체 시료를 유전체 시퀀싱 의뢰하여, 정보를 얻을 수 있다. 또 다른 예로는 유전자 내에서 단백질 합성에 직접적인 역할을 수행하는 엑솜 서열 집합(whole exome sequence set)을 확보하여 이용할 수 있으나, 이에 한정하는 것은 아니다.In the step of extracting complex genetic information from a sample of a diseased patient or a normal person, information related to DNA, RNA, protein, etc. for the entire genome of the sample may be secured. The method of obtaining the information is not limited as long as it does not impair the purpose of the present invention, but it can be obtained from a genetic information database, etc., and more specifically, provided by the National Institutes of Health (NIH). A database can be used, and as a more specific example, information related to cancer can be obtained through the entire genome information provided by TCGA (The Cancer Genome Atlas) for each type of disease. As another example, information may be obtained by requesting genome sequencing of a patient's sample sample taken directly from a hospital or patient. As another example, a whole exome sequence set that plays a direct role in protein synthesis in a gene may be secured and used, but is not limited thereto.

본 발명에서 상기 검체의 유전체 서열 정보는 유전 정보 데이터베이스의 종류, 시퀀싱에 사용된 기기, 시퀀싱 방법 등에 따라 일부 변화를 가질 수 있다. 또한 상기 유전체 서열 정보는 본 발명의 목적을 저해하지 않는 한 제한되지 않으나, 일예로 인간 게놈 프로젝트로부터 밝혀진 인간 유전체 지도에서 제공하는 정보를 기준으로 할 수 있다.In the present invention, the genome sequence information of the sample may have some changes depending on the type of the genetic information database, a device used for sequencing, a sequencing method, and the like. In addition, the genome sequence information is not limited as long as it does not impair the purpose of the present invention, but may be based on information provided by the human genome map revealed from the human genome project as an example.

본 발명에서 질병 환자 및 정상인의 검체의 전체 유전체 서열 정보는 본 발명에 다른 바이오마커 검출에 있어 기초가 되는 정보가 될 수 있으며, 이러한 유전체 서열 정보로부터 얻을 수 있는 cf-DNA, ct-DNA 등의 DNA 정보, mRNA, mi-RNA 등의 RNA 발현 정보, 단백질 합성 정보 등을 포함하여, 상기 검체의 유전체 서열 정보가 가진 차이를 기반으로 분석을 수행한다. 상기 전체 유전체 서열 정보 중 제한되지는 않으나, 염색체 정보, 염색체 내 염기 서열의 위치와 관련된 정보, 염기 서열의 추가, 삭제 또는 치환과 관련된 염기 서열의 변이 정보, RNA 정보, 단백질 발현 정보, 단백질의 3차원 구조 및 신뢰도 등을 포함하는 정보가 질병 진단 바이오마커의 검출에 주요하게 사용될 수 있다.In the present invention, the entire genome sequence information of samples from diseased patients and normal people can serve as the basis for detecting biomarkers according to the present invention, and cf-DNA, ct-DNA, etc. that can be obtained from such genome sequence information Analysis is performed based on differences in the genome sequence information of the sample, including DNA information, RNA expression information such as mRNA and mi-RNA, protein synthesis information, and the like. Although not limited among the entire genome sequence information, chromosome information, information related to the position of the nucleotide sequence in the chromosome, mutation information of the nucleotide sequence related to the addition, deletion or substitution of the nucleotide sequence, RNA information, protein expression information, protein 3 Information including dimensional structure and reliability may be mainly used for detection of disease diagnostic biomarkers.

본 발명에서, 상기 유전체 서열 정보에 포함된 정보의 분석은 사용된 프로그램의 종류, 버전, 사용 환경에 따라 정보의 가감이 이루어질 수 있다.In the present invention, in the analysis of the information included in the genome sequence information, addition or subtraction of information may be performed according to the type, version, and environment of the program used.

다음으로, 상기 복합 유전 정보 사이의 정보를 비교 분석하여 복합 유전 정보 라이브러리를 구축하는 단계에 대해 상세히 설명한다.Next, the step of constructing a complex genetic information library by comparing and analyzing information between the complex genetic information will be described in detail.

복합 유전 정보 사이의 정보를 비교 분석하여 복합 유전 정보 라이브러리를 구축하는 단계에서는 상기 질병 환자 및 정상인의 검체로부터 복합 유전 정보를 추출하는 단계에서 얻은 유전 정보 사이에 존재하는 복잡한 관계를 분석하여 질병과 관련된 중요 유전 정보를 추출하여 라이브러리화 할 수 있다.In the step of constructing a complex genetic information library by comparing and analyzing the information between the complex genetic information, the complex relationship existing between the genetic information obtained in the step of extracting the complex genetic information from the samples of the disease patient and the normal person is analyzed to be related to the disease. It is possible to extract important genetic information and form a library.

본 발명에서 상기 유전 정보는 본 발명의 목적을 저해하지 않는 한 제한되지 않으나, 일예로 유전자 발현 과정과 관련된 cf-DNA, ct-DNA 등의 DNA 정보, mRNA, mi-RNA 등의 RNA 발현 정보, 단백질 합성 정보를 들 수 있다(도 2).In the present invention, the genetic information is not limited as long as it does not impair the purpose of the present invention, but for example, DNA information such as cf-DNA and ct-DNA related to the gene expression process, RNA expression information such as mRNA, mi-RNA, protein synthesis information (FIG. 2).

분석을 대상으로 하는 중요 유전 정보 인자의 추출을 위해서는 본 발명의 목적을 저해하지 않는 한 제한되지 않으나, 다음과 같은 과정을 포함할 수 있다.The extraction of important genetic information factors for analysis is not limited as long as it does not impair the purpose of the present invention, but may include the following process.

먼저, 단일 유전 정보 인자를 이용하여 정상군과 질병군을 구분할 수 있는 경우에 대한 분류 정확도를 추출할 수 있다. 상기 단일 유전 정보 인자는 그 정보만으로 정상군과 질병군의 구분이 가능한 경우 그 종류 및 수량에 제한되지 않으나, 일예로, 염기 서열의 추가, 삭제 또는 치환과 관련된 염기 서열의 변이를 포함하는 단일 염기 다형성 변이, 유전자 복제 수 변이, 단백질 아미노산 서열 다형성 변이 등을 들 수 있으나, 이에 한정하는 것은 아니다. 일예로, 질병군의 검체 시료에서는 공통적으로 염기 서열의 변이가 나타나고, 정상군의 검체 시료에서는 동일하게 변이가 나타나지 않는 경우, 해당 유전 정보를 파악하고, 이에 대한 염기 서열의 위치 정보와 변이 정보를 추출하여 저장하는 것이 좋다.First, classification accuracy for a case in which a normal group and a disease group can be distinguished using a single genetic information factor can be extracted. The single genetic information factor is not limited to the type and quantity when it is possible to distinguish a normal group from a disease group only with the information, but for example, a single nucleotide polymorphism including a nucleotide sequence mutation related to addition, deletion or substitution of a nucleotide sequence mutations, gene copy number mutations, protein amino acid sequence polymorphism mutations, and the like, but are not limited thereto. For example, if mutations in the nucleotide sequence appear in common in the sample samples from the disease group and the mutation does not appear in the same sample in the sample from the normal group, the corresponding genetic information is identified, and positional information and mutation information of the nucleotide sequence are extracted. It is better to save it.

다음으로, 각 유전 정보 인자에 대하여 실제 발현량과 기준량의 차이를 측정하여 해당 유전 정보 인자가 질병군의 선별에 영향을 미치는지의 여부를 판단할 수 있는 on/off 태그를 설정할 수 있다. 상기와 같은 설정을 위한 방법의 일예로, 중요 유전 유전자 발현 과정과 관련된 각 단계에 대한 발현량의 기준치를 각각 Th₁, Th₂, Th₃이라 하고, 질병에 의해 유전 정보 발현량이 증가 또는 감소하는 경우 각각 증가 기준치(Th₁ ^up, Th₂ ^up, Th₃ ^up) 및 감소 기준치(Th₁ ^down, Th₂ ^down, Th₃ ^down)와 같이 정의하여 사용할 수 있다. 상기와 같이 정의된 변수를 이용하여, 확보한 검체 시료에 대해 각 발현량 기준을 만족하면서, 질병에 의해 발현량이 변화하는 유전 정보를 추출할 수 있다. 이 때, 필요에 따라 해당 유전 정보의 염기서열 정보를 확보하여 이용할 수 있으며, 상기 언급한 염기 서열의 추가, 삭제 또는 치환과 관련된 염기 서열의 변이를 포함하는 단일 염기 다형성 변이, 유전자 복제 수 변이 등의 변이체를 추출하여 질병에 의한 염기 서열의 변이 정보를 활용할 수 있으나, 이에 한정하는 것은 아니다.Next, by measuring the difference between the actual expression level and the reference amount for each genetic information factor, an on/off tag capable of determining whether the genetic information factor affects the selection of a disease group may be set. As an example of the method for the above setting, the reference value of the expression level for each step related to the important genetic gene expression process is Th ₁ , Th ₂ , Th ₃ , respectively, and the genetic information expression level increases or decreases by disease. In each case, it can be defined and used as an increase reference value (Th ₁ ^up , Th ₂ ^up , Th ₃ ^up ) and a decrease reference value (Th ₁ ^down , Th ₂ ^down , Th ₃ ^{down ).} By using the variables defined as described above, it is possible to extract genetic information whose expression level is changed due to a disease while satisfying each expression level criterion for the obtained specimen sample. At this time, if necessary, the nucleotide sequence information of the corresponding genetic information can be obtained and used, and single nucleotide polymorphic mutations including nucleotide sequence mutations related to addition, deletion or substitution of the above-mentioned nucleotide sequences, gene copy number mutations, etc. information on mutations in the nucleotide sequence caused by disease may be extracted by extracting mutants of , but is not limited thereto.

추출한 유전 정보 인자를 이용하여 도 2에 예시한 단계에 있어서, 서로 다른 단계에 해당하는 유전 정보 간의 발현량 변화, 염기 서열의 변이 등을 분석하여 라이브러리를 구축함으로써, 복합 유전 정보 간의 상관 관계를 파악할 수 있으며, 이후 바이오마커의 도출에 활용될 수 있다.In the step illustrated in FIG. 2 using the extracted genetic information factor, the correlation between the complex genetic information is identified by analyzing the expression level change between the genetic information corresponding to different stages, the nucleotide sequence variation, etc. to construct a library. and can be used for the derivation of biomarkers thereafter.

[표 1] 유전 정보 관계 분석 라이브러리 구축의 예[Table 1] Example of Genetic Information Relationship Analysis Library Construction

유전 정보 관계 분석을 통한 라이브러리 구축의 일예로, 상기 표 1에서와 같이 위암 1기 남성 환자 및 간암 2기 남성 환자에서 mi-RNA1이 기준치 Th₂ ^down이하로 발현됨이 관찰되고, 동시에 protein5가 기준치 Th₃ ^up 이상으로 발현될 경우; 위암 1기 남성 및 위암 1기 여성에게서 mi-RNA5의 SNP1이 발견되었으나 다른 특정 유전 정보와의 관계를 찾지 못한 경우; 등에 대해 유전 정보를 기록하여 라이브러리화 할 수 있다.As an example of library construction through genetic information relationship analysis, as shown in Table 1 above, it was observed that mi-RNA1 _{was expressed below the reference value Th 2} ^down in male patients with stage 1 gastric cancer and male patients with stage 2 liver cancer, and at the same time protein5 was reduced to the reference value. When expressed above _{Th 3} ^up; When mi-RNA5 SNP1 was found in stage 1 gastric cancer male and gastric cancer stage 1 female, but no relationship with other specific genetic information was found; It is possible to record the genetic information for the library, etc.

상기 방법을 통해 분석한 정보는 일예로 상기 표 1과 같이 일정한 플랫폼 즉, 동일한 틀의 형태로 변환하여 저장 또는 관리할 수 있다.The information analyzed through the above method may be converted into a specific platform, that is, in the form of the same frame as shown in Table 1 above, for example, and stored or managed.

다음으로, 상기 복합 유전 정보 라이브러리에 최적화 기법(optimization method) 또는 학습 기법(learning method)을 적용하여 질병 상태 특이적 바이오마커를 도출하는 단계에 대해 상세히 설명한다.Next, a step of deriving a disease state-specific biomarker by applying an optimization method or a learning method to the complex genetic information library will be described in detail.

복합 유전 정보 라이브러리에 최적화 기법(optimization method) 또는 학습 기법(learning method)을 적용하여 질병 상태 특이적 바이오마커를 도출하는 단계에서는 상기 방법을 통해 구축한 복합 유전 정보 라이브러리를 최적화 기법(optimization method) 또는 러닝 기법(learning method)를 통해 분석하여 질병 상태에 특이적인 바이오마커를 도출할 수 있다.In the step of deriving a disease state-specific biomarker by applying an optimization method or a learning method to the complex genetic information library, the complex genetic information library constructed through the above method is used as an optimization method or It is possible to derive biomarkers specific for a disease state by analysis through a learning method.

질병 상태 특이적 바이오 마커 후보의 추출 방법은 본 발명의 목적을 저해하지 않는 한 제한되지 않으나, 상기 도출한 복합 유전 정보 라이브러리로부터, 확인하고자 하는 질병 상태에서 검체 시료와 라이브러리에서 동일한 유전 정보의 관계가 성립하는지 확인하고, 확인된 유전 정보로부터, 유전 정보의 증감 관계, 염기 서열의 변이 정보, 유전 정보의 개수를 추출하여 질병 상태 특이적 바이오마커 도출을 위한 후보군으로 선정할 수 있다. 상기 후보군의 선정에 있어서, 바람직하게는 질병 상태 특이적 바이오 마커가 해당 질병 상태를 나타내는 정확도를 최대로 높이면서, 고려해야 하는 유전 정보의 수를 최소화하는 것을 동시에 만족하도록 하는 것이 좋으며, 이를 위해 다변수 함수의 최적화(multi-variable optimization) 형태로 정의한 후, 수학적 알고리즘의 적용을 통해 질병 상태 특이적 마커를 도출할 수 있으나, 이에 한정하는 것은 아니다.The method of extracting a disease state-specific biomarker candidate is not limited as long as it does not impair the purpose of the present invention, but from the derived complex genetic information library, the relationship between the same genetic information in the sample sample and the library in the disease state to be confirmed It can be confirmed whether it is established and selected as a candidate group for deriving a disease state-specific biomarker by extracting the increase/decrease relationship of genetic information, mutation information of the nucleotide sequence, and the number of genetic information from the confirmed genetic information. In the selection of the candidate group, it is preferable that the disease state-specific biomarker maximizes the accuracy of indicating the disease state, while minimizing the number of genetic information to be considered. After defining in the form of multi-variable optimization of a function, a disease state-specific marker may be derived through application of a mathematical algorithm, but is not limited thereto.

상기 다변수 함수의 최적화를 위한 수학적 알고리즘은 상기 다변수 함수에 대한 문제를 해결할 수 있는 방법이라면 제한없이 도입하여 사용할 수 있으며, 일예로, 시뮬레이티드 어닐링 기법, 유전자 알고리즘, 탭 서치 기법, 시뮬레이티드 에볼루션, 확률적 진화 기법 등을 들 수 있으며, 좋게는 유전자 알고리즘(genetic algorithm)을 사용할 수 있다. 상기 방법을 통해 질병 상태 특이적 바이오 마커를 추출하는 경우, 반드시 전체 과정을 끝낼 필요는 없으며, 최적해를 구하는 도중에 중단하여, 그 때까지 구한 해 중에서 가장 좋은 해를 이용해도 좋다.A mathematical algorithm for optimizing the multivariate function can be introduced and used without limitation as long as it is a method capable of solving the problem of the multivariate function. For example, a simulated annealing technique, a genetic algorithm, a tap search technique, a simulation Tied evolution, a stochastic evolution technique, etc. may be mentioned, and a genetic algorithm may be preferably used. In the case of extracting the disease state-specific biomarker through the above method, it is not necessary to complete the entire process, and the best solution may be used by stopping while finding the optimal solution.

상기 유전자 알고리즘은 자연계의 생물 유전학에 기본 이론을 두며, 병렬적이고 전역적인 탐색 알고리즘으로 문제에 대한 가능한 해들을 정해진 형태의 자료구조로 표현한 다음, 이들을 점차적으로 변형함으로써 점점 더 좋은 해들을 만들어 가는 방법이다. 여기에서 해들을 나타내는 자료구조는 유전자, 이들을 변형함으로써 점점 더 좋은 해를 만들어 내는 과정은 진화로 표현할 수 있다. 달리 표현하면, 유전자 알고리즘은 어떤 미지의 함수 Y = f(x)를 최적화하는 해 x를 찾기 위해, 진화를 모방한(Simulated evolution) 탐색 알고리즘이라 할 수 있다. 유전자 알고리즘은 특정한 문제를 풀기 위한 알고리즘이라기보다는 문제를 풀기 위한 접근방법에 가까우며, 유전자 알고리즘에서 사용할 수 있는 형식으로 바꾸어 표현할 수 있는 모든 문제에 대해서 적용할 수 있다. 일반적으로 문제가 계산 불가능할 정도로 지나치게 복잡할 경우 유전자 알고리즘을 통하여, 실제 최적해를 구하지는 못하더라도 최적해에 가까운 답을 얻기 위한 방안으로써 접근할 수 있어 바람직하다.The genetic algorithm is based on the basic theory of biogenetics in the natural world, and is a parallel and global search algorithm that expresses possible solutions to a problem as a data structure in a fixed form, and then gradually transforms them to create better solutions. . Here, the data structure representing the solutions is a gene, and the process of creating better solutions by modifying them can be expressed as evolution. In other words, the genetic algorithm can be called a simulated evolution search algorithm to find a solution x that optimizes an unknown function Y = f(x). A genetic algorithm is closer to an approach to solving a problem rather than an algorithm to solve a specific problem, and can be applied to all problems that can be expressed in a form that can be used in a genetic algorithm. In general, when a problem is too complex to be calculated, it is preferable because it can be approached as a method to obtain an answer close to the optimal solution even if the actual optimal solution cannot be obtained through a genetic algorithm.

질병 상태 특이적 바이오마커의 도출을 위한 방법의 일예로 도 3에 제시된 것과 같이 분석 대상인 학습 샘플과 상기 학습 샘플의 정확도를 검증하기 위한 검증 샘플을 구비할 수 있으며, 일예로 상기 검증 샘플은 기존의 분석을 통해 해당 질병 상태 특이적인 유전 정보만을 포함할 수도 있으나, 이에 한정하는 것은 아니며, 본 발명의 일 실시예에서 수행한 바와 같이, 분석 대상 라이브러리를 학습 샘플과 검증 샘플로 임의로 분리하여 학습을 수행할 수 있고, 상기 학습 과정을 수회 반복하여 정확도를 향상시킬 수 있다.As an example of a method for deriving a disease state-specific biomarker, as shown in FIG. 3 , a learning sample to be analyzed and a verification sample for verifying the accuracy of the learning sample may be provided. The analysis may include only genetic information specific to the disease state, but is not limited thereto, and as performed in an embodiment of the present invention, the analysis target library is arbitrarily separated into a learning sample and a verification sample to perform learning and repeating the learning process several times to improve accuracy.

라이브러리의 크기가 큰 경우, 모든 부분집합에 대하여 분류 정확도를 계산하기 어렵고 복잡도가 높아지기 때문에 복잡도를 줄이기 위한 과정을 수행하는 것이 바람직하다. 라이브러리의 크기가 N인 경우, 모든 부분집합의 수는 2^N 개의 경우의 수가 생긴다. 이에 따라, 라이브러리의 크기가 커지게 되면 모든 부분집합에 대하여 분류 정확도를 계산하기 어렵고 복잡도가 높아지기 때문에 이를 해결하기 위하여 일예로 휴리스틱(heuristic) 알고리즘 등을 이용하여 복잡도를 줄이는 것이 필요하다. 일 예를 들자면, 부분집합의 크기가 N인 경우에 대하여, 마커의 가능성을 확인하고 가능성이 가장 큰 경우에 대해서만 우선적으로 고려하여 단계적으로 집합의 크기를 줄여나가게 되면 조사해야 하는 마커에 대한 전체 경우의 수가 N(N+1)/2로 줄어들게 된다.When the size of the library is large, it is preferable to perform a process to reduce the complexity because it is difficult to calculate the classification accuracy for all subsets and the complexity increases. If the size of the library is N, the number of all subsets is 2^N cases. Accordingly, as the size of the library increases, it is difficult to calculate classification accuracy for all subsets and the complexity increases. In order to solve this problem, it is necessary to reduce the complexity by using, for example, a heuristic algorithm. As an example, when the size of the subset is N, the possibility of the marker is checked and only the case with the highest probability is prioritized and the size of the set is gradually reduced, the overall case for the marker to be investigated is reduced to N(N+1)/2.

상기 다변수 함수 최적화를 위한 변수인 유전 정보 선택은 본 발명의 목적을 저해하지 않는 한 제한되지는 않으나, 일예로 상기 휴리스틱 알고리즘에 따라 유전 정보를 임의로 선택할 수 있으며, 바람직하게는 최대 정확도를 갖는 유전 정보의 조합을 선택할 수 있다. 일예로, 유전 정보 mi-RNA1과 ct-DNA5가 동시에 증가하는 특성이 존재할 경우, mi-RNA1과 ct-DNA5의 각각의 발현량 증감과 관련된 정보를 학습에 활용하여 각각 두 개의 특징(feature)으로 사용할 수 있으며, 상기 mi-RNA1 및 ct-DNA5가 동시에 증가하는 특성이 샘플 내에 존재하는지의 여부를 학습에서 하나의 특징(feature)로 사용할 수 있다.The selection of genetic information, which is a variable for optimizing the multivariate function, is not limited as long as it does not impair the purpose of the present invention, but for example, genetic information may be arbitrarily selected according to the heuristic algorithm, and preferably, genetic information having maximum accuracy You can choose any combination of information. For example, if there is a characteristic that the genetic information mi-RNA1 and ct-DNA5 increase at the same time, information related to the increase and decrease of each expression level of mi-RNA1 and ct-DNA5 is used for learning and converted into two features. It can be used, and whether the mi-RNA1 and ct-DNA5 simultaneously increase characteristics exist in the sample can be used as a feature in learning.

본 발명에서, 바이오마커 도출을 위한 상기 학습에 사용되는 인공지능 기반 학습 기법(learning method)은 본 발명의 목적을 저해하지 않는 한 그 종류가 제한되는 것은 아니나, 일예로, 신경망 네트워크(neural network), 딥러닝(deep learning) 등을 사용할 수 있으며, 상기 신경망 네트워크에 해당하는 일예로는 CNN(convolutional neural network), RNN(recurrent neural network) 등을 들 수 있고, 일예로 본 발명의 일 실시예에서와 같이 CNN을 사용할 수 있으나, 이에 한정되는 것은 아니며, 확보한 데이터 및 바이오마커의 특징에 따라 적절한 학습 기법을 선택하여 사용할 수 있다.In the present invention, the artificial intelligence-based learning method used for the learning for biomarker derivation is not limited in its type as long as it does not impair the purpose of the present invention, but as an example, a neural network , deep learning, etc. may be used, and an example corresponding to the neural network network may include a convolutional neural network (CNN), a recurrent neural network (RNN), and the like, and as an example, in an embodiment of the present invention CNN can be used, but it is not limited thereto, and an appropriate learning technique can be selected and used according to the characteristics of the obtained data and biomarkers.

본 발명에서, 바람직하게는 상기 방법을 통해 도출한 질병 상태 특이적 바이오 마커의 성능을 검증하기 위한 과정을 더 수행할 수 있다. 이를 위해 바이오마커 검출에 사용되지 않은 샘플 또는 정상 샘플을 대상으로 하여, 상기 도출한 질병 상태 특이적 바이오마커를 적용한 후, 분류 정확도를 계산함으로써 도출한 바이오마커의 정확도를 검증할 수 있어 더욱 좋다.In the present invention, preferably, a process for verifying the performance of the disease state-specific biomarker derived through the above method may be further performed. For this, it is better to verify the accuracy of the derived biomarker by applying the derived disease state-specific biomarker to a sample not used for biomarker detection or a normal sample, and then calculating the classification accuracy.

다음으로, 상기 질병 상태 특이적 바이오마커로부터 질병 위험도 예측을 위한 네트워크 모델을 구축하고 위험도를 예측하는 단계에 대해 상세히 설명한다.Next, the step of constructing a network model for predicting disease risk from the disease state-specific biomarker and predicting the risk will be described in detail.

질병 상태 특이적 바이오마커로부터 질병 위험도 예측을 위한 네트워크 모델을 구축하는 단계에서는 상기 복합 유전 정보 사이의 관계 분석을 통해 얻은 복합 유전 정보 라이브러리 및 최적화 기법 또는 학습 기법(learning method)을 이용하여 도출한 질병 상태 특이적 바이오마커로부터 질병의 발명, 진행 및 재발 등의 상태 변화를 네트워크 형태로 구축할 수 있다.In the step of constructing a network model for predicting disease risk from disease state-specific biomarkers, a disease derived using a complex genetic information library obtained through analysis of the relationship between the complex genetic information and an optimization technique or a learning method State changes such as the invention, progression, and recurrence of a disease can be constructed in the form of a network from the state-specific biomarkers.

상기 네트워크의 구축 방법은 본 발명의 목적을 저해하지 않는 한 제한되지 않으나, 상기 방법을 통해 구축한 유전 정보 라이브러리를 이용하여, 특정 질병 상태 변화에 따라 도출된 질병 상태 특이적 바이오마커 내 정보의 변화를 분석하는 방법을 포함할 수 있다. 상기 분석의 일예로, 도 4에서와 같이 유전 정보인 ct-DNA1 또는 mi-RNA5의 비연속적인 발현 변화를 추적하여, 수학적인 함수의 형태로 모델링 할 수 있다. 상기 수학적 함수의 형태는 특별히 제한되는 것은 아니나, 일예로 비연속적인 발현 변화의 데이터를 근사적으로 만족시킬 수 있는 회귀 함수를 선택하는 것이 좋다.The method for constructing the network is not limited as long as it does not impair the purpose of the present invention, but using the genetic information library constructed through the method, changes in information in disease state-specific biomarkers derived according to changes in a specific disease state may include a method for analyzing As an example of the analysis, as shown in FIG. 4 , the discontinuous expression change of ct-DNA1 or mi-RNA5, which is genetic information, can be tracked and modeled in the form of a mathematical function. The form of the mathematical function is not particularly limited, but for example, it is preferable to select a regression function that can approximately satisfy the data of discontinuous expression change.

회귀 함수의 구성에 사용되는 회귀 분석 방법은 크게 단순 회귀 분석과 다중 회귀 분석으로 나뉘며, 단순 회귀 분석은 하나의 종속 변수와 하나의 독립 변수 사이의 관계를 분석하기 위해, 다중 회귀 분석은 하나의 종속 변수와 여러 독립 변수 사이의 관계를 규명하기 위해 사용될 수 있다. 상기 도 4에 예시된 발현 변화는 각각 하나의 종속 변수와 하나의 독립 변수로 구성하여 단순 회귀 분석을 통해 회귀 함수를 구할 수 있으며, 일예로 도 4의 ct-DNA1의 발현은 지수 함수(exponential function), mi-RNA5의 발현은 계단 함수(step function) 형태의 회귀 함수로 모델링될 수 있다.The regression analysis method used to construct a regression function is largely divided into simple regression analysis and multiple regression analysis. Simple regression analysis is used to analyze the relationship between one dependent variable and one independent variable, and multiple regression analysis is one dependent variable analysis. It can be used to elucidate the relationship between a variable and several independent variables. The expression change illustrated in FIG. 4 is composed of one dependent variable and one independent variable, and a regression function can be obtained through simple regression analysis. For example, the expression of ct-DNA1 in FIG. 4 is an exponential function (exponential function). ), the expression of mi-RNA5 can be modeled as a regression function in the form of a step function.

상기 방법을 통해 질병 상태 특이적 바이오마커의 특징을 수학적으로 모델링한 후, 질병의 주요 상태 변화에 따른 유전 정보의 변화 과정을 추적할 수 있도록 유전 정보로 이루어진 질병 위험도 예측을 위한 네트워크 모델인 유전 정보 관계 네트워크 모델(genetic-information relation network model)을 수립할 수 있다.After mathematically modeling the characteristics of disease state-specific biomarkers through the above method, genetic information, which is a network model for disease risk prediction made of genetic information, to track the change process of genetic information according to changes in the major state of the disease A genetic-information relation network model can be established.

상기 유전 정보 관계 네트워크 모델의 형태는 본 발명의 목적을 저해하지 않는 한 제한되지는 않으나, 복합 유전 정보간의 상관관계만을 적용한 정적 질병 네트워크(static disease network) 또는 시간 경과, 습관 등의 개인 특이적 유전 정보를 변수로 추가한 동적 질병 네트워크(dynamic disease network) 형태일 수 있으며, 좋게는 동적 질병 네트워크(dynamic disease network) 형태일 수도 있다. 상기 형태의 네트워크 모델을 사용함으로써 지속적으로 변화하는 유전 정보 특성을 추적하고 질병을 진단, 예측할 수 있어 바람직하다.The form of the genetic information relationship network model is not limited as long as it does not impair the purpose of the present invention, but a static disease network to which only correlations between complex genetic information are applied or individual-specific genetics such as time course, habits, etc. It may be in the form of a dynamic disease network in which information is added as a variable, preferably in the form of a dynamic disease network. By using the above type of network model, it is preferable to track continuously changing genetic information characteristics and to diagnose and predict diseases.

본 발명에서, 상기 바이오 마커 및 질병 위험도 예측을 위한 네트워크 모델인 유전 정보 관계 네트워크 모델의 정확도는 제한되지는 않으나 아래 지표를 이용하여 평가할 수 있다.In the present invention, the accuracy of the genetic information relation network model, which is a network model for predicting the biomarker and the disease risk, is not limited, but may be evaluated using the following indicators.

- 민감도(sensitivity): 실제 질병에 걸린 환자를 잘 분류하는지 평가하는 측정 지표로, 오진 기반 진단 실패 방지를 위해 TP/(TP+FN)으로 정의할 수 있으며, 여기서 TP는 질병에 걸린 환자를 질병으로 분류하는 경우의 수, FN은 질병에 걸린 환자를 정상으로 분류하는 경우의 수이다. 바이오마커 및 질병 위험도 예측을 위한 네트워크 모델에 대해서, 좋게는 95% 이상, 더욱 좋게는 99% 이상, 가장 좋게는 99.9% 이상인 것이 검사 비용 및 상용화 가능성을 높일 수 있으며, 다수 질병과 관련된 주요 유전 정보를 활용하여 한 번의 검사로 확인이 가능한 경우가 증가하게 되어 좋다.- Sensitivity: A measurement index that evaluates whether a patient with a disease is properly classified. It can be defined as TP/(TP+FN) to prevent misdiagnosis-based diagnosis failure, where TP is a disease-causing patient The number of cases classified as , FN is the number of cases in which a diseased patient is classified as normal. For a network model for predicting biomarkers and disease risk, preferably 95% or more, more preferably 99% or more, and most preferably 99.9% or more, can increase the test cost and commercialization possibility, and provide key genetic information related to many diseases. It is good to increase the number of cases that can be confirmed with one inspection by using

- 특이도(specificity): 실제 정상인을 잘 분류하는지 평가하는 측정 지표로, 허위 질병 진단에 따른 불필요한 후속 검사 방지를 위해 TN/(TN+FP)로 정의되며, 여기서 TN은 정상인을 정상으로 분류하는 경우의 수, FP는 정상인을 질병에 걸린 환자로 분류하는 경우의 수이다. 바이오마커 및 질병 위험도 예측을 위한 네트워크 모델에 대해서, 좋게는 90% 이상, 더욱 좋게는 95% 이상, 가장 좋게는 99% 이상인 것이 검사 비용 및 상용화 가능성을 높일 수 있으며, 다수 질병과 관련된 주요 유전 정보를 활용하여 한 번의 검사로 확인이 가능한 경우가 증가하게 되어 좋다.- Specificity: A measurement index that evaluates whether a normal person is properly classified. It is defined as TN/(TN+FP) to prevent unnecessary follow-up tests due to a false diagnosis of a disease, where TN is a measure of classifying normal people as normal. The number of cases, FP, is the number of cases in which a normal person is classified as a diseased patient. For network models for biomarker and disease risk prediction, preferably 90% or more, more preferably 95% or more, and most preferably 99% or more can increase the cost of testing and commercialization, and provide key genetic information related to many diseases. It is good to increase the number of cases that can be confirmed with one inspection by using

질병 위험도 예측에 있어서는 민감도 또는 특이도를 이용하거나 이 둘을 조합하여 이용할 수 있으며, 이 중 민감도는 질병 위험도 예측에 있어 특이도 대비 중요도가 보다 높아 함께 사용되는 것이 더욱 좋다.In predicting disease risk, sensitivity or specificity may be used, or a combination of the two may be used. Of these, sensitivity is more important than specificity in predicting disease risk, so it is better to use them together.

본 발명에서 상기 질병은 바이오마커의 도출이 가능한 경우라면 어떠한 질병이라도 적용될 수 있으며, 일예로 암과 같이 빠른 진단이 필요한 질병일 수 있으며, 보다 구체적인 예로, 상기 암은 방광암(Bladder urothelial carcinoma), 유방암(Breast invasive carcinoma), 자궁경부암(Cervical and endocervical cancers), 대장암, 결장암(Colon adenocarcinoma), 식도암(Esophageal carcinoma), 교모세포종(Glioblastoma multiforme), 두경부암(Head and Neck squamous cell carcinoma), 혐색소 신세포암(Kidney Chromophobe), 투명세포형 신세포암(Kidney renal clear cell carcinoma), 유두상 신세포암(Kidney renal papillary cell carcinoma), 급성 골수성 백혈병(Acute Myeloid Leukemia), 양성뇌종양(Brain Lower Grade Glioma), 간암(Liver hepatocellular carcinoma), 폐선암(Lung adenocarcinoma), 폐편평상피세포암(Lung squamous cell carcinoma), 장액성 난소상피암(Ovarian serous cystadenocarcinoma), 췌장암(Pancreatic adenocarcinoma), 부신암(Pheochromocytoma and Paraganglioma), 전립선암(Prostate adenocarcinoma), 직장암(Rectum adenocarcinoma), 육종(Sarcoma), 악성흑색종(Skin Cutaneous Melanoma), 위암(Stomach adenocarcinoma), 고환암(Testicular Germ Cell Tumors), 갑상선암(Thyroid carcinoma), 흉선종(Thymoma) 및 자궁내막암(Uterine Corpus Endometrial Carcinoma)으로 이루어진 군으로부터 선택된 1종 이상일 수 있으며, 바람직하게는 방광암, 유방암, 대장암, 결장암, 자궁경부암, 간암, 폐선암, 혐색소 신세포암, 투명세포형 신세포암, 유두상 신세포암, 장액성 난소상피암, 전립선암, 폐편평상피세포암 및 위암으로 이루어진 군으로부터 선택된 1종 이상일 수 있고, 보다 바람직하게는 유방암, 대장암 및 위암으로 이루어진 군으로부터 선택된 1종 이상일 수 있으나 이에 한정하는 것은 아니다.In the present invention, the disease may be any disease as long as it is possible to derive a biomarker, for example, may be a disease requiring rapid diagnosis, such as cancer, and more specifically, the cancer is bladder cancer (Bladder urothelial carcinoma), breast cancer (Breast invasive carcinoma), cervical cancer (Cervical and endocervical cancers), colorectal cancer, colon cancer (Colon adenocarcinoma), esophageal carcinoma (Esophageal carcinoma), Glioblastoma multiforme, Head and Neck squamous cell carcinoma, atrophic Kidney Chromophobe, Kidney renal clear cell carcinoma, Kidney renal papillary cell carcinoma, Acute Myeloid Leukemia, Benign brain tumor (Brain Lower Grade) Glioma, Liver hepatocellular carcinoma, Lung adenocarcinoma, Lung squamous cell carcinoma, Ovarian serous cystadenocarcinoma, Pancreatic adenocarcinoma, Pheochromocytoma and Paraganglioma), prostate cancer (Prostate adenocarcinoma), rectal cancer (Rectum adenocarcinoma), sarcoma (Sarcoma), malignant melanoma (Skin Cutaneous Melanoma), stomach cancer (Stomach adenocarcinoma), testicular cancer (Testicular Germ Cell Tumors), Thyroid carcinoma, Thymoma and Uterine Corpus Endomet rial Carcinoma), preferably bladder cancer, breast cancer, colorectal cancer, colon cancer, cervical cancer, liver cancer, lung adenocarcinoma, anoxic renal cell carcinoma, clear cell type renal cell carcinoma, papillary epithelial cancer It may be one or more selected from the group consisting of cell cancer, serous ovarian cancer, prostate cancer, lung squamous cell carcinoma and gastric cancer, and more preferably, may be one or more selected from the group consisting of breast cancer, colorectal cancer and gastric cancer. It is not limiting.

본 발명에서 도출한 바이오마커는 진단 칩 및 단말기를 포함하는 의료 기기의 제작 및 질병 진단 서비스로의 사업화를 통해 질병 예후 판정 등에 효율적으로 활용될 수 있을 것으로 기대된다.It is expected that the biomarker derived from the present invention can be efficiently utilized for disease prognosis determination, etc. through the manufacture of medical devices including diagnostic chips and terminals and commercialization of disease diagnosis services.

이하, 본 발명의 내용을 실시예를 통하여 보다 구체적으로 설명한다. 실시예는 본 발명을 보다 구체적으로 설명하기 위한 것일 뿐, 본 발명의 권리범위가 이들에 의해 한정되는 것은 아니다.Hereinafter, the content of the present invention will be described in more detail through examples. The examples are only for explaining the present invention in more detail, and the scope of the present invention is not limited thereto.

[실험재료] [Experimental material]

1. 하기 mi-RNA 관련 데이터를 확보하여 사용하였다.1. The following mi-RNA related data were obtained and used.

1) GSE543971) GSE54397

서울대학교 김나영 교수 연구 데이터베이스에서 제공받아 사용하였다.It was provided and used from the research database of Professor Na-Young Kim, Seoul National University.

상기 데이터베이스에는 위암 환자 16 명에 대한 정상 조직 및 암 조직 샘플의 마이크로어레이(microarray) 데이터 중 3523 종류의 mi-RNA 데이터를 제공받아 사용하였다.In the database, 3523 types of mi-RNA data among the microarray data of normal tissue and cancer tissue samples for 16 gastric cancer patients were provided and used.

2) GSE617412) GSE61741

Saarland 대학의 데이터베이스에서 제공받아 사용하였다.It was provided from the database of Saarland University and used.

상기 데이터베이스에는 암 환자 및 정상인을 포함하여 총 1049명의 혈액 샘플의 마이크로어레이(microarray) 데이터 중 위암 환자, 대장암 환자 및 정상인 각각 13, 29 및 94개씩 총 136개의 샘플로 총 848 종류의 mi-RNA 데이터를 제공받아 사용하였다.In the database, among microarray data of a total of 1049 blood samples including cancer patients and normal persons, there are 13, 29, and 94 samples of gastric cancer patients, colorectal cancer patients and normal persons, respectively, for a total of 136 samples of 848 types of mi-RNAs. Data was provided and used.

3) TCGA NGS 데이터3) TCGA NGS data

TCGA 데이터베이스에서 제공받아 사용하였다.It was provided from the TCGA database and used.

상기 데이터베이스에서 위암 환자의 491 개의 정상 조직과 암 조직 샘플을 각각 45, 446개 다운로드하였고, miRNA 및 NGS read 정보를 확보하였다.From the database, 45 and 446 samples of 491 normal and cancerous tissues of gastric cancer patients were downloaded, respectively, and miRNA and NGS read information was obtained.

총 211 개의 mi-RNA 데이터를 사용하였다.A total of 211 mi-RNA data were used.

2. 하기 mi-RNA 외의 데이터를 확보하여 사용하였다.2. Data other than the following mi-RNA was obtained and used.

1) TCGA 단백질 발현 어레이 데이터(protein expression array data)1) TCGA protein expression array data

유방암, 갑상선암, 간암, 신장암, 폐암 환자와 정상 환자에 대한 단백질 발현량 데이터 베이스이다.It is a database of protein expression levels for patients with breast cancer, thyroid cancer, liver cancer, kidney cancer, lung cancer, and normal patients.

상기 데이터베이스로부터 유방암질병, 유방암정상, 갑상선암, 간암, 신장암1(Kidney Renal Clear Cell Carcinoma), 신장암2(Kidney Renal Papillary Cell Carcinoma), 신장암3(Kidney Chromophobe), 폐암1(Lung Adenocarcinoma), 폐암2(Lung Squamous Cell Carcinoma) 데이터를 확보하였고,Breast cancer disease, breast cancer normal, thyroid cancer, liver cancer, kidney cancer 1 (Kidney Renal Clear Cell Carcinoma), kidney cancer 2 (Kidney Renal Papillary Cell Carcinoma), kidney cancer 3 (Kidney Chromophobe), lung cancer 1 (Lung Adenocarcinoma) Lung Squamous Cell Carcinoma (Lung Squamous Cell Carcinoma) data were obtained,

각각 샘플 수는 1078, 45, 426, 183, 478, 215, 63, 365, 327개이다.The number of samples is 1078, 45, 426, 183, 478, 215, 63, 365, and 327, respectively.

각 샘플당 약 200개씩의 단백질에 대한 발현량 데이터가 존재하고, 모든 샘플에서 공통적으로 존재하는 146개의 단백질 데이터만을 추출하여 사용하였다.There is expression data for about 200 proteins in each sample, and only 146 protein data commonly present in all samples were extracted and used.

[표 2] mi-RNA 및 단백질 발현 데이터 예[Table 2] Examples of mi-RNA and protein expression data

[실시예 1] 단백질 데이터에 대한 학습 기법 적용[Example 1] Application of learning techniques to protein data

단백질 데이터에 대하여 도 5의 CNN 방법을 사용하여 입력 데이터 사이의 관계를 도출하고, 이를 Fully connected layer를 통과시킨 후, 최종적으로 softmax를 이용하여 분류하였다.For protein data, a relationship between input data was derived using the CNN method of FIG. 5, passed through a fully connected layer, and finally classified using softmax.

[실시예 2] 정확도 예측[Example 2] Accuracy prediction

상기 실시예 1에서와 동일한 CNN 네트워크 방법을 적용하여 학습을 수행하였다.Learning was performed by applying the same CNN network method as in Example 1.

1. GSE543971. GSE54397

32개의 샘플 중 22개로 학습을 진행하고, 나머지 10개로 검증을 진행하였다.Learning was performed with 22 samples out of 32 samples, and verification was performed with the remaining 10 samples.

2. GSE617412. GSE61741

136개의 샘플 중 106개로 학습을 진행하고, 나머지 30개로 검증을 진행하였다.Learning was performed with 106 samples out of 136 samples, and validation was performed with the remaining 30 samples.

3. TCGA NGS 데이터3. TCGA NGS data

491개의 샘플 중 391개로 학습을 진행하고, 나머지 100개로 검증을 진행하였다.Learning was performed with 391 samples out of 491 samples, and validation was performed with the remaining 100 samples.

상기 검증 결과를 도 7 및 도 8에 도시하였다.The verification results are shown in FIGS. 7 and 8 .

도 7의 결과에 따르면 학습이 진행됨에 따라 GSE54397 모델(조직, 마이크로어레이 데이터)은 100%의 분류 정확도를 나타내고, GSE61741(혈액, 마이크로어레이 데이터)은 약 96.67%, TCGA NGS data는 약 99%의 분류 정확도에 수렴하여 모두 95% 이상의 매우 높은 정확도를 나타냄을 확인하였다.According to the results of FIG. 7, as the learning progresses, the GSE54397 model (tissue, microarray data) shows 100% classification accuracy, GSE61741 (blood, microarray data) is about 96.67%, and the TCGA NGS data is about 99%. It was confirmed that all of them showed very high accuracy of 95% or more by converging to the classification accuracy.

도 8에 따르면 중요 mi-RNA의 추출 과정 진행에 따라 민감도(sensitivity)는 학습이 진행됨에 따라 848개의 mi-RNA 및 30 개의 optimal mi-RNA 추출 (BEST) 경우에 대해 모두 1에 수렴하는 것을 확인하였으며, 특이도(specificity)의 경우 848개의 mi-RNA에 대해서는 학습이 진행되면서도 1 근처에서 fluctuation이 진행되고, 30 개의 optimal mi-RNA 추출 (BEST) 경우에 대해서는 0.95 이상으로 수렴하는 형태를 나타내었다.According to FIG. 8, it was confirmed that the sensitivity as the extraction process of important mi-RNAs converges to 1 for 848 mi-RNAs and 30 optimal mi-RNA extraction (BEST) cases as learning progresses. In the case of specificity, fluctuations progressed near 1 while learning was progressing for 848 mi-RNAs, and convergence to more than 0.95 for 30 optimal mi-RNA extraction (BEST) cases. .

[실시예 3] 임상 데이터를 이용한 바이오마커 도출[Example 3] Derivation of biomarkers using clinical data

유방암, 위암, 대장암 세 질병의 데이터는 모두 미국의 NIH에서 2006년부터 진행하고 있는 TCGA (The Cancer Genome Atlas) project의 데이터베이스로부터 확보하였다. 각각의 질병 데이터 확보에 사용된 세부 데이터베이스 명은 다음과 같다.The data for three diseases of breast cancer, stomach cancer, and colorectal cancer were all obtained from the database of the TCGA (The Cancer Genome Atlas) project, which has been in progress since 2006 at the NIH in the United States. The detailed database names used to secure each disease data are as follows.

유방암 : TCGA-BRCA, 위암 : TCGA-STAD, 대장암 : TCGA-COAD.Breast cancer: TCGA-BRCA, stomach cancer: TCGA-STAD, colorectal cancer: TCGA-COAD.

이 중 mi-RNA 유전 정보 데이터에 대해, 상기 실시예에서와 같은 방법으로 CNN 네트워크 방법을 통한 학습을 통해 암 종류별로 30 종류의 바이오마커를 도출하였다.Among them, for mi-RNA genetic information data, 30 types of biomarkers were derived for each cancer type through learning through the CNN network method in the same manner as in the above example.

그 결과는 하기 표 3과 같다.The results are shown in Table 3 below.

암 종류별 optimal mi-RNA 바이오마커 (BEST)Optimal mi-RNA biomarker for each type of cancer (BEST) 암종류Cancer type mi-RNA 바이오마커mi-RNA biomarkers 유방암breast cancer 'hsa-mir-30d', 'hsa-mir-145', 'hsa-mir-425', 'hsa-mir-203a','hsa-mir-452', 'hsa-mir-378a', 'hsa-mir-455', 'hsa-mir-100','hsa-mir-199b', 'hsa-mir-205', 'hsa-mir-542', 'hsa-mir-532','hsa-mir-625', 'hsa-mir-200c', 'hsa-mir-183', 'hsa-mir-22','hsa-mir-451a', 'hsa-mir-30a', 'hsa-mir-30e', 'hsa-mir-148a','hsa-mir-143', 'hsa-mir-375', 'hsa-mir-584', 'hsa-mir-379', 'hsa-mir-10a', 'hsa-mir-182', 'hsa-mir-21', 'hsa-mir-486-1', 'hsa-mir-486-2', 'hsa-mir-10b''hsa-mir-30d', 'hsa-mir-145', 'hsa-mir-425', 'hsa-mir-203a', 'hsa-mir-452', 'hsa-mir-378a', 'hsa' -mir-455', 'hsa-mir-100','hsa-mir-199b', 'hsa-mir-205', 'hsa-mir-542', 'hsa-mir-532','hsa-mir -625', 'hsa-mir-200c', 'hsa-mir-183', 'hsa-mir-22', 'hsa-mir-451a', 'hsa-mir-30a', 'hsa-mir-30e' ', 'hsa-mir-148a', 'hsa-mir-143', 'hsa-mir-375', 'hsa-mir-584', 'hsa-mir-379', 'hsa-mir-10a', 'hsa-mir-182', 'hsa-mir-21', 'hsa-mir-486-1', 'hsa-mir-486-2', 'hsa-mir-10b' 대장암colorectal cancer 'hsa-mir-6086', 'hsa-mir-3118-1', 'hsa-mir-1321', 'hsa-mir-548f-5', hsa-let-7c', 'hsa-mir-4752', 'hsa-mir-183', 'hsa-mir-29a', 'hsa-mir-30e', 'hsa-mir-486-1', 'hsa-mir-194-1', 'hsa-mir-194-2', 'hsa-mir-30a', 'hsa-mir-28', 'hsa-mir-25', 'hsa-mir-486-2', 'hsa-mir-182', 'hsa-mir-30d', 'hsa-mir-203a', 'hsa-mir-10b', 'hsa-mir-148a', 'hsa-mir-145', 'hsa-mir-378a', 'hsa-mir-143', 'hsa-mir-22', 'hsa-mir-10a', 'hsa-mir-200c', 'hsa-mir-21', 'hsa-mir-192', 'hsa-mir-375''hsa-mir-6086', 'hsa-mir-3118-1', 'hsa-mir-1321', 'hsa-mir-548f-5', hsa-let-7c', 'hsa-mir-4752' , 'hsa-mir-183', 'hsa-mir-29a', 'hsa-mir-30e', 'hsa-mir-486-1', 'hsa-mir-194-1', 'hsa-mir- 194-2', 'hsa-mir-30a', 'hsa-mir-28', 'hsa-mir-25', 'hsa-mir-486-2', 'hsa-mir-182', 'hsa- mir-30d', 'hsa-mir-203a', 'hsa-mir-10b', 'hsa-mir-148a', 'hsa-mir-145', 'hsa-mir-378a', 'hsa-mir- 143', 'hsa-mir-22', 'hsa-mir-10a', 'hsa-mir-200c', 'hsa-mir-21', 'hsa-mir-192', 'hsa-mir-375' 위암stomach cancer 'hsa-mir-500b', 'hsa-mir-496', 'hsa-mir-2392', 'hsa-mir-5739', 'hsa-mir-4540', 'hsa-mir-6749', 'hsa-mir-1915', 'hsa-mir-202', 'hsa-mir-2467', 'hsa-mir-27b', 'hsa-mir-583', 'hsa-mir-374c', 'hsa-mir-219b', 'hsa-mir-299', 'hsa-mir-142', 'hsa-mir-30d', 'hsa-mir-3074', 'hsa-mir-147b', 'hsa-mir-5009', 'hsa-mir-624', 'hsa-mir-181d', 'hsa-mir-489', 'hsa-mir-581', 'hsa-mir-29b-2', 'hsa-mir-541', 'hsa-mir-485', 'hsa-mir-4519', 'hsa-mir-20b', 'hsa-mir-486-1', 'hsa-mir-527''hsa-mir-500b', 'hsa-mir-496', 'hsa-mir-2392', 'hsa-mir-5739', 'hsa-mir-4540', 'hsa-mir-6749', 'hsa' -mir-1915', 'hsa-mir-202', 'hsa-mir-2467', 'hsa-mir-27b', 'hsa-mir-583', 'hsa-mir-374c', 'hsa-mir' -219b', 'hsa-mir-299', 'hsa-mir-142', 'hsa-mir-30d', 'hsa-mir-3074', 'hsa-mir-147b', 'hsa-mir-5009' ', 'hsa-mir-624', 'hsa-mir-181d', 'hsa-mir-489', 'hsa-mir-581', 'hsa-mir-29b-2', 'hsa-mir-541' ', 'hsa-mir-485', 'hsa-mir-4519', 'hsa-mir-20b', 'hsa-mir-486-1', 'hsa-mir-527'

상기로부터, 유방암, 대장암 및 위암에 공통되는 mi-RNA 바이오마커는 11 종류이며, 이는 3가지 암에서 공통적인 특성을 갖는 바이오마커로 해석할 수 있다.From the above, there are 11 types of mi-RNA biomarkers common to breast cancer, colorectal cancer and gastric cancer, which can be interpreted as biomarkers having common characteristics in the three cancers.

3가지 암에 공통되는 바이오마커Biomarkers common to three cancers 암종류Cancer type 공통 mi-RNA 바이오마커Common mi-RNA biomarkers 유방암, 대장암 및 위암Breast, colon and stomach cancer 'hsa-mir-143', 'hsa-mir-148a', 'hsa-mir-182', 'hsa-mir-203a', 'hsa-mir-21', 'hsa-mir-22', 'hsa-mir-30a', 'hsa-mir-30e', 'hsa-mir-375', 'hsa-mir-486-1', 'hsa-mir-486-2''hsa-mir-143', 'hsa-mir-148a', 'hsa-mir-182', 'hsa-mir-203a', 'hsa-mir-21', 'hsa-mir-22', 'hsa' -mir-30a', 'hsa-mir-30e', 'hsa-mir-375', 'hsa-mir-486-1', 'hsa-mir-486-2'

실시예 3의 데이터로부터 본 발명의 분석 방법을 통해 도출한 바이오마커 중, 3가지 암에 공통되는 바이오마커로부터, hsa-mir-486 계열은 이외의 암과의 연관 관계가 있음이 알려져 있으며, hsa-mir-375 계열은 체내 순환(circulating)과 관련되어 있음이 알려져 있으며, hsa-mir-30 계열은 암의 억제와 관련된 것으로 알려져 있다.Among the biomarkers derived from the data of Example 3 through the analysis method of the present invention, from the biomarkers common to three cancers, it is known that the hsa-mir-486 family has a relationship with other cancers, and hsa The -mir-375 family is known to be associated with circulating in the body, and the hsa-mir-30 family is known to be associated with the inhibition of cancer.

즉, 본 발명의 방법으로부터, 종래에 암과 연관된 인자로 밝혀진 바이오마커를 주요 인자로 정확하게 추출해 내는 것을 확인하였으며, 또한 기존의 알려진 결과에 대해 올바른 결론임을 확인할 수 있었다.That is, from the method of the present invention, it was confirmed that the biomarker previously identified as a factor related to cancer was accurately extracted as a major factor, and it was also confirmed that the conclusion was correct with respect to the existing known results.

또한, 상기 기술한 바이오마커 이외에 개별적인 암 특이적 바이오 마커는, 개별적인 암의 진단을 위한 신규한 바이오마커임을 알 수 있다.In addition, it can be seen that individual cancer-specific biomarkers other than the above-described biomarkers are novel biomarkers for individual cancer diagnosis.

[실시예 4] 임상 데이터를 이용하여 도출된 바이오마커의 정확도 예측[Example 4] Prediction of accuracy of biomarkers derived using clinical data

상기 실시예 1~2에서와 같은 방법으로 도출된 바이오마커를 이용한 질병 위험도 예측 연산을 수행한 결과는 다음과 같다.The results of the disease risk prediction calculation using the biomarkers derived in the same way as in Examples 1 and 2 are as follows.

각 민감도 및 특이도의 측정 결과는 100-fold cross validation 기법을 통해 특정 학습 세트에 특이적인 결과가 아닌 알고리즘 자체에 대한 범용적인 결과가 되도록 수행하였다.The measurement results of each sensitivity and specificity were performed through 100-fold cross validation technique so that they become general results for the algorithm itself, not specific results for a specific training set.

위험도 예측 알고리즘은 Convolutional Neural Network로 이루어져 있으며, convolutional layer 7계층과 fully connected layer 4개 계층으로 이루어져 있다. The risk prediction algorithm consists of a convolutional neural network, and consists of 7 convolutional layers and 4 fully connected layers.

Convolutional layer는 모두 1-dimensional filter로 구성되어 있으며, 첫 layer는 20 by 1, 두 번째 layer는 10 by 1, 세 번째 이후의 layer들은 3 by 1의 filter를 사용하였다.All convolutional layers are composed of 1-dimensional filters, the first layer is 20 by 1, the second layer is 10 by 1, and the third and subsequent layers use a filter of 3 by 1.

Padding은 'Valid' 기법을 사용하였다. Padding used 'Valid' technique.

Fully connected layer는 1024, 512, 256, 128개의 node로 구성되어 있으며, 마지막은 readout layer와 softmax activation을 이용하여 질병 확률을 분류하도록 구성하였다.The fully connected layer consists of 1024, 512, 256, and 128 nodes, and the last is configured to classify the disease probability using the readout layer and softmax activation.

그 결과를 표 5에 도시하였다.The results are shown in Table 5.

암 종류별 질병 위험도 예측 결과Disease risk prediction results by cancer type 암종류Cancer type 민감도(sensitivity)sensitivity 특이도(specificity)specificity 유방암breast cancer 98.0%98.0% 95.5%95.5% 대장암colorectal cancer 99.3%99.3% 96.0%96.0% 위암stomach cancer 99.0%99.0% 96.2%96.2%

상기 결과로부터, 본 발명을 통한 복합 유전 정보 관계 분석을 통한 질병 위험도 예측 방법은 통해 95% 이상의 높은 정확도를 갖는 바이오마커를 제공할 뿐만 아니라, 95% 이상의 민감도 및 특이도를 나타내었다. 이로부터 본 발명을 통해 검사 비용 저렴화 및 상용화 가능성을 높일 수 있으며, 다수 질병과 관련된 주요 유전 정보를 활용하여 한 번의 검사로 확인이 가능한 경우가 증가하는 것을 확인하여 상용 가능한 수준의 정확도 및 경제성을 만족하는 진단 기술로의 활용이 가능한 것을 보임으로써 본 발명을 완성하였다.From the above results, the disease risk prediction method through the complex genetic information relationship analysis through the present invention not only provides a biomarker with high accuracy of 95% or more, but also exhibits sensitivity and specificity of 95% or more. From this, through the present invention, the test cost can be reduced and the commercialization possibility can be increased, and it is confirmed that the number of cases that can be confirmed with a single test by utilizing the main genetic information related to a number of diseases increases, thereby satisfying the commercially available level of accuracy and economic feasibility The present invention was completed by showing that it can be used as a diagnostic technology.

Claims

extracting complex genetic information from samples of diseased patients and normal persons;
constructing a complex genetic information library by comparing and analyzing information between the complex genetic information;
deriving a disease state-specific biomarker by applying an optimization method or a learning method to the complex genetic information library; and
constructing a network model for predicting disease risk from the disease state-specific biomarkers and predicting the risk;
As a disease risk prediction method through complex genetic information relationship analysis,
The complex genetic information library is derived and constructed through statistical analysis or optimization method,
When constructing the complex genetic information library, by measuring the difference between the actual expression level and the reference amount for each genetic information factor, setting an on/off tag that can determine whether the genetic information factor affects the selection of disease groups A method of predicting disease risk by analyzing the relationship between phosphorus and complex genetic information.

The method of claim 1,
The complex genetic information is any one or two or more expression or synthesis information selected from the group consisting of DNA, RNA and protein, disease risk prediction method through complex genetic information relationship analysis.

delete

The method of claim 1,
The on/off tag setting is
_{a) Th 1} , Th ₂ , and Th ₃ are the expression levels for each step related to important genetic gene expression processes, respectively, and when the expression level of genetic information increases or decreases due to disease, the reference value of increase (Th ₁ ^up , defining variables as Th ₂ ^up , Th ₃ ^up ) and reduction thresholds (Th ₁ ^down , Th ₂ ^down , Th ₃ ^{down );} and
b) extracting genetic information whose expression level is changed by disease while satisfying each expression level criterion with respect to the specimen sample using the variable;
A method for predicting disease risk through complex genetic information relationship analysis, characterized in that it comprises a.

6. The method of claim 5,
extracting the mutant containing single-nucleotide polymorphic mutation or gene copy number mutation including addition, deletion or substitution of nucleotide sequence by securing the nucleotide sequence information of the genetic information when extracting the genetic information;
Disease risk prediction method through complex genetic information relationship analysis, characterized in that it further comprises.

The method of claim 1,
By analyzing the relationship between the complex genetic information and the disease existing in the complex genetic information library through an optimization technique or a learning technique, biomarkers that can be used for disease analysis are derived, through complex genetic information relationship analysis How to predict disease risk.

The method of claim 1,
A method for predicting disease risk through complex genetic information relationship analysis, characterized in that a static disease network model is constructed based on the disease state-specific biomarker.

The method of claim 1,
A method of predicting disease risk through complex genetic information relationship analysis, characterized in that a network model for disease risk prediction is built and a dynamic disease network model is constructed in the step of predicting the risk.

8. The method of claim 7,
The optimization technique is a method for predicting disease risk through complex genetic information relationship analysis, characterized in that selected from the group consisting of a simulated annealing technique, a genetic algorithm, a tap search technique, a simulated evolution, and a stochastic evolution technique.

8. The method of claim 7,
The learning technique is a method for predicting disease risk through complex genetic information relationship analysis, characterized in that selected from the group consisting of a neural network and deep learning.

12. The method of claim 11,
The neural network network is a method for predicting disease risk through complex genetic information relationship analysis, characterized in that selected from the group consisting of a convolutional neural network (CNN) and a recurrent neural network (RNN).

The method of claim 1,
The method of predicting disease risk through complex genetic information relationship analysis, characterized in that the accuracy of the network model for predicting the disease risk is 95% or more in sensitivity and 90% or more in specificity.

As a disease state-specific biomarker derived through the disease risk prediction method through the complex genetic information relationship analysis of any one of claims 1, 2 and 5 to 13,
The disease state is breast cancer, colorectal cancer or stomach cancer,
The breast cancer-specific biomarkers are hsa-mir-30d, hsa-mir-145', hsa-mir-425, hsa-mir-203a, hsa-mir-452, hsa-mir-378a, hsa-mir-455, hsa-mir-100, hsa-mir-199b, hsa-mir-205, hsa-mir-542, hsa-mir-532, hsa-mir-625, hsa-mir-200c, hsa-mir-183, hsa- mir-22,hsa-mir-451a, hsa-mir-30a, hsa-mir-30e, hsa-mir-148a,hsa-mir-143, hsa-mir-375, hsa-mir-584, hsa-mir- 379, hsa-mir-10a, hsa-mir-182, hsa-mir-21, hsa-mir-486-1, hsa-mir-486-2, and any one selected from the group consisting of hsa-mir-10b or two or more,
The colorectal cancer-specific biomarkers are hsa-mir-6086, hsa-mir-3118-1, hsa-mir-1321, hsa-mir-548f-5, hsa-let-7c, hsa-mir-4752, hsa- mir-183, hsa-mir-29a, hsa-mir-30e, hsa-mir-486-1, hsa-mir-194-1, hsa-mir-194-2, hsa-mir-30a, hsa-mir- 28, hsa-mir-25, hsa-mir-486-2, hsa-mir-182, hsa-mir-30d, hsa-mir-203a, hsa-mir-10b, hsa-mir-148a, hsa-mir- 145, hsa-mir-378a, hsa-mir-143, hsa-mir-22, hsa-mir-10a, hsa-mir-200c, hsa-mir-21, hsa-mir-192, and hsa-mir-375 Any one or two or more selected from the group consisting of
The gastric cancer-specific biomarkers are hsa-mir-500b, hsa-mir-496, hsa-mir-2392, hsa-mir-5739, hsa-mir-4540, hsa-mir-6749, hsa-mir-1915, hsa -mir-202, hsa-mir-2467, hsa-mir-27b, hsa-mir-583, hsa-mir-374c, hsa-mir-219b, hsa-mir-299, hsa-mir-142, hsa-mir -30d, hsa-mir-3074, hsa-mir-147b, hsa-mir-5009, hsa-mir-624, hsa-mir-181d, hsa-mir-489, hsa-mir-581, hsa-mir-29b -2, any one selected from the group consisting of hsa-mir-541, hsa-mir-485, hsa-mir-4519, hsa-mir-20b, hsa-mir-486-1, and hsa-mir-527; two or more disease state specific biomarkers.