KR20140098561A

KR20140098561A - Method for extracting single nucleotide polymorphism combination related to a disease, metohd and apparatus for predicting a disease using the same

Info

Publication number: KR20140098561A
Application number: KR1020130011376A
Authority: KR
Inventors: 이관수; 강지용
Original assignee: 한국과학기술원
Priority date: 2013-01-31
Filing date: 2013-01-31
Publication date: 2014-08-08
Also published as: KR101483284B1

Abstract

The present invention relates to an apparatus for predicting the risk of disease of a user based on single nucleotide polymorphism (SNP) combination related to the target disease. The apparatus comprises a whole genome analysis-based disease related database saving the relation between the disease and SNP combinations extracted from a disease group and a control group based on the whole genome analysis information thereof; and an SNP combination extraction part filtering the target disease related SNP data from the disease related database in stages according to priority and conducting the following stage of filtering based on error rates of candidate SNP combinations generated from each stage of filtering, and extracting a candidate SNP combination showing the lowest error rate as the target disease related SNP combination.

Description

FIELD OF THE INVENTION [0001] The present invention relates to a method for predicting the incidence of disease, a method for predicting the risk of disease, and a method for predicting the incidence of disease using the same. [0002]

본 발명은 질병 관련 단일염기다형성 조합 추출 방법, 질병 발생 위험도 예측 방법, 그리고 이를 이용한 질병 발생 위험도 예측 장치에 관한 것이다.The present invention relates to a disease-associated single base polymorphism combination extraction method, a disease occurrence risk prediction method, and a disease occurrence risk prediction apparatus using the same.

단일염기다형성(single nucleotide polymorphism, SNP)은 유전적 염기서열이 개인 간의 차이를 보이는 유전 변이의 일종으로, 단일의 염기가 다른 염기서열을 나타내며, 인구집단 내에서 1%이상의 빈도로 2개의 대립 염기서열(bi-allelic) 변이가 발생하는 위치이다. Single nucleotide polymorphism (SNP) is a kind of genetic mutation in which individual nucleotide polymorphism (SNP) shows a difference between individuals. Single nucleotide polymorphism (SNP) This is where the bi-allelic mutation occurs.

최근 전장유전체분석(genome-wide association study)과 차세대서열분석(next-generation sequencing) 등의 유전체 분석 기술의 발달로 인간 유전체 변이형, 특히 SNP 정보를 분석할 수 있는 기술이 개발되었다.Recent developments in genome-wide association studies and next-generation sequencing techniques have led to the development of techniques for analyzing human genetic variants, particularly SNP information.

최근 연구들에서 각각의 SNP는 일반적으로 낮은 질병 연관성을 나타내고 있지만, SNP의 특정 조합은 높은 질병 연관성을 나타낼 수 있음이 밝혀지고 있다. 질병 발생을 예측할 수 있는 최적의 SNP 조합을 발굴하기 위하여 베이즈 요인, 로지스틱 회귀분석, 은닉 마르코브 모델, 서포트 벡터 머신, 랜덤 포레스트 기계학습기 등이 최근 연구들에서 사용되고 있다.In recent studies, it has been found that although each SNP is generally associated with low disease-specificity, certain combinations of SNPs may exhibit high disease-relatedness. Bayes factor, logistic regression, concealed Markov model, support vector machine, and random forest machine learning machine have been used in recent studies to find the optimal SNP combination to predict disease occurrence.

당뇨는 혈액 내의 혈당량을 일정하게 유지시키는 호르몬인 인슐린의 비정상적인 작용에 의해 고혈당이 지속되어 대사 이상을 나타내는 질환이다. 당뇨의 유형은 인슐린 의존적인 제 1형 당뇨와 인슐린 비의존적인 제 2형 당뇨가 있다. 제 2형 당뇨는 β세포의 기능장애에 의한 인슐린 결핍 및 인슐린의 양은 정상적이나, 세포가 인슐린을 효과적으로 이용하지 못하는 인슐린 저항성으로 인해 발병된다. 그런데 제 2형 당뇨는 장년층에서 주로 발명하나, 최근 식습관의 변화 등으로 인하여 청년층에서도 발병률이 높아지고 있다. Diabetes is a disease in which hyperglycemia persists due to the abnormal action of insulin, a hormone that keeps the blood sugar level in the blood constant. There are two types of diabetes: insulin-dependent type 1 diabetes and insulin-independent type 2 diabetes. Type 2 diabetes is caused by insulin deficiency and insulin levels due to β-cell dysfunction, but due to insulin resistance that cells can not effectively use insulin. However, type 2 diabetes is mainly developed by the elderly, but the incidence is rising in young people due to recent changes in eating habits.

제 2형 당뇨는 전 세계적으로 가장 흔한 복합 질환 중에 하나이며, 대사증후군뿐 아니라 심장질환, 안구질환, 심혈계관질환, 수족질환 등 다양한 합병증을 보유하고 있기 때문에 위험한 질환이다. 제 2형 당뇨는 부분적으로 유전되며 특정 감염에 의해 촉발되는 제 1형 당뇨와는 달리 유전적 요인과 생활방식과 같은 환경적 요인에 의해 점차적으로 발생하기 때문에 대표적인 흔한 복합 질병으로 알려져 있다. Type 2 diabetes is one of the most common multiple diseases in the world and is a dangerous disease because it has diverse complications such as heart disease, eye disease, cardiovascular disease, and limb disease as well as metabolic syndrome. Type 2 diabetes is partly inherited and, unlike type 1 diabetes triggered by certain infections, is known to be a common and common complication because it occurs gradually by environmental factors such as genetic factors and lifestyle.

유전적 변이와 제 2형 당뇨를 연관 지으려는 많은 연관성 연구가 진행되고 있으며, 그 결과 특정 몇몇 SNP가 인슐린 분비, 인슐린 감수성 등과 같은 제 2형 당뇨와 직접적으로 연관된 기능에 영향을 미친다는 연구가 발표되었다. 최근에 제 2형 당뇨와 연관 있는 SNP 조합을 발굴하기 위하여 서포트 벡터 머신 알고리즘을 활용한 연구가 발표되었으나, 소수의 모집단과 소수의 SNP에 기반하여 최적의 SNP 조합을 발굴하는데 제약이 있다. A number of studies have been conducted to correlate genetic variants with type 2 diabetes, with the result that certain SNPs affect the function directly associated with type 2 diabetes, such as insulin secretion and insulin sensitivity . Recently, support vector machine algorithms have been published to identify SNP combinations associated with type 2 diabetes, but there are limitations in finding optimal SNP combinations based on a small number of populations and a small number of SNPs.

최근 스마트폰의 기술 발전과 보급률 확대로 인하여 모바일 기반의 서비스가 다양하게 개발, 적용되고 있으며, 특히 헬스케어 분야에서도 다양한 모바일 어플리케이션이 개발되고 있다. 그러나 개인의 유전자형에 따라 개인의 특정 질병에 대한 발생 위험도를 예측해주고 이를 모바일 기반의 서비스로 언제 어디서나 이용 가능하게 하는 서비스가 없는 실정이다.Recently, mobile-based services have been developed and applied due to technological advancement and penetration rate of smartphones, and various mobile applications are being developed especially in the health care field. However, according to the genotype of an individual, there is no service that predicts an occurrence risk of a specific disease to an individual and makes it available as a mobile-based service anytime and anywhere.

본 발명이 해결하고자 하는 과제는 각종 필터링을 통해 특정 질병 발생 위험도 예측이 가능한 단일염기다형성(single nucleotide polymorphism, SNP) 후보 조합들을 추출하고, 랜덤 포레스트 기계학습기를 이용하여 SNP 후보 조합들 중에서 에서 질병 관련성이 높은 SNP 조합을 추출하며, 질병 관련 SNP 조합과 개인의 SNP를 비교하여 개인의 질병 발생 위험도를 예측하는 방법 및 장치를 제공하는 것이다.SUMMARY OF THE INVENTION The object of the present invention is to extract single nucleotide polymorphism (SNP) candidate combinations capable of predicting the risk of occurrence of a specific disease through various filtering, and to use the random forest machine learning machine And to provide a method and an apparatus for predicting an individual's risk of disease occurrence by extracting the high SNP combination and comparing the SNP combination with the disease-related SNP.

본 발명의 한 실시예에 따른 분석 대상 질병에 관련된 단일염기다형성(single nucleotide polymorphism, SNP) 조합을 기초로 사용자의 질병 발생 위험도를 예측하는 질병 발생 위험도 예측 장치로서, 질환군과 대조군의 전장유전체분석 정보를 기초로 추출한 SNP들의 질병 연관도를 저장하는 전장유전체분석 기반 질병 연관도 데이터베이스, 그리고 상기 질병 연관도 데이터베이스에서 가져온 분석 대상 질병 관련 SNP 데이터를 필터링 순서에 따라 단계적으로 필터링하고, 각 단계의 필터링으로 생성된 SNP 후보 조합들의 오류율을 기초로 다음 단계의 필터링을 진행하며, SNP 후보 조합들 중에서 오류율이 가장 낮은 SNP 후보 조합을 질병 관련 SNP 조합으로 추출하는 SNP 조합 추출부를 포함한다.The present invention relates to an apparatus for predicting a disease occurrence risk based on a combination of single nucleotide polymorphism (SNP) associated with a disease to be analyzed according to an embodiment of the present invention, Based disease association database storing the disease association of SNPs extracted based on the information, and the disease-related SNP data collected from the disease association database are filtered step by step according to the filtering order, Based on the error rate of the SNP candidate combinations generated by the SNP candidate combinations, and extracting the SNP candidate combination having the lowest error rate among the SNP candidate combinations as the disease related SNP combination.

상기 질병 발생 위험도 예측 장치는 상기 사용자의 유전체정보를 기초로 상기 질병 관련 SNP 조합에 해당하는 개인 SNP 정보를 추출하는 개인유전체 데이터 처리부, 그리고 상기 질병 관련 SNP 조합으로 랜덤 포레스트 기계학습기를 훈련하고, 훈련된 랜덤 포레스트 기계학습기를 이용하여, 상기 개인 SNP 정보의 질병 발생 위험도를 계산하는 질병 발생 위험도 예측부를 더 포함할 수 있다.The disease occurrence risk prediction apparatus includes a personal genome data processing unit for extracting individual SNP information corresponding to the disease-related SNP combination based on the genome information of the user, and a random forest machine learning unit using the disease-related SNP combination, And a disease occurrence risk prediction unit for calculating a disease occurrence risk of the individual SNP information using a random forest machine learning machine.

상기 SNP 조합 추출부는 분석 대상 질병 관련 SNP 데이터를 필터링하여 SNP 후보 조합을 생성하고, SNP 후보 조합의 오류율을 계산하는 복수의 필터링부, 그리고 상기 복수의 필터링부를 제어하여 상기 분석 대상 질병 관련 SNP 데이터의 필터링을 단계적으로 수행하고, 각 필터링부의 필터링으로 생성된 SNP 후보 조합들 중에서 오류율이 가장 낮은 SNP 후보 조합을 질병 관련 SNP 조합으로 추출하는 제어부를 포함하고, 상기 오류율은 랜덤 포레스트기계학습기를 이용하여 계산될 수 있다.The SNP combination extractor includes a plurality of filtering units for filtering SNP data to be analyzed to generate SNP candidate combinations and calculating an error rate of a SNP candidate combination, and a plurality of filtering units for controlling the plurality of filtering units, And a controller for performing a filtering step by step and extracting a SNP candidate combination having the lowest error rate among the SNP candidate combinations generated by the filtering of each filtering part with a disease related SNP combination, .

상기 복수의 필터링부는 본페로니 필터링부, p값 순위 필터링부, p값 범위 필터링부 중 적어도 둘 이상을 포함하고, 상기 본페로니 필터링부, 상기 p값 순위 필터링부, 상기 p값 범위 필터링부는 필터링으로 생성되는 SNP 후보 조합의 SNP 개수 범위를 기초로 필터링 순서가 결정될 수 있다.And the p value range filtering unit, the p value range filtering unit, and the p value range filtering unit may include at least two of a main ferroion filtering unit, a p value ranking filtering unit, and a p value range filtering unit, The filtering order can be determined based on the SNP number range of the SNP candidate combinations generated by the filtering.

상기 제어부는 상기 복수의 필터링부 중 제1필터링부로부터 상기 분석 대상 질병 관련 SNP 데이터를 필터링하여 생성한 제1 SNP 후보 조합들의 오류율을 수신하고, 상기 제1 SNP 후보 조합들 중에서, SNP 개수가 가장 많은 SNP 후보 조합의 오류율이 가장 낮은지 판단하여 상기 제1필터링부의 다음 단계인 제2필터링부로 상기 분석 대상 질병 관련 SNP 데이터의 필터링을 요청할지 결정할 수 있다.Wherein the controller receives the error rate of the first SNP candidate combinations generated by filtering the SNP data related to the disease to be analyzed from the first filtering unit among the plurality of filtering units, It may be determined whether the error rate of the SNP candidate combination is the lowest, and the second filtering unit, which is the next step of the first filtering unit, may determine whether to request the SNP data to be analyzed.

상기 제어부는 상기 제1 SNP 후보 조합들 중에서, SNP 개수가 가장 많은 SNP 후보 조합의 오류율이 가장 낮은 경우, 상기 제2필터링부로 상기 분석 대상 질병 관련 SNP 데이터의 필터링을 요청할 수 있다.If the error rate of the SNP candidate combination having the largest number of SNPs is the lowest among the first SNP candidate combinations, the controller may request the second filtering unit to filter the SNP data related to the analysis subject.

상기 제어부는 상기 제1 SNP 후보 조합들 중에서, SNP 개수가 가장 많은 SNP 후보 조합보다 오류율이 낮은 최저 SNP 후보 조합이 있는 경우, 상기 최저 SNP 후보 조합을 질병 관련 SNP 조합으로 추출할 수 있다.The control unit may extract the lowest SNP candidate combination as a disease-related SNP combination when there is a lowest SNP candidate combination having a lower error rate than the SNP candidate combination having the largest number of SNPs among the first SNP candidate combinations.

본 발명의 다른 실시예에 따른 질병 발생 위험도 예측 장치가 분석 대상 질병에 관련된 단일염기다형성(single nucleotide polymorphism, SNP) 조합을 추출하는 방법으로서, 분석 대상 질병 관련 전장유전체분석 데이터를 입력받는 단계, 상기 분석 대상 질병 관련 전장유전체분석 데이터를 제1단계 필터링하여 제1 SNP 후보 조합들을 생성하는 단계, 상기 제1 SNP 후보 조합들의 오류율을 분석하여, SNP 후보 조합에 포함된 SNP 개수가 증가할수록 오류율이 낮아지는지 판단하는 단계, 그리고 상기 제1 SNP 후보 조합들 중에서, SNP 개수를 가장 많이 포함한 SNP 후보 조합보다 오류율이 낮은 최저 SNP 후보 조합이 있는 경우, 최저 SNP 후보 조합을 질병 관련 SNP 조합으로 추출하는 단계를 포함하고, 상기 오류율은 랜덤 포레스트기계학습기를 이용하여 계산된다.A disease occurrence risk prediction apparatus according to another embodiment of the present invention is a method for extracting a single nucleotide polymorphism (SNP) combination related to a disease to be analyzed, Generating first SNP candidate combinations by first-step filtering the disease-related full-length genome analysis data to be analyzed; analyzing an error rate of the first SNP candidate combinations; and determining an error rate as the number of SNPs included in the SNP candidate combination increases And extracting the lowest SNP candidate combination with a disease-related SNP combination when there is a lowest SNP candidate combination having a lower error rate than the SNP candidate combination containing the greatest number of SNPs among the first SNP candidate combinations And the error rate is calculated using a random forest machine learning machine.

상기 SNP 조합 추출 방법은 상기 제1 SNP 후보 조합들 중에서, SNP 개수를 가장 많이 포함한 SNP 후보 조합의 오류율이 가장 낮은 경우, 상기 분석 대상 질병 관련 전장유전체분석 데이터를 제2단계 필터링하여 제2 SNP 후보 조합들을 생성하는 단계, 상기 제2 SNP 후보 조합들의 오류율을 분석하여, SNP 개수가 증가할수록 오류율이 낮아지는지 판단하는 단계, 그리고 상기 제2 SNP 후보 조합들 중에서 SNP 개수를 가장 많이 포함한 SNP 후보 조합보다 오류율이 낮은 최저 SNP 후보 조합이 있는 경우, 최저 SNP 후보 조합을 질병 관련 SNP 조합으로 추출하는 단계를 포함할 수 있다.The SNP combination extraction method may further include a second step of filtering the analysis subject disease related full-length genome analysis data when the error rate of the SNP candidate combination having the largest number of SNPs is the lowest among the first SNP candidate combinations, Analyzing the error rate of the second SNP candidate combinations to determine whether the error rate is lowered as the number of SNPs increases; and comparing the SNP candidate combination having the largest number of SNPs among the second SNP candidate combinations If there is a lowest SNP candidate combination with a low error rate, extracting the lowest SNP candidate combination into a disease-associated SNP combination.

상기 제1단계 필터링은 본페로니 필터링이고, 상기 제2단계 필터링은 p값 순위 필터링일 수 있다.The first stage filtering may be baryoney filtering, and the second stage filtering may be a p value rank filtering.

상기 제1단계 필터링은 p값 순위 필터링이고, 상기 제2단계 필터링은 p값 범위 필터링일 수 있다.The first stage filtering may be p value order filtering, and the second stage filtering may be p value range filtering.

본 발명의 또 다른 실시예에 따른 질병 발생 위험도 예측 장치가 분석 대상 질병에 관련된 단일염기다형성(single nucleotide polymorphism, SNP) 조합을 기초로 사용자의 질병 발생 위험도를 예측하는 방법으로서, 사용자의 질병 발생 위험도 예측을 요청받는 단계, 상기 사용자의 유전체정보에서 분석 대상 질병 관련 SNP 조합에 해당하는 개인 SNP 정보를 추출하는 단계, 그리고 랜덤 포레스트 기계학습기를 이용하여, 상기 개인 SNP 정보에 해당하는 상기 사용자의 질병 발생 위험도를 계산하는 단계를 포함한다.According to another embodiment of the present invention, there is provided a method for predicting a disease occurrence risk of a user based on a single nucleotide polymorphism (SNP) combination associated with a disease to be analyzed, Extracting individual SNP information corresponding to a disease-related SNP combination to be analyzed from the genome information of the user, and using the random forest machine learning machine to calculate a disease occurrence probability of the user corresponding to the individual SNP information And calculating a risk.

상기 질병 발생 위험도 예측 방법은 전장유전체분석 정보에 포함된 분석 대상 질병 관련 SNP 데이터를 복수 단계로 필터링하여 질병 연관성을 나타내는 상기 분석 대상 질병 관련 SNP 조합을 생성하는 단계를 더 포함할 수 있다.The disease occurrence risk prediction method may further include a step of filtering the SNP data related to the disease to be analyzed included in the whole-field genome analysis information in a plurality of steps to generate the SNP combination related to the disease, which indicates the disease association.

상기 사용자의 질병 발생 위험도를 계산하는 단계는 상기 분석 대상 관련 SNP 조합으로 훈련된 랜덤 포레스트 기계학습기를 이용할 수 있다.The step of calculating the risk of developing a disease of the user may utilize a random forest machine learning machine trained with the SNP associated with the analyte.

본 발명의 실시예에 따르면 연관불평형 필터링, 본페로니 필터링, p값 기반 필터링 등의 필터링을 통해 추출한 최적의 SNP 후보 조합들을 랜덤 포레스트 기계학습기로 입력하므로, 계산 복잡도(computational complexity)를 낮추면서도 정확하게 질병 연관성이 높은 SNP 조합을 찾을 수 있다. 따라서, 본 발명의 실시예에 따르면 질병 연관성이 높은 SNP 조합을 기초로 질병 발생 위험도를 정확하게 예측할 수 있다. 본 발명의 실시예에 따르면 특정 질병 연관 전장유전체분석 데이터를 기반으로 하므로, 특정 질병의 발생 위험도 예측에 필요한 최적의 SNP 조합을 추출할 수 있다. 또한, 본 발명의 실시예에 따르면 스마트폰과 같은 모바일 플랫폼을 이용하여, 개인이 위치와 시간에 구애받지 않고 자신의 유전체 정보와 임상 정보를 기초로 즉시 제 2형 당뇨와 같은 특정 질병의 발생 위험도를 확인할 수 있다.According to the embodiment of the present invention, optimal SNP candidate combinations extracted through filtering such as associative unbalance filtering, blooming filtering, p-value based filtering, and the like are input to the random forest machine learning machine, so that the computational complexity is reduced SNP combinations that are highly disease-related can be found. Thus, according to embodiments of the present invention, the risk of developing a disease can be accurately predicted on the basis of SNP combinations having high disease-relatedness. According to the embodiment of the present invention, since the SNP analysis data is based on a specific disease-related full-length genome, it is possible to extract optimal SNP combinations necessary for predicting the risk of occurrence of a specific disease. In addition, according to an embodiment of the present invention, a mobile platform such as a smart phone can be used so that an individual can immediately recognize the risk of occurrence of a specific disease such as type 2 diabetes on the basis of his / her genome information and clinical information, .

도 1은 본 발명의 한 실시예에 따른 질병 발생 위험도 예측 장치의 블록도이다.
도 2는 본 발명의 한 실시예에 따른 질병 관련 단일염기다형성 조합 추출부의 블록도이다.
도 3은 본 발명의 한 실시예에 따른 필터링부의 블록도이다.
도 4는 본 발명의 한 실시예에 따른 p값 순위 필터링의 오류율을 나타내는 그래프이다.
도 5는 본 발명의 한 실시예에 따른 단일염기다형성 조합 추출부의 질병 관련 SNP 조합 추출 방법의 흐름도이다.
도 6은 본 발명의 한 실시예에 따른 질병 발생 위험도 예측 방법의 흐름도이다.1 is a block diagram of a disease occurrence risk prediction apparatus according to an embodiment of the present invention.
2 is a block diagram of a disease-associated monoclonal polymorphism combination extractor in accordance with one embodiment of the present invention.
3 is a block diagram of a filtering unit according to an embodiment of the present invention.
4 is a graph illustrating an error rate of p value ranking filtering according to an embodiment of the present invention.
FIG. 5 is a flowchart of a method of extracting a disease-related SNP combination of a single nucleotide polymorphism combination extracting unit according to an embodiment of the present invention.
Figure 6 is a flow chart of a method for predicting the risk of a disease according to one embodiment of the present invention.

아래에서는 첨부한 도면을 참고로 하여 본 발명의 실시예에 대하여 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 상세히 설명한다. 그러나 본 발명은 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시예에 한정되지 않는다. 그리고 도면에서 본 발명을 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면 부호를 붙였다.Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings so that those skilled in the art can easily carry out the present invention. The present invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. In order to clearly illustrate the present invention, parts not related to the description are omitted, and similar parts are denoted by like reference characters throughout the specification.

명세서 전체에서, 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있는 것을 의미한다.Throughout the specification, when an element is referred to as "comprising ", it means that it can include other elements as well, without excluding other elements unless specifically stated otherwise.

도 1은 본 발명의 한 실시예에 따른 질병 발생 위험도 예측 장치의 블록도이다.1 is a block diagram of a disease occurrence risk prediction apparatus according to an embodiment of the present invention.

도 1을 참고하면, 질병 발생 위험도 예측 장치(100)는 질병 발생 위험도 예측이 가능한 단일염기다형성(single nucleotide polymorphism, SNP) 조합을 추출한다. 질병 발생 위험도 예측 장치(100)는 SNP 조합과 사용자의 유전체정보를 기초로 사용자의 질병 발생 위험도를 예측한다. 질병 발생 위험도 예측 장치(100)는 사용자 단말(10)에게 사용자의 질병 발생 위험도를 제공할 수 있다. 분석 대상 질병은 다양할 수 있으며, 제 2형 당뇨를 예를 들어 설명한다. Referring to FIG. 1, the disease occurrence risk prediction apparatus 100 extracts a single nucleotide polymorphism (SNP) combination capable of predicting the risk of disease occurrence. The disease incidence risk prediction apparatus 100 predicts a user's risk of disease occurrence based on the SNP combination and the user's genome information. The disease occurrence risk prediction apparatus 100 can provide the user terminal 10 with a risk of a disease occurrence of the user. Diseases to be analyzed may vary and, for example, describe type 2 diabetes.

질병 발생 위험도 예측 장치(100)는 전장유전체분석(Genome-Wide Association Study, GWAS) 기반 질병 연관도 데이터베이스(200), 질병 관련 단일염기다형성 조합 추출부(300), 개인유전체정보 데이터베이스(400), 개인유전체 데이터 처리부(500), 그리고 질병 발생 위험도 예측부(600)를 포함한다. The disease occurrence risk prediction apparatus 100 includes a disease association database 200 based on a Genome-Wide Association Study (GWAS), a disease-related single nucleotide polymorphism combination extraction unit 300, a personal genome information database 400, A personal genome data processing unit 500, and a disease occurrence risk prediction unit 600.

전장유전체분석 기반 질병 연관도 데이터베이스("앞으로 질병 연관도 데이터베이스"라고 한다)(200)는 질환군과 대조군의 전장유전체분석 정보를 저장한다. 예를 들면, 질병 연관도 데이터베이스(200)는 WTCCC(Wellcome Trust Case Control Consortium) 전장유전체분석 데이터베이스로부터 분석 대상 질병, 예를 들면 제 2형 당뇨를 앓고 있는 복수의 환자와 복수의 건강한 대조군의 전장유전체분석 정보를 수집한다. 질병 연관도 데이터베이스(200)는 수집한 전장유전체분석 정보 중에서 결손값이 크거나 대립 유전자 빈도가 낮아 분석이 어려운 SNP를 제외한 SNP들의 질병 연관도를 추출한다. 예를 들면, 질병 연관도 데이터베이스(200)는 코크란-아미티지 경향 검사(Cochran-Armitage trend test statistics)를 이용하여 SNP들의 질병 연관도를 추출한다.A full-field genome-based disease association database (hereinafter referred to as a "disease association database") 200 stores full-length genomic analysis information of diseased and control groups. For example, the disease association database 200 may be generated from a Wellcome Trust Case Control Consortium (WTCCC) full-length genome analysis database of a plurality of patients suffering from the disease being diagnosed, for example, type 2 diabetes, Gather analysis information. The disease association database 200 extracts the disease associations of SNPs other than the SNPs that are difficult to analyze because of a large deficit value or a low allele frequency among the collected full-length genome analysis information. For example, the disease association database 200 extracts disease associations of SNPs using Cochran-Armitage trend test statistics.

질병 관련 단일염기다형성 조합 추출부(앞으로 "SNP 조합 추출부"라고 한다)(300)는 질병 연관도 데이터베이스(200)에서 분석 대상 질병 관련 전장유전체분석 데이터, 즉, 분석 대상 질병 관련 SNP 데이터를 가져온다. SNP 조합 추출부(300)는 전장유전체분석 데이터를 단계적으로 필터링하여 분석 대상 질병 연관성이 높은 SNP 조합을 추출한다. SNP 조합 추출부(300)는 연관 불평형 필터링(linkage disequilibrium pruning), 본페로니(Boneferroni) 필터링, p값 순위(p-value rank) 필터링, p값 범위(p-value range) 필터링을 단계적으로 적용하여 분석 대상 질병 연관성이 높은 SNP 조합을 생성한다. 이때, SNP 조합 추출부(300)는 최적의 SNP 조합을 찾을 때까지, SNP 후보 조합(SNP candidate set)에 포함되는 SNP 개수를 늘리면서 필터링을 단계적으로 진행한다.Related SNP polymorphism combination extracting unit 300 (hereinafter referred to as "SNP combination extracting unit") 300 fetches the disease-related full-length genome analysis data to be analyzed, that is, the disease-related SNP data to be analyzed in the disease association database 200 . The SNP combination extraction unit 300 extracts a SNP combination having high relevance to the disease to be analyzed by filtering the whole field dielectric analysis data step by step. The SNP combination extractor 300 may be configured to perform stepwise application of linkage disequilibrium pruning, Boneferroni filtering, p-value rank filtering, and p-value range filtering To generate SNP combinations with high disease-relatedness to be analyzed. At this time, the SNP combination extractor 300 proceeds the filtering step by step while increasing the number of SNPs included in the SNP candidate set (SNP candidate set) until the optimal SNP combination is found.

SNP 조합 추출부(300)는 필터링으로 생성된 SNP 후보 조합군을 이용하여 랜덤 포레스트(Random Forests) 기계학습기를 훈련시킨다. 이때, SNP 조합 추출부(300)는 랜덤 포레스트기계학습기의 변수 선택을 이용하여 SNP 후보 조합이군 중에서 오류율이 가장 낮은 SNP 조합을 추출한다. The SNP combination extraction unit 300 trains a random forests machine learning machine using a group of SNP candidate combinations generated by filtering. At this time, the SNP combination extractor 300 extracts the SNP combination having the lowest error rate among the SNP candidate combinations using the variable selection of the random forest machine learning machine.

개인유전체정보 데이터베이스(400)는 사용자들의 개인유전체정보를 저장한다. 예를 들면, 개인유전체정보는 전장유전체분석 등을 통하여 수집된 다양한 SNP들에 대한 고유식별자, 유전체 내의 위치, 사용자 각각의 유전자형을 포함한다.The personal genome information database 400 stores personal genome information of users. For example, the individual genome information includes a unique identifier for various SNPs collected through a whole-field genome analysis, a location in the genome, and a genotype of each of the users.

개인유전체 데이터 처리부(500)는 개인유전체정보 데이터베이스(400)에 저장된 개인유전체정보 중에서 분석 대상 질병 예측에 필요한 개인 SNP 정보를 추출한다. 개인유전체 데이터 처리부(500)는 질병 발생 위험도 예측부(600)의 요청에 따라, 사용자의 개인유전체정보 중 분석 대상 질병 진단에 최적화된 분석 대상 질병 관련 SNP 조합에 포함되는 개인 SNP 정보를 추출한다. 개인유전체 데이터 처리부(500)는 개인 SNP 정보를 포함하는 개인 정보를 생성한다. 개인 정보는 질병 발생 위험도 예측부(300)로 제공된다. 여기서, 개인 정보는 임상정보를 더 포함할 수 있다.The personal genome data processing unit 500 extracts the individual SNP information necessary for predicting the disease to be analyzed from the individual genome information stored in the individual genome information database 400. The personal genome data processor 500 extracts the individual SNP information included in the disease-related SNP combination optimized for diagnosis of the target disease among the individual genome information of the user at the request of the disease occurrence risk predicting unit 600. [ The personal genome data processing unit 500 generates personal information including personal SNP information. Personal information is provided to the disease occurrence risk prediction unit 300. Here, the personal information may further include clinical information.

질병 발생 위험도 예측부(600)는 SNP 조합 추출부(300)에서 분석 대상 질병에 관계된 SNP 조합을 추출한다. 질병 발생 위험도 예측부(600)는 추출한 SNP 조합에 포함되어 있는 각 SNP 들의 유전자형을 질병군과 대조군으로 구분한다. 질병 발생 위험도 예측부(600)는 질병군과 대조군을 랜덤 포레스트 기계학습기에 적용하여 질병 발생 위험도 예측을 위한 훈련을 수행한다. 질병 발생 위험도 예측부(600)는 개인유전체 데이터 처리부(500)로부터 입력받은 개인 SNP 정보를 질병 발생 위험도 예측을 위한 랜덤 포레스트 기계학습기에 적용하여 개인의 질병 발생 위험도를 예측한다. 질병 발생 위험도 예측부(600)는 사용자 단말(10)에게 질병 발생 위험도를 제공할 수 있다.The disease occurrence risk predicting unit 600 extracts the SNP combination related to the disease to be analyzed in the SNP combination extracting unit 300. [ The disease occurrence risk predicting unit 600 classifies genotypes of SNPs included in the extracted SNP combination into a disease group and a control group. The disease occurrence risk prediction unit 600 applies a disease group and a control group to a random forest machine learning machine to perform a training for predicting a disease occurrence risk. The disease occurrence risk prediction unit 600 applies the personal SNP information received from the individual genome data processing unit 500 to a random forest machine learning machine for predicting the disease occurrence risk to predict an individual's risk of developing a disease. The disease occurrence risk prediction unit 600 may provide the user terminal 10 with a risk of disease occurrence.

기존의 랜덤 포레스트 기계학습기 기반 전장유전체분석 기술은 모든 SNP를 고려하거나, 임의의 p값 임계값으로 필터링한 SNP를 이용하여 SNP 후보 조합을 추출한다. 그런데, 모든 SNP를 이용하는 경우, 가장 정확하지만 계산 복잡도가 증가하여 효율성이 떨어진다. 또한, 임의의 p값 임계값를 적용하여 필터링한 SNP를 이용하는 경우, 최적의 SNP 조합이 아니므로 분석 대상 질병 예측의 정확성이 떨어진다. Conventional random forest machine learning based full length genome analysis techniques extract SNP candidate combinations using all SNPs or SNPs filtered with arbitrary p-value thresholds. However, when all the SNPs are used, the accuracy is the most accurate, but the computational complexity increases and the efficiency decreases. In addition, when the SNP filtered by applying a certain p-value threshold is used, the accuracy of the prediction of the disease to be analyzed is degraded because it is not an optimum SNP combination.

반면, 질병 발생 위험도 예측 장치(100)는 연관 불평형 필터링, 본페로니 필터링, p값 순위 필터링, p값 범위 필터링을 단계적으로 적용하여 SNP 후보 조합에 포함된 SNP 개수를 늘리면서, 오류율이 가장 낮은 SNP 조합을 추출한다. 따라서, 질병 발생 위험도 예측 장치(100)는 분석 대상 질병 연관성이 높은 SNP 조합을 효율적이고 정확하게 찾을 수 있다.On the other hand, the disease occurrence risk predicting apparatus 100 increases the number of SNPs included in the SNP candidate combination by stepwise applying associative unbalance filtering, blooming filtering, p value ranking filtering, and p value range filtering, The combination is extracted. Therefore, the disease occurrence risk prediction apparatus 100 can efficiently and accurately find a SNP combination having high disease-relatedness to be analyzed.

도 2는 본 발명의 한 실시예에 따른 질병 관련 단일염기다형성 조합 추출부의 블록도이다.2 is a block diagram of a disease-associated monoclonal polymorphism combination extractor in accordance with one embodiment of the present invention.

도 2를 참고하면, SNP 조합 추출부(300)는 전장유전체분석 데이터 입력부(310), 제어부(330), 그리고 복수의 필터링부(350, 360, 370)를 포함한다. 복수의 필터링부(350-370)는 다양한 필터링 방법의 필터링부로 구성될 수 있으며, 예를 들면, 본페로니 필터링, p값 순위 필터링, p값 범위 필터링을 수행할 수 있다. Referring to FIG. 2, the SNP combination extraction unit 300 includes a full-field dielectric analysis data input unit 310, a control unit 330, and a plurality of filtering units 350, 360, and 370. The plurality of filtering units 350-370 may be configured as a filtering unit of various filtering methods, and may perform, for example, filtering, p value ranking filtering, and p value range filtering.

전장유전체분석 데이터 입력부(310)는 질병 연관도 데이터베이스(200)로부터 분석 대상 질병 관련 전장유전체분석 데이터, 즉, 분석 대상 질병 관련 SNP 데이터를 입력받는다. 전장유전체분석 데이터 입력부(310)는 분석 대상 질병 관련 SNP 데이터를 연관 불평형 필터링하여 복수의 필터링부(350, 360, 370)로 전달할 수 있다.The full-field dielectric analysis data input unit 310 receives the disease-related full-field genome analysis data, that is, the disease-related SNP data to be analyzed, from the disease- The full-field genome analysis data input unit 310 may relate disequilibrium SNP data to be analyzed and transmit them to a plurality of filtering units 350, 360, and 370.

제어부(330)는 필터링부(350-370)를 제어하여, 분석 대상 질병 관련 전장유전체분석 데이터의 필터링을 단계적으로 수행한다. 제어부(330)는 각 필터링부(350-370)의 임계값(threshold)을 포함하는 필터링 제어 정보를 제어할 수 있다.The control unit 330 controls filtering units 350-370 to perform filtering of disease-related full-field dielectric analysis data to be analyzed step by step. The control unit 330 may control filtering control information including a threshold value of each filtering unit 350-370.

제어부(330)는 SNP 후보 조합에 포함된 SNP 개수와 오류율의 관계를 기초로, SNP 개수를 늘리면서 오류율이 가장 낮은 SNP 조합을 단계적으로 찾는다. 즉, 제어부(330)는 어느 필터링부로부터 수신한 필터링 결과를 기초로 다음 단계의 필터링을 수행할지 판단한다. 필터링 결과는 각 필터링부가 생성한 SNP 후보 조합들에 대한 오류율을 포함한다. Based on the relationship between the number of SNPs included in the SNP candidate combination and the error rate, the control unit 330 finds the SNP combination having the lowest error rate step by step while increasing the number of SNPs. That is, the control unit 330 determines whether to perform filtering in the next step based on the filtering result received from the filtering unit. The filtering result includes the error rate for the SNP candidate combinations generated by each filtering unit.

제어부(330)는 어느 필터링부의 필터링 결과에서 SNP 개수가 증가할수록 오류율이 낮아지면, SNP 개수가 더 많은 다음 단계의 필터링부로 SNP 후보 조합들에 대한 오류율을 요청한다. 제어부(330)는 어느 필터링부의 필터링 결과에서, SNP 개수가 증가할 때, 오류율이 낮아지다가 증가하는 경우, 가장 낮은 오류율을 보이는 SNP 후보 조합을 질병 연관성이 높은 SNP 조합으로 결정한다.If the error rate decreases as the number of SNPs increases in the filtering result of a certain filtering unit, the controller 330 requests an error rate for the SNP candidate combinations to the next-stage filtering unit having a larger number of SNPs. When the error rate is low and increases when the number of SNPs increases, the control unit 330 determines a SNP candidate combination having the lowest error rate as a SNP combination having a high disease-relatedness.

필터링부(350-370) 각각은 지정된 필터링 방법으로 분석 대상 질병 관련 전장유전체분석 데이터를 필터링하여 SNP 후보 조합들을 생성한다. 필터링부(350-370) 각각은 SNP 후보 조합들의 오류율을 계산한다. 필터링부(350-370) 각각은 랜덤 포레스트 기계학습기의 변수 선택을 활용하여 SNP 후보 조합들의 오류율을 계산할 수 있다.Each of the filtering units 350-370 generates the SNP candidate combinations by filtering the disease-related full-length genome analysis data by the designated filtering method. Each of the filtering units 350-370 calculates the error rate of the SNP candidate combinations. Each of the filtering units 350-370 can calculate the error rate of the SNP candidate combinations using the variable selection of the random forest machine learning machine.

필터링부(350-370) 각각은 SNP 후보 조합에 포함된 SNP 개수의 범위가 다르다. 따라서, 필터링부(350-370) 각각은 SNP 후보 조합에 포함된 SNP 개수의 범위를 기초로 필터링 순서가 결정된다.Each of the filtering units 350-370 has a different number of SNPs included in the SNP candidate combination. Therefore, the filtering order is determined based on the range of the number of SNPs included in the SNP candidate combinations of each of the filtering units 350-370.

도 3은 본 발명의 한 실시예에 따른 필터링부의 블록도이고, 도 4는 본 발명의 한 실시예에 따른 p값 순위 필터링의 오류율을 나타내는 그래프이다.FIG. 3 is a block diagram of a filtering unit according to an embodiment of the present invention, and FIG. 4 is a graph illustrating an error rate of p value ranking filtering according to an embodiment of the present invention.

도 3을 참고하면, 복수의 필터링부(350-370)는 다양한 필터링 방법의 필터링부로 구성될 수 있으며, 예를 들면, 본페로니 필터링부(350), p값 순위 필터링부(360), p값 범위 필터링부(370)를 포함할 수 있다.3, the plurality of filtering units 350-370 may be configured as a filtering unit of various filtering methods. For example, the filtering unit 350-370 may include a ferroni filtering unit 350, a p-value ranking filtering unit 360, a p And a value range filtering unit 370.

본페로니 필터링부(350)는 제어부(330)의 요청에 따라, 분석 대상 질병 관련 전장유전체분석 데이터를 본페로니 필터링하여 SNP 후보 조합들(r, 2r, 5r, 10r), 즉 SNP 후보 조합군을 생성한다. 본페로니 필터링부(350)는 표 1과 같이, 일정 개수의 SNP가 포함된 SNP 후보 조합들을 생성한다. 여기서, r은 0.05를 전체 SNP의 개수로 나누어서 계산된 본페로니 보정 p값 임계값(corrected p-value threshold) 이내의 SNP 개수를 의미한다. The ferroinic filtering unit 350 filters the subject-related disease-related full-length genome analysis data according to a request from the control unit 330 to generate SNP candidate combinations (r, 2r, 5r, 10r) Create a group. The ferroinic filtering unit 350 generates SNP candidate combinations including a certain number of SNPs, as shown in Table 1. Here, r means the number of SNPs within the corrected p-value threshold calculated by dividing 0.05 by the number of all SNPs.

본페로니 필터링부(350)는 본페로니 필터링하여 생성한 SNP 후보 조합들의 오류율을 측정한다. SNP 조합 추출부(300)는 랜덤 포레스트 기계학습기의 변수 선택을 활용하여 각 SNP 후보 조합군의 오류율을 측정할 수 있다. 예를 들면, 본페로니 필터링부(350)는 표 1과 같이, 각 SNP 후보 조합의 오류율을 측정한다. 여기에서 SNP 개수는 본페로니 필터링을 거친 SNP의 개수를 의미하며, 선정된 SNP 조합 내의 SNP 개수는 본페로니 필터링을 통하여 선정된 SNP 중에서 추출된 가장 낮은 오류율을 지니는 최적의 SNP 조합 내의 SNP 개수를 의미한다. 본페로니 필터링을 적용할 때에 r에서 10r으로 커질수록 SNP 조합을 선정할 때에 고려할 수 있는 SNP의 개수도 많아진다.The ferroinic filtering unit 350 measures the error rate of the SNP candidate combinations generated by the ferro-filtering. The SNP combination extractor 300 can measure the error rate of each SNP candidate combination group using the variable selection of the random forest machine learning machine. For example, the ferro-iris filtering unit 350 measures the error rate of each SNP candidate combination as shown in Table 1. Here, the number of SNPs means the number of SNPs subjected to the ferroni filtering, and the number of SNPs in the selected SNP combination is the number of SNPs in the optimum SNP combination having the lowest error rate extracted from the SNPs selected through the ferro- . The number of SNPs that can be taken into account when selecting the SNP combination increases with the increase of r to 10r when applying the ferroinic filtering.

본페로니 필터링부(350)는 각 SNP 후보 조합의 오류율을 제어부(330)로 전달한다. 제어부(330)는 SNP 개수가 가장 많은 SNP 후보 조합(10r)의 오류율이 가장 낮으므로, 다음 단계의 필터링을 수행한다.The ferroinian filtering unit 350 transmits the error rate of each SNP candidate combination to the control unit 330. The control unit 330 performs filtering of the next step since the error rate of the SNP candidate combination 10r having the largest number of SNPs is the lowest.

p값 순위 필터링부(360)는 제어부(330)의 요청에 따라, 분석 대상 질병 관련 전장유전체분석 데이터를 p값 순위 필터링하여 SNP 후보 조합들을 생성한다. p값 순위 필터링부(360)는 SNP를 p값 순위에 따라서 1위에서부터 임계 순위, 예를 들면 500위까지 누적하면서 500개의 SNP 후보 조합을 생성한다.The p-value ranking filtering unit 360 generates a SNP candidate combination by filtering the disease-related full-length genome analysis data to be analyzed by a p-value ranking in response to a request from the control unit 330. [ The p-value ranking filtering unit 360 generates 500 SNP candidate combinations while accumulating the SNP from the first rank to the threshold rank, for example, 500 rank according to the p value rank.

도 3을 참고하면, p값 순위 필터링부(360)는 생성한 SNP 후보 조합들의 오류율을 측정한다. SNP 조합 추출부(300)는 랜덤 포레스트 기계학습기의 변수 선택을 활용하여 각 SNP 후보 조합군의 오류율을 측정할 수 있다. Referring to FIG. 3, the p-value ranking filtering unit 360 measures an error rate of the generated SNP candidate combinations. The SNP combination extractor 300 can measure the error rate of each SNP candidate combination group using the variable selection of the random forest machine learning machine.

p값 순위 필터링부(360)는 각 SNP 후보 조합의 오류율을 제어부(330)로 전달한다. 제어부(330)는 500위까지 누적한 SNP 후보 조합의 오류율이 가장 낮으므로, 다음 단계의 필터링을 수행한다.The p-value ranking filtering unit 360 transmits the error rate of each SNP candidate combination to the control unit 330. The control unit 330 performs the filtering of the next step since the error rate of the SNP candidate combination accumulated up to 500 is the lowest.

p값 범위 필터링부(370)는 제어부(330)의 요청에 따라, 분석 대상 질병 관련 전장유전체분석 데이터를 p값 범위 필터링하여 SNP 후보 조합들을 생성한다. p값 범위 필터링부(370)는 p값을 확장해 가면서 SNP 후보 조합들을 생성한다. 예를 들면, p값 범위 필터링부(370)는 표 2와 같이 p값을 0.01, 0.05, 0.1, 0.2, ···, 1.0 의 순서대로 확장해가면서 SNP 후보 조합들을 생성한다.The p-value range filtering unit 370 generates SNP candidate combinations by filtering the disease-related full-length genome analysis data to be analyzed according to a request from the control unit 330 in the p value range. The p value range filtering unit 370 generates SNP candidate combinations while expanding the p value. For example, the p-value range filtering unit 370 generates SNP candidate combinations while expanding the p value in the order of 0.01, 0.05, 0.1, 0.2, ..., 1.0 as shown in Table 2. [

p값 범위 필터링부(370)는 생성한 SNP 후보 조합들의 오류율을 측정한다. SNP 조합 추출부(300)는 랜덤 포레스트 기계학습기의 변수 선택을 활용하여 각 SNP 후보 조합군의 오류율을 측정할 수 있다. The p-value range filtering unit 370 measures the error rate of the generated SNP candidate combinations. The SNP combination extractor 300 can measure the error rate of each SNP candidate combination group using the variable selection of the random forest machine learning machine.

p값 범위 필터링부(370)는 각 SNP 후보 조합의 오류율을 제어부(330)로 전달한다. 제어부(330)는 최소 오류율을 보이는 범위(<0.6)의 SNP 후보 조합을 질병 관련 SNP 조합으로 결정한다. The p-value range filtering unit 370 delivers the error rate of each SNP candidate combination to the control unit 330. The control unit 330 determines the SNP candidate combination in the range (< 0.6) that shows the minimum error rate as the disease-related SNP combination.

도 5는 본 발명의 한 실시예에 따른 단일염기다형성 조합 추출부의 질병 관련 SNP 조합 추출 방법의 흐름도이다.FIG. 5 is a flowchart of a method of extracting a disease-related SNP combination of a single nucleotide polymorphism combination extracting unit according to an embodiment of the present invention.

도 5를 참고하면, SNP 조합 추출부(300)는 복수 단계의 필터링을 순차적으로 진행한다. 여기서, SNP 조합 추출부(300)는 최대 3단계의 필터링을 순차적으로 진행하는 것으로 가정한다.Referring to FIG. 5, the SNP combination extraction unit 300 sequentially performs a plurality of filtering steps. Herein, it is assumed that the SNP combination extraction unit 300 sequentially performs filtering at a maximum of three stages.

SNP 조합 추출부(300)는 분석 대상 질병 관련 전장유전체분석 데이터를 입력받는다(S110).The SNP combination extraction unit 300 receives the disease-related full-length genome analysis data to be analyzed (S110).

SNP 조합 추출부(300)는 분석 대상 질병 관련 전장유전체분석 데이터를 제1단계 필터링하여 제1 SNP 후보 조합들을 생성한다(S120). 제1단계 필터링은 본페로니 필터링일 수 있다.The SNP combination extractor 300 generates the first SNP candidate combinations by filtering the disease-related full-field genome analysis data to be analyzed in a first step (S120). The first stage filtering may be the ferroni filtering.

SNP 조합 추출부(300)는 랜덤 포레스트기계학습기를 이용하여 제1단계 필터링으로 생성된 제1 SNP 후보 조합들의 오류율을 계산한다(S122). The SNP combination extractor 300 calculates an error rate of the first SNP candidate combinations generated by the first-stage filtering using the random forest machine learning machine (S122).

SNP 조합 추출부(300)는 제1 SNP 후보 조합들의 오류율을 분석하여, SNP 개수가 증가할수록 오류율이 낮아지는지 판단한다(S130).The SNP combination extractor 300 analyzes the error rate of the first SNP candidate combinations and determines whether the error rate decreases as the number of SNPs increases (S130).

SNP 개수를 가장 많이 포함한 SNP 후보 조합의 오류율이 최저가 아닌 경우, SNP 조합 추출부(300)는 오류율이 가장 낮은 SNP 후보 조합을 분석 대상 질병 관련 SNP 조합으로 추출한다(S132). 즉, SNP 조합 추출부(300)는 제1 SNP 후보 조합 중에 최저 오류율이 있으므로, 더 이상 다음 단계의 필터링을 수행하지 않고 SNP 조합 추출을 완료한다.If the error rate of the SNP candidate combination having the largest number of SNPs is not the lowest, the SNP combination extraction unit 300 extracts the SNP candidate combination having the lowest error rate as the SNP combination related to the disease to be analyzed (S132). In other words, the SNP combination extraction unit 300 completes extraction of the SNP combination without performing the filtering of the next step because there is the lowest error rate in the first SNP candidate combination.

SNP 개수를 가장 많이 포함한 SNP 후보 조합의 오류율이 최저인 경우, SNP 조합 추출부(300)는 분석 대상 질병 관련 전장유전체분석 데이터를 제2단계 필터링하여 제2 SNP 후보 조합들을 생성한다(S140). 제2단계 필터링은 p값 순위 필터링일 수 있다. p값 순위 필터링은 SNP를 p값 순위에 따라서 1위에서부터 임계 순위까지 누적하면서 SNP 후보 조합들을 생성한다. If the error rate of the SNP candidate combination having the greatest number of SNPs is the lowest, the SNP combination extraction unit 300 generates the second SNP candidate combinations by filtering the analysis subject disease-related full-length genome analysis data in the second step S140. The second stage filtering may be a p value rank filtering. The p value rank filtering generates SNP candidate combinations by accumulating the SNPs from the first rank to the threshold rank according to the p value ranking.

SNP 조합 추출부(300)는 랜덤 포레스트기계학습기를 이용하여 제2단계 필터링으로 생성된 제2 SNP 후보 조합들의 오류율을 계산한다(S142). The SNP combination extractor 300 calculates an error rate of the second SNP candidate combinations generated by the second step filtering using the random forest machine learning machine (S142).

SNP 조합 추출부(300)는 제2 SNP 후보 조합들의 오류율을 분석하여, SNP 개수가 증가할수록 오류율이 낮아지는지 판단한다(S150).The SNP combination extractor 300 analyzes the error rate of the second SNP candidate combinations and determines whether the error rate decreases as the number of SNPs increases (S150).

SNP 개수를 가장 많이 포함한 SNP 후보 조합의 오류율이 최저가 아닌 경우, SNP 조합 추출부(300)는 오류율이 가장 낮은 SNP 후보 조합을 분석 대상 질병 관련 SNP 조합으로 추출한다(S152). 즉, 마지막 SNP 순위를 포함하는 SNP 후보 조합의 오류율이 가장 낮지 않은 경우, SNP 조합 추출부(300)는 오류율이 가장 낮은 SNP 후보 조합을 분석 대상 질병 관련 SNP 조합으로 추출한다.If the error rate of the SNP candidate combination having the greatest number of SNPs is not the lowest, the SNP combination extraction unit 300 extracts the SNP candidate combination having the lowest error rate as the SNP combination related to the disease to be analyzed (S152). That is, when the error rate of the SNP candidate combination including the last SNP ranking is not the lowest, the SNP combination extraction unit 300 extracts the SNP candidate combination having the lowest error rate as the SNP combination related to the analysis subject.

SNP 개수를 가장 많이 포함한 SNP 후보 조합의 오류율이 최저인 경우, SNP 조합 추출부(300)는 분석 대상 질병 관련 전장유전체분석 데이터를 제3단계 필터링하여 제3 SNP 후보 조합들을 생성한다(S160). 제3단계 필터링은 p값 범위 필터링일 수 있다. p값 범위 필터링은 p값을 확장해 가면서 SNP 후보 조합들을 생성한다. SNP 개수가 증가할수록 오류율이 낮아지므로, SNP 조합 추출부(300)는 SNP 개수 더 많이 포함하는 제3단계의 필터링 결과를 확인한다.If the error rate of the SNP candidate combination having the greatest number of SNPs is the lowest, the SNP combination extraction unit 300 generates the third SNP candidate combinations by filtering the disease-related full-length genome analysis data to be analyzed in the third step (S160). The third stage filtering may be p value range filtering. p-value range filtering generates SNP candidate combinations while expanding the p-value. As the number of SNPs increases, the error rate decreases. Therefore, the SNP combination extraction unit 300 confirms the filtering result of the third step including more SNPs.

SNP 조합 추출부(300)는 랜덤 포레스트기계학습기를 이용하여 제3단계 필터링으로 생성된 제3 SNP 후보 조합들의 오류율을 계산한다(S162). The SNP combination extractor 300 calculates an error rate of the third SNP candidate combinations generated by the third step filtering using the random forest machine learning machine (S162).

SNP 조합 추출부(300)는 제3 SNP 후보 조합들의 오류율을 분석하여, 오류율이 가장 낮은 SNP 후보 조합을 분석 대상 질병 관련 SNP 조합으로 추출한다(S170). The SNP combination extractor 300 analyzes the error rate of the third SNP candidate combinations and extracts the SNP candidate combination having the lowest error rate as the SNP combination to be analyzed (S170).

SNP 조합 추출부(300)가 제 2형 당뇨 관련 SNP 조합을 추출하는 경우, SNP 조합 추출부(300)는 본페로니 필터링, p값 순위 필터링, p값 범위 필터링을 거치면서, 표 2에서 오류율이 최저인 SNP 후보 조합을 찾는다. p값 범위 필터링 결과, 오류율이 최저인 SNP 후보 조합(<0.6)이 제 2형 당뇨 관련 SNP 조합으로 결정되며, 제 2형 당뇨 관련 SNP 조합은 도 3과 같이 101개의 SNP로 구성된다.SNP combination extractor 300 extracts a combination of SNPs associated with type 2 diabetes mellitus, the SNP combination extractor 300 obtains an error rate (error rate) in Table 2 while subjecting it to perennial filtering, p value ranking filtering, and p value range filtering, Find the lowest SNP candidate combination. As a result of p-value range filtering, the SNP candidate combination with the lowest error rate (<0.6) is determined by the combination of SNPs associated with type 2 diabetes, and SNPs associated with type 2 diabetes consist of 101 SNPs as shown in FIG.

이와 같이, SNP 조합 추출부(300)는 복수 필터링 단계를 거치면서 SNP 후보 조합에 포함된 SNP 개수를 점차 늘린다. 그리고, SNP 조합 추출부(300)는 오류율이 최저인 SNP 후보 조합을 발견할 때까지 필터링을 순차적으로 진행한다. As described above, the SNP combination extraction unit 300 gradually increases the number of SNPs included in the SNP candidate combination through a plurality of filtering steps. Then, the SNP combination extraction unit 300 sequentially performs filtering until it finds a SNP candidate combination having the lowest error rate.

도 6은 본 발명의 한 실시예에 따른 질병 발생 위험도 예측 방법의 흐름도이다.Figure 6 is a flow chart of a method for predicting the risk of a disease according to one embodiment of the present invention.

도 6을 참고하면, 질병 발생 위험도 예측부(600)는 사용자의 질병 발생 위험도 예측을 요청받는다(S210).Referring to FIG. 6, the disease occurrence risk predicting unit 600 is requested to predict a user's disease occurrence risk (S210).

질병 발생 위험도 예측부(600)는 SNP 조합 추출부(300)에서 분석 대상 질병 관련 SNP 조합을 가져온다(S220).The disease occurrence risk predicting unit 600 obtains the SNP combination related to the disease to be analyzed in the SNP combination extracting unit 300 (S220).

질병 발생 위험도 예측부(600)는 분석 대상 질병 관련 SNP 조합을 기초로 랜덤 포레스트 기계학습기를 훈련한다(S230). 질병 발생 위험도 예측부(600)는 분석 대상 질병 관련 SNP 조합에 포함되어 있는 각 SNP 들의 유전자형을 질병군과 대조군으로 구분한다. 질병 발생 위험도 예측부(600)는 질병군과 대조군을 랜덤 포레스트 기계학습기에 적용하여 질병 발생 위험도 예측을 위한 훈련을 수행한다.The disease occurrence risk predicting unit 600 trains a random forest machine learning machine based on the disease-related SNP combination to be analyzed (S230). The disease occurrence risk predicting unit 600 classifies genotypes of SNPs included in the disease-associated SNP combination into the disease group and the control group. The disease occurrence risk prediction unit 600 applies a disease group and a control group to a random forest machine learning machine to perform a training for predicting a disease occurrence risk.

질병 발생 위험도 예측부(600)는 사용자의 유전체정보에서 분석 대상 질병 관련 SNP 조합에 해당하는 개인 SNP 정보를 추출한다(S240).질병 발생 위험도 예측부(600)는 훈련된 랜덤 포레스트 기계학습기를 이용하여, 개인 SNP 정보에 해당하는 사용자의 질병 발생 위험도를 계산한다(S250).The disease occurrence risk predicting unit 600 extracts individual SNP information corresponding to a disease-related SNP combination to be analyzed from the user's genome information at step S240. The disease occurrence risk predicting unit 600 estimates a disease occurrence risk using the trained random forest machine learning machine , And the risk of the disease occurrence of the user corresponding to the individual SNP information is calculated (S250).

질병 발생 위험도 예측부(600)는 사용자의 질병 발생 위험도를 출력한다(S260). 질병 발생 위험도 예측부(600)는 사용자 단말(10)에게 사용자의 질병 발생 위험도를 제공할 수 있다.The disease occurrence risk prediction unit 600 outputs the user's disease occurrence risk level (S260). The disease occurrence risk predicting unit 600 may provide the user terminal 10 with a risk of a disease occurrence of the user.

이와 같이 본 발명의 실시예에 따르면 각종 필터링을 통해 추출한 최적의 SNP 후보 조합들을 랜덤 포레스트 기계학습기로 입력하므로, 시스템 복잡도를 낮추면서도 정확하게 질병 연관성이 높은 SNP 조합을 찾을 수 있다. 따라서, 본 발명의 실시예에 따르면 질병 연관성이 높은 SNP 조합을 기초로 질병 발생 위험도를 정확하게 예측할 수 있다. 본 발명의 실시예에 따르면 특정 질병 연관 전장유전체분석 데이터를 기반으로 하므로, 특정 질병의 발생 위험도 예측에 필요한 최적의 SNP 조합을 추출할 수 있다. 또한, 본 발명의 실시예에 따르면 스마트폰과 같은 모바일 플랫폼을 이용하여, 개인이 위치와 시간에 구애받지 않고 자신의 유전체 정보와 임상 정보를 기초로 즉시 제 2형 당뇨와 같은 특정 질병의 발생 위험도를 확인할 수 있다.As described above, according to the embodiment of the present invention, optimal SNP candidate combinations extracted through various filtering are input to the random forest machine learning device, so that SNP combinations with high disease relevance can be found while reducing the system complexity. Thus, according to embodiments of the present invention, the risk of developing a disease can be accurately predicted on the basis of SNP combinations having high disease-relatedness. According to the embodiment of the present invention, since the SNP analysis data is based on a specific disease-related full-length genome, it is possible to extract optimal SNP combinations necessary for predicting the risk of occurrence of a specific disease. In addition, according to an embodiment of the present invention, a mobile platform such as a smart phone can be used so that an individual can immediately recognize the risk of occurrence of a specific disease such as type 2 diabetes on the basis of his / her genome information and clinical information, .

이상에서 설명한 본 발명의 실시예는 장치 및 방법을 통해서만 구현이 되는 것은 아니며, 본 발명의 실시예의 구성에 대응하는 기능을 실현하는 프로그램 또는 그 프로그램이 기록된 기록 매체를 통해 구현될 수도 있다.The embodiments of the present invention described above are not implemented only by the apparatus and method, but may be implemented through a program for realizing the function corresponding to the configuration of the embodiment of the present invention or a recording medium on which the program is recorded.

이상에서 본 발명의 실시예에 대하여 상세하게 설명하였지만 본 발명의 권리범위는 이에 한정되는 것은 아니고 다음의 청구범위에서 정의하고 있는 본 발명의 기본 개념을 이용한 당업자의 여러 변형 및 개량 형태 또한 본 발명의 권리범위에 속하는 것이다.While the present invention has been particularly shown and described with reference to exemplary embodiments thereof, it is to be understood that the invention is not limited to the disclosed exemplary embodiments, It belongs to the scope of right.

Claims

Based on the combination of single nucleotide polymorphism (SNP) associated with the disease to be analyzed, it is a device for predicting the risk of disease occurrence,
A disease-association database based on a full-length genome analysis that stores disease associations of SNPs extracted based on the full-length genomic analysis information of diseased and control groups, and
Related SNP data fetched from the disease association database is filtered step by step according to a filtering order and filtering is performed in the next step based on error rates of SNP candidate combinations generated by filtering in each step, SNP combination extracting unit for extracting a SNP candidate combination having the lowest error rate among disease-related SNP combinations
Wherein the disease risk prediction device comprises:

The method of claim 1,
A personal genome data processing unit for extracting individual SNP information corresponding to the disease-related SNP combination based on the genome information of the user, and
A disease occurrence risk prediction unit for training the random forest machine learning machine with the disease related SNP combination and calculating the disease occurrence risk of the individual SNP information using the trained random forest machine learning machine,
Wherein the disease risk prediction device further comprises:

The method of claim 1,
The SNP combination extraction unit
A plurality of filtering units for filtering the disease-related SNP data to be analyzed to generate SNP candidate combinations, and calculating an error rate of SNP candidate combinations; and
Related SNP data by controlling the plurality of filtering units and extracting a SNP candidate combination having the lowest error rate among the SNP candidate combinations generated by the filtering of each filtering unit in a disease- Lt; / RTI >
Wherein the error rate is calculated using a random forest machine learning machine.

4. The method of claim 3,
Wherein the plurality of filtering units include at least two of a blooming filtering unit, a p-value ranking filtering unit, and a p-value range filtering unit,
Wherein the filtering order is determined on the basis of a SNP count range of the SNP candidate combination generated by the filtering, the p-value ranking filtering unit, and the p-value range filtering unit.

4. The method of claim 3,
The control unit
And an error rate of the first SNP candidate combinations generated by filtering the SNP data related to the disease to be analyzed from the first filtering unit among the plurality of filtering units is received and a SNP candidate having the largest number of SNPs among the first SNP candidate combinations Determines whether the error rate of the combination is the lowest, and determines whether to request filtering of the SNP data to be analyzed by the second filtering unit, which is the next step of the first filtering unit.

The method of claim 5,
The control unit
Wherein the first filtering unit requests the second filtering unit to filter the SNP data to be analyzed when the error rate of the SNP candidate combination having the largest number of SNPs is the lowest among the first SNP candidate combinations.

The method of claim 5,
The control unit
And a lowest SNP candidate combination having a lower error rate than the SNP candidate combination having the largest number of SNPs among the first SNP candidate combinations, extracts the lowest SNP candidate combination as a disease-related SNP combination.

A method for predicting a disease risk prediction method for extracting a single nucleotide polymorphism (SNP) combination associated with a disease to be analyzed,
Inputting disease-related full-length genome analysis data to be analyzed,
Generating first SNP candidate combinations by first-step filtering the disease-related full-length genome analysis data to be analyzed;
Analyzing an error rate of the first SNP candidate combinations to determine whether the error rate is lowered as the number of SNPs included in the SNP candidate combination increases; and
And extracting the lowest SNP candidate combination as a disease-related SNP combination when there is a lowest SNP candidate combination having a lower error rate than the SNP candidate combination having the largest number of SNPs among the first SNP candidate combinations,
Wherein the error rate is calculated using a random forest machine learning machine.

9. The method of claim 8,
Generating second SNP candidate combinations by performing second-step filtering on the analysis subject disease-related full-length genome analysis data when the error rate of the SNP candidate combination having the largest number of SNPs is the lowest among the first SNP candidate combinations;
Analyzing an error rate of the second SNP candidate combinations to determine whether the error rate is lowered as the number of SNPs increases; and
If there is a lowest SNP candidate combination having a lower error rate than the SNP candidate combination containing the greatest number of SNPs among the second SNP candidate combinations, extracting the lowest SNP candidate combination as a disease-related SNP combination
And extracting the SNP combination.

9. The method of claim 8,
Wherein the first step filtering is bernoni filtering and the second step filtering is p value ranking filtering.

9. The method of claim 8,
Wherein the first step filtering is p value ranking filtering and the second step filtering is p value range filtering.

A method for predicting a user's risk of developing a disease based on a combination of single nucleotide polymorphism (SNP) associated with a disease to be diagnosed,
Receiving a request for predicting the risk of a disease occurrence of a user,
Extracting individual SNP information corresponding to a disease-related SNP combination to be analyzed from the genome information of the user, and
Calculating a risk of a disease occurrence of the user corresponding to the individual SNP information using a random forest machine learning machine
The method comprising the steps of:

The method of claim 12,
Related SNP data included in the whole-field genome analysis information in a plurality of steps to generate a disease-related SNP combination showing disease association
Wherein the disease risk prediction method further comprises:

The method of claim 13,
The step of calculating the risk of the disease occurrence of the user
A method for predicting the risk of disease outbreak using a random forest machine learning machine trained with a combination of the subject-related SNPs.