KR20200084807A

KR20200084807A - Apparatus and method for predicting chronic kidney disease

Info

Publication number: KR20200084807A
Application number: KR1020200000929A
Authority: KR
Inventors: 박수경; 김종효; 안서경; 김경식; 문성지; 홍유진
Original assignee: 서울대학교산학협력단
Priority date: 2019-01-03
Filing date: 2020-01-03
Publication date: 2020-07-13
Also published as: KR102316403B1

Abstract

The present invention relates to a device for predicting chronic kidney disease. The device for predicting chronic kidney disease comprises: a genome marker selection unit that selects genome markers related to a plurality of diseases; an integrated genome index calculation unit that calculates an integrated genome index using the genome markers related to the plurality of diseases selected by the genome marker selection unit; an integrated risk factor building unit that derives factors related to the plurality of diseases and builds an integrated risk factor model; a disease occurrence prediction model generation unit that constructs a disease occurrence prediction model by inputting the integrated genome index and the integrated risk factor model; and a disease prediction unit that predicts the occurrence of chronic kidney disease of a subject by inputting new disease prediction data into the disease occurrence prediction model. Accordingly, high-risk groups in clinical trials can be identified and managed.

Description

Apparatus and method for predicting chronic kidney disease occurrence{APPARATUS AND METHOD FOR PREDICTING CHRONIC KIDNEY DISEASE}

본원은 만성신장 질환 발생 예측 장치 및 방법에 관한 것이다.The present invention relates to an apparatus and method for predicting the development of chronic kidney disease.

건강위험예측 도구 구현 및 그에 따른 고위험군에 대한 중재가 활발히 이루어지고 있는 질환 중 대표적인 것은 유방암이고, 서양에서 구현된 유방암 발생위험도 평가모델에 따르면 크게 세 가지로 나눌 수 있다.Among the diseases in which the health risk prediction tool is implemented and the intervention for the high-risk group is actively performed, breast cancer is representative, and according to the risk assessment model implemented in the West, it can be divided into three types.

그 중 하나는 일반인구에서 기저위험도 (baseline risk)와 위험요인의 조합(joint risk)으로 절대 발생 가능성을 예측하는 모델이고, 다른 하나는 위험인자의 상대적인 위험 크기에 따라 발생 가능성을 예측하는 방법일 수 있으며, 세 번째는 유전성 유방암 발생 예측에 특화하여 사용되는 모델로 가족력을 기반으로 BRCA 유전자 돌연변이 보유 가능성 또는 BRCA 유전자 돌연변이 보유 가능성에 기반 하여 유방암 발생 가능성을 예측할 수 있다. One is a model that predicts the likelihood of absolute occurrence in the general population as a combination of baseline risk and risk, and the other is a method of predicting the likelihood of occurrence according to the relative risk of risk factors. The third is a model used specifically for predicting hereditary breast cancer occurrence, and may predict the likelihood of breast cancer development based on the likelihood of having a BRCA gene mutation or the possession of a BRCA gene mutation based on family history.

현재 국내에서는 대한가정의학회에서 한국형 건강위험예측도구를 개발하였으며 이를 적용하여 국민건강보험공단에서 건강검진을 받은 국민들을 대상으로 공단 홈페이지 <건강iN>에 개인별 맞춤형 건강관리 프로그램 서비스를 제공되고 있다. Currently, in Korea, the Korean Family Medicine Association has developed a Korean health risk prediction tool, and by applying this, a personalized health management program service is provided on the website of the National Health Insurance Corporation, <Health iN>.

하지만, 기존의 질병 발생 위험 모형은 질병을 예측하는 데 있어서, 질병이 없는 건강한 대상자를 기반으로 새롭게 생길 질병 발생 위험에 대해서 예측을 하기에, 일반 인구집단에서의 적용이 제한된다는 한계점이 있다. 또한, 주로 한 질병에 대한 발생 혹은 사망으로 한가지의 결과에 대한 위험 확률값을 제시하여, 많은 변수를 입력하는데 비해 단 하나의 질병만을 예측한다는 제한점을 가진다. 그리고 무엇보다도 이러한 기존 기술의 경우 건강위험요인을 확인하고 이를 교정함으로써 주기적으로 변화하는 개인의 건강 상태를 관리하기에는 부적절하다는 한계가 있다.However, the existing disease incidence risk model has limitations in that it is limited in application to the general population because it predicts a new disease incidence risk based on healthy subjects without disease. In addition, it presents a risk probability value for one result, mainly due to the occurrence or death of a disease, and has a limitation of predicting only one disease compared to entering many variables. And above all, these existing technologies have limitations in that they are inadequate to manage the health status of individuals who change periodically by identifying and correcting health risk factors.

이에 따라, 요인 노출의 시기와 이후 변화될 수 있는 상태의 시간적 연속성과 질병의 자연사를 고려하여, 개인의 유전학적 특성을 기반으로 연령, 성별과 같은 인구학적 지표, 사회학적 요인, 질병가족력 요인과 이후 변화되어 노출될 수 있는 환경적 행태 관련 요인, 그리고 이런 행태 요인에 따라 변화하는 비만 관련 측정 지표와 혈액 이상을 인지하는 혈액 마커와 이런 모든 과정의 상태의 변화로 인한 질병의 이환 및 전질병상태의 이환 여부를 질병의 생물학적 연속성을 고려하여, 향후 고혈압, 당뇨병, 그리고 그들의 동반질병과 만성신장질환 발생 위험을 확인하는 통합 질병 발생 정밀예측 모형의의 제시를 통한 건강 상태 관리 및 개인 맞춤형 예방 방안이 필요로 된다.Accordingly, demographic indicators such as age and gender, sociological factors, and disease family history factors, based on the individual's genetic characteristics, taking into account the timing of exposure and the temporal continuity of the condition that can be changed afterwards and the natural history of the disease. Factors related to environmental behavior that can be changed and exposed afterwards, and obesity-related metrics and blood markers that recognize blood abnormalities and blood disease markers that recognize blood abnormalities according to these behavior factors. Health condition management and personalized preventive measures through the provision of an integrated disease outbreak prediction model that identifies the risk of hypertension, diabetes, and their accompanying diseases and chronic kidney disease in the future, taking into account the disease's biological continuity Is needed.

본원의 배경이 되는 기술은 한국등록특허공보 제10-0931300호에 개시되어 있다.The background technology of the present application is disclosed in Korean Patent Registration No. 10-0931300.

본원은 전술한 종래 기술의 문제점을 해결하기 위한 것으로서, 개인이 태어날 때부터 가지고 있는 유전적 특징부터 시작해서, 환경적 요인의 변화와 이들에 따른 혈액 마커의 변화 양상으로 고려함으로써, 시간 순서에 따른 위험 요인을 고려하고, 시시각각으로 변화하는 이들의 양상으로 반영하여 향후 고혈압, 당뇨병과 이들의 동반질병과 만성신장질환 3가지 질병 발생 위험을 예측하는 발생 통합 모형을 구축하는 만성신장 질환 발생 예측 장치 및 방법을 제공하려는 것을 목적으로 한다.The present application is to solve the problems of the prior art described above, starting with the genetic characteristics that an individual has from birth, considering changes in environmental factors and changes in blood markers according to them, according to time sequence A device for predicting the occurrence of chronic kidney disease, taking into account risk factors and reflecting them in the form of those who change from time to time, and constructing an integrated model to predict the risks of high blood pressure, diabetes and their accompanying diseases and chronic kidney disease. It aims to provide a method.

다만, 본원의 실시예가 이루고자 하는 기술적 과제는 상기된 바와 같은 기술적 과제들로 한정되지 않으며, 또 다른 기술적 과제들이 존재할 수 있다.However, the technical problems to be achieved by the embodiments of the present application are not limited to the technical problems as described above, and other technical problems may exist.

상기한 기술적 과제를 달성하기 위한 기술적 수단으로서, 본원의 일 실시예에 따른 만성신장질환 발생 예측 장치는, 복수의 질환과 관련된 유전체 마커를 선별하는 유전체 마커 선별부, 상기 유전체 마커 선별부에서 선별된 복수의 질환과 관련된 유전체 마커를 이용하여 통합 유전체 지표를 산출하는 통합 유전체 지표 산출부, 상기 복수의 질환과 관련된 요인들을 도출하여 통합 위험 요인 모델을 구축하는 통합 위험 요인 구축부, 상기 통합 유전체 지표 및 상기 통합 위험 요인 모델을 입력으로 하여, 질병 발생 예측 모델을 구축하는 질병 발생 예측 모델 생성부 및 상기 질병 발생 예측 모델에 신규 질환 예측 데이터를 입력으로 하여 대상자의 만성신장질환 발생을 예측하는 질환 예측부를 포함할 수 있다. As a technical means for achieving the above technical problem, the apparatus for predicting the development of chronic kidney disease according to an embodiment of the present application is selected from a genomic marker selection unit and the genomic marker selection unit to select genomic markers associated with a plurality of diseases. An integrated genomic index calculation unit that calculates an integrated genomic index using genomic markers associated with a plurality of diseases, an integrated risk factor construction unit that derives factors related to the plurality of diseases and builds an integrated risk factor model, the integrated genomic index, and Using the integrated risk factor model as an input, a disease outbreak prediction model generator for constructing a disease outbreak prediction model and a disease predictor for predicting the occurrence of chronic kidney disease in a subject by inputting new disease prediction data into the disease outbreak prediction model It can contain.

또한, 상기 유전체 마커 선별부는, 제1질환과 관련된 제1유전체 마커를 선별하고, 선별된 상기 제1유전체 마커를 도시 코호트 및 농촌 코호트 각각에서 검증하여 제1유전체 마커로 결정하고, 제2질환과 관련된 제2유전체 마커를 선별하고, 선별된 상기 제2유전체 마커를 도시 코호트 및 농촌 코호트 각각에서 검증하여 제2유전체 마커로 결정할 수 있다. In addition, the genome marker selection unit selects a first genome marker associated with the first disease, verifies the selected first genome marker in each of the urban cohort and rural cohort, and determines the first genome marker, and Related second genome markers can be selected, and the selected second genome markers can be verified in each of the urban cohort and rural cohort to determine the second genome marker.

또한, 상기 유전체 마커 선별부는, 회귀분석 알고리즘을 기반으로 복수의 질환과 연관된 단일염기 다형성(SNP) 마커를 선정하고, 상기 단일염기 다형성(SNP) 마커, 상기 제1유전체 마커 및 상기 제2유전체 마커 중 적어도 어느 하나를 고려하여 핵심 유전자 정보를 도출할 수 있다. In addition, the genome marker selection unit selects a single-base polymorphism (SNP) marker associated with a plurality of diseases based on a regression analysis algorithm, the single-base polymorphism (SNP) marker, the first dielectric marker, and the second dielectric marker Core gene information can be derived by considering at least one of the above.

또한, 상기 통합 위험 요인 구축부는, 기본 인구학적 요인, 사회학적 요인, 질병 및 가족력 요인, 환경적 행태 관련 요인, 비만 관련 지표 요인, 혈액적 이상인지 지표 요인, 기저상태의 질병 이환 상태요인 중 적어도 어느 하나를 고려하여 상기 복수의 질환과 관련된 요인들을 도출하고 통합 위험 요인 모델을 구축할 수 있다. In addition, the integrated risk factor building unit includes at least one of basic demographic factors, sociological factors, disease and family history factors, environmental behavior-related factors, obesity-related indicator factors, blood abnormality indicator factors, and underlying disease morbidity factors. Factors related to the plurality of diseases may be derived by considering any one, and an integrated risk factor model may be constructed.

또한, 만성신장질환 발생 예측 장치는, 복수의 질환 중 적어도 어느 하나의 질병을 보유하고 있는 대상자의 데이터를 설명변수로서 도출하는 설명변수 도출부를 더 포함할 수 있다. In addition, the apparatus for predicting the occurrence of chronic kidney disease may further include an explanatory variable derivation unit that derives data of a subject having at least one of a plurality of diseases as an explanatory variable.

또한, 상기 질병 발생 예측 모델 생성부는, 만성신장 질환의 질환자의 생활상태 변수 및 건강상태 변수를 포함하는 복수의 상태 변수, 상기 핵심 유전자 정보 및 만성신장 질환의 질병 위험도를 입력으로 하여, 상기 복수의 상태 변수 및 핵심 유전자 정보 중 적어도 하나 이상과 상기 만성신장 질환의 질병 위험도 사이의 관계의 정도를 학습하는 질병 발생 예측 모델을 생성할 수 있다. In addition, the disease incidence model generation unit, by inputting a plurality of state variables, including the state variables and health status variables of patients with chronic kidney disease, the key genetic information and disease risk of chronic kidney disease as input, the plurality of A disease outbreak prediction model for learning the degree of relationship between at least one of state variables and key genetic information and the disease risk of the chronic kidney disease may be generated.

또한, 상기 질병 발생 예측 모델은, 상기 복수의 상태 변수 중 제 1 상태 변수 및 이전 시점 은닉층을 입력층으로 하고 상기 복수의 상태 변수 중 제 2 상태 변수 또는 현재 시점 상태 변수를 은닉층으로 할 때, 상기 입력층과 은닉층 사이의 관계의 정도를 학습하는 제 1 학습을 하고, 상기 은닉층 및 상기 유전자 정보를 입력층으로 하고 상기 질병 위험도를 출력층으로 할 때, 상기 은닉층과 출력층 사이의 관계의 정도를 학습하는 제 2 학습을 함으로써, 상기 복수의 상태 변수 및 유전자 정보 중 적어도 하나 이상과 상기 만성신장 질환의 질병 위험도 사이의 관계의 정도를 학습하는 것이되, 상기 제 1 학습은 [수학식 1]을 기반으로, 상기 입력층과 은닉층 사이의 관계의 정도를 학습하는 것이되,In addition, the disease occurrence prediction model, when the first state variable and the previous time hidden layer of the plurality of state variables as the input layer, and the second state variable or the current time state variable of the plurality of state variables as the hidden layer, the When performing the first learning to learn the degree of the relationship between the input layer and the hidden layer, and using the hidden layer and the genetic information as the input layer and the disease risk as the output layer, learning the degree of the relationship between the hidden layer and the output layer By performing the second learning, the degree of relationship between at least one of the plurality of state variables and genetic information and the disease risk of the chronic kidney disease is learned, and the first learning is based on [Equation 1]. , Learning the degree of the relationship between the input layer and the hidden layer,

[수학식 1][Equation 1]

이때, 상기

는 t 시점에서의 은닉층이고, 상기

는 입력층과 은닉층 사이의 제1유형의 관계의 정도를 나타내는 제1가중치이고, 상기

은 이전 시점 은닉층이고,

는 입력층과 은닉층 사이의 제2유형의 관계의 정도를 나타내는 제2가중치이고,

는 t시점에서의 제1상태 변수일 수 있다.At this time, above

Is a hidden layer at time t, and

Is a first weight value indicating the degree of relationship of the first type between the input layer and the hidden layer, wherein

Is the hidden layer from the previous point,

Is a second weight value indicating the degree of relationship of the second type between the input layer and the hidden layer,

May be the first state variable at time t.

또한, 상기 제2학습은 [수학식 1] 및 [수학식 2]를 기반으로 상기 은닉층과 출력층 사이의 관계의 정도를 학습하는 것이되, In addition, the second learning is to learn the degree of the relationship between the hidden layer and the output layer based on [Equation 1] and [Equation 2],

[수학식 2][Equation 2]

이때, 상기 y는 출력층이고, 상기

는 은닉층과 출력층 사이의 관계의 정도를 나타내는 제3가중치이고,

는 t 시점에서의 은닉층이고, 상기

는 입력층 중 유전자 정보와 출력층 사이의 관계의 정도를 나타내는 제4가중치이고, z는 입력층 중 유전자 정보일 수 있다.At this time, y is an output layer, and

Is a third weight value indicating the degree of relationship between the hidden layer and the output layer,

Is a hidden layer at time t, and

Is a fourth weight value indicating the degree of relationship between the gene information and the output layer in the input layer, and z may be the gene information in the input layer.

또한, 상기 질병 발생 예측 모델 생성부는, [수학식 3]을 기반으로 상기 복수의 상태 변수 및 유전자 정보 중 적어도 하나 이상과 상기 만성신장 질환의 질병 위험도 사이의 관계의 정도를 학습하는 기계 학습 모델을 생성 시 발생하는 오차에 가중치를 갱신하는 것이되,In addition, the disease occurrence prediction model generation unit, based on [Equation 3], a machine learning model for learning the degree of relationship between at least one of the plurality of state variables and genetic information and the disease risk of the chronic kidney disease. To update the weight to the error that occurs during generation,

[수학식 3][Equation 3]

상기 E는 상기 질병 발생 예측 모델 생성부의 오차의 검출값이고, 상기 t는 상기 만성신장 질환의 발생 여부이고, 상기 y는 기계학습 모델을 통해 예측된 질병 위험도이고,

는 오차에 따른 과적합(overfitting)을 방지하기 위한 L2 정규식일 수 있다. The E is the detection value of the error of the disease occurrence prediction model generation unit, the t is whether the chronic kidney disease occurs, the y is the disease risk predicted through the machine learning model,

May be an L2 regular expression to prevent overfitting due to errors.

또한, 상기 질환 예측부는, 상기 대상자의 만성신장질환 발생 예측 결과와 연계된 질병 예방 관리 정보를 제공할 수 있다. In addition, the disease prediction unit may provide disease prevention management information associated with the prediction result of chronic kidney disease in the subject.

본원의 일 실시예에 따르면, 만성신장질환 발생 예측 방법은, 복수의 질환과 관련된 유전체 마커를 선별하는 단계, 선별된 복수의 질환과 관련된 유전체 마커를 이용하여 통합 유전체 지표를 산출하는 단계, 상기 복수의 질환과 관련된 요인들을 도출하여 통합 위험 요인 모델을 구축하는 단계, 상기 통합 유전체 지표 및 상기 통합 위험 요인 모델을 입력으로 하여, 질병 발생 예측 모델을 구축하는 단계 및 상기 질병 발생 예측 모델에 신규 질환 예측 데이터를 입력으로 하여 대상자의 만성신장질환 발생을 예측하는 단계를 포함할 수 있다. According to one embodiment of the present application, the method for predicting the development of chronic kidney disease comprises: selecting a genomic marker associated with a plurality of diseases, calculating an integrated genomic index using genomic markers associated with a plurality of selected diseases, and the plurality Constructing an integrated risk factor model by deriving factors related to diseases of the disease, constructing a disease outbreak prediction model by inputting the integrated genomic indicator and the integrated risk factor model, and predicting a new disease in the disease outbreak prediction model It may include a step of predicting the occurrence of chronic kidney disease in the subject by using the data as input.

상술한 과제 해결 수단은 단지 예시적인 것으로서, 본원을 제한하려는 의도로 해석되지 않아야 한다. 상술한 예시적인 실시예 외에도, 도면 및 발명의 상세한 설명에 추가적인 실시예가 존재할 수 있다.The above-described problem solving means are merely exemplary and should not be construed as limiting the present application. In addition to the exemplary embodiments described above, additional embodiments may exist in the drawings and detailed description of the invention.

전술한 본원의 과제 해결 수단에 의하면, 본인이 가지고 있는 유전적 특성부터 시작해서, 환경적 요인의 변화와 이에 따른 임상 수치들의 변화 양상을 기반으로, 고혈압과 당뇨병 그리고 이들의 다중질병의 형태인 만성신장질환의 발생 확률이 높은 대상자를 확인하여, 높은 확률을 가진 대상자에 대해 조기 진단을 통해 예방 방안을 제시하고, 유병 환자의 경우에도 더욱 악화 (심혈관계 질환 및 사망) 되기 전에 있어 사전에 이를 치료하여 2차 예방을 함으로써, 심혈관질환, 만성신장질환, 대사이상질환, 신경질환 등의 합병증 발생 위험을 감소시킬 수 있기 때문에 삶의 질을 높일 수 있는 효과가 있다. According to the above-described problem solving means of the present application, starting from the genetic characteristics of the person, based on the change in environmental factors and the change in clinical values, hypertension, diabetes, and chronic forms of their multiple diseases Identify subjects with a high probability of developing kidney disease, and provide preventive measures through early diagnosis on subjects with high probability, and treat them in advance even before they become worse (cardiovascular disease and death) even in patients with illness. By performing the secondary prevention, it is possible to reduce the risk of complications such as cardiovascular disease, chronic kidney disease, metabolic disorders, and neurological diseases, thereby improving the quality of life.

전술한 본원의 과제 해결 수단에 의하면, 고혈압, 당뇨병, 만성신장질환 발생의 위험 요인에 따른 질병 위험 양상에 대한 정보를 제시함으로써, 생활습관 교정을 통해 이러한 위험 요인을 제어하고, 이에 따른 질병 위험의 감소 형태를 확인함으로써, 개인의 적극적 자가 건강관리를 유도할 수 있다. According to the above-mentioned problem solving means of the present application, by presenting information on a disease risk pattern according to a risk factor of hypertension, diabetes, and chronic kidney disease, these risk factors are controlled through lifestyle correction, and accordingly By identifying the form of reduction, an individual's active self can lead to health care.

전술한 본원의 과제 해결 수단에 의하면, 기계학습 기반으로 구축된 모형을 사용함으로써 높은 예측력을 통해 실제 지역사회 일반 인구 집단이나, 임상시험에서의 고위험군을 확인 및 관리할 수 있다. According to the above-described problem solving means of the present application, by using a model built on a machine learning basis, it is possible to identify and manage a real community general population group or a high-risk group in a clinical trial through high predictive power.

전술한 본원의 과제 해결 수단에 의하면, 고혈압, 당뇨병, 만성신장질환과 같은 대사질환을 예측하고 미리 예방함으로써, 사망의 위험을 높이는 질병의 다중질환 위험을 낮추는데 도움을 줄 수 있다. According to the above-described problem solving means of the present application, by predicting and preventing metabolic diseases such as hypertension, diabetes, and chronic kidney disease, it is possible to help lower the risk of multiple diseases of the disease to increase the risk of death.

전술한 본원의 과제 해결 수단에 의하면, 알고리즘은 질병 위함 예측 모형이나 개인 건강관리서비스를 목표로 한 웹(WEB) 및 앱(APP)을 활용한 제품에 활용될 수 있다. According to the above-described problem solving means of the present application, the algorithm may be used in a product using a web (WEB) and an app (APP) aimed at a disease risk prediction model or a personal health care service.

다만, 본원에서 얻을 수 있는 효과는 상기된 바와 같은 효과들로 한정되지 않으며, 또 다른 효과들이 존재할 수 있다.However, the effects obtainable herein are not limited to the effects described above, and other effects may exist.

도 1은 본원의 일 실시예에 따른 만성신장질환 발생 예측 장치의 개략적인 구성도이다.
도 2는 본원의 일 실시예에 따른 만성신장질환 장치의 개략적인 블록도이다.
도 3은 본원의 일 실시예에 따른 만성신장질환 장치의 유전변이정보확장의 참조유전체 정보로 사용할 수 있는 정보 리스트를 나타낸 도면이다.
도 4는 본원의 일 실시예에 따른 고혈압과 관련된 유의한 유전체 마커를 설명하기 위한 도면이다.
도 5는 본원의 일 실시예에 따른 당뇨병과 관련된 유의한 유전체 마커를 설명하기 위한 도면이다.
도 6은 본원의 일 실시예에 따른 고혈압 통합 유전체 점수 구축의 결과를 설명하기 위한 도면이다.
도 7은 본원의 일 실시예에 따른 당뇨병 통합 유전체 점수 구축의 결과를 설명하기 위한 도면이다.
도 8은 본원의 일 실시예에 따른 요인 노출의 시기와 이후 변화될 수 있는 상태의 시간적 연속성과 질병의 자연사를 고려하여 예측모형에 순차적으로 포함된 변수를 나타내는 도면이다.
도 9는 본원의 일 실시예에 따른 시계열 데이터와 유전 데이터를 통합하는 딥러닝 모델 구조를 설명하기 위한 도면이다.
도 10은 본원의 일 실시예에 따른 질병 발생 예측 모델의 대략적인 다이어그램이다.
도 11은 본원의 일 실시예에 따른 랜덤포레스트 예측 모형의 개략적인 다이어그램이다.
도 12는 본원의 일 실시예에 따른 여러 경우의 모형들을 앙상블 기법으로 훈련시킨 과정을 개략적으로 나타낸 도면이다.
도 13은 본원의 일 실시예에 따른 통계적, 기계학습 모형에 따른 고혈압 발생위험 예측도 비교를 나타낸 도면이다.
도 14는 본원의 일 실시예에 따른 통계적, 기계학습 모형에 따른 당뇨병 발생위험 예측도 비교를 나타낸 도면이다.
도 15는 본원의 일 실시예에 따른 통계적, 기계학습 모형에 따른 만성신장질환 발생위험 예측도 비교를 나타낸 도면이다.
도 16은 본원의 일 실시예에 따른 만성신장질환 발생 예측 방법에 대한 동작 흐름도이다.1 is a schematic configuration diagram of a device for predicting the development of chronic kidney disease according to an embodiment of the present application.
Figure 2 is a schematic block diagram of a chronic kidney disease device according to an embodiment of the present application.
FIG. 3 is a diagram showing a list of information that can be used as reference dielectric information for extending genetic variation information of a chronic kidney disease device according to an embodiment of the present application.
4 is a view for explaining a significant genomic marker associated with hypertension according to an embodiment of the present application.
5 is a view for explaining a significant genomic marker associated with diabetes according to an embodiment of the present application.
6 is a view for explaining the results of the hypertension integrated genomic score construction according to an embodiment of the present application.
7 is a view for explaining the results of the diabetes integrated genomic score construction according to an embodiment of the present application.
FIG. 8 is a diagram showing variables sequentially included in a predictive model in consideration of the timing of factor exposure according to an embodiment of the present application and the temporal continuity of a state that can be changed thereafter and the natural history of the disease.
9 is a diagram for explaining a deep learning model structure that integrates time series data and genetic data according to an embodiment of the present application.
10 is a schematic diagram of a disease outbreak prediction model according to an embodiment of the present application.
11 is a schematic diagram of a random forest prediction model according to an embodiment of the present application.
12 is a diagram schematically showing a process of training models in various cases according to an embodiment of the present application by an ensemble technique.
13 is a view showing a comparison of the predicted risk of developing hypertension according to a statistical and machine learning model according to an embodiment of the present application.
14 is a diagram showing a comparison of predicted risk of developing diabetes according to a statistical and machine learning model according to an embodiment of the present application.
15 is a view showing a comparison of predicted risk of developing chronic kidney disease according to a statistical and machine learning model according to an embodiment of the present application.
16 is a flowchart illustrating a method for predicting the development of chronic kidney disease according to an embodiment of the present application.

아래에서는 첨부한 도면을 참조하여 본원이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 본원의 실시예를 상세히 설명한다. 그러나 본원은 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시예에 한정되지 않는다. 그리고 도면에서 본원을 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면 부호를 붙였다.Hereinafter, embodiments of the present application will be described in detail with reference to the accompanying drawings so that those skilled in the art to which the present application pertains may easily practice. However, the present application may be implemented in various different forms and is not limited to the embodiments described herein. In addition, in order to clearly describe the present application in the drawings, parts irrelevant to the description are omitted, and like reference numerals are assigned to similar parts throughout the specification.

본원 명세서 전체에서, 어떤 부분이 다른 부분과 "연결"되어 있다고 할 때, 이는 "직접적으로 연결"되어 있는 경우뿐 아니라, 그 중간에 다른 소자를 사이에 두고 "전기적으로 연결" 또는 "간접적으로 연결"되어 있는 경우도 포함한다. Throughout this specification, when a part is "connected" to another part, it is not only "directly connected", but also "electrically connected" or "indirectly connected" with another element in between. "It includes the case where it is.

본원 명세서 전체에서, 어떤 부재가 다른 부재 "상에", "상부에", "상단에", "하에", "하부에", "하단에" 위치하고 있다고 할 때, 이는 어떤 부재가 다른 부재에 접해 있는 경우뿐 아니라 두 부재 사이에 또 다른 부재가 존재하는 경우도 포함한다.Throughout this specification, when one member is positioned on another member “on”, “on top”, “top”, “bottom”, “bottom”, “bottom”, it means that one member is on another member This includes cases where there is another member between the two members as well as when in contact.

본원 명세서 전체에서, 어떤 부분이 어떤 구성 요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성 요소를 제외하는 것이 아니라 다른 구성 요소를 더 포함할 수 있는 것을 의미한다.Throughout the present specification, when a part “includes” a certain component, it means that the component may further include other components, not to exclude other components, unless otherwise stated.

본원은 요인 노출의 시기와 이후 변화될 수 있는 상태의 시간적 연속성과 질병의 자연사를 고려하여, 개인의 유전학적 특성을 기반으로 연령, 성별과 같은 인구학적 지표, 사회학적 요인, 질병가족력 요인과 이후 변화되어 노출될 수 있는 환경적 행태 관련 요인, 그리고 이런 행태 요인에 따라 변화하는 비만 관련 측정 지표와 혈액 이상을 인지하는 혈액 마커와 이런 모든 과정의 상태의 변화로 인한 질병의 이환 및 전질병상태의 이환 여부를 질병의 생물학적 연속성을 고려하여, 향후 고혈압, 당뇨병, 그리고 그들의 동반질병과 만성신장질환까지의 3가지의 질병 발생 위험을 확인하는 통합 질병 발생 예측 모형의 방법에 관한 것이다.Based on the genetic characteristics of an individual, demographic indicators such as age and gender, sociological factors, disease family history factors, and later Factors related to environmental behavior that can be exposed to change, obesity-related metrics that change according to these behavior factors, blood markers for recognizing blood abnormalities, and disease and pre-disease conditions caused by changes in the state of all these processes. It relates to a method of predicting an integrated disease outbreak model that identifies the risk of three diseases, including hypertension, diabetes, and their comorbidities and chronic kidney disease, in consideration of the biological continuity of the disease.

도 1은 본원의 일 실시예에 따른 만성신장질환 발생 예측 장치의 개략적인 구성도이다.1 is a schematic configuration diagram of a device for predicting the development of chronic kidney disease according to an embodiment of the present application.

도1을 참조하면, 만성신장질환 발생 예측 장치(10) 및 질병 예측 서버(20)는 네트워크를 통해 연동될 수 있다. 예시적으로, 질병 예측 서버(20)는 질병관리본부의 한국인 유전체역학조사사업의 일부인 안산-안성 코호트의 유전체 자료원과 1차부터 7차까지의 추적된 추적 자료를 포함할 수 있다. 질병 예측 서버(20)는 네트워크를 통해 만성신장질환 발생 예측 장치(10)로 질병관리본부의 한국인 유전체 역학조사 사업의 일환인 안산-안성 코호트의 유전체 자료원과 추적 자료원의 정보를 제공할 수 있다.Referring to FIG. 1, the chronic kidney disease occurrence prediction apparatus 10 and the disease prediction server 20 may be linked through a network. Illustratively, the disease prediction server 20 may include a genomic data source of the Ansan-Anseong cohort, which is part of the Korean genomic epidemiology research project of the Center for Disease Control and traced tracking data from the 1st to the 7th. The disease prediction server 20 may provide information on the genomic and trace data sources of the Ansan-Anseong cohort, which is part of the Korean Genomic Epidemiology Project of the Korea Centers for Disease Control and Prevention, through the network as a device for predicting the occurrence of chronic kidney disease.

만성신장질환 발생 예측 장치(10)는 질병 예측 서버(20)와 데이터, 콘텐츠, 각종 통신 신호를 네트워크를 통해 송수신하고, 데이터 저장 및 처리의 기능을 가지는 모든 종류의 서버, 단말, 또는 디바이스를 포함할 수 있다.Chronic kidney disease occurrence prediction apparatus 10 includes all kinds of servers, terminals, or devices that transmit and receive data, contents, and various communication signals with the disease prediction server 20 through a network and have data storage and processing functions. can do.

만성신장질환 발생 예측 장치(10)는 네트워크를 통해 질병 예측 서버(20)와 연동되는 디바이스로서, 예를 들면, 스마트폰(Smartphone), 스마트패드(Smart Pad), 태블릿 PC, 웨어러블 디바이스 등과 PCS(Personal Communication System), GSM(Global System for Mobile communication), PDC(Personal Digital Cellular), PHS(Personal Handyphone System), PDA(Personal Digital Assistant), IMT(International Mobile Telecommunication)-2000, CDMA(Code Division Multiple Access)-2000, W-CDMA(W-Code Division Multiple Access), Wibro(Wireless Broadband Internet) 단말기 같은 모든 종류의 무선 통신 장치 및 데스크탑 컴퓨터, 스마트 TV와 같은 고정용 단말기일 수도 있다. The chronic kidney disease occurrence prediction apparatus 10 is a device that is linked to the disease prediction server 20 through a network, for example, a smart phone, a smart pad, a tablet PC, a wearable device, and the like. Personal Communication System (GSM), Global System for Mobile communication (GSM), Personal Digital Cellular (PDC), Personal Handyphone System (PHS), Personal Digital Assistant (PDA), International Mobile Telecommunication (IMT)-2000, Code Division Multiple Access (CDMA) )-2000, W-CDMA (W-Code Division Multiple Access), all types of wireless communication devices such as Wibro (Wireless Broadband Internet) terminals, and fixed terminals such as desktop computers and smart TVs.

만성신장질환 발생 예측 장치(10) 및 질병 예측 서버(20)간의 정보 공유를 위한 네트워크의 일 예로는 3GPP(3rd Generation Partnership Project) 네트워크, LTE(Long Term Evolution) 네트워크, 5G 네트워크, WIMAX(World Interoperability for Microwave Access) 네트워크, 유무선 인터넷(Internet), LAN(Local Area Network), Wireless LAN(Wireless Local Area Network), WAN(Wide Area Network), PAN(Personal Area Network), 블루투스(Bluetooth) 네트워크, Wifi 네트워크, NFC(Near Field Communication) 네트워크, 위성 방송 네트워크, 아날로그 방송 네트워크, DMB(Digital Multimedia Broadcasting) 네트워크 등이 포함될 수 있으며, 이에 한정된 것은 아니다.Examples of networks for sharing information between the chronic kidney disease occurrence prediction apparatus 10 and the disease prediction server 20 include a 3rd Generation Partnership Project (3GPP) network, a Long Term Evolution (LTE) network, a 5G network, and World Interoperability (WIMAX). for Microwave Access (Wi-Fi) network, wired and wireless Internet (LAN), Local Area Network (LAN), Wireless Local Area Network (LAN), Wide Area Network (WAN), Personal Area Network (PAN), Bluetooth network, Wifi network , NFC (Near Field Communication) network, satellite broadcasting network, analog broadcasting network, DMB (Digital Multimedia Broadcasting) network, etc. may be included, but is not limited thereto.

이하 설명되는 만성신장 질환 발생 예측 방법은 만성신장질환 발생 예측 장치 (10)에서 수행될 수 있다. 다른 일예로, 만성신장 질환 발생 예측 방법의 각 단계는 질병 예측 서버(20)에서 수행될 수 있다. 또 다른 일예로, 만성신장 질환 발생 예측 방법의 각 단계 중 일부 단계는 만성신장질환 발생 예측 장치(10)에서 수행되고, 나머지 단계는 질병 예측 서버(20)에서 수행될 수 있다. 예를 들어, 만성신장질환 발생 예측 장치 (10는 만성신장 질환의 질병을 예측하는 방법의 일부 단계로서 사용자 입력을 수신하고, 수신된 사용자 입력을 서버로 전송하며, 사용자 입력에 응답하여 서버로부터 전송된 정보를 화면에 표시하는 기능만을 수행할 수 있으며, 이 밖에 만성신장 질환 발생 예측 방법의 나머지 단계는 질병 예측 서버(20)에서 수행될 수 있다. 이하에서는 설명의 편의를 위하여 만성신장 질환 발생 예측 장치(10)에서 만성신장 질환 발생 예측 방법이 수행되는 예에 대하여 설명하기로 한다.The method for predicting the occurrence of chronic kidney disease described below may be performed in the apparatus for predicting the occurrence of chronic kidney disease 10. As another example, each step of the method for predicting the occurrence of chronic kidney disease may be performed in the disease prediction server 20. As another example, some steps of each step of the method for predicting the occurrence of chronic kidney disease may be performed by the apparatus for predicting the occurrence of chronic kidney disease, and the remaining steps may be performed by the disease prediction server 20. For example, the apparatus for predicting the occurrence of chronic kidney disease (10 is a step of a method for predicting a disease of chronic kidney disease, receives user input, transmits the received user input to the server, and transmits from the server in response to the user input) Only the function of displaying the displayed information on the screen may be performed, and the remaining steps of the method for predicting the occurrence of chronic kidney disease may be performed by the disease prediction server 20. Hereinafter, for convenience of description, prediction of the occurrence of chronic kidney disease A description will be given of an example in which the method for predicting the occurrence of chronic kidney disease in the device 10 is performed.

본원의 일 실시예에 따르면, 만성신장질환 발생 예측 장치(10)는 한국인을 대상으로 하여 유전체 정보와 1-7기까지 총 6번의 반복측정으로 이루어진 인구·사회학적 요인, 환경적 생태 요인, 비만 관련 요인에 대한 신체 측정치, 주요 혈액 마커 정보, 현재의 질병 이환 상태 등의 정보를 조사한 질병관리본부의 코호트 데이터를 질병 예측 서버(20)로부터 제공받아, 만성신장질환 발생을 예측할 수 있다. According to an embodiment of the present application, the apparatus for predicting the occurrence of chronic kidney disease 10 targets Koreans, including genomic information and population/social factors, environmental ecological factors, and obesity, which consist of a total of six repetitive measurements from 1 to 7 stages. It is possible to predict the occurrence of chronic kidney disease by receiving the cohort data of the disease management headquarters, which has investigated information such as body measurements, key blood marker information, and current disease morbidity for related factors, from the disease prediction server 20.

또한, 만성신장질환 발생 예측 장치(10)는 한국인을 대상으로 하여 유전체 정보와 생활 습관, 비만 관련 신체 측정치, 주요 혈액마커 정보, 현재의 질병 상태 등의 정보를 조사한 질병관리본부의 코호트 데이터를 이용하여 만성신장질환 발생을 예측할 수 있다. 만성신장질환 발생 예측 장치(10)는 질병관리본부의 안산안성 코호트 대상자를 대상으로 모형을 구축하고, 도시기반 코호트와 농촌기반 코호트, 그리고 이들의 통합인 도시-농촌기반 통합 코호트 3개의 코호트를 통해 구축된 모형을 기반으로 검증을 수행할 수 있다.In addition, the apparatus for predicting the occurrence of chronic kidney disease uses cohort data of the Korea Centers for Disease Control and Prevention, which surveys information on genomic information, lifestyle, obesity-related body measurements, major blood marker information, and current disease status for Koreans. Therefore, it can predict the occurrence of chronic kidney disease. Chronic kidney disease outbreak prediction device (10) builds a model for the Ansan Anseong cohort of the Center for Disease Control and Prevention through three cohorts of urban-rural and rural-based cohorts, and urban-rural-based integrated cohorts. Verification can be performed based on the built model.

또한, 만성신장질환 발생 예측 장치(10)는 유전체 마커 분석을 위해 전체 인구집단에서 고혈압, 당뇨병, 만성신장질환 이환과 신규 발생을 모두 통합하여, 이환/발생 대상자를 환자군으로 분류할 수 있다. 또한, 만성신장질환 발생 예측 장치(10)는 이환/발생에 모두 해당하지 않은 대상자를 대조군을 분류하여 고혈압과 당뇨병 각 질병에 관련된 유전체 마커를 선별할 수 있다. 만성신장질환 발생 예측 장치(10)는 환자군과 대조군을 통합하여 ‘통합 유전체 점수’를 구축할 수 있다. 특히 여기서 유전체 정보는 기존 타 연구에서는 microarray 기반 유전체 정보(genotyping)를 이용하여 고혈압, 당뇨병 마커를 산출하였지만 만성신장질환 발생 예측 장치(10)는 1K genome 3.0 최신버전을 기반으로 imputation이 진행된 유전체 정보로 확장하여 해당 정보를 유전체마커 분석에 적용하였다.In addition, the apparatus for predicting the occurrence of chronic kidney disease 10 may integrate hypertension, diabetes, chronic kidney disease morbidity and new incidence in the entire population for analysis of genomic markers, and classify the subject of morbidity/occurrence into the patient group. In addition, the apparatus for predicting the development of chronic kidney disease 10 may classify a control group of subjects who are not all affected by morbidity/occurrence and select genomic markers related to each disease of hypertension and diabetes. The apparatus for predicting the occurrence of chronic kidney disease 10 may construct a “integrated genomic score” by integrating a patient group and a control group. In particular, the genome information here was calculated using other microarray-based genotyping (genotyping) to calculate hypertension and diabetes markers, but the chronic kidney disease outbreak prediction device (10) is based on the latest version of 1K genome 3.0. The information was extended and applied to the analysis of genome markers.

또한, 만성신장질환 발생 예측 장치(10)는 고혈압, 당뇨병, 그리고 당뇨병-고혈압 동반상태 (고혈압과 당뇨병은 생물학적 병리학적 기전 상 동시이환 높음) 및 만성신장질환 (고혈압과 당뇨병은 만성신장질환의 4대 원인성 질환 중 가장 많은 규모를 차지하고 있는 2대 질환임)에 대해 각 질병 이환과 각 질병 발생에 관련된 요인을 확인하고, 이들을 모두 통합하여, 이들의 환경적 행태 요인과 그들의 변화상태에 대한 ‘통합 위험 요인’을 구축할 수 있다.In addition, the apparatus for predicting the development of chronic kidney disease 10 includes hypertension, diabetes, and diabetes-hypertension (high hypertension and diabetes are concurrently high due to biological pathological mechanisms) and chronic kidney disease (hypertension and diabetes are 4 of chronic kidney disease). About the factors related to each disease morbidity and each disease outbreak, and integrating them, for the two major diseases that occupy the largest scale among the major causative diseases), ' Integrated risk factors.

또한, 만성신장질환 발생 예측 장치(10)는 질병의 위험요인으로는 볼 수 없지만, 일반 인구집단의 구성 상태를 볼 때, 고혈압 환자, 고혈압 전질병상태, 당뇨병 환자, 당뇨병 전질병상태의 환자가 존재하고 있고, 이들의 경우 정상 상태에 비해 다른 질병의 발생 위험이 더욱 높기 때문에 (예를 들어, 정상 상태에 비해 고혈압 전질병상태는 당뇨병, 만성신장질환 발생 위험이 높고, 고혈압 환자는 정상이나 전질병상태에 비해 그 위험이 더욱 높음), 이러한 일방 인구 집단의 상태를 위험 예측 모델에 바로 적용하기 위해, ‘기저조사의 질병-전질병상태 지표’ (고혈압, 당뇨병, 만성신장질환 이환과 전질병상태 이환에 대한 평가 지표)를 설명변수에 추가할 수 있다.In addition, the apparatus for predicting the occurrence of chronic kidney disease cannot be seen as a risk factor of the disease, but in view of the composition of the general population, patients with hypertension, pre-hypertensive disease, diabetes, pre-diabetes Exist, and in these cases, the risk of developing other diseases is higher than that of normal conditions (e.g., hypertensive pre-diseases have a higher risk of developing diabetes and chronic kidney disease than normal conditions, and hypertensive patients have normal or pre-existing risks). The risk is higher than the disease status), and to apply the status of this one-group population directly to the risk prediction model, the'baseline's disease-to-disease status indicators' (hypertension, diabetes, chronic kidney disease morbidity and all diseases) Evaluation indicators for state morbidity) can be added to explanatory variables.

또한, 만성신장질환 발생 예측 장치(10)는 발생위험 예측 모델의 결과로 고혈압, 당뇨병, 그리고 이들의 동반질병으로 고혈압과 당뇨병의 동반상태와 만성신장질환의 발생으로 정의하였고, 위험예측모델의 설명변수는 ‘기저조사의 통합 유전체 점수’와 ‘반복측정 통합 위험 요인’, ‘기저조사의 질병-전질병 상태 지표’를 모두 통합하여 ‘통합 위험 요인 패널’을 구축할 수 있다.In addition, the apparatus for predicting the occurrence of chronic kidney disease 10 was defined as the occurrence of hypertension and diabetes and the accompanying condition and chronic kidney disease as hypertension, diabetes mellitus, and their accompanying diseases as a result of the occurrence risk prediction model, and a description of the risk prediction model As for the variables, the'integrated risk factor panel' can be constructed by integrating both the'integrated genomic score of the baseline survey', the'integrated risk factor of repeated measures', and the'indicator of disease-to-disease status of the baseline survey'.

또한, 만성신장질환 발생 예측 장치(10)는 통계기반의 시간변이 비례확률위험 모형과 신경구조망과 랜덤 포레스트와 같은 기계학습 방법을 적용하여 위험 예측 모델을 구축할 수 있다. 또한 만성신장질환 발생 예측 장치(10)는 반복적 훈련과 모형의 예측력 비교를 통해 가장 높은 예측력을 가진 랜덤포레스트를 이용한 모형을 최종 ‘통합 위험 예측 모형’으로 선정할 수 있다.In addition, the apparatus for predicting the occurrence of chronic kidney disease 10 can construct a risk prediction model by applying a statistically-based time-variable proportional probability risk model and a machine learning method such as neural network and random forest. In addition, the apparatus for predicting the occurrence of chronic kidney disease 10 may select a model using a random forest having the highest predictive power as the final “integrated risk prediction model” through repeated training and comparison of the predictive power of the model.

또한, 만성신장질환 발생 예측 장치(10)는 기계학습 방법 모형에서는 결과변수에 대한 설명 변수를 포함할 때, ‘요인 노출의 시기와 이후 변화될 수 있는 상태의 시간적 연속성과 질병의 자연서’를 고려할 수 있으며, 이후 모든 설명 변수가 포함된 이후 결과변수들도 질병의 자연사와 질병의 생물학적 연속성에 따라 포함되도록 생성할 수 있다. In addition, the apparatus for predicting the occurrence of chronic kidney disease, in the machine learning method model, includes the'temporal continuity of the time and the state of the disease that can be changed after and when the factors are exposed' in the machine learning method model. It can be considered, and after all the explanatory variables are included, the result variables can also be generated to be included according to the natural history of disease and biological continuity of disease.

또한, 만성신장질환 발생 예측 장치(10)는 출생부터 노출될 수 있는 요인 (유전체) -> 인구학적 지표 (연령과 성별) -> 사회학적 요인 (교육수준, 소득 수준, 결혼 여부) -> 질병력·가족력 요인 -> 이후 변화되어 노출될 수 있는 행태 요인 -> 행태 요인으로 인해 변화될 수 있는 비만 관련 측정 지표-> 이들로 인해 변화될 수 있는 혈액적 이상을 인지하는 혈액 마커-> 모든 상태의 변화로 인한 질병의 이환 및 전질병상태의 이환 여부 -> 고혈압 혹은 당뇨병 단독질병 발생 -> 고혈압-당뇨병 동반질병발생 -> 만성신장질환 발생과 같은 흐름으로 만성신장질환 발생 여부를 예측할 수 있다. In addition, the apparatus for predicting the occurrence of chronic kidney disease 10 is a factor that can be exposed from birth (dielectric) -> demographic indicators (age and gender) -> sociological factors (education level, income level, marital status) -> medical history Family history factor -> Behavioral factors that may change and be exposed afterwards -> Obesity-related measurement indicators that may change due to behavioral factors -> Blood markers that recognize hematologic abnormalities that may be altered by them -> All states It is possible to predict whether or not chronic kidney disease occurs through the same flow as the disease caused by the change and whether or not the entire disease state -> hypertension or diabetes alone disease -> hypertension-diabetic comorbid disease -> chronic kidney disease.

도 2는 본원의 일 실시예에 따른 만성신장질환 장치의 개략적인 블록도이다.Figure 2 is a schematic block diagram of a chronic kidney disease device according to an embodiment of the present application.

도2를 참조하면 만성신장질환 장치(10)는 유전체 마커 선별부(11), 통합 유전체 지표 산출부(12), 통합 위험 요인 구축부(13), 설명변수 도출부(14), 질병 발생 예측 모델 생성부(15) 및 질환 예측부(16)를 포함할 수 있다. 다만, 만성신장질환 장치(10)의 구성이 이에 한정되는 것은 아니다. Referring to FIG. 2, the chronic kidney disease device 10 includes a genomic marker selection unit 11, an integrated genomic index calculation unit 12, an integrated risk factor construction unit 13, an explanatory variable derivation unit 14, and disease outbreak prediction A model generation unit 15 and a disease prediction unit 16 may be included. However, the configuration of the chronic kidney disease device 10 is not limited thereto.

본원의 일 실시예에 따르면, 유전체 마커 선별부(11)는 복수의 질환과 관련된 유전체 마커를 선별할 수 있다. 일예로, 복수의 질환은, 고혈압, 당뇨병, 당뇨병-고혈압 동반상태 및 만성신장질환을 포함할 수 있다. 예시적으로 유전체 마커 선별부(11)는 출생부터 노출될 수 있는 요인인 유전체 마커를 선별하기 위해서 안산안성 코호트 전체 대상자 중에서 유전체가 있고, 혈압과 혈압약 복용여부와 공복혈당과 당화혈색소 당뇨병 복용 여부 정보를 가지고 있어, 고혈압과 당뇨병 정의가 가능한 대상자를 대상으로 복수의 질환과 관련된 유전체 마커의 선별 대상자로 선정할 수 있다.According to one embodiment of the present application, the genomic marker selection unit 11 may select genomic markers associated with a plurality of diseases. For example, a plurality of diseases may include hypertension, diabetes, diabetes-hypertension and chronic kidney disease. Illustratively, the genome marker selector 11 has a genome among all subjects in the Ansan Antarctic Cohort in order to select a genome marker that can be exposed from birth, whether or not blood pressure and blood pressure medications are taken, and whether fasting blood glucose and glycated hemoglobin diabetes are taken. With the information, it is possible to select targets capable of defining hypertension and diabetes as targets for selection of genomic markers related to multiple diseases.

또한, 유전체 마커 선별부(11)는 제1질환과 관련된 제1유전체 마커를 선별하고, 선별된 상기 제1유전체 마커를 도시 코호트 및 농촌 코호트 각각에서 검증하여 제1유전체 마커로 결정할 수 있다. 일예로, 제1질환은 고혈압일 수 있다. 예시적으로, 유전체 마커 선별부(11)는 질병 예측 서버(20)로부터 안산안성 코호트 정보를 제공받을 수 있다. 유전체 마커 선별부(11)는 안산안성 코호트에서 제1질환과 관련된 제1유전체 마커(예를 들어, 고혈압 관련 유전체 마커)를 선별하고, 이를 도시코호트와 농촌 코호트 각각에서 검증하여 최종 고혈압 관련 유전체 마커를 선별할 수 있다. 예시적으로, 유전체 마커 선별부(11)는 안산안성 코호트 전체 대상자 10,030명 중 유전체 데이터가 있고 고혈압 정의가 가능한 대상자 (혈압, 혈압약 복용여부에 대한 정보 있음) 8,840명을 대상으로 하여, 제1질환과 관련된 제1유전체 마커를 선별할 수 있다. 또한, 유전체 마커 선별부(11)는 기저자료와 추적자료를 통합할 수 있다. 유전체 마커 선별부(11)는 고축기 혈압이 제1혈압 이상이거나, 이완기 혈압이 제2혈압 이상이거나, 혈압약을 복용하는 대상자의 경우 고혈압 환자군으로 정의할 수 있다. 또한, 이환과 발생에 한번도 포함되지 않는 대상자를 고혈압 대조군으로 설정할 수 있다. 달리 말해, 유전체 마커 선별부(11)는 기저자료와 추적자료를 통합하여 고축기 혈압 (SBP) 130mm/Hg 이상이거나 이완기 혈압 (DBP)이 90mm/Hg 이상이거나 혈압약을 복용한 경우를 고혈압 환자군으로 정의하고, 이환과 발생에 한번도 포함되지 않은 대상자를 대조군으로 설정할 수 있다. 일예로, 유전체 마커 선별부(11)는 안산안성코호트에서는 7,444명, 도시 코호트에서는 858명, 농촌 코호트에서는 332명의 대상자를 유전체 마커의 선별을 위해 도출할 수 있다. 다만, 이에; 한정되는 것은 아니며 다양한 실시예가 존재할 수 있다. In addition, the genome marker selection unit 11 may select the first dielectric marker associated with the first disease, and verify the selected first dielectric marker in each of the urban cohort and the rural cohort to determine the first dielectric marker. As an example, the first disease may be hypertension. For example, the genomic marker selection unit 11 may be provided with the Ansan Anseong cohort information from the disease prediction server 20. The genome marker selection unit 11 selects a first genome marker (eg, a hypertension-related genome marker) related to the first disease from the Ansan Anseong cohort, and verifies this in a city cohort and a rural cohort to determine the final hypertension-related genome marker Can be selected. Illustratively, the genome marker selection unit 11 targets 8,840 subjects who have genomic data and can define hypertension (with information on whether to take blood pressure or blood pressure medication) among 10,030 subjects in the Ansan Anseong Cohort. The first genetic marker associated with the disease can be selected. In addition, the genome marker selection unit 11 may integrate base data and tracking data. The genome marker selection unit 11 may be defined as a hypertensive patient group in the case of a subject having a systolic blood pressure equal to or greater than the first blood pressure, a diastolic blood pressure equal to or greater than the second blood pressure, or a person taking blood pressure medication. In addition, subjects who are never included in morbidity and development can be set as hypertension controls. In other words, the genome marker selection unit 11 integrates the basal data and the tracking data, and the hypertensive blood pressure (SBP) is greater than 130 mm/Hg, the diastolic blood pressure (DBP) is greater than 90 mm/Hg, or the patient is taking hypertension. It can be defined as, and subjects who have never been included in morbidity and development can be set as controls. As an example, the genome marker selection unit 11 can derive 7,444 people from the Ansan Anseong cohort, 858 people from the urban cohort, and 332 people from the rural cohort for the selection of genomic markers. However, this; It is not limited, and various embodiments may exist.

또한, 유전체 마커 선별부(11)는 제2질환과 관련된 제2유전체 마커를 선별하고, 선별된 상기 제2유전체 마커를 도시 코호트 및 농촌 코호트 각각에서 검증하여 제2유전체 마커로 결정할 수 있다. 예시적으로, 유전체 마커 선별부(11)는 질병 예측 서버(20)로부터 안산안성 코호트 정보를 제공받을 수 있다. 유전체 마커 선별부(11)는 안산안성 코호트에서 제2질환과 관련된 제2유전체 마커(예를 들어, 당뇨병 관련 유전체 마커)를 선별하고, 이를 도시코호트와 농촌 코호트 각각에서 검증하여 최종 고혈압 관련 유전체 마커를 선별할 수 있다. 유전체 마커 선별부(11)는 안산안성 코호트 전체 대상자 10,030명 중 유전체 데이터가 있고 당뇨병 정의가 가능한 대상자 (공복혈당, HBA1C, 당뇨병약 복용여부) 8,831명을 대상으로 제2질환과 관련된 제2유전체 마커를 선별할 수 있다. 또한, 유전체 마커 선별부(11)는 1기부터 7기까지 공복혈당이 제1혈당 이상이거나 당화혈색소가 제1혈색소 이상이거나 당뇨병약을 복용한 경우를 당뇨병이라고 정의할 수 있다. 달리 말해, 유전체 마커 선별부(11)는 1기부터 7기까지 공복혈당이 126 이상이거나 당화혈색소가 6.5% 이상이거나 당뇨병약을 복용한 경우를 당뇨병이라고 정의할 수 있다. 유전체 마커 선별부(11)는 기저자료와 추적자료를 통합하여 공복혈당이 126 이상이거나 당화혈색소가 6.5% 이상이거나 당뇨병약을 복용한 경우를 당뇨병 환자군으로 정의할 수 있다. 또한, 유전체 마커 선별부(11)는 이환과 발생에 한번도 포함되지 않은 대상자를 대조군으로 설정할 수 있다. 일예로, 유전체 마커 선별부(11) 최종적으로, 안산안성코호트에서는 7,444명, 도시 코호트에서는 858명, 농촌 코호트에서는 332명의 대상자를 유전체 마커의 선별을 위해 도출할 수 있다. 다만, 이에; 한정되는 것은 아니며 다양한 실시예가 존재할 수 있다. In addition, the genome marker selection unit 11 may select the second dielectric marker associated with the second disease, and verify the selected second dielectric marker in each of the urban cohort and rural cohort to determine the second dielectric marker. For example, the genomic marker selection unit 11 may be provided with the Ansan Anseong cohort information from the disease prediction server 20. The genome marker selection unit 11 selects a second genome marker (eg, diabetes-related genome marker) related to the second disease from the Ansan Anseong cohort, and verifies this in a city cohort and a rural cohort to determine the final hypertension-related genome marker Can be selected. The genome marker selection unit 11 is a second genome marker related to the second disease for 8,831 subjects who have genomic data and who can define diabetes (fasting blood glucose, HBA1C, whether to take diabetes drugs) among 10,030 subjects in the Ansan Anseong Cohort Can be selected. In addition, the genome marker selection unit 11 may be defined as diabetes when fasting blood glucose is greater than or equal to the first blood glucose level or glycated hemoglobin level is greater than or equal to the first blood glucose level or is taken from 1 to 7 periods. In other words, the genome marker selection unit 11 may define a case in which the fasting blood glucose is 126 or more, the glycated hemoglobin is 6.5% or more, or the diabetes drug is taken from 1st to 7th. The genomic marker selection unit 11 may be defined as a group of diabetic patients by integrating basis data and tracking data, having fasting blood glucose of 126 or more, glycated hemoglobin of 6.5% or more, or taking diabetes drugs. In addition, the dielectric marker selection unit 11 may set a subject that has never been included in morbidity and development as a control. As an example, the genome marker selection unit 11 finally, 7,444 people in the Ansan Anseong cohort, 858 people in the urban cohort, and 332 subjects in the rural cohort can be derived for the selection of genomic markers. However, this; It is not limited, and various embodiments may exist.

본원의 일 실시예에 따르면, 유전체 마커 선별부(11)는 유전체 유전형질분석(Genotyping) 데이터의 유전변이정보확장(genotype imputation)을 통한 유전자 정보를 확장할 수 있다. 예시적으로, 안산안성 집단에서의 유전체 정보는 Affimetrix 6.0 array를 이용한 유전체 분석으로 당시 최대 60만개의 SNPs으로 구성된 유전체 칩을 기반으로 유전체 분석이 진행되었다. 그러나 이 칩은 모든 SNPs에 대한 정보를 제공할 수 있는 것이 아니라 일정 간격으로 SNPs을 확인하도록 하는 방법이기 때문에 한국인 인구집단에서 빈도가 드물거나 이질성이 없는 경우 SNP에 대한 정보는 얻기 어렵다는 한계가 있다. 60만개 SNPs 중 30-40만개 정도에서만 SNPs의 변이를 확인할 수 있었고 그들 중 유전체 마커를 찾을 수 있다. 유전체 마커 선별부(11)는유전체 칩의 한계점을 보완하기 위해 유전변이정보를 확장할 수 있는 방법인 유전형 임퓨테이션(genotype imputation, 이하 ‘유전변이정보확장)을 적용하여 유전변이정보를 확장할 수 있다. 이는 수천 명 이상의 유전체 정보를 바탕으로 수천만 개 이상의 유전변이를 가져올 수 있으며, 이를 이용해서 유전변이정보확장, 유전변이 빈도 확인 등에 사용할 수 있다. 그 결과 유전체 마커 선별부(11)는 유전체 칩에 있는 수십만 개의 유전변이정보를 WGS 수준인 약 8천만 개 이상으로 유전변이정보를 확장할 수 있다. 유전변이정보확장 방법은 참조유전체 정보와 유전체 칩의 유전변이정보를 비교해서 유전체 칩에는 없지만 참조유전체 정보에 있는 유전변이를 통계적으로 추정해서 확보할 수 있는 분석 방법이다. According to one embodiment of the present application, the genome marker selection unit 11 may expand gene information through genotype imputation of genome genotyping data. For example, genomic information in the Ansan Anseong population was analyzed by genome analysis using Affimetrix 6.0 array, and genome analysis was performed based on a genome chip composed of up to 600,000 SNPs at the time. However, since this chip is not able to provide information on all SNPs, it is a method to check SNPs at regular intervals, so it is difficult to obtain information on SNPs when the frequency is rare or heterogeneous in the Korean population. Of the 600,000 SNPs, only about 300,000 to 400,000 SNPs could be identified and genome markers could be found. The dielectric marker selection unit 11 can expand the genetic variation information by applying genotype imputation (hereinafter referred to as'genetic variation information extension'), which is a method of expanding the genetic variation information to compensate for the limitations of the dielectric chip. have. It can bring over tens of millions of genetic mutations based on information of more than several thousand people, and it can be used to expand genetic variation information and check the frequency of genetic variation. As a result, the dielectric marker selector 11 may extend the genetic variation information to hundreds of thousands of genetic variation information on the dielectric chip to about 80 million or more, which is the WGS level. The method of extending the genetic variation information is an analysis method that compares the reference dielectric information and the genetic variation information of the dielectric chip, and does not exist in the dielectric chip, but can statistically estimate and secure the genetic variation in the reference dielectric information.

도 3은 본원의 일 실시예에 따른 만성신장질환 장치의 유전변이정보확장의 참조유전체 정보로 사용할 수 있는 정보 리스트를 나타낸 도면이다.FIG. 3 is a diagram showing a list of information that can be used as reference dielectric information for extending genetic variation information of a chronic kidney disease device according to an embodiment of the present application.

예시적으로 도3을 참조하면, 도3은 유전변이정보확장의 참조유전체 정보로 사용할 수 있는 정보 리스트로, 유전체 마커 선별부(11)는 참조유전체 정보로서 가장 상부에 있는 1K Genome project에서 산출된 Eastern Asian 결과를 적용하여 유전변이정보확장의 참조유전체 정보를 획득할 수 있다.Referring to FIG. 3 as an example, FIG. 3 is a list of information that can be used as reference genome information of the genetic variation information extension, and the genome marker selection unit 11 is the reference genome information calculated from the top 1K Genome project. Eastern Asian results can be applied to obtain reference genome information for genetic variation information expansion.

일예로, 유전체 마커 선별부(11)는 안산안성 코호트 중 유전체 정보가 있는 대상자를 기반으로 유전변이정보를 확장할 수 있다. 마커 선별부(11)는-Imputation을 실시하기 전 ‘Plink’ 프로그램을 이용하여 원 데이터에서의 QC (MAF: 0.01, missing rate per sample: 0.05, missing rate per SNP: 0.02, Hardy-weinberg: < 10-6)를 수행할 수 있다. 마커 선별부(11)는 이후 ‘HG19’로 annotation을 변경하고 annotation이 불가능한 SNP을 제거할 수 있다. 또한, 마커 선별부(11)는 Pre-phasing을 위해 ‘Shapeit2, 1000 Genome Phase 3 East Asian population’을 사용할 수 있으며, 직접적인 imputation 수행은 ‘impute2’를 이용하여 수행할 수 있다. 또한, 마커 선별부(11)는 File converting을 위해서는 Plink 2.0 버전에서의 ‘Qctool v2’을 이용할 수 있다. 또한, 마커 선별부(11)는- Imputation 결과 중 probability 0.9, completion rate 0.98, info (r2) 0.7 이상을 기준으로 filtering 시행할 수 있다. 마커 선별부(11)는 최종 Impution 후 원 genotyping 자료는 총 31,563,540 SNPs 으로 확장할 수 있다. 기존 타 연구에서는 유전체 분석 정보(Affimetrix 6.0 array genotyping) 를 이용하여 고혈압, 당뇨병 마커를 산출하였지만 마커 선별부(11)는 1K genome 3.0 최신버전을 기반으로 정보를 확장하여 (Imputation) 그 정보를 유전체마커 분석에 이용하였다.As an example, the genome marker selection unit 11 may expand the genetic variation information based on a subject having genome information among the Ansan Anseong cohorts. Marker selector 11 uses QP (MAF: 0.01, missing rate per sample: 0.05, missing rate per SNP: 0.02, Hardy-weinberg: <10) using the'Plink' program before performing Imputation. -6) can be performed. The marker selector 11 can then change the annotation to'HG19' and remove the SNP that cannot be annotated. In addition, the marker selection unit 11 can use'Shapeit2, 1000 Genome Phase 3 East Asian population' for pre-phasing, and direct imputation can be performed using'impute2'. In addition, the marker selector 11 may use'Qctool v2' in the Plink 2.0 version for file converting. In addition, the marker selector 11-filtering may be performed based on probability 0.9, completion rate 0.98, and info (r2) 0.7 or higher among imputation results. The marker selection unit 11 can expand the original genotyping data to 31,563,540 SNPs after the final impution. In other existing studies, hypertension and diabetes markers were calculated using genomic analysis information (Affimetrix 6.0 array genotyping), but the marker selection unit 11 expands the information based on the latest version of 1K genome 3.0 (Imputation) and uses that information as a genomic marker It was used for analysis.

본원의 일 실시예에 따르면, 유전체 마커 선별부(11)는, 회귀분석 알고리즘을 기반으로 복수의 질환과 연관된 단일염기 다형성(SNP) 마커를 선정하고, 단일염기 다형성(SNP) 마커, 제1유전체 마커 및 제2유전체 마커 중 적어도 어느 하나를 고려하여 핵심 유전자 정보를 도출할 수 있다. 일예로, 유전체 마커 선별부(11)는 로지스틱 회귀분석 (Logistic regression)을 이용해 SNP 마커 자체가 질병에 미치는 위험 (crude odds ratio [crude OR])을 산출할 수 있다. 또한 유전체 마커 선별부(11)는 성별과 연령 변수의 경우 질병뿐 아니라 다른 위험요인에 동시에 미치는 영향이 가장 큰 변수이기 때문에 두 변수를 보정한 SNP 마커의 질병 위험 (Age-sex adjusted OR) 결과를 산출할 수 있다.According to one embodiment of the present application, the genomic marker selection unit 11 selects a single-base polymorphism (SNP) marker associated with a plurality of diseases based on a regression analysis algorithm, a single-base polymorphism (SNP) marker, and a first genome Core gene information may be derived by considering at least one of the marker and the second dielectric marker. As an example, the genome marker selection unit 11 may calculate a risk (crude odds ratio [crude OR]) of the SNP marker itself to a disease using logistic regression. In addition, since the genome marker selection unit 11 has the largest effect on both the disease and other risk factors in the case of the gender and age variables, the results of the disease risk (Age-sex adjusted OR) of the SNP markers corrected for the two variables are shown. Can be calculated.

예시적으로, 유전체 마커 선별부(11)는 입력 데이터에서 SNPs에서의 세부적인 품질관리를 수행할 수 있다. 유전체 마커 선별부(11)는 Genotyping missing < 5% 에 의해 11,313,891 SNPs 제거할 수 있다. 또한, 유전체 마커 선별부(11)는 HWE p <1E-06 에 의해 183 SNPs 제거할 수 있다. 또한, 유전체 마커 선별부(11)는 MAF (Minor allele frequency) < 1% 에 의해 17,950,690 SNPs 제거할 수 있다. 유전체 마커 선별부(11)는 최종 2,298,777 SNPs 으로 SNPs을 도출할 수 있으며, 대상자에 대한 missing 등은 이미 질본에서 사전에 QC를 시행한 바 있어서 더 이상 처리하지 않을 수 있다. For example, the dielectric marker selection unit 11 may perform detailed quality control in SNPs from the input data. The dielectric marker selector 11 can remove 11,313,891 SNPs by Genotyping missing <5%. In addition, the dielectric marker selector 11 can remove 183 SNPs by HWE p <1E-06. In addition, the dielectric marker selection unit 11 may remove 17,950,690 SNPs by MAF (Minor allele frequency) <1%. The genome marker selection unit 11 may derive SNPs with a final 2,298,777 SNPs, and missing objects may not be further processed since QC has already been performed in advance in the questionnaire.

또한, 유전체 마커 선별부(11)는 Q-Q plot과 lambda를 통해 보정해야 할 집단 간 이질성이 존재 할 경우에는 인구집단에 기반하여 보정을 실시할 수 있다. 유전체 마커 선별부(11)는 Phenotype에 따른 SNP의 통계적인 유의성은 < 1 x 10^-6을 기준으로 하여 threshold 미만의 SNP을 선정할 수 있다. 유전체 마커 선별부(11)는 분석결과의 시각화를 위하여 각 SNP에서의 P-value를 염색체(chromosome), 물리적 거리순으로 늘여놓은 Manhattan plot을 생성할 수 있다. 유전체 마커 선별부(11)는 각 SNP 지표들의 additive effect 가정 하에서 Cochran-Armitage test (1df)를 시행하여 raw p-value를 산출하였으며 Manhattan plot을 통해 결과를 확인할 수 있다. 달리 말해, 유전체 마커 선별부(11)는 각 SNP에서의 P-value를 염색체(chromosome), 물리적 거리순으로 늘여놓은 Manhattan plot를 사용자 단말로 제공함으로써, 사용자는 시각화된 분석결과를 제공받을 수 있다. 또한, 유전체 마커 선별부(11)는 로지스틱 회귀분석 (Logistic regression)을 이용해 SNP 마커 자체가 질병에 미치는 위험 (crude odds ratio [crude OR])을 산출하였고, 또한 성별과 연령 변수의 경우 질병뿐 아니라 다른 위험요인에 동시에 미치는 영향이 가장 큰 변수이기 때문에 두 변수를 보정한 SNP 마커의 질병 위험 (Age-sex adjusted OR) 결과를 산출할 수 있다.In addition, the genome marker selection unit 11 may perform correction based on a population group when heterogeneity between groups to be corrected exists through QQ plot and lambda. The genome marker selector 11 may select an SNP below a threshold based on the statistical significance of SNP according to Phenotype <1 x 10 ^-6 . The genome marker selector 11 may generate a Manhattan plot in which the P-value of each SNP is increased in the order of chromosome and physical distance for visualization of the analysis result. The dielectric marker selection unit 11 performed a Cochran-Armitage test (1df) under the assumption of the additive effect of each SNP index to calculate the raw p-value and can confirm the result through the Manhattan plot. In other words, the dielectric marker selection unit 11 provides a Manhattan plot in which the P-value of each SNP is increased in the order of chromosome and physical distance, so that the user can be provided with visualized analysis results. In addition, the genome marker selection unit 11 calculated the risk of the SNP marker itself on the disease (crude odds ratio [crude OR]) using logistic regression, and also gender and age variables, as well as disease Since the effect on the other risk factors at the same time is the largest variable, it is possible to calculate the disease risk (Age-sex adjusted OR) result of the SNP markers corrected for both variables.

달리 말해, 유전체 마커 선별부(11)는 질병의 마커이면서 또한 질병 위험예측에 이용할 수 있는 유전체 지표의 선별할 수 있다. 유전체 마커 선별부(11)는 복수의 질병 각각에 대하여 Crude 와 Age-sex adjusted 결과에서 공통적으로 유의하게 해당 질병과 연관성 있는 SNP 마커를 선정할 수 있다. 유전체 마커 선별부(11)는 상기 선정 마커 중 질병 위험을 높이는 방향의 마커 (odds ratio (OR) ≥ 1) 를 만족하는 SNP)을 유전 마커로 선정할 수 있다. 고혈압 환자에서 당뇨병은 흔한 동반이환 질환이고, 당뇨병 환자에서도 마찬가지로 고혈압이 흔한 동반이환 질환이며, 고혈압과 당뇨병은 혈관네트워크를 통해 같은 표적기관에 영향을 주며, 고혈압과 당뇨병은 만성신장질환, 심내혈관질환, 신부전, 심부전, 안구질환 등의 원인성 질환이며, 이들 질환으로 인하여 이들 발생률과 사망률을 급격히 증가한다. In other words, the genome marker selection unit 11 may select a genome indicator that is a marker of a disease and can be used to predict disease risk. The genome marker selection unit 11 may select SNP markers commonly associated with a corresponding disease in the Crude and Age-sex adjusted results for each of a plurality of diseases. The genome marker selector 11 may select a marker (SNP that satisfies odds ratio (OR) ≥ 1) in the direction of increasing the risk of disease among the selection markers as a genetic marker. Diabetes is a common comorbid disease in hypertensive patients, and hypertension is a common comorbid disease in diabetic patients. Hypertension and diabetes affect the same target organs through the vascular network, and hypertension and diabetes are chronic kidney disease and endovascular diseases. , Renal failure, heart failure, ocular diseases, etc., and these diseases rapidly increase their incidence and mortality.

따라서 유전체 마커 선별부(11)는 제1질환(고혈압) 및 제2질환(당뇨병)의 유전체 SNP 마커를 통합하여 고혈압, 당뇨병, 고혈압-당뇨병 동반질환과 만성신장질환에 이용하여 만성신장질환 발생을 예측할 수 있다. 유전체 마커 선별부(11)는 이러한 생물학적 타당성 하에서 선정된 SNP 마커들을 통합하여 최종 유전체 마커 패널을 구축할 수 있다. Therefore, the genome marker selection unit 11 integrates the genomic SNP markers of the first disease (hypertension) and the second disease (diabetes) and uses them for hypertension, diabetes, hypertension-diabetes, and chronic kidney disease to develop chronic kidney disease. Predictable. The genome marker selection unit 11 may construct a final genome marker panel by integrating SNP markers selected under such biological feasibility.

도4는 본원의 일 실시예에 따른 고혈압과 관련된 유의한 유전체 마커를 설명하기 위한 도면이다. 도4의 (a)는 로지스틱 회귀분석 (Logistic regression)을 이용해 아무것도 보정하지 않은 Crude 모델이다. 또한, 도4의 (b)는 성별과 연령을 추가적으로 보정한 Age-sex adjusted 모델이다. 유전체 마커 선별부(11)는 로지스틱 회귀분석 (Logistic regression)을 이용해 아무것도 보정하지 않은 Crude 모델과 성별과 연령을 추가적으로 보정한 Age-sex adjusted 모델 각각에서 고혈압과 관련있는 유전지표 결과를 qq-plot과 Manhattan plot을 통해 결과를 확인할 수 있다. 4 is a view for explaining a significant genomic marker associated with hypertension according to an embodiment of the present application. 4(a) is a crude model in which nothing is corrected using logistic regression. In addition, Figure 4 (b) is an age-sex adjusted model that additionally corrects gender and age. The genome marker selector 11 displays qq-plot and genetic index results related to hypertension in each of the Crude model, which does not correct anything using logistic regression, and the Age-sex adjusted model, which additionally corrects gender and age. You can check the results through the Manhattan plot.

또한, 유전체 마커 선별부(11)는 Crude 결과와 연령과 성별을 보정한 결과 모두에서 공통된 SNP 중 Odds ratio (OR)이 1 이상인 SNP을 최종 선정할 수 있다. 최종 선정된 4개의 SNP은 표1과 같다. 표1은 최종 선정된 고혈압 SNP이다.In addition, the dielectric marker selection unit 11 may finally select an SNP having an Odds ratio (OR) of 1 or more among common SNPs in both Crude results and results of correcting age and gender. The final four selected SNPs are shown in Table 1. Table 1 is the final selected hypertension SNP.

SNPSNP GeneGene CCHRCCHR PPOSPPOS MAFMAF AllelesAlleles OROR L95L95 U95U95 PP rs35639474rs35639474 66 0.9283290.928329 -/T-/T 1.356411.35641 1.190341.19034 1.545641.54564 4.77E-064.77E-06 rs57808874rs57808874 STOML3STOML3 1313 0.7560940.756094 -/G-/G 1.215931.21593 1.119341.11934 1.320851.32085 3.66E-063.66E-06 rs7040217rs7040217 RAD23BRAD23B 99 0.8228630.822863 A/CA/C 1.191351.19135 1.103991.10399 1.285631.28563 6.61E-066.61E-06 rs7986278rs7986278 1313 0.9608590.960859 A/GA/G 1.568041.56804 1.322251.32225 1.859521.85952 2.33E-072.33E-07

도5는 본원의 일 실시예에 따른 당뇨병과 관련된 유의한 유전체 마커를 설명하기 위한 도면이다. 도5의(a)는 로지스틱 회귀분석 (Logistic regression)을 이용해 아무것도 보정하지 않은 Crude 모델이다. 또한, 도4의 (b)는 성별과 연령을 추가적으로 보정한 Age-sex adjusted 모델이다. 유전체 마커 선별부(11)는 로지스틱 회귀분석 (Logistic regression)을 이용해 아무것도 보정하지 않은 Crude 모델과 성별과 연령을 추가적으로 보정한 Age-sex adjusted 모델 각각에서의 당뇨병과 관련있는 유전 마커를 확인할 수 있다.또한, 유전체 마커 선별부(11)는 Crude 결과와 연령과 성별을 보정한 결과 모두에서 공통된 SNP 중 Odds ratio (OR)이 1 이상이고, p-value<10e-6 인 SNP을 최종 선정할 수 있다. 최종 선정된 28개의 SNP은 표2와 같다. 표2는 최종 선정된 당뇨병 SNP이다.5 is a view for explaining a significant genomic marker associated with diabetes according to an embodiment of the present application. 5(a) is a crude model in which nothing is corrected using logistic regression. In addition, Figure 4 (b) is an age-sex adjusted model that additionally corrects gender and age. The genomic marker selection unit 11 may identify genetic markers related to diabetes in each of the Crude model, which has not corrected anything by using logistic regression, and the Age-sex adjusted model, which additionally corrects sex and age. In addition, the genome marker selection unit 11 may finally select an SNP having an Odds ratio (OR) of 1 or more and p-value<10e-6 among common SNPs in both Crude results and age and gender correction results. . Table 28 shows the 28 selected SNPs. Table 2 is the final selected diabetes SNP.

SNPSNP GeneGene CHRCHR POSPOS MAFMAF AllelesAlleles OROR L95L95 U95U95 PP rs10536170rs10536170 CDKAL1CDKAL1 66 2.1E+072.1E+07 0.5352410.535241 -/TATAT-/TATAT 1.29661.2966 1.198981.19898 1.402171.40217 7.84E-117.84E-11 rs138420022rs138420022 CDKAL1CDKAL1 66 2.1E+072.1E+07 0.5328770.532877 -/T-/T 1.297031.29703 1.198911.19891 1.403181.40318 9.18E-119.18E-11 rs9356744rs9356744 CDKAL1CDKAL1 66 2.1E+072.1E+07 0.5222410.522241 C/TC/T 1.290271.29027 1.194071.19407 1.394241.39424 1.15E-101.15E-10 rs34499031rs34499031 CDKAL1CDKAL1 66 2.1E+072.1E+07 0.5252470.525247 -/AA-/AA 1.289261.28926 1.193161.19316 1.39311.3931 1.29E-101.29E-10 rs9356748rs9356748 CDKAL1CDKAL1 66 2.1E+072.1E+07 0.5217740.521774 A/TA/T 1.269241.26924 1.175891.17589 1.370011.37001 9.55E-109.55E-10 rs6906327rs6906327 CDKAL1CDKAL1 66 2.1E+072.1E+07 0.5282240.528224 A/GA/G 1.25841.2584 1.167491.16749 1.35641.3564 1.88E-091.88E-09 rs9358356rs9358356 CDKAL1CDKAL1 66 2.1E+072.1E+07 0.5252680.525268 C/TC/T 1.2591.259 1.167031.16703 1.358221.35822 2.67E-092.67E-09 rs9295474rs9295474 CDKAL1CDKAL1 66 2.1E+072.1E+07 0.522910.52291 C/GC/G 1.25321.2532 1.162061.16206 1.351481.35148 4.66E-094.66E-09 rs9348440rs9348440 CDKAL1CDKAL1 66 2.1E+072.1E+07 0.5323310.532331 C/TC/T 1.244811.24481 1.155181.15518 1.341411.34141 9.28E-099.28E-09 rs9356743rs9356743 CDKAL1CDKAL1 66 2.1E+072.1E+07 0.5097040.509704 C/TC/T 1.245651.24565 1.150591.15059 1.348561.34856 5.85E-085.85E-08 rs56099357rs56099357 CDKAL1CDKAL1 66 2.1E+072.1E+07 0.5587280.558728 -/T-/T 1.230761.23076 1.141521.14152 1.326981.32698 6.43E-086.43E-08 rs4515379rs4515379 CDKAL1CDKAL1 66 2.1E+072.1E+07 0.5630850.563085 C/GC/G 1.222861.22286 1.134251.13425 1.31841.3184 1.59E-071.59E-07 rs80151164rs80151164 88 6.1E+076.1E+07 0.9340380.934038 A/CA/C 1.511671.51167 1.278221.27822 1.787751.78775 1.38E-061.38E-06 rs77039794rs77039794 66 2.1E+072.1E+07 0.5566530.556653 A/GA/G 1.215471.21547 1.122721.12272 1.315891.31589 1.45E-061.45E-06 rs75897742rs75897742 88 6.1E+076.1E+07 0.9404070.940407 A/GA/G 1.543271.54327 1.292991.29299 1.8421.842 1.54E-061.54E-06 rs2328529rs2328529 CDKAL1CDKAL1 66 2.1E+072.1E+07 0.5868550.586855 A/CA/C 1.197711.19771 1.111271.11127 1.290871.29087 2.35E-062.35E-06 rs17193507rs17193507 LOC107987429LOC107987429 66 3.1E+073.1E+07 0.9608730.960873 G/TG/T 1.581921.58192 1.306411.30641 1.915531.91553 2.63E-062.63E-06 rs75925737rs75925737 88 6.1E+076.1E+07 0.9387860.938786 A/GA/G 1.525151.52515 1.278311.27831 1.819661.81966 2.79E-062.79E-06 rs116381594rs116381594 88 6.1E+076.1E+07 0.9394920.939492 C/GC/G 1.524991.52499 1.278061.27806 1.819631.81963 2.84E-062.84E-06 rs114462576rs114462576 88 6.1E+076.1E+07 0.940650.94065 G/TG/T 1.526411.52641 1.278091.27809 1.822971.82297 3.03E-063.03E-06 rs115952070rs115952070 88 6.1E+076.1E+07 0.9408020.940802 C/TC/T 1.527351.52735 1.278511.27851 1.824621.82462 3.04E-063.04E-06 rs116468467rs116468467 88 6.1E+076.1E+07 0.9406230.940623 C/TC/T 1.526111.52611 1.277921.27792 1.822521.82252 3.04E-063.04E-06 rs114961690rs114961690 88 6.1E+076.1E+07 0.9410570.941057 A/GA/G 1.528521.52852 1.279011.27901 1.826711.82671 3.07E-063.07E-06 rs2932331rs2932331 55 38634343863434 0.0155450.015545 A/GA/G 2.553212.55321 1.718791.71879 3.792723.79272 3.44E-063.44E-06 rs11187007rs11187007 IDEIDE 1010 9.4E+079.4E+07 0.3261680.326168 A/GA/G 1.204821.20482 1.113151.11315 1.304051.30405 3.94E-063.94E-06 rs191296992rs191296992 LPPLPP 33 1.9E+081.9E+08 0.989860.98986 A/GA/G 2.986242.98624 1.874951.87495 4.756184.75618 4.09E-064.09E-06 rs116198656rs116198656 55 38677213867721 0.9853730.985373 A/GA/G 2.589962.58996 1.722721.72272 3.89383.8938 4.77E-064.77E-06 rs115880178rs115880178 88 6.1E+076.1E+07 0.9384560.938456 C/TC/T 1.504681.50468 1.262661.26266 1.793111.79311 4.96E-064.96E-06

본원의 일 실시예에 따르면, 통합 유전체 지표 산출부(12)는 유전체 마커 선별부(11)에서 선별된 복수의 질환과 관련된 유전체 마커를 이용하여 통합 유전체 지표를 산출할 수 있다. 일예로, 통합 유전체 지표 산출부(12)에서 구축된 통합 유전체 마커 패널의 경우 수십개의 많은 SNP 마커들이 포함되어 있다. 이들을 동시에 이용할 경우 실제로 유전자가 질병에 영향을 주는 경우가 대개 많은 경우 5-7% 이하로 보고 있는데, 많은 SNP들이 동시에 모형에 들어갈 경우 질병 예측에 있어서 환경적 영향에 비해 유전적 영향성이 커지게 되어 이는 생물학적 타당성에 위배된다. 통합 유전체 지표 산출부(12)는 이러한 문제점을 해결하기 위해 이를 하나의 지표로 산출하고자 유전체 점수의 형태로 산출할 수 있다.또한, 통합 유전체 지표 산출부(12)는 유전체 점수를 산출할 수 있다. 유전체 점수 (polygenic risk score) 는 각 SNP의 beta 값과 개인별 각 SNP의 질병에 대한 확률값을 곱한 뒤 모두 합산한 후 SNP의 총 개수로 나누어 polygenic risk score를 산출될 수 있다. According to one embodiment of the present application, the integrated genomic index calculation unit 12 may calculate the integrated genomic index using genomic markers associated with a plurality of diseases selected by the genomic marker selection unit 11. For example, in the case of the integrated dielectric marker panel constructed by the integrated dielectric index calculator 12, dozens of SNP markers are included. When these are used at the same time, the gene actually affects the disease. In many cases, it is considered to be 5-7% or less. When many SNPs enter the model at the same time, the genetic impact of the disease prediction is greater than the environmental impact. Thus violating biological feasibility. In order to solve this problem, the integrated genome index calculation unit 12 may calculate it in the form of a genomic score to calculate it as one index. Also, the integrated genome index calculation unit 12 may calculate a genome score. . The polygenic risk score can be calculated by multiplying the beta value of each SNP and the probability value of each individual SNP disease and then summing them and dividing by the total number of SNPs.

여기서, X_j는 유전체 점수 (polygenic risk score)이고, OR_i는 각 SNP의 beta 값이고, SNP_ij는 개인별 각 SNP의 질병에 대한 확률값이고, m은 SNP의 총 개수이다.Here, X _j is a genomic score (polygenic risk score), OR _i is the beta value of each SNP, SNP _ij is the probability value for each individual SNP disease, m is the total number of SNPs.

[수학식 4][Equation 4]

본원의 일 실시예에 따르면, 통합 위험 요인 구축부(13)는 복수의 질환과 관련된 요인들을 도출하여 통합 위험 요인 모델을 구축할 수 있다. 복수의 질환은 고혈압, 당뇨병, 당뇨병-고혈압 동반상태 및 만성신장질환을 포함할 수 있다. 통합 위험 요인 구축부(13)는 각 질병 발생에 관련된 요인을 도출하고, 도출된 요인들과 환경적 요인을 결합하여 그들의 상태에 대한 통합 위험 요인 모델을 구축할 수 있다. According to one embodiment of the present application, the integrated risk factor building unit 13 may build factors associated with a plurality of diseases to construct an integrated risk factor model. A plurality of diseases may include hypertension, diabetes, diabetes-hypertension and chronic kidney disease. The integrated risk factor building unit 13 may derive factors related to each disease outbreak and combine the derived factors with environmental factors to build an integrated risk factor model for their condition.

도6은 본원의 일 실시예에 따른 고혈압 통합 유전체 점수 구축의 결과를 설명하기 위한 도면이다. 6 is a view for explaining the results of the hypertensive integrated genomic score construction according to an embodiment of the present application.

예시적으로 도6을 참조하면, 통합 위험 요인 구축부(13)에서 최종 선정된 SNP 마커로 유전체 점수를 산출한 결과, 고혈압의 PRS의 분포는 도6과 같다. Referring to FIG. 6 as an example, as a result of calculating the genomic score with the SNP marker finally selected by the integrated risk factor building unit 13, the distribution of PRS of hypertension is shown in FIG. 6.

통합 요인 구축부(13)에서 고혈압 Polygenic risk score 기본값은 표3과 같다. Table 3 shows the default values for the hypertensive polygenic risk score in the integrated factor building unit (13).

최솟값Minimum 중위수Median 평균값medium 최대값Maximum 0.000.00 0.052400.05240 0.064650.06465 0.348220.34822

도7은 본원의 일 실시예에 따른 당뇨병 통합 유전체 점수 구축의 결과를 설명하기 위한 도면이다.예시적으로 도6을 참조하면, 통합 위험 요인 구축부(13)에서 최종 선정된 SNP 마커로 유전체 점수를 산출한 결과, 당뇨병의 PRS의 분포는 도7과 같다.7 is a view for explaining the results of the diabetes integrated genomic score construction according to an embodiment of the present application. Referring to FIG. 6 for example, the genomic score as the SNP marker finally selected by the integrated risk factor building unit 13 As a result, the distribution of PRS in diabetes is shown in FIG. 7.

통합 요인 구축부(13)에서 고혈압 Polygenic risk score 기본값은 표4와 같다. Table 4 shows the default values for the hypertensive polygenic risk score in the integrated factor building unit (13).

최솟값Minimum 중위수Median 평균값medium 최대값Maximum 0.048130.04813 0.19030.1903 0.19960.1996 0.575340.57534

본원의 일 실시예에 따르면, 통합 위험 요인 구축부(13)는, 기본 인구학적 요인, 사회학적 요인, 질병 및 가족력 요인, 환경적 행태 관련 요인, 비만 관련 지표 요인, 혈액적 이상인지 지표 요인, 기저상태의 질병 이환 상태요인 중 적어도 어느 하나를 고려하여 상기 복수의 질환과 관련된 요인들을 도출하고 통합 위험 요인 모델을 구축할 수 있다. 예시적으로, 통합 위험 요인 구축부(13)는 기본 인구학적 요인으로는 성별과 연령에 대응하는 변수가 포함되도록 하여 통합 위험 요인 모델을 구축할 수 있다. 또한, 통합 위험 요인 구축부(13)는 사회학적 요인으로는 교육수준 및 소득수준에 대응하는 변수가 포함되도록 하여 통합 위험 요인 모델을 구축할 수 있다. 또한, 통합 위험 요인 구축부(13)는 질병 및 가족력 요인으로는 심혈관계질환 가족력 점수 (고혈압 가족력 여부, 당뇨병 가족력 여부, 심장질환 가족역 여부 3가지의 가족력에 대해 각각 가족력이 있는 경우 1로, 없는 경우를 0으로 두고, 질병 가족력이 모두 없는 경우를 0으로, 한가지 이상 있는 경우를 1점으로 분류하여 산출함) 에 대응하는 변수가 포함되도록 하여 통합 위험 요인 모델을 구축할 수 있다.According to one embodiment of the present application, the integrated risk factor building unit 13 includes basic demographic factors, sociological factors, disease and family history factors, environmental behavior factors, obesity indicator factors, and blood abnormality indicator factors, Factors related to the plurality of diseases may be derived and an integrated risk factor model may be constructed by considering at least one of the underlying disease morbidity factors. For example, the integrated risk factor construction unit 13 may construct an integrated risk factor model by including variables corresponding to gender and age as basic demographic factors. In addition, the integrated risk factor building unit 13 may construct an integrated risk factor model by including variables corresponding to education level and income level as sociological factors. In addition, the integrated risk factor building unit 13 is a family history of cardiovascular disease as a disease and family history factor (a family history of hypertension family history, diabetes family history, heart disease family history, 1 if each has a family history, An integrated risk factor model can be constructed by including variables that correspond to zero (0), absence of all family history of the disease, and one or more cases (1 point).

또한, 통합 위험 요인 구축부(13)는 환경적 행태 관련 요인으로는 알콜음주 상태와 규칙적 운동 상태, 담배 흡연 상태에 대응하는 변수가 포함되도록 하여 통합 위험 요인 모델을 구축할 수 있다. 또한, 통합 위험 요인 구축부(13)는 비만 관련 지표 요인으로는 비만지표인 체질량지수, 복부비만지표인 허리둘레에 대응하는 변수가 포함되도록 하여 통합 위험 요인 모델을 구축할 수 있다. 또한, 통합 위험 요인 구축부(13)는 혈액적 이상을 인지하는 지표로서는 지질이상지표인 TG, HDL-콜레스테롤, 총 콜레스테롤과 간이상지표인 ALBUMIN에 대응하는 변수가 포함되도록 하여 통합 위험 요인 모델을 구축할 수 있다.In addition, the integrated risk factor building unit 13 may construct an integrated risk factor model by including variables corresponding to alcohol drinking state, regular exercise state, and cigarette smoking state as factors related to environmental behavior. In addition, the integrated risk factor building unit 13 may construct an integrated risk factor model by including variables corresponding to the obesity index, body mass index, and abdominal obesity index, as obesity-related indicator factors. In addition, the integrated risk factor building unit 13 includes a variable corresponding to lipid abnormality indicators TG, HDL-cholesterol, total cholesterol and liver abnormality indicator ALBUMIN as indicators for recognizing blood abnormalities. Can build.

또한, 통합 위험 요인 구축부(13)는 기저상태의 질병 이환 상태 (혈액 이상, 혈압 이상 및 질병력)로는 심혈관계질환 질병력 점수 (울혈성 심부전증 진단 여부, 관상동맥 질환 진단 여부, 뇌졸중, 중풍 등의 뇌혈관 질환 진단 여부 3가지의 질병력에 대해 각각 질병력이 있는 경우 1로, 없는 경우를 0으로 두고, 질병력이 모두 없는 경우를 0으로, 한가지 이상 있는 경우를 1점으로 분류하여 산출함) 고혈압 이환, 고혈압 전단계, 당뇨병 이환, 당뇨병 전단계, 만성신장질환 유병 상태 에 대응하는 변수가 포함되도록 하여 통합 위험 요인 모델을 구축할 수 있다.In addition, the integrated risk factor building unit 13 is a cardiovascular disease disease history score (whether diagnosing congestive heart failure, diagnosing coronary artery disease, stroke, paralysis, etc.) as the underlying disease morbidity (blood abnormality, blood pressure abnormality, and medical history) Diagnosis of cerebrovascular disease: 3 cases of disease history are calculated by classifying 1 as a case of history, 0 as a case of absence, 0 as a case of no history, and 1 point as having one or more diseases) , Pre-hypertension, pre-diabetes, pre-diabetes, pre-diabetes, chronic kidney disease prevalence variables can be included to build an integrated risk factor model.

한편, 통합 위험 요인 구축부(13)는 통합 위험요인 및 통합 유전체 지표(유전체 점수)를 포함하여 통합 위험 요인 모델을 구축할 수 있다. 통합 유전체 지표(유전체 점수)는 통합 유전체 지표 산출부(12)에서 산출된 결과일 수 있다. 통합 유전체 지표 산출부(12)는 고혈압과 당뇨병에서 산출된 모든 유전체 지표가 포함되도록 유전체 점수를 산출할 수 있다. On the other hand, the integrated risk factor building unit 13 can build an integrated risk factor model including an integrated risk factor and an integrated genomic index (dielectric score). The integrated genomic index (dielectric score) may be a result calculated by the integrated genomic index calculation unit 12. The integrated genomic index calculation unit 12 may calculate a genomic score to include all genomic indices calculated from hypertension and diabetes.

또한, 통합 위험 요인 구축부(13)는 복수의 질환 각각에 대응하는 통합 위험 요인 모델을 구축할 수 있다. 통합 위험 요인 구축부(13)는 고혈압과 연관된 유전체 지표들만을 고려하여 유전체 점수가 산출되도록 하여 고혈압 발생을 예측하는 통합 위험 요인 모델을 생성할 수 있다. 또한, 통합 위험 요인 구축부(13)는 당뇨병과 연관된 유전체 지표들만을 고려하여 유전체 점수가 산출되도록 하여 당뇨병 발생을 예측하는 통합 위험 요인 모델을 생성할 수 있다. 또한, 통합 위험 요인 구축부(13)는 당뇨병-고혈압 동반발생 및 만성신장질환 통합 위험 요인 모델에서는 모든 유전체 지표를 기반으로 한 유전체 점수가 산출되도록 하여 통합 위험 요인 모델을 생성할 수 있다. In addition, the integrated risk factor building unit 13 may build an integrated risk factor model corresponding to each of a plurality of diseases. The integrated risk factor construction unit 13 may generate an integrated risk factor model predicting the occurrence of hypertension by allowing the genomic score to be calculated by considering only the genomic indicators associated with hypertension. In addition, the integrated risk factor building unit 13 may generate an integrated risk factor model for predicting the occurrence of diabetes by allowing the genomic score to be calculated in consideration of only genomic indicators associated with diabetes. In addition, the integrated risk factor building unit 13 may generate an integrated risk factor model by calculating the genomic scores based on all genomic indicators in the diabetes-hypertensive co-occurrence and chronic kidney disease integrated risk factor model.

본원의 일 실시예에 따르면, 설명변수 도출부(14)는 복수의 질환 중 적어도 어느 하나의 질병을 보유하고 있는 대상자의 데이터를 설명변수로서 도출할 수 있다. 달리 말해, 설명변수 도출부(14)는 질병 발생 예측 모델에 포함될 설명 변수를 도출할 수 있다. 설명 변수는 전체 인구집단의 고혈압, 당뇨병 유병과 관련된 위험요인과 고혈압, 당뇨병 발생과 관련된 위험요인을 포함할 수 있다. 고혈압-당뇨병 동반상태의 경우는 두 질병의 위험요인을 통합할 경우 포괄할 수 있다고 설정하고, 따로 위험요인을 도출하지 않았다. According to one embodiment of the present application, the explanatory variable deriving unit 14 may derive data of a subject having at least one of a plurality of diseases as an explanatory variable. In other words, the explanatory variable deriving unit 14 may derive an explanatory variable to be included in the disease outbreak prediction model. The explanatory variables may include risk factors associated with hypertension and diabetes prevalence of the entire population and risk factors associated with hypertension and diabetes. In the case of hypertension-diabetes, the risk factors of the two diseases were set to be included, and no risk factors were derived.

설명변수 도출부(14)는 만성신장질환의 경우 고혈압, 당뇨병이 원인으로 발생하는 질환이긴 하지만 두 질병의 위험요인만으로는 위험요인을 모두 포괄하기 어렵다고 판단되어 각각의 질환에 연관된 위험요인을 선별할 수 있다. 그 이유는 만성신장질환의 또 다른 원인성 질환으로 polycystic kidney disease, glomerulonephosis와 같은 신장질환이 원인이다. The explanatory variable derivation unit 14 is a disease caused by hypertension and diabetes in the case of chronic kidney disease, but it is judged that it is difficult to cover both risk factors with only the risk factors of both diseases, so that the risk factors associated with each disease can be selected. have. The reason is another cause of chronic kidney disease, and kidney disease such as polycystic kidney disease and glomerulonephosis.

또한, 설명변수 도출부(14)는 기저집단에서 이환은 전체 집단의 5% 미만에 불과하여 검정력이 부족하여 분석하지 않았고, 대신 추적기간 동안 발생은 전체 집단의 약 20% 정도라서 발생에 대한 위험요인에 대해서 분석하여 이를 포함할 수 있다. 설명변수 도출부(14)는 는 유전적 요인 이외 환경적 및 기타 인체 변화상태에 대한 요인에 대해서는 고혈압, 당뇨병의 이환과 발생 각각의 위험요인들과 만성신장질환 발생 위험요인을 모두 통합하여 통합 위험 요인 모델을 구축할 수 있다. 예시적으로, 유전체 마커 선별부(11)는 유전적 요인과 관련하여, 유전체 Genotyping 원 정보를 1k genome 정보 기반 imputation을 통해 확장하여 고혈압, 당뇨병과 관련된 SNPs 마커를 선정하였고 통합 유전체 지표 산출부(12)는 하나의 유전체 지표를 형성하기 위해 여러 마커들을 통합한 유전체 점수 (Polygenic risk score: PRS)로 환산할 수 있다. 설명변수 도출부(14)는 이를 지표로 하여 통합 위험 요인 모델을 구축할 수 있다. 최종 모형에 포함된 변수는 따라서 유전체 지표 (유전체 점수)와 통합 위험요인들로 구성되었고, 최종 이들을 통합 위험 요인 모델이라고 할 수 있다. In addition, the explanatory variable derivation unit 14 did not analyze due to lack of power in the base group because the morbidity was less than 5% of the total group, and instead, the occurrence during the follow-up period was about 20% of the entire group, so the risk for occurrence Factors can be analyzed and included. The explanatory variable derivation unit 14 integrates all risk factors of high blood pressure, diabetes, and diabetes, and risk factors for chronic kidney disease, as well as integrated risks for factors related to environmental and other changes in the human body, other than genetic factors. Factor models can be built. Exemplarily, the genome marker selection unit 11 selects SNPs markers related to hypertension and diabetes by expanding the genome genotyping source information through 1k genome information-based imputation in relation to genetic factors, and the integrated genome index calculator 12 ) Can be converted into a polygenic risk score (PRS) that combines multiple markers to form a single genomic indicator. The explanatory variable derivation unit 14 can use this as an index to construct an integrated risk factor model. The variables included in the final model were thus composed of genomic indicators (dielectric scores) and integrated risk factors, and the final ones can be called the integrated risk factor model.

본원의 일 실시예에 따르면, 설명변수 도출부(14)는 질병의 위험요인으로는 볼 수 없지만 일반 인구집단의 구성 상태를 볼 때, 고혈압 환자, 고혈압 전질병상태, 당뇨병 환자, 당뇨병 전질병상태의 환자가 존재하고 있고 이들의 경우 정상 상태에 비해 다른 질병의 발생 위험이 더욱 높기 때문에 (예를 들어, 정상 상태에 비해 고혈압 전질병상태는 당뇨병, 만성신장질환 발생 위험이 높고, 고혈압 환자는 정상이나 전질병상태에 비해 그 위험은 더욱 높음), 이러한 일반 집단의 상태를 위험 예측 모델에 설명변수로 바로 적용하기 위해, ‘기저조사의 질병-전질병상태 지표’ (고혈압, 당뇨병, 만성신장질환 이환과 전질병상태 이환에 대한 평가 지표)를 도출할 수 있다.According to one embodiment of the present application, the explanatory variable derivation unit 14 cannot be seen as a risk factor for disease, but when viewing the compositional state of the general population, a hypertensive patient, a hypertensive all disease state, a diabetic patient, a diabetic all disease state Because patients are present and in these cases, the risk of developing other diseases is higher than that of normal conditions (e.g., hypertensive all disease status has a higher risk of developing diabetes and chronic kidney disease than normal conditions, and hypertensive patients are normal) However, the risk is higher than that of all diseases). To apply the condition of these general populations as an explanatory variable directly to the risk prediction model, the'baseline disease-to-disease status indicator' (hypertension, diabetes, chronic kidney disease) The evaluation index for morbidity and morbidity morbidity) can be derived.

본원의 일 실시예에 따르면, 통합 위험 요인 구축부(13)는 고혈압, 당뇨병, 그리고 이들의 동반질병으로 고혈압과 당뇨병의 동반상태와 만성신장질환의 발생과 관련하여 질병 발생 예측 모델의 결과변수로 해당 변수를 적용할 수 있다. 또한, 통합 위험 요인 구축부(13)는 통합 유전체 지표 산출부(12)의 유전체 지표 산출 결과, 통합 위험 요인, 설명변수 도출부(14)에서 도출된 기저조사의 질병-전질병상태 지표를 통합하여 통합 위험 요인 모델을 구축할 수 있다. According to one embodiment of the present application, the integrated risk factor building unit 13 is a result variable of a disease outbreak prediction model in relation to hypertension, diabetes, and the accompanying state of diabetes and chronic kidney disease due to hypertension, diabetes, and their accompanying diseases. The variable can be applied. In addition, the integrated risk factor building unit 13 integrates the results of calculating the genomic indicators of the integrated genomic index calculation unit 12, the integrated risk factors, and the disease-to-disease status index of the underlying investigation derived from the explanatory variable derivation unit 14 To build an integrated risk factor model.

또한, 통합 위험 요인 구축부(13)는 출생 이후 환경적 노출과 그들이 변화 상태 및 인체 내 생물학적 변화를 반영하는 지표들을 선정할 수 있다. 통합 위험 요인 구축부(13)는 전체 변수에 대해서 개인식별자, 조사일과 같이 질병 예측과 관련이 없는 변수들과 질병 결과와 직접적으로 관련이 되거나 결측치 비율이 20%가 넘어가는 변수는 제외할 수 있다. 이후 통합 위험 요인 구축부(13)는 고혈압과 당뇨병 이환의 경우, 다항 로지스틱 회귀모형에서 3가지의 변수 선택법 (전진, 후진, 단계별)을, 고혈압, 당뇨병, 만성신장질환 발생의 경우 콕스 비례확률위험 모형에서의 3가지 변수 선택법 (전진, 후진, 단계별)을 기반으로 적어도 2개의 과정에서 나온 변수를 선정할 수 있다. In addition, the integrated risk factor building unit 13 may select indicators that reflect environmental exposure after birth and changes in them and biological changes in the human body. The integrated risk factor building unit 13 may exclude variables that are not directly related to disease prediction, such as personal identifiers and survey dates, and variables that are directly related to disease results or have a missing ratio of more than 20% for all variables. . Thereafter, the integrated risk factor building unit 13 selects three variable selection methods (forward, backward, stepwise) in a multinomial logistic regression model in case of hypertension and diabetes, and Cox proportional probability risk in case of hypertension, diabetes, chronic kidney disease. Based on the three variable selection methods (forward, backward, and stepwise) in the model, variables from at least two processes can be selected.

통합 위험 요인 구축부(13)는 이후 임상적 판단 기준을 바탕으로 고혈압, 당뇨병, 그리고 그들의 동반질병과 만성신장질환 통합 발생 모델을 구축하기 위해서 기본 인구학적 요인 (성별, 연령) 사회학적 요인 (교육수준, 소득수준, 결혼 여부), 가족력 요인 (심혈관계질환 가족력 점수: 고혈압 가족력 여부, 당뇨병 가족력 여부, 심장질환 가족역 여부 3가지의 가족력에 대해 각각 가족력이 있는 경우 1로, 없는 경우를 0으로 두소, 질병 가족력이 모두 없는 경우를 0으로, 한가지 이상 있는 경우를 1점으로 분류하여 산출함)), 환경적 행태 관련 요인 (음주, 규칙적 운동 여부, 흡연 상태)과 비만 지표 (체질량지수, 허리둘레), 혈액적 이상을 인지하는 지표 (지질이상지표인 TG, HDL-콜레스테롤, 총 콜레스테롤과 빈혈지표인 헤모글로빈, 간이상지표인 ALT), 기저상태의 질병 이환 상태 (심혈관계질환 질병력 점수 (울혈성 심부전증 진단 여부, 관상동맥 질환 진단 여부, 뇌졸중, 중풍 등의 뇌혈관 질환 진단 여부 3가지의 질병력에 대해 각각 질병력이 있는 경우 1로, 없는 경우를 0으로 두고, 질병력이 모두 없는 경우를 0으로, 한가지 이상 있는 경우를 1점으로 분류하여 산출함) 고혈압 이환, 고혈압 전단계, 당뇨병 이환, 당뇨병 전단계, 만성신장질환 이환 상태) 등을 포함해 최종적으로 19개의 변수가 선정할 수 있다. 최종 모형에 포함된 변수는 따라서 유전체 지표 (유전체 점수)와 통합 위험요인들로 구성되었고, 최종 이들을 ‘통합 위험요인 패널’이라고 할 수 있다. The integrated risk factor building unit (13) then uses the basic demographic factors (gender, age) and sociological factors (education to build an integrated development model for hypertension, diabetes, and their accompanying disease and chronic kidney disease based on clinical judgment criteria). Level, income level, marital status, family history factor (cardiovascular disease family history score: hypertension family history, diabetes family history, heart disease family history) 1 for each family history of 3 family history, 0 for none Too cows, family history of all diseases are 0, and one or more cases are classified as 1 point), factors related to environmental behavior (drinking, regular exercise, smoking status) and obesity index (body mass index, waist Circumference), indicators for recognizing hematologic abnormalities (lipid abnormalities TG, HDL-cholesterol, total cholesterol and anemia indicators hemoglobin, liver abnormality indicators ALT), underlying disease morbidity (cardiovascular disease history score (congestion) Whether to diagnose sexual heart failure, whether to diagnose coronary artery disease, whether to diagnose cerebrovascular diseases such as stroke and stroke, each case has a history of 1, 0 if none, and 0 if none. , Calculated by classifying one or more cases as one point) 19 variables can be finally selected including hypertension, pre-hypertension, diabetes, pre-diabetes, and chronic kidney disease). The variables included in the final model were therefore composed of genomic indicators (dielectric scores) and integrated risk factors, and the final ones can be called the “integrated risk panel”.

본원의 일 실시예에 따르면, 질병 발생 예측 모델 생성부(15)는 통합 유전체 지표 및 통합 위험 요인 모델을 입력으로 하여, 질병 발생 예측 모델을 구축할 수 있다. 예시적으로, 질병 발생 예측 모델은 통계적 방법 (비례확률위험 모형)과 기계학습 방법 (서포트 벡터 머신, 신경구조망, 랜덤 포레스트)을 이용하여 고혈압, 당뇨병, 고혈압-당뇨병 동반질병과 만성신장질환의 발생에 대한 통합 위험 예측 모형을 구축할 수 있다. 질병 발생 예측 모델 생성부(15)는 제1통계 기법과 제2기계학습 방법을 적용하여 질병 발생 예측 모델을 구축할 수 있다. 예를 들어, 제1통계 기법은 콕스 비례위험 모형일 수 있다. 또한, 제2기계학습 방법은 서포트 벡터 머신 (Support Vector Machine, SVM), recurrent neural network (RNN), 랜덤 포레스트 (Random Forest, RF) 중 어느 하나를 포함할 수 있다. 또한, 질병 발생 예측 모델 생성부(15)는 다양한 알고리즘에 기반하여 생성된 질병 발생 예측 모델을 비교하고 가장 예측력이 높은 랜덤포레스트를 이용한 모델을 최종 통합 위험 예측 모형(질병 발생 예측 모델)으로 선정할 수 있다. According to an embodiment of the present application, the disease occurrence prediction model generator 15 may build a disease occurrence prediction model by using an integrated genomic indicator and an integrated risk factor model as inputs. Illustratively, the disease outbreak prediction model uses the statistical method (proportional probability risk model) and the machine learning method (support vector machine, neural network, random forest) to develop hypertension, diabetes, hypertension-diabetic comorbidities and chronic kidney disease. You can build an integrated risk prediction model for. The disease occurrence prediction model generator 15 may build a disease occurrence prediction model by applying a first statistical technique and a second machine learning method. For example, the first statistical technique could be a Cox proportional hazards model. In addition, the second machine learning method may include any one of a support vector machine (SVM), a recurrent neural network (RNN), and a random forest (Random Forest). In addition, the disease occurrence prediction model generation unit 15 compares the disease occurrence prediction model generated based on various algorithms and selects a model using the most predictive random forest as the final integrated risk prediction model (disease occurrence prediction model). Can.

예시적으로, 질병 발생 예측 모델 생성부(15)는 고혈압, 당뇨병 유전체 마커와 반복적으로 측정된 요인들을 이용해서 기존에 알려진 질병의 위험 요인 조합에 따른 질병 위험을 확인하기 위해 콕스회귀모형을 이용해 각 요인 별 질병 위험도를 확인할 수 있다. 질병 발생 예측 모델 생성부(15)는 콕스 비례위험 모형을 이용한 통계모형 방법과 기계학습법인 인공 신경망 기반의 서포트 벡터 머신 (Support Vector Machine, SVM), 딥러닝의 한 방법인 recurrent neural network (RNN)과 랜덤 포레스트 (Random Forest, RF)를 이용하여 질병 발생 예측 모델을 구축할 수 있다. 질병 발생 예측 모델 생성부(15)는 복수의 신경망으로부터 구축된 예측 모델 중 가장 예측력이 좋은 결과를 가진 모형을 최종 질병 예측 모형으로 설정할 수 있다. Illustratively, the disease occurrence prediction model generation unit 15 uses the Cox regression model to identify disease risk according to a combination of risk factors of known diseases using hypertension, diabetes genomic markers, and repeatedly measured factors. Disease risk by factor can be confirmed. The disease occurrence prediction model generation unit 15 includes a statistical model method using a Cox proportional risk model, a support vector machine (SVM) based on artificial neural networks as a machine learning method, and a recurrent neural network (RNN) as a method of deep learning. And a random forest (Random Forest, RF) can be used to build a disease outbreak prediction model. The disease occurrence prediction model generator 15 may set a model having the most predictive result among prediction models constructed from a plurality of neural networks as a final disease prediction model.

질병 발생 예측 모델 생성부(15)는 복수의 질환의 질환자의 유전체 마커 정보와 복수의 질병의 위험 요인 변수 및 만성신장 질환의 질병 위험도를 입력으로 하여, 복수의 질환의 질환자의 유전체 마커 정보와 복수의 질병의 위험 요인 변수 및 만성신장 질환의 질병 위험도 사이의 관계의 정도를 학습하는 질병 위험도 사이의 관계의 정보를 학습하는 질병 위험도 기계학습 모델을 생성할 수 있다. 예시적으로 기계학습 모델은 콕스회귀모형을 이용하여 기계학습 모델을 생성할 수 있다. The disease incidence prediction model generator 15 inputs genomic marker information of a plurality of diseases, a risk factor variable of a plurality of diseases, and a disease risk of chronic kidney disease, thereby inputting genomic marker information and a plurality of diseases of a plurality of diseases A disease risk machine learning model that learns information of the relationship between disease risk factors of disease and disease risk to learn the degree of relationship between disease risk of chronic kidney disease can be generated. For example, the machine learning model may generate a machine learning model using the Cox regression model.

질병 발생 예측 모델 생성부(15)는 고혈압, 당뇨병 유전체 마커와 반복적으로 측정된 요인들을 이용해서 반복 측정된 요인들의 변화에 따른 질병 위험을 확인하기 위해 시간변이 콕스회귀모형을 이용해 각 요인 별 질병 위험도를 할 수 있다. 최종 질병 발생 예측 모델을 위해서 시간 변이 콕스회귀모형을 이용한 통계모형 방법과 기계학습법인 인공 신경망 기반의 서포트 벡터 머신 (Support Vector Machine, SVM), 딥러닝의 한 방법인 recurrent neural network (RNN)과 랜덤 포레스트 (Random Forest, RF)를 이용하여 질병 발생 예측 모델을 구축할 수 있다. 질병 발생 예측 모델 생성부(15)는 그 중 가장 예측력이 좋은 결과를 가진 모형을 최종 질병 예측 모델로 설정할 수 있다. The disease incidence prediction model generator 15 uses the time-variant Cox regression model to determine the disease risk according to changes in factors measured repeatedly by using hypertension, diabetes genomic markers, and repetitively measured factors. Can do For the final disease outbreak prediction model, a statistical model method using a time-variant cox regression model and a support vector machine (SVM) based on an artificial neural network as a machine learning method, and a recurrent neural network (RNN) and a random method of deep learning You can build a disease outbreak prediction model using Forest (RF). The disease occurrence prediction model generator 15 may set a model having the most predictable result as a final disease prediction model.

본원의 일 실시예에 따르면, 질병 발생 예측 모델 생성부(15)는 유전체 마커 선별부(11), 통합 유전체 지표 산출부(12), 통합 위험 요인 구축부(13)에서 도출된 변수들을 이분형으로 구분할 수 있다. 예를 들어, 질병 발생 예측 모델 생성부(15) 신체 계측치 및 혈액 마커와 같은 연속형 변수의 경우, 대사증후군 진단 기준에 의거하여 정상범위와 정상을 벗어난 위험수준 범위로 구분하였고, 체질량지수를 기반으로 한 비만정도, 4구간으로 정의한 연령분포 (50세 미만, 50-60세, 60-70세, 70세 이상), 환경적 행태 관련 변수들, 4분위수로 정의한 PRS과 같은 범주형 변수의 경우, 각 수준에 대한 더미변수 (dummy variable)를 만들어 이분형의 값을 갖도록 구분할 수 있다. 질병 발생 예측 모델 생성부(15)에서 변수들을 이분형으로 구분함으로써, 각 변수의 상태별 질병 발생에 미치는 영향을 평가할 수 있다. According to one embodiment of the present application, the disease occurrence prediction model generation unit 15 divides the variables derived from the genomic marker selection unit 11, the integrated genomic index calculation unit 12, and the integrated risk factor construction unit 13 Can be separated by. For example, in the case of continuous variables such as the disease occurrence prediction model generation unit 15 body measurements and blood markers, they were divided into normal ranges and out-of-normal risk levels based on metabolic syndrome diagnosis criteria, and based on body mass index. In the case of categorical variables such as obesity as defined, age distribution defined by 4 sections (under 50, 50-60, 60-70, over 70), variables related to environmental behavior, and PRS defined as quartile In this case, dummy variables for each level can be created and classified to have binary values. By dividing the variables into dichotomous types in the disease occurrence prediction model generator 15, it is possible to evaluate the effect of each variable on disease occurrence by state.

또한, 질병 발생 예측 모델 생성부(15)는 질병 발생 예측 모델을 검증하기 위해, 훈련 데이터 셋(training set)과 검증 데이터 셋(test set)으로 구분하여 질병 발생 예측 모델을 구축할 수 잇다. 일예로, 질병 발생 예측 모델 생성부(15)는 모델을 구축하고 검증단계가 필요하기 때문에, 안산안성 코호트를 trainng set으로, 도시 코호트를 test set1, 농촌 코호트를 test set2로, 도시와 농촌 통합 코호트를 test set3로 정의하여 이에 대한 외부 검증을 진행할 수 있다. In addition, the disease occurrence prediction model generator 15 may construct a disease occurrence prediction model by dividing it into a training data set and a test data set to verify the disease occurrence prediction model. As an example, the disease occurrence prediction model generation unit 15 needs to build a model and require a verification step, so Ansan Anseong cohort is trainng set, city cohort is test set1, rural cohort is test set2, and urban and rural integrated cohort Can be defined as test set3 to perform external verification.

질병 발생 예측 모델 생성부(15)는 훈련 데이터 셋(training set)과 검증 데이터 셋(test set)에 적용해 예측도를 확인해봄으로써, 구축된 모델의 일반화 정도를 확인할 수 있다. 질병 발생 예측 모델 생성부(15)는 전체 대상자를 7 대 3의 비로 구분하여 정상 대상자와 각 질병 발생 (고혈압, 당뇨병, 고혈압-당뇨병 동반질병, 만성신장질병) 대상자의 70%는 모형개발을 위해 사용하였고, 30%는 모델의 검증을 위해 테스트 데이터로 사용하였다. 훈련 데이터 셋(training set)에서는 통합 위험요인 패널의 변수들을 이용해 결과변수만을 변경해 가면서 각 결과변수 발생 위험을 예측하고자 하였는데, 이를 위해 통계학적 방법으로는 시간 변이 콕스 비례확률위험 모형을, 기계학습 방법으로는 인공신경망 기반 방법 2가지 (서포트 벡터 머신 (Support vector machine) 과 딥러닝 recurrent neural network (RNN))와 랜덤포레스트법을 선정하여 각각에 대해 다른 알고리즘을 적용하여 학습 모델을 구축할 수 있다. The disease occurrence prediction model generator 15 can confirm the degree of generalization of the constructed model by checking the prediction degree by applying it to the training data set and the test data set. The disease outbreak prediction model generator 15 divides the entire subjects into a ratio of 7 to 3, and 70% of the subjects with normal subjects and each disease outbreak (hypertension, diabetes, hypertension-diabetes, chronic kidney disease) are developed for model development. 30% was used as test data for model verification. In the training data set, we tried to predict the risk of each outcome variable by changing only the outcome variable using the variables in the integrated risk factor panel.For this purpose, the statistical method is the time-variant Cox proportional probability risk model, and the machine learning method. As an alternative, we can construct a learning model by applying two different algorithms to each of two artificial neural network based methods (Support vector machine and deep learning recurrent neural network (RNN)) and random forest method.

서포트 벡터머신은 지도 학습 모델이며 분류와 회귀 분석을 응용하여 두 범주 중 어느 하나에 속한 데이터의 집합이 주어졌을 때, 주어진 데이터 집합을 바탕으로 하여 새로운 데이터가 어느 카테고리에 속할지 판단하는 비확률적 이분형 선형 분류 모델을 만들어 가장 가까운 학습 자료와 가장 먼 거리의 초평면을 확인하여 분리를 하도록 하는 방법이다. 질병 발생 예측 모델 생성부(15)는 여러 응용 방법 중 스플라인 및 분산분석 RBF 커널법을 응용하였다.The support vector machine is a supervised learning model, and by applying classification and regression analysis, when a set of data belonging to one of the two categories is given, it is non-stochastic to determine which category the new data belongs to based on the given set of data. This is a method of creating a bilinear linear classification model to identify and separate the closest learning data and the farthest hyperplane. The disease occurrence prediction model generation unit 15 applied the spline and variance analysis RBF kernel method among various application methods.

랜덤 포레스트법은 분류, 회귀 분석 등에 사용되는 앙상블 학습방법의 일종으로, 훈련 과정에서 구성한 다수의 결정 트리로부터 분류 또는 회귀분석의 평균 예측치를 출력하여 다음 과정으로 동작할 수 있다. 먼저, 학습 데이터로부터 트리 구조와 매개변수를 자동으로 학습하도록 하는 방식으로 진행되어 독립적인 훈련단계와 학습을 하게 하여 최종 종단 노드인 질병 발생 여부에 도달하게 하는 방식으로 진행할 수 있다. 또한, 전형적 랜덤 포레스트법에서는 임의적으로 데이터에서 변수를 선별하여 모형을 만들고 복원추출을 통해 또 다른 변수들을 무작위로 추출해 가면서 또 다른 모형을 만드는 것이 원칙인데, 이 과정에서 가중치가 크지만 에러 데이터나 생물학적 관련성이 떨어지는 변수가 들어오거나 전자보다 후자에 일어났어야 하는 중간결과에 해당되는 것들이 원인에 해당하는 변수보다 먼저 들어오는 상황이 발생할 수도 있다. 이 문제를 어느 정도 해결하기 위해 앙상블 학습법은 유지하되 생물학적 시간적 알고리즘 하에서 변수들이 트리에 포함되도록 하기 위해 집단별로 변수들로 그루핑하여 단계별로 각 집단의 변수들이 포함되도록 조정하는 알고리즘을 개발하여 사용할 수 있다.The random forest method is a kind of ensemble learning method used for classification, regression analysis, etc., and outputs an average predicted value of classification or regression analysis from a plurality of decision trees constructed in a training process and can operate in the next process. First, the tree structure and parameters are automatically learned from the learning data, so that the independent training step and learning can be performed to reach the end node, whether or not a disease occurs. In addition, in the typical random forest method, it is a principle to randomly select variables from data and make another model by randomly extracting other variables through reconstruction extraction. A situation may arise in which a variable with less relevance comes in, or those with intermediate results that should have occurred in the latter than the former come before variables corresponding to the cause. In order to solve this problem to some extent, an ensemble learning method is maintained, but an algorithm for grouping variables into groups and adjusting each group to include variables in a step-by-step manner can be developed and used in order to include variables in a tree under a biological temporal algorithm. .

질병 발생 예측 모델 생성부(15) 각 모형의 예측력을 검정하기 위해 훈련 데이터 셋(training set)에서 생성된 모형이 검증 데이터 셋(test set)에서 재현되는 지를 검증하기 위한 내부 타당성 검증을 위해 검증 데이터 셋(test set)을 boot-straping 기법을 이용하여 1,000번의 permutation을 시행한 다음 각 산출된 모형의 확률 산출 방식을 그대로 적용하여 training set의 예측값과 test set의 예측값이 일치되는 지에 대해 검증하였고, test set 에서의 ROC-curve와 AUC값을 제시할 수 있다. 랜덤 포레스트의 경우, ntree=300을 이용하여 AUC값을 산출할 수 있다. 각 모형의 예측력은 AUC 의 95% 신뢰구간으로 차이를 검정하였고, 가장 높은 AUC를 가진 모형을 최종 예측 모형으로 선정할 수 있다. Disease occurrence prediction model generation unit 15 Verification data for internal validity verification to verify whether the model generated in the training data set is reproduced in the test data set to test the predictive power of each model The set (test set) was subjected to 1,000 permutations using the boot-straping technique, and then the probability calculation method of each calculated model was applied as it was to verify that the prediction value of the training set and the prediction value of the test set matched. ROC-curve and AUC values in set can be suggested. In the case of a random forest, an AUC value can be calculated using ntree=300. The predictive power of each model was tested for the difference with a 95% confidence interval of AUC, and the model with the highest AUC can be selected as the final predictive model.

도8은 본원의 일 실시예에 따른 요인 노출의 시기와 이후 변화될 수 있는 상태의 시간적 연속성과 질병의 자연사를 고려하여 예측모형에 순차적으로 포함된 변수를 나타내는 도면이다.8 is a diagram showing variables sequentially included in a predictive model in consideration of the timing of factor exposure and temporal continuity of a state that can be changed afterwards and natural history of disease according to an embodiment of the present application.

도8은 개인의 유전적 특성을 바탕으로 질병에 유의한 유전체 마커를 이용한 유전체 스코어를 기반으로 인구학적, 사회학적 요인, 환경적 행태 관련 요인, 비만 관련 측정 지표, 혈액 마커에 따른 현재 건강 상태, 기저상태에서의 질병 이환 상태에 대한 시계열적 데이터가 추가되면서 최종적으로 고혈압과 당뇨병, 이들의 다중질병 고혈압-당뇨병 발생과 더 나아가 만성신장질환 발생에 이르기까지의 흐름을 나타내는 도면이다.8 is a demographic, sociological factors, environmental behavior-related factors, obesity-related measurement indicators, current health status according to blood markers, based on genomic scores using genomic markers that are significant for diseases based on the genetic characteristics of the individual. It is a diagram showing the flow from hypertension and diabetes, their multi-disease hypertension-diabetes, and further chronic kidney disease, as time-series data on disease morbidity at baseline is added.

도9는 본원의 일 실시예에 따른 시계열 데이터와 유전 데이터를 통합하는 딥러닝 모델 구조를 설명하기 위한 도면이다. 도 10은 본원의 일 실시예에 따른 질병 발생 예측 모델의 대략적인 다이어그램이다. 도11은 본원의 일 실시예에 따른 랜덤포레스트 예측 모형의 개략적인 다이어그램이다. 도12는 본원의 일 실시예에 따른 여러 경우의 모형들을 앙상블 기법으로 훈련시킨 과정을 개략적으로 나타낸 도면이다.9 is a diagram for explaining a deep learning model structure that integrates time series data and genetic data according to an embodiment of the present application. 10 is a schematic diagram of a disease outbreak prediction model according to an embodiment of the present application. 11 is a schematic diagram of a random forest prediction model according to an embodiment of the present application. 12 is a diagram schematically showing a process of training models in various cases according to an embodiment of the present application by an ensemble technique.

본원의 일 실시예에 따르면, 질병 발생 예측 모델 생성부(15)는 만성신장 질환의 질환자의 생활상태 변수 및 건강상태 변수를 포함하는 복수의 상태 변수, 핵심 유전자 정보 및 만성신장 질환의 질병 위험도를 입력으로 하여, 복수의 상태 변수 및 핵심 유전자 정보 중 적어도 하나 이상과 만성신장 질환의 질병 위험도 사이의 관계의 정도를 학습하는 질병 발생 예측 모델을 생성할 수 있다. According to one embodiment of the present application, the disease incidence prediction model generation unit 15 measures a plurality of state variables including life status variables and health state variables of patients with chronic kidney disease, core genetic information, and disease risk of chronic kidney disease. As an input, a disease outbreak prediction model for learning the degree of relationship between at least one of a plurality of state variables and core genetic information and disease risk of chronic kidney disease may be generated.

질병 발생 예측 모델 생성부(15)는 복수의 상태 변수 및 유전자 정보 중 적어도 하나 이상과 만성신장 질환의 질병 위험도 사이의 관계의 정보를 학습하는 기계학습 모델을 생성할 수 있다. 예시적으로, 기계학습 모델은 순환신경망(Recurrent Neural Network, RNN) 과 다층퍼셉트론신경망 (Multi-layer perceptron neural network, MLP)을 이용해 기계학습 모델을 생성할 수 있다.The disease occurrence prediction model generation unit 15 may generate a machine learning model that learns information on a relationship between at least one of a plurality of state variables and genetic information and a disease risk of chronic kidney disease. For example, the machine learning model may generate a machine learning model using a recurrent neural network (RNN) and a multi-layer perceptron neural network (MLP).

본원의 일 실시예에 따르면, 질병 발생 예측 모델 생성부(15)는 만성신장 질환의 각 질병과 관련된 유전자를 다층 퍼셉트론 신경망을 연결해 순환신경망에 연결하여 입력할 수 있다. 또한, 질병 발생 예측 모델 생성부(15)는 반복 측정된 복수의 상태 변수를 통해 각 역학적 변수의 시간에 따른 상관관계뿐만 아니라 변수간의 상관관계까지 분석이 가능하도록 이를 순환 신경망에 순차적으로 입력하여 분석할 수 있다.According to one embodiment of the present application, the disease occurrence prediction model generator 15 may input a gene related to each disease of a chronic kidney disease by connecting a multi-layer perceptron neural network to a circulatory neural network. In addition, the disease occurrence prediction model generation unit 15 sequentially inputs and analyzes the correlation between variables as well as time-dependent correlation of each epidemiological variable through a plurality of state variables that are repeatedly measured and analyzed. can do.

또한, 질병 발생 예측 모델 생성부(15)는 대상자의 대상자 상태 변수 및 대상자 유전자의 정보를 반복측정하고 반복 측정된 정보를 입력할 수 있다. 질병 발생 예측 모델 생성부(15)는 대상자의 대상자 상태 변수 및 대상자 유전자의 정보를 기반으로 생활습관 및 신체계측치, 임상치 등의 반복 측정된 값들에 대해 생활습관에 변화가 있는지를 확인할 수 있다. 질병 발생 예측 모델 생성부(15)는 반복 측정된 값들 중 유사한 양상을 보이는 집단끼리 구분 하여 각각에 대한 클러스터를 생성하고, 성별, 질병별로 비슷한 생활습관 변화 양상을 보이는 집단을 구분할 수 있다. 질병 발생 예측 모델 생성부(15)는 대상자의 대상자 유전자 정보를 기반으로, 만성신장 질환의 각 질병별로 생활습관의 변화와 관련된 유의한 유전자를 선별할 수 있다. 유의한 유전자는 만성신장 질환의 각 질병과 연계된 유전자일 수 있다.In addition, the disease occurrence prediction model generation unit 15 may repeatedly measure information of a subject's subject state variable and subject's gene and input the repeatedly measured information. The disease occurrence prediction model generation unit 15 may check whether there is a change in the lifestyle for repeated measured values such as lifestyle, body measurement, and clinical values based on the subject's subject state variables and the information of the subject's genes. The disease outbreak prediction model generation unit 15 may classify groups showing similar patterns among repeated measured values, generate clusters for each, and distinguish groups showing similar lifestyle changes by gender and disease. The disease occurrence prediction model generation unit 15 may select a significant gene related to a change in lifestyle for each disease of chronic kidney disease based on the subject's target gene information. A significant gene may be a gene associated with each disease of chronic kidney disease.

예시적으로 도9를 참조하면, 질병 발생 예측 모델 생성부(15)는 반복 측정된 대상자의 대상자 유전 데이터를 인공신경망 중 순환신경망에 순차적으로 입력하고, 만성신장 질환의 각 질병별로 생활습관의 변화와 관련된 유의한 유전자는 다층퍼셉트론을 통해 순환신경망에 연결될 수 있다. For example, referring to FIG. 9, the disease occurrence prediction model generation unit 15 sequentially inputs the subject's genetic data of the repetitively measured subjects into the circulatory neural network among artificial neural networks, and changes the lifestyle of each disease of chronic kidney disease. A significant gene associated with can be connected to the circulatory neural network through a multi-layer perceptron.

예시적으로 도 10을 참조하면, 질병 발생 예측 모델 생성부(15)는 생활상태 변수 및 건강상태 변수를 포함하는 복수의 상태 변수와 같은 시계열 데이터를 입력할 수 있는 인공 신경망 중 순환신경망을 적용하여 기계학습 모델을 생성할 수 있다. 질병 발생 예측 모델 생성부(15)는 단일 시점에서 수집한 유전 정보를 통합 입력하기 위해 기존 순환신경망 마지막 층에 다층 퍼셉트론 신경망을 추가적으로 연결할 수 있다. 질병 발생 예측 모델 생성부(15)는 마지막의 출력 층에 만성신장 질환 발생 유/무를 설정할 수 있다.For example, referring to FIG. 10, the disease occurrence prediction model generator 15 applies a circulatory neural network among artificial neural networks capable of inputting time series data such as a plurality of state variables including living state variables and health state variables. You can create machine learning models. The disease occurrence prediction model generator 15 may additionally connect a multi-layer perceptron neural network to the last layer of the existing circulatory neural network in order to integrally input genetic information collected at a single time point. The disease occurrence prediction model generation unit 15 may set the presence/absence of chronic kidney disease occurrence in the last output layer.

예시적으로, 인공 신경망은 입력층(input layer), 은닉층(hidden layer) 및 출력층(output layer)의 3가지의 층으로 구분될 수 있다. 각 층들은 노드들로 구성되어 있으며, 입력층은 시스템 외부로부터 입력자료를 받아들여 시스템으로 입력 자료를 전송할 수 있다. 은닉층은 시스템 안쪽에 자리잡고 있으며 입력 값을 넘겨받아 입력자료를 처리한 뒤 결과를 산출할 수 있다. 출력층은 입력 값과 현재 시스템 상태에 기준하여 시스템 출력 값을 산출할 수 있다. 입력층은 예측값(출력변수)을 도출하기 위한 예측변수(입력변수)의 값들을 입력할 수 있다. 입력층에 n개의 입력 값들이 있다면 입력층은 n개의 노드를 가지게 되며, 본원에서의 입력층에 입력되는 값은 생활상태 변수 및 건강상태를 포함하는 복수의 상태 변수와 유전자 정보일 수 있다. 은닉층은 복수의 입력 노드로부터 입력 값을 받아 가중합을 계산하고, 이 값을 전이함수에 적용하여 출력층에 전달할 수 있다. 예시적으로 기계학습 모델의 입력층은 복수의 상태 정보, 유전자 정보, 이전 시점의 은닉층이 될 수 있고, 은닉층은 복수의 상태 정보, 복수의 상태 정보를 그룹핑한 정보일 수 있고, 출력층은 질병 위험도를 나타내는 것일 수 있다.For example, the artificial neural network may be divided into three layers: an input layer, a hidden layer, and an output layer. Each layer is composed of nodes, and the input layer can receive input data from outside the system and transmit input data to the system. The hidden layer is located inside the system and can take input data, process the input data, and calculate the result. The output layer can calculate the system output value based on the input value and the current system state. The input layer may input values of a predictive variable (input variable) for deriving a predicted value (output variable). If there are n input values in the input layer, the input layer will have n nodes, and the value input to the input layer in the present application may be a plurality of state variables and genetic information including a living state variable and a health state. The hidden layer can receive input values from a plurality of input nodes, calculate weighted sums, and apply the values to the transfer function to transfer them to the output layer. For example, the input layer of the machine learning model may be a plurality of status information, genetic information, a hidden layer at a previous time, the hidden layer may be a plurality of status information, or a grouping of a plurality of status information, and the output layer may be a disease risk. It may indicate that.

본원의 일 실시예에 따르면 질병 발생 예측 모델은 복수의 상태 변수 중 제 1 상태 변수를 입력층으로 하고 복수의 상태 변수 중 제 2 상태 변수를 은닉층으로 할 때, 입력층과 은닉층 사이의 관계의 정보를 학습하는 제 1 학습을 수행할 수 있다. 또한, 기계학습 모델은 복수의 상태 변수의 이전 시점 상태 변수를 입력층으로 하고 복수의 상태 변수의 현재 시점 상태 변수를 은닉층으로 할 때, 입력층과 은닉층 사이의 관계의 정보를 학습하는 제 1 학습을 수행할 수 있다. According to one embodiment of the present application, when a first state variable among a plurality of state variables is used as an input layer and a second state variable among a plurality of state variables is a hidden layer, the disease occurrence prediction model is information on a relationship between the input layer and the hidden layer A first learning for learning may be performed. In addition, the machine learning model is the first learning that learns information of the relationship between the input layer and the hidden layer when the current viewpoint state variable of the plurality of state variables is the hidden layer and the current viewpoint state variable of the plurality of state variables is the hidden layer. You can do

기계학습 모델은 수학식1을 기반으로, 입력층과 은닉층 사이의 관계의 정도를 학습할 수 있다. 관계의 정도는 입력층에 입력 받은 정보들의 가중합을 계산한 값을 의미할 수 있으나, 이에 한정되는 것은 아니다. The machine learning model can learn the degree of relationship between the input layer and the hidden layer based on Equation (1). The degree of relationship may mean a value obtained by calculating a weighted sum of information received at the input layer, but is not limited thereto.

이때,

는 t 시점에서의 은닉층이고,

은 t시점의 이전 시점 은닉층이고,

는 제 1 상태 변수이고,

는 입력층과 은닉층 사이의 제 1 유형의 관계의 정도를 나타내는 제 1 가중치이고,

는 입력층과 은닉층 사이의 제 2 유형의 관계의 정도를 나타내는 제 2 가중치이다. 예시적으로, [수학식 1]에서

는 t시점의 복수의 상태 변수 중 제 1 상태 변수이고,

는 t시점의 은닉층을 나타내고

는 복수의 상태 변수(입력 변수)와 은닉층간의 가중치이고,

는 은닉층들간의 가중치일 수 있으나, 이에 한정되는 것은 아니다. 일예로, 제 1 유형의 관계의 정도는 시간에 따른 복수의 상태 변수들관의 상관관계(가중치)일 수 있고, 제 2 유형의 관계의 정도는 복수의 상태 변수간의 상관관계(가중치)일 수 있으나, 이에 한정되진 않는다. At this time,

Is the hidden layer at time t,

Is the hidden layer before t

Is the first state variable,

Is a first weight indicating the degree of the first type of relationship between the input layer and the hidden layer,

Is a second weight representing the degree of the second type of relationship between the input layer and the hidden layer. For example, in [Equation 1]

Is a first state variable among a plurality of state variables at time t,

Denotes the hidden layer at time t

Is a weight between a plurality of state variables (input variables) and a hidden layer,

May be a weight between hidden layers, but is not limited thereto. For example, the degree of relationship of the first type may be a correlation (weight) of a plurality of state variables over time, and the degree of relationship of a second type may be a correlation (weight) of a plurality of state variables. However, it is not limited thereto.

질병 발생 예측 모델은 [수학식 1]에 표현된 순환신경망에 반복 측정된 복수의 상태 변수 (예를 들어, 개개인의 생활 습관 및 건강 상태 변수)를 입력하여 시간에 따른 상관관계뿐만 아니라 생활 습관 및 건강 상태 변수간의 상관관계까지 분석할 수 있다. The disease outbreak prediction model inputs a plurality of state variables (for example, individual lifestyle and health state variables) measured repeatedly in the circulatory neural network expressed in Equation 1, as well as lifestyle changes as well as correlation with time. You can even analyze the correlation between health status variables.

본원의 일 실시예에 따르면, 질병 발생 예측 모델은 은닉층 및 유전자 정보를 입력층으로 하고 질병 위험도를 출력층으로 할 때, 은닉층과 출력층 사이의 관계의 정보를 학습하는 제 2 학습을 수행할 수 있다. 또한, 기계학습 모델은 은닉층 및 유전자 정보를 입력층으로 하고 질병 위험도를 출력층으로 할 때, 은닉층과 출력층 사이의 관계의 정보를 학습하는 제 2 학습을 수행할 수 있다. According to one embodiment of the present application, the disease occurrence prediction model may perform a second learning to learn information of a relationship between the hidden layer and the output layer when the hidden layer and the genetic information are used as the input layer and the disease risk is the output layer. In addition, the machine learning model may perform a second learning to learn information on the relationship between the hidden layer and the output layer when the hidden layer and the genetic information are used as the input layer and the disease risk is the output layer.

기계학습 모델은 [수학식 2]를 기반으로 은닉층과 출력층 사이의 관계의 정도를 학습할 수 있다. 제 2학습은 [수학식 1] 및 [수학식2]를 기반으로 은닉층과 출력층 사이의 관계의 정도를 학습할 수 있다. 기계학습 모델은 [수학식1] 및[수학식2]를 기반으로 입력층, 은닉층 및 출력층 사이의 관계의 정보를 학습하고 출력층의 결과로 질병 위험도의 예측 결과를 학습할 수 있다. The machine learning model can learn the degree of the relationship between the hidden layer and the output layer based on [Equation 2]. The second learning can learn the degree of relationship between the hidden layer and the output layer based on [Equation 1] and [Equation 2]. The machine learning model can learn information on the relationship between the input layer, the hidden layer, and the output layer based on [Equation 1] and [Equation 2], and the predicted result of disease risk as a result of the output layer.

이때, y는 출력층이고,

는 은닉층과 출력층 사이의 관계의 정도를 나타내는 제 3 가중치이고,

는 은닉층이고,

는 입력층 중 유전자 정보와 출력층 사이의 관계의 정도를 나타내는 제4 가중치이고, z는 입력층 중 유전자 정보일 수 있다. 일예로, 제 3 가중치는 질병 위험을 예측하기 위해 복수의 상태 변수와 출력층 사이의 관계를 나타낸 관계의 정도이고, 제 4가중치는 특정 유전자에 가중치를 부여하기 위한 유전자 정보와 출력층 사이의 관계의 정도일 수 있다. At this time, y is the output layer,

Is a third weight indicating the degree of relationship between the hidden layer and the output layer,

Is a hidden layer,

Is a fourth weight representing the degree of relationship between the gene information and the output layer in the input layer, and z may be the gene information in the input layer. In one example, the third weight is a degree of a relationship representing a relationship between a plurality of state variables and an output layer to predict a disease risk, and the fourth weight is a degree of a relationship between a gene information and an output layer for weighting a specific gene Can.

본원의 일 실시예에 따르면, 유전 정보는 단일 시점으로 수집되었으므로 순환신경망에 통합시키기 위해 [수학식 2]와 같이 순환신경망 마지막 층에 다층 퍼셉트론 신경망을 연결하여 입력할 수 있다. 예시적으로, 유전 정보는 단일염기 다형성 형태로 수집되었으며, 각 만성신장 질병 각각에 대해 기존에 알려진 유전정보를 대립유전자에 따른 위험 지수(Risk fator)로 변환하여 입력할 수 있다. 기계학습 모델은 제 2 학습을 통해, 은닉층과 출력층 사이의 관계의 정도, 즉 은닉층과 출력층 사이의 가중치를 학습할 수 있다. According to one embodiment of the present application, since the genetic information is collected at a single time point, a multi-layer perceptron neural network may be connected to and input to the last layer of the circulatory neural network as shown in [Equation 2] in order to integrate it into the circulatory neural network. Exemplarily, the genetic information was collected in the form of a single nucleotide polymorphism, and it is possible to convert and input previously known genetic information for each chronic kidney disease into a risk fator according to an allele. The machine learning model may learn the degree of the relationship between the hidden layer and the output layer, that is, the weight between the hidden layer and the output layer through the second learning.

본원의 일 실시예에 따르면, 질병 발생 예측 모델 생성부(15)는 [수학식 3]을 기반으로 복수의 상태 변수 및 유전자 정보 중 적어도 하나 이상과 만성신장 질환의 질병 위험도 사이의 관계의 정도를 학습하는 질병 발생 예측 모델 생성 시 발생하는 오차에 가중치를 갱신할 수 있다. According to one embodiment of the present application, the disease occurrence prediction model generator 15 determines the degree of relationship between at least one of a plurality of state variables and genetic information based on [Equation 3] and disease risk of chronic kidney disease. Weights can be updated for errors that occur when creating a predicted disease outbreak model.

E는 질병 발생 예측 모델(기계학습 모델 생의 오차의 검출값이고, t는 만성신장 질환의 발생 여부이고, y는 기계학습 모델을 통해 예측된 질병 위험도이고,

는 오차에 따른 과적합(overfitting)을 방지하기 위한 L2 정규식이다. E is a disease outbreak prediction model (mechanical learning model is the detection value of an error, t is whether chronic kidney disease occurs, y is a disease risk predicted through a machine learning model,

Is an L2 regular expression to prevent overfitting due to errors.

[수학식 3]은 질병 위험도 기계학습 모델 생성부(140)의 오차식이며 산출된 오차를 역전파 알고리즘을 통해 인공신경망의 가중치를 학습할 수 있다. 학습 과정 중 발생하는 노이즈(noise)에 따른 과적합을 방지하기 위해 L2 정화규 식을 추가하였으며, t는 각 실제 만성신장 질환에 대한 발생 유 또는 무를 나타내는 것일 수 있으나, 이에 한정되는 것은 아니다. [Equation 3] is an error formula of the disease risk machine learning model generator 140, and the calculated error can be used to learn the weight of the artificial neural network through a backpropagation algorithm. In order to prevent overfitting due to noise generated during the learning process, the L2 purification formula has been added, and t may indicate the presence or absence of each actual chronic kidney disease, but is not limited thereto.

본원의 일 실시예에 따르면, 질병 발생 예측 모델 생성부(15)에서 적용된 콕스비례위험모델은 유전체지표와 인구학적, 사회적, 가족력 지표는 1회 측정되었고 고혈압과 당뇨병, 만성신장질환 상태는 기저 조사에서의 이환 상태만을 포함하였으므로 역시 1회 측정할 수 있다. 질병 발생 예측 모델 생성부(15)는 질병 발생 위험을 행태요인과 비만관련 지표, 혈액 이상 지표들은 역시 기반조사 당시 측정치를 기반으로 질병 발생 위험을 예측할 수 있다. 질병 발생 모델 생성부(15)는 질병 발생 위험을 예측하고자 콕스비례위험모형을 이용할 수 있다. 질병 발생 예측 모델 생성부(15)는 콕스 비례확률위험 모형을 이용한 분석을 통해, 각 변수 별 고혈압, 당뇨병, 만성신장질환 발생에 미치는 영향도 (beta) 값과 질병 발생 위험 (HR, hazard ratio)을 확인할 수 있다. According to an embodiment of the present application, the Cox proportional risk model applied in the disease incidence prediction model generator 15 was measured once for genomic indicators, demographic, social, and family history indicators, and the underlying conditions for hypertension, diabetes, and chronic kidney disease were investigated. Since it contains only the morbidity of, it can also be measured once. The disease occurrence prediction model generation unit 15 can predict the risk of disease occurrence based on measures measured at the time of the base investigation, such as behavioral factors, obesity-related indicators, and blood abnormality indicators. The disease occurrence model generation unit 15 may use the Cox proportional risk model to predict the risk of disease occurrence. The disease incidence prediction model generator 15 analyzes using the Cox proportional probability risk model, the effect (beta) value and the risk of disease (HR) hazard on the occurrence of hypertension, diabetes, and chronic kidney disease for each variable. can confirm.

예시적으로 도 11을 참조하면, 상기 모형은 생물학적인 시간적 선후관계를 고려하기 위해 집단별 변수들을 단계별로 진입되도록한 모형이며, 도11과 같이 여러 경우의 모형들을 앙상블 기법으로 배깅과 부스팅을 통해 훈련시킴으로 일반화 가능한 최적모형을 생성할 수 있다. 도11은 여러 경우의 모형 선별과 앙상블 훈련 및 최적화모델 설정애 대한 간략한 다이어그램이다. Referring to FIG. 11 as an example, the model is a model that allows group-specific variables to be entered step by step in order to take into account the biological temporal and temporal relationships. As shown in FIG. 11, several models are ensembled through bagging and boosting. By training, you can create an optimal model that can be generalized. 11 is a simplified diagram of model selection and ensemble training and optimization model setup in various cases.

본원의 일 실시예에 다르면, 질환 예측부(16)는 질병발생 예측 모델에 신규 질환 예측 데이터를 입력으로 하여 대상자의 만성신장질환 발생을 예측할 수 있다. According to one embodiment of the present application, the disease prediction unit 16 may predict the occurrence of chronic kidney disease in a subject by inputting new disease prediction data into the disease outbreak prediction model.

질환 예측부(16)는 대상자의 유전자 정보, 복수의 통합 위험 요인 지표 혈액 마커 등을 질병 발생 예측 모델에 적용하여 대상자의 만성신장질환 발생을 예측할 수 있다. 예시적으로, 질환 예측부(16)는 질병 위험도 기계학습 모델에 대상자의 대상자 상태 변수 및 대상자 유전자 정보를 적용하여 대상자의 대상자 질병 위험도를 예측할 수 있다. 또한, 질환 예측부(16)는 는 질병 위험도 기계학습 모델 및 유전자 정보 통계확률 모델에 대상자의 대상자 상태 변수 및 대상자 유전자 정보를 적용하여 대상자의 대상자 질병 위험도를 예측할 수 있다. The disease prediction unit 16 may predict the occurrence of chronic kidney disease in the subject by applying the genetic information of the subject, a plurality of integrated risk factor indicator blood markers, and the like to the disease outbreak prediction model. For example, the disease prediction unit 16 may predict a subject's disease risk by applying the subject's subject state variable and subject's gene information to the disease risk machine learning model. In addition, the disease predicting unit 16 may predict the subject's subject's disease risk by applying the subject's subject state variables and subject's genetic information to the disease risk machine learning model and genetic information statistical probability model.

본원의 일 실시예에 따르면 질환 예측부(16)는 기계학습 모델 및 통계확률 모델에 대상자의 대상자 상태 변수 및 대상자 유전자 정보를 적용하여 대상자의 대상자 질병 위험도를 예측할 수 있다. 또한, 질환 예측부(16)는 대상자의 질병 위험도 예측 결과를 기 설정된 분류 항목에 기반하여 시각화할 수 있다. 예를 들어, 질환 예측부(16)는 딥러닝 기반의 시각화 알고리즘을 구축하여 질병 발생 예측 모델 생성부(15)의 질병 발생 예측 모델을 기반으로 각 대상자별 시각화된 결과를 제공할 수 있다. 질환 예측부(16)는 부정적 요인의 변화양상을 바탕으로 개인의 질병 위험 경로의 변화를 예측하여 시각화하여 제공할 수 있다. 또한, 질환 예측부(16)는 긍정적 요인의 변화양상을 바탕으로 개인의 질병 위험 확률이 감소될 수 있는 안전 경로를 시각화하여 제공할 수 있다. 또한, 질환 예측부(16)는 부정적 요인 및 긍정적 요인의 변화 양상을 통합적으로 고려하여, 각 대상자별 생활 습관의 변화양상을 바탕으로 만성신장질환 및 최종 건강상태인 심혈관계 질환, 만성심장질환 및 사망에 대한 위험회피 경로 안내를 통해 개인 맞춤형 예방 관리 서비스 모형을 제공할 수 있다According to one embodiment of the present application, the disease prediction unit 16 may predict the subject's disease risk by applying the subject's subject state variable and subject's genetic information to the machine learning model and the statistical probability model. In addition, the disease predicting unit 16 may visualize a result of predicting a disease risk of a subject based on a predetermined classification item. For example, the disease prediction unit 16 may construct a deep learning-based visualization algorithm to provide visualized results for each subject based on the disease occurrence prediction model of the disease occurrence prediction model generator 15. The disease prediction unit 16 may predict and visualize a change in an individual's disease risk path based on a change pattern of a negative factor. In addition, the disease prediction unit 16 may visualize and provide a safety path through which an individual's probability of disease risk may be reduced based on a change pattern of a positive factor. In addition, the disease predicting unit 16 considers the changes in the negative and positive factors collectively, and based on the change in lifestyle of each subject, chronic kidney disease and cardiovascular disease, which is the final health condition, chronic heart disease, and Personalized preventive care service model can be provided through guidance on risk aversion to death

또한, 질환 예측부(16)는 대상자의 만성신장질환 발생 예측 결과와 연계된 질병 예방 관리 정보를 제공할 수 있다. 질환 예측부(16)는 사용자 단말(미도시)로 대상자의 만성신장질환 발생 예측 결과와 연계된 질병 예방 관리 정보를 제공할 수 있다. 일예로, 질병 예방 관리 정보는, 고혈압, 당뇨병, 고혈압-당뇨병 동반 상태, 만성신장질환 대상자와 연계된 식단, 운동법 등과 관련된 정보를 포함할 수 있다. In addition, the disease prediction unit 16 may provide disease prevention management information associated with the predicted outcome of a subject's chronic kidney disease. The disease prediction unit 16 may provide disease prevention management information associated with a prediction result of a chronic kidney disease occurrence of a target user terminal (not shown). As an example, the disease prevention management information may include information related to hypertension, diabetes, hypertension-diabetes accompanying condition, diet associated with chronic kidney disease subjects, exercise methods, and the like.

본원의 일 실시예에 따르면, 만성신장질환 발생 예측 장치(10)는 기저인구집단에서의 각 질병 이환과 연관된 요인을 선정할 수 있다. 먼저, 만성신장질환 발생 예측 장치(10) 전체 변수 501개중에서 다음과 같은 기준으로 변수들을 제외할 수 있다. 이때, 개인식별자, 조사일과 같이 질병 (고혈압, 당뇨병, 만성신장질환) 예측과 관련 없는 변수들은 제외할 수 있다. 또한, 질병 진단 여부, 질병 과거력 등과 같이 질병의 최종상태 (고혈압, 당뇨병, 만성신장질환) 와 직접적으로 관련된 변수들은 제외할 수 있다. 또한, 결측치 비율이 전체 데이터의 20%를 넘어가는 변수들은 제외할 수 있다. 만성신장질환 발생 예측 장치(10)는 이와 같은 과정을 거쳐 고려 대상으로 총 120개의 변수를 도출할 수 있다. According to one embodiment of the present application, the apparatus for predicting the occurrence of chronic kidney disease 10 may select factors associated with each disease morbidity in the basal population. First, among the 501 variables of the chronic kidney disease occurrence prediction apparatus 10, variables may be excluded based on the following criteria. At this time, variables that are not related to the prediction of diseases (hypertension, diabetes, chronic kidney disease), such as personal identification and the date of investigation, can be excluded. In addition, variables directly related to the final state of the disease (hypertension, diabetes, chronic kidney disease), such as whether the disease was diagnosed or not, can be excluded. Also, variables in which the ratio of missing values exceeds 20% of the total data can be excluded. Chronic kidney disease incidence prediction device 10 may derive a total of 120 variables for consideration through this process.

또한, 만성신장질환 발생 예측 장치(10)는 전처리 단계를 거쳐 정의된 질병이환 상태를 반응 변수로 두고 120개 변수들에 대한 전체 다항 로지스틱 회귀모형을 작성하여 변수들을 선택할 수 있다. 만성신장질환 발생 예측 장치(10)는다항 로지스틱 회귀모형을 작성 후, 변수선택법 (전진Forward / 후진Backward / 단계별Stepwise Selection)을 실행하여 변수를 1차적으로 선정 후, 각각의 3가지의 변수 선정 방법 중 적어도 2개의 변수선택법 적용과정에서 나온 변수를 선정할 수 있다. 만성신장질환 발생 예측 장치(10)는 여기에 추가로, 임상적·역학적 유의성을 기반으로 고혈압/당뇨병/만성신장질환의 발생과 유의한 연관성을 가진 위험요인을 아래와 같은 방법을 통해 선정할 수 있다. 구체적으로 콕스비례위험모형에서 backward, forward, stepwise를 통한 각각의 3가지의 변수 선정 방법을 기반으로 분석을 수행 후, 적어도 2개의 변수선택법 적용결과에서 나온 변수를 선정할 수 있다. 기존 선행연구에서 구축된 심혈관계질환 및 대사질환 예측 모형에 사용된 변수들을 추가적 설명변수로 선정할 수 있다. 기존 선행연구에서 구축된 심혈관계 질환 예측 모형은 표5와 같다. In addition, the apparatus for predicting the occurrence of chronic kidney disease 10 may select variables by creating a full polynomial logistic regression model for 120 variables with the disease morbidity defined as a response variable through a pre-processing step. Chronic kidney disease occurrence prediction device (10) After creating a multinomial logistic regression model, select variables by performing variable selection method (forward / backward / stepwise selection), and then select each of the three variables Among them, at least two variables can be selected. In addition, the apparatus for predicting the occurrence of chronic kidney disease 10 may further select risk factors having a significant correlation with the occurrence of hypertension/diabetes/chronic kidney disease based on clinical and epidemiological significance through the following method. . Specifically, after performing the analysis based on each of the three variable selection methods through backward, forward, and stepwise in the Cox proportional risk model, it is possible to select variables from the results of applying at least two variable selection methods. The variables used in the model for predicting cardiovascular and metabolic diseases established in previous studies can be selected as additional explanatory variables. Table 5 shows the model for predicting cardiovascular disease established in previous studies.

인구학적
및 가족력Demographic
And family history 생활습관Lifestyle 질병과거력Disease history 혈압 Blood pressure 혈액검사
(지질)Blood test
(Lipid) 기타 혈액검사Other blood tests 성별gender 흡연smoking 당뇨병diabetes SBPSBP 총콜레스테롤Total cholesterol 혈당Blood sugar 연령age 체질량지수Body mass index CVDCVD DBPDBP HDL-콜레스테롤HDL-cholesterol C reative
proteinC reative
protein 인종race 신체활동Physical activity AF AF 고혈압High blood pressure LDL-콜레스테롤LDL-cholesterol AlbuminAlbumin 심혈관계질환 가족력Family history of cardiovascular disease 식이Dietary 당불내성(Glucose
intolerance)Glucose tolerance
intolerance) TGTG CreatinineCreatinine 심리적 요인Psychological factors 협심증 angina pectoris 사회경제학적요인Socioeconomic factors 기타Etc 음주Drinking

최종적으로 본 모형에 포함된 변수들은 우리나라에서 건강증진 질병예방정책 기반 개선해야 할 요인으로 흡연, 음주, 비만, 규칙적 운동이 포함되어 있어 흡연의 국민보건측면에서의 중요성을 고려하여 이러한 변수들을 고려하여 선정하였다. 만성신장질환 발생 예측 장치(10)는 위의 과정을 기반으로, 고혈압, 당뇨병, 만성신장질환 통합 발생 모델을 구축하기 위해서 성별, 연령, 교육수준, 소득수준, 심혈관계질환 과거력, 심혈관계질환 가족력 점수, 환경적 행태 관련 요인 (음주, 규칙적 운동 여부, 흡연 상태)과 비만 지표인 체질량지수, 허리둘레, 혈액검사 (총 콜레스테롤, HDL-콜레스테롤, TG), Albumin과 같은 간기능 혈액 마커, 기저상태의 질병 이환 상태, 질병의 유전체 마커 등을 포함해 최종적으로 19개의 변수를 선정할 수 있다. 본원의 일 실시예에 따르면, 통합 위험 요인 구축부(13)는 통합 위험요인 패널에 선정된 변수들의 고혈압과 당뇨병 발생에 대한 영향성을 평가할 수 있다. 예시적으로 통합 위험 요인 구축부(13)는 고혈압 발생에 대한 영향성을 평가할 수 있다. 표 6은 통합 위험요인 패널 변수의 고혈압 발생에 대한 영향성 평가이다.Finally, the variables included in this model include factors such as smoking, drinking, obesity, and regular exercise as factors that should be improved based on health promotion disease prevention policy in Korea. Selected. Based on the above process, the chronic kidney disease incidence prediction apparatus 10 has sex, age, education level, income level, cardiovascular disease history, and family history of cardiovascular disease in order to build an integrated model of hypertension, diabetes, and chronic kidney disease. Scores, environmental behavior-related factors (drinking, regular exercise, smoking status), obesity indicators, body mass index, waist circumference, blood test (total cholesterol, HDL-cholesterol, TG), liver function blood markers such as albumin, basal status Finally, 19 variables can be selected, including disease morbidity and disease genomic markers. According to one embodiment of the present application, the integrated risk factor building unit 13 may evaluate the effects of variables selected in the integrated risk factor panel on the occurrence of hypertension and diabetes. Illustratively, the integrated risk factor building unit 13 may evaluate the effect on the occurrence of hypertension. Table 6 is an assessment of the impact of the integrated risk factor panel variable on the development of hypertension.

각 요인에서의 beta 값이 0보다 클 경우, 해당 요인에 따른 질병 발생 위험은 증가하는 양상을, 반면에 0보다 작을 경우에는 고혈압 발생 위험이 감소하는 양상을 나타냄.When the beta value in each factor is greater than 0, the risk of disease occurrence increases according to the factor, whereas when it is less than 0, the risk of hypertension decreases.

VariableVariable betabeta Standard errorStandard error HR (95% CI)¹ HR (95% CI) ¹ P-value¹ P-value ¹ 연령, 세Age, years <50<50 refref 50-6050-60 0.5820.582 0.0400.040 1.79 (1.65-1.94)1.79 (1.65-1.94) <.0001<.0001 60-7060-70 0.9270.927 0.0390.039 2.53 (2.34-2.73)2.53 (2.34-2.73) <.0001<.0001 성별gender 남자man refref 여자Woman -0.119-0.119 0.0320.032 0.89 (0.83-0.95)0.89 (0.83-0.95) 0.0010.001 교육수준Education level 초등학교이하Elementary School refref 고등학교high school -0.501-0.501 0.0350.035 0.61 (0.57-0.65)0.61 (0.57-0.65) <.0001<.0001 대학교 이상University or higher -0.634-0.634 0.0550.055 0.53 (0.48-0.59)0.53 (0.48-0.59) <.0001<.0001 Income, 만원/월Income, 10,000 won/month <100만원/월<1 million won/month refref 100-200만원/월1 million to 2 million won/month -0.418-0.418 0.0400.040 0.66 (0.61-0.71)0.66 (0.61-0.71) <.0001<.0001 200-400만원/월2-4 million won/month -0.609-0.609 0.0420.042 0.54 (0.50-0.59)0.54 (0.50-0.59) <.0001<.0001 400만원/월 이상4 million won/month or more -0.588-0.588 0.0670.067 0.56 (0.49-0.63)0.56 (0.49-0.63) <.0001<.0001 BMI, kg/m² BMI, kg/m ² <23<23 refref 23-27.523-27.5 0.3820.382 0.0400.040 1.47 (1.36-1.59)1.47 (1.36-1.59) <.0001<.0001 27.5+27.5+ 0.7780.778 0.0480.048 2.18 (1.98-2.39)2.18 (1.98-2.39) <.0001<.0001 허리둘레, cmWaist circumference, cm 남자 <90cm, 여자 <85cmMen <90cm, Women <85cm refref 남자 90+, 여자85+90+ male, 85+ female 0.6450.645 0.0330.033 1.91 (1.79-2.04)1.91 (1.79-2.04) <.0001<.0001 흡연여부Smoking NeverNever refref PastPast 0.1480.148 0.0440.044 1.16 (1.06-1.27)1.16 (1.06-1.27) <.0001<.0001 CurrentCurrent 0.0390.039 0.0390.039 1.04 (0.96-1.12)1.04 (0.96-1.12) 0.3140.314 음주여부Drinking NeverNever refref PastPast 0.0560.056 0.0680.068 1.06 (0.93-1.21)1.06 (0.93-1.21) 0.4110.411 CurrentCurrent 0.0180.018 0.0330.033 1.02 (0.95-1.09)1.02 (0.95-1.09) 0.5840.584 규칙적인 운동regular exercise NoNo refref YesYes -0.031-0.031 0.0320.032 0.97 (0.91-1.03)0.97 (0.91-1.03) 0.3310.331 고혈압 상태 (Blood pressure)¹ High blood pressure (Blood pressure) ¹ 정상normal refref 전고혈압Prehypertension 1.6291.629 0.0440.044 5.10 (4.68-5.56)5.10 (4.68-5.56) <.0001<.0001 고혈압High blood pressure 2.7412.741 0.0570.057 15.50 (13.86-17.333)15.50 (13.86-17.333) <.0001<.0001 당뇨병 상태 (Fasting Glucose and HBA1C)² Diabetes Status (Fasting Glucose and HBA1C) ² 정상normal refref 전당뇨Pre-diabetes 0.4150.415 0.0700.070 1.51 (1.32-1.74)1.51 (1.32-1.74) <.0001<.0001 당뇨병diabetes 0.5280.528 0.0470.047 1.70 (1.55-1.86)1.70 (1.55-1.86) <.0001<.0001 만성신장질환 (creatinine)³ Chronic kidney disease (creatinine) ³ 정상normal refref CKDCKD 0.6940.694 0.0870.087 2.00 (1.69-2.38)2.00 (1.69-2.38) <.0001<.0001 Total cholesterol, mg/dLTotal cholesterol, mg/dL <200<200 refref 200-240200-240 0.1150.115 0.0360.036 1.12 (1.05-1.20)1.12 (1.05-1.20) 0.0020.002 240+240+ 0.3240.324 0.0530.053 1.38 (1.25-1.54)1.38 (1.25-1.54) <.0001<.0001 HDL-C, mg/dLHDL-C, mg/dL 남자 40+, 여자 50+40+ men, 50+ women refref 남자<40여자<50Men <40 Women <50 0.1510.151 0.0330.033 1.16 (1.09-1.24)1.16 (1.09-1.24) <.0001<.0001 TG, mg/dLTG, mg/dL < 150<150 refref 150+150+ 0.4860.486 0.0330.033 1.63 (1.53-1.73)1.63 (1.53-1.73) <.0001<.0001 ALBUMIN, g/dLALBUMIN, g/dL 3.4-5.43.4-5.4 refref <3.4<3.4 0.8390.839 0.3530.353 2.32 (1.16-4.63)2.32 (1.16-4.63) 0.0180.018 심혈관질환 과거력 점수⁴ Cardiovascular disease history ⁴ 00 refref 1+1+ 0.6500.650 0.0980.098 1.92 (1.58-2.32)1.92 (1.58-2.32) <.0001<.0001 심혈관질환 가족력 점수⁵ Cardiovascular disease family history score ⁵ 00 refref 1+1+ 0.0600.060 0.0440.044 1.06 (0.97-1.16)1.06 (0.97-1.16) 0.1710.171 고혈압 유전체 점수Hypertensive genome score Continuous scaleContinuous scale Categorical scaleCategorical scale Quantile1Quantile1 refref Q2Q2 0.0700.070 0.0540.054 1.07 (0.96-1.19)1.07 (0.96-1.19) 0.2010.201 Q3Q3 0.0950.095 0.0540.054 1.10 (0.99-1.22)1.10 (0.99-1.22) 0.0780.078 Q4Q4 0.2230.223 0.0520.052 1.25 (1.13-1.39)1.25 (1.13-1.39) <.0001<.0001 Q5Q5 0.2990.299 0.0510.051 1.35 (1.22-1.49)1.35 (1.22-1.49) <.0001<.0001

고혈압 상태는 SBP가 120mmHg이하이면서 DBP가 80mmHg 이하를 정상으로, SBP가 130mmHg이상이거나 또는 DBP 80mmHg이상인 경우 혹은 SBP 140mmHg미만이면서 DBP가 90mmHg미만인 경우를 전고혈압단계로, SBP가 140mmHg이상이거나 DBP가 90mmHg이상인 경우를 고혈압으로 정의할 수 있다. 당뇨병 상태는 공복 혈당이 100mg/dL미만이자 HBA1C가 6.5% 미만인 경우를 정상으로, 100≤공복혈당<126mg/dL 이면서 HBA1C가 6.5% 미만인 경우 당뇨병전단계로, 공복혈당이 126mmHg이상이거나 HBA1C 6.5이상인 경우를 당뇨병으로 정의할 수 있다.In hypertension, SBP is 120mmHg or less, DBP is 80mmHg or less, SBP is 130mmHg or more, DBP is 80mmHg or more, or SBP is less than 140mmHg and DBP is less than 90mmHg. The abnormal case can be defined as hypertension. Diabetes status is normal when fasting blood glucose is less than 100mg/dL and HBA1C is less than 6.5%, and if 100≤ fasting blood sugar <126mg/dL and HBA1C is less than 6.5%, pre-diabetes, when fasting blood sugar is 126mmHg or more or HBA1C 6.5 or more Can be defined as diabetes.

만성신장질환 상태는 혈청 크레아틴 농도를 바탕으로 한 추정사구체여과율 (Estimated glomerular filtration rate: eGRFR)값이 60미만을 만성신잘질환으로 정의할 수 있다.Chronic kidney disease state can be defined as chronic nephropathy disease with an estimated glomerular filtration rate (eGRFR) value of less than 60 based on serum creatine concentration.

심혈관계질환 과거력 점수는 울혈성 심부전증 진단 여부, 관상동맥 질환 진단 여부, 뇌졸중, 중풍 등의 뇌혈관 질환 진단 여부 3가지의 질병력에 대해 각각 질병력이 있는 경우 1로, 없는 경우를 0으로 두고, 질병력이 모두 없는 경우를 0으로, 한가지 이상 있는 경우를 1점으로 분류하여 산출할 수 있다.The history of cardiovascular disease scores is 1 for cases with a history of disease, 1 for cases with no history, and 0 for cases with a history of congestive heart failure, diagnosis of coronary artery disease, and diagnosis of cerebrovascular diseases such as stroke and stroke. It can be calculated by classifying a case where all of these are absent as 0 and a case when there is more than one case.

심혈관계질환 가족력 점수는 고혈압 가족력 여부, 당뇨병 가족력 여부, 심장질환 가족역 여부 3가지의 가족력에 대해 각각 가족력이 있는 경우 1로, 없는 경우를 0으로 두고, 질병 가족력이 모두 없는 경우를 0으로, 한가지 이상 있는 경우를 1점으로 분류하여 산출할 수 있다.The family history of cardiovascular disease scores is 3 for family history of hypertension, family history of diabetes, family history of heart disease, 1 for family history, 0 for none, and 0 for all family history of disease. It can be calculated by classifying one case of more than one.

50세 미만 대상자에 비해 연령이 증가할수록, 그리고 여성에 비해 남성에서의 고혈압 발생의 위험이 높은 것을 확인할 수 있다. 교육수준에서는 ‘초등학교 미만’을 기준으로, 교육기간이 길수록 고혈압의 비율이 낮아지는 양상이 관찰되었다. 또한 대체로 소득수준이 높을수록 발생위험이 감소하는 양상을 보이는 것을 확인할 수 있다. 이는 사회경제학적 수준에 따라 건강 상태에 주의를 기울이는 정도, 종사하는 업무의 종류, 건강유지에 대해 필요한 지식의 차이 등에서 온 효과일 것으로 예측할 수 있다. 비만 관련 지표인 체질량 지수와 복부비만의 지표인 허리둘레는 증가할수록 고혈압 발생 위험이 높아지는 것을 확인할 수 있다. 음주여부와 규칙적 운동 상태에 대해서는 고혈압의 발생 위험이 유의하지 않음을 확인할 수 있다. 단, 흡연의 경우 과거 흡연자에서 고혈압 발생 위험이 유의하게 증가하는 것을 확인할 수 있다. 혈액 이상 상태를 나타내는 혈액 마커의 경우, 지질이상지표인 TG, HDL-콜레스테롤, 총 콜레스테롤과 간이상지표인 Albumin는 임상 참고치를 바탕으로 범주를 구분하였고, 정상범주를 벗어나는 수치를 가질 경우 고혈압의 발생이 통계적으로 유의하게 증가하는 형태를 나타나는 것을 확인할 수 있다. 현재 질병 이환 상태를 나타내는 심혈관계 질병력 점수와 고혈압, 당뇨병, 만성신장질환의 이환 상태일수록 고혈압 위험이 증가하는 양상을 보이는 것을 확인할 수 있다. 심혈관계질환 과거력이 있는 경우, 고혈압 위험이 증가하나 유의하지는 않음을 확인할 수 있다. 고혈압과 관련된 유전체로 이뤄진 유전체 점수가 높을수록 고혈압 위험이 증가하는 것을 확인할 수 있다.It can be seen that the risk of hypertension in men is higher as the age increases, compared to women under the age of 50, compared to women. At the education level, the rate of hypertension decreased as the period of education was longer, based on'below elementary school'. In addition, it can be seen that in general, the higher the income level, the lower the risk of occurrence. This can be expected to be the effect from the degree of attention to health status, the type of work involved, and the difference in knowledge required to maintain health, depending on the socio-economic level. It can be seen that the risk of hypertension increases with increasing body mass index, which is an index related to obesity, and waist circumference, which is an index of obesity. It can be confirmed that the risk of hypertension is not significant for drinking and regular exercise. However, in the case of smoking, it can be seen that the risk of hypertension in the past smokers increases significantly. In the case of blood markers indicating blood abnormalities, lipid abnormality indicators TG, HDL-cholesterol, total cholesterol and liver abnormality indicator Albumin were categorized based on clinical reference values. It can be seen that this statistically significant increase was observed. It can be seen that the cardiovascular disease history score indicating the current disease morbidity and the risk of hypertension increased as the morbidity of hypertension, diabetes, and chronic kidney disease increased. If you have a history of cardiovascular disease, you can see that the risk of hypertension increases but is not significant. It can be seen that the higher the genomic score made of the genome associated with hypertension, the higher the risk of hypertension.

본원의 일 실시예에 따르면, 통합 위험 요인 구축부(13)는 통합 위험요인 패널에 선정된 변수들의 고혈압과 당뇨병 발생에 대한 영향성을 평가할 수 있다. 예시적으로 통합 위험 요인 구축부(13)는 당뇨병 발생에 대한 영향성을 평가할 수 있다. 표 7은 통합 위험요인 패널 변수의 당뇨병 발생에 대한 영향성 평가이다.According to one embodiment of the present application, the integrated risk factor building unit 13 may evaluate the effects of variables selected in the integrated risk factor panel on the occurrence of hypertension and diabetes. Illustratively, the integrated risk factor building unit 13 may evaluate the effect on diabetes incidence. Table 7 assesses the impact of the integrated risk factor panel variable on diabetes incidence.

각 요인에서의 beta 값이 0보다 클 경우, 해당 요인에 따른 질병 발생 위험은 증가하는 양상을, 반면에 0보다 작을 경우에는 당뇨병 발생 위험이 감소하는 양상을 나타냄.When the beta value of each factor is greater than 0, the risk of disease occurrence increases according to the factor, whereas when it is less than 0, the risk of diabetes decreases.

VariableVariable betabeta Standard errorStandard error HR (95% CI)¹ HR (95% CI) ¹ P-value¹ P-value ¹ 연령, 세Age, years <50<50 refref 50-6050-60 0.5390.539 0.0650.065 1.71 (1.51-1.95)1.71 (1.51-1.95) <.0001<.0001 60-7060-70 0.6390.639 0.0650.065 1.89 (1.67-2.15)1.89 (1.67-2.15) <.0001<.0001 성별gender 남자man refref 여자Woman -0.160-0.160 0.0540.054 0.85 (0.77-0.95)0.85 (0.77-0.95) 0.0030.003 교육수준Education level 초등학교이하Elementary School refref 고등학교high school -0.341-0.341 0.0580.058 0.71 (0.64-0.80)0.71 (0.64-0.80) <.0001<.0001 대학교 이상University or higher -0.323-0.323 0.0870.087 0.72 (0.61-0.86)0.72 (0.61-0.86) <.0001<.0001 Income, 만원/월Income, 10,000 won/month <100만원/월<1 million won/month refref 100-200만원/월1 million to 2 million won/month -0.235-0.235 0.0660.066 0.79 (0.69-0.90)0.79 (0.69-0.90) <.0001<.0001 200-400만원/월2-4 million won/month -0.353-0.353 0.0680.068 0.70 (0.62-0.80)0.70 (0.62-0.80) <.0001<.0001 400만원/월 이상4 million won/month or more -0.246-0.246 0.1050.105 0.78 (0.64-0.96)0.78 (0.64-0.96) 0.0190.019 BMI, kg/m² BMI, kg/m ² <23<23 refref 23-27.523-27.5 0.5050.505 0.0720.072 1.66 (1.44-1.91)1.66 (1.44-1.91) <.0001<.0001 27.5+27.5+ 1.1511.151 0.0790.079 3.16 (2.71-3.69)3.16 (2.71-3.69) <.0001<.0001 허리둘레, cmWaist circumference, cm 남자 <90cm, 여자 <85cmMen <90cm, Women <85cm refref 남자 90+, 여자85+90+ male, 85+ female 0.8800.880 0.0540.054 2.41 (2.17-2.68)2.41 (2.17-2.68) <.0001<.0001 흡연여부Smoking NeverNever refref PastPast 0.1400.140 0.0750.075 1.15 (0.99-1.33)1.15 (0.99-1.33) 0.0610.061 CurrentCurrent 0.2440.244 0.0620.062 1.28 (1.13-1.44)1.28 (1.13-1.44) <.0001<.0001 음주여부Drinking NeverNever refref PastPast 0.3170.317 0.1020.102 1.37 (1.13-1.68)1.37 (1.13-1.68) 0.0020.002 CurrentCurrent -0.023-0.023 0.0560.056 0.98 (0.88-1.09)0.98 (0.88-1.09) 0.6750.675 규칙적인 운동regular exercise NoNo refref YesYes -0.077-0.077 0.0530.053 0.93 (0.83-1.03)0.93 (0.83-1.03) 0.1480.148 고혈압 상태 (Blood pressure)¹ High blood pressure (Blood pressure) ¹ 정상normal refref 전고혈압Prehypertension 0.6000.600 0.0600.060 1.82 (1.62-2.05)1.82 (1.62-2.05) <.0001<.0001 고혈압High blood pressure 0.8830.883 0.0780.078 2.42 (2.08-2.82)2.42 (2.08-2.82) <.0001<.0001 당뇨병 상태 (Fasting Glucose and HBA1C)² Diabetes Status (Fasting Glucose and HBA1C) ² 정상normal refref 전당뇨Pre-diabetes 1.9271.927 0.0930.093 6.87 (5.73-8.24)6.87 (5.73-8.24) <.0001<.0001 만성신장질환 (creatinine)³ Chronic kidney disease (creatinine) ³ 정상normal refref CKDCKD 0.2030.203 0.1570.157 1.23 (0.90-1.66)1.23 (0.90-1.66) 0.1950.195 Total cholesterol, mg/dLTotal cholesterol, mg/dL <200<200 refref 200-240200-240 0.3280.328 0.0570.057 1.39 (1.24-1.55)1.39 (1.24-1.55) <.0001<.0001 240+240+ 0.6940.694 0.0770.077 2.00 (1.72-2.33)2.00 (1.72-2.33) <.0001<.0001 HDL-C, mg/dLHDL-C, mg/dL 남자 40+, 여자 50+40+ men, 50+ women refref 남자<40여자<50Men <40 Women <50 0.3630.363 0.0530.053 1.44 (1.30-1.60)1.44 (1.30-1.60) <.0001<.0001 TG, mg/dLTG, mg/dL < 150<150 refref 150+150+ 0.9940.994 0.0540.054 2.70 (2.43-3.00)2.70 (2.43-3.00) <.0001<.0001 ALBUMIN, g/dLALBUMIN, g/dL 3.4-5.43.4-5.4 refref <3.4<3.4 1.4731.473 0.4480.448 4.36 (1.81-10.50)4.36 (1.81-10.50) 0.0010.001 심혈관질환 과거력 점수⁴ Cardiovascular disease history ⁴ 00 refref 1+1+ 0.4150.415 0.1620.162 1.51 (1.10-2.08)1.51 (1.10-2.08) 0.0110.011 심혈관질환 가족력 점수⁵ Cardiovascular disease family history score ⁵ 00 refref 1+1+ -0.022-0.022 0.0720.072 0.98 (0.85-1.23)0.98 (0.85-1.23) 0.7550.755 당뇨병 유전체 점수Diabetes genome score Continuous scaleContinuous scale Categorical scaleCategorical scale Quantile1Quantile1 refref Q2Q2 0.0950.095 0.0880.088 1.10 (0.93-1.31)1.10 (0.93-1.31) 0.2830.283 Q3Q3 0.1370.137 0.0870.087 1.15 (0.97-1.36)1.15 (0.97-1.36) 0.1160.116 Q4Q4 0.3150.315 0.0840.084 1.37 (1.16-1.62)1.37 (1.16-1.62) <.0001<.0001 Q5Q5 0.4910.491 0.0820.082 1.63 (1.39-1.92)1.63 (1.39-1.92) <.0001<.0001

본원의 일 실시예에 따르면, 비만 관련 지표인 체질량 지수와 복부비만의 지표인 허리둘레는 증가할수록 당뇨병 발생 위험이 높아지는 것을 확인할 수 있다. 또한, 현재 흡연자이거나, 과거에 음주를 한 경우 당뇨병의 위험이 유의하게 증가하는 것을 확인할 수 있다. 그러나 규칙적 운동의 상태의 경우 당뇨병의 발생 위험이 유의하지 않음을 확인할 수 있다. 혈액 이상 상태를 나타내는 혈액 마커의 경우, 지질이상지표인 TG, HDL-콜레스테롤, 총 콜레스테롤과 간이상지표인 Alubmin는 임상 참고치를 바탕으로 범주를 구분하였고, 정상범주에 벗어나는 높은 수치를 가질 경우 당뇨병의 발생이 통계적으로 유의하게 증가하는 형태를 보이는 것을 확인할 수 있다. 또한, 현재 질병 이환 상태를 나타내는 고혈압, 전당뇨병, 만성신장질환의 이환상태일수록 당뇨병 위험이 증가하는 양상을 보이는 것을 확인할 수 있다. 또한 심혈관계질환 과거력이 있는 경우, 당뇨병 발생위험이 유의하게 증가하는 것을 확인함. 다만 심혈관계 질환 가족력의 경우 당뇨병의 발생 위험이 유의하지 않았음을 확인할 수 있다. 당뇨병과 관련된 유전체로 이뤄진 유전체 점수가 높을수록 당뇨병 위험이 증가하는 것을 확인할 수 있다.본원의 일 실시예에 따르면, 통합 위험 요인 구축부(13)는 통합 위험요인 패널에 선정된 변수들의 만성신장질환에 대한 영향성을 평가할 수 있다. 예시적으로 통합 위험 요인 구축부(13)는 만성신장질환 발생에 대한 영향성을 평가할 수 있다. 표 8은 통합 위험요인 패널 변수의 만성신장질환 발생에 대한 영향성 평가이다.According to an embodiment of the present application, it can be confirmed that the risk of diabetes increases as the body mass index, which is an index related to obesity, and the waist circumference, which is an index of abdominal obesity, increase. In addition, it can be seen that the risk of diabetes increases significantly if you are a current smoker or have been drinking in the past. However, it can be confirmed that the risk of developing diabetes is not significant in the case of regular exercise. In the case of blood markers indicating blood abnormalities, lipid abnormality indicators TG, HDL-cholesterol, total cholesterol and liver abnormality indicators Alubmin were categorized based on clinical reference values. It can be seen that the occurrence has a statistically significant increase. In addition, it can be seen that the risk of diabetes increases as the prevalence of hypertension, pre-diabetes, and chronic kidney disease, which indicates the current disease prevalence. In addition, it was confirmed that the risk of developing diabetes increased significantly if there was a history of cardiovascular disease. However, it can be confirmed that the risk of developing diabetes was not significant in the case of family history of cardiovascular diseases. The higher the genomic score of diabetes-related genome, the higher the risk of diabetes. According to an embodiment of the present disclosure, the integrated risk factor building unit 13 chronic kidney disease of variables selected in the integrated risk factor panel. You can evaluate the impact on. Illustratively, the integrated risk factor building unit 13 may evaluate the effect on the occurrence of chronic kidney disease. Table 8 is an assessment of the impact of the integrated risk factor panel variable on the occurrence of chronic kidney disease.

각 요인에서의 beta 값이 0보다 클 경우, 해당 요인에 따른 질병 발생 위험은 증가하는 양상을, 반면에 0보다 작을 경우에는 만성신장질환 발생 위험이 감소하는 양상을 나타냄.When the beta value in each factor is greater than 0, the risk of disease occurrence increases according to the factor, whereas when it is less than 0, the risk of developing chronic kidney disease decreases.

VariableVariable betabeta Standard errorStandard error HR (95% CI)¹ HR (95% CI) ¹ P-value¹ P-value ¹ 연령, 세Age, years <50<50 refref 50-6050-60 0.6920.692 0.0720.072 2.00 (1.74-2.30)2.00 (1.74-2.30) <.0001<.0001 60-7060-70 1.5451.545 0.0640.064 4.69 (4.14-5.31)4.69 (4.14-5.31) <.0001<.0001 성별gender 남자man refref 여자Woman 0.4940.494 0.0540.054 1.64 (1.47-1.82)1.64 (1.47-1.82) <.0001<.0001 교육수준Education level 초등학교이하Elementary School refref 고등학교high school -0.637-0.637 0.0550.055 0.53 (0.48-0.59)0.53 (0.48-0.59) <.0001<.0001 대학교 이상University or higher -0.694-0.694 0.0890.089 0.50 (0.42-0.59)0.50 (0.42-0.59) <.0001<.0001 Income, 만원/월Income, 10,000 won/month <100만원/월<1 million won/month refref 100-200만원/월1 million to 2 million won/month -0.519-0.519 0.0640.064 0.60 (0.53-0.68)0.60 (0.53-0.68) <.0001<.0001 200-400만원/월2-4 million won/month -0.580-0.580 0.0660.066 0.56 (0.49-0.64)0.56 (0.49-0.64) <.0001<.0001 400만원/월 이상4 million won/month or more -0.753-0.753 0.1160.116 0.47 (0.38-0.59)0.47 (0.38-0.59) <.0001<.0001 BMI, kg/m² BMI, kg/m ² <23<23 refref 23-27.523-27.5 0.1710.171 0.0620.062 1.19 (1.05-1.34)1.19 (1.05-1.34) 0.0060.006 27.5+27.5+ 0.4890.489 0.0760.076 1.63 (1.41-1.89)1.63 (1.41-1.89) <.0001<.0001 허리둘레, cmWaist circumference, cm 남자 <90cm, 여자 <85cmMen <90cm, Women <85cm refref 남자 90+, 여자85+90+ male, 85+ female 0.5870.587 0.0520.052 1.80 (1.62-1.99)1.80 (1.62-1.99) <.0001<.0001 흡연여부Smoking NeverNever refref PastPast -0.286-0.286 0.0770.077 0.75 (0.65-0.87)0.75 (0.65-0.87) <.0001<.0001 CurrentCurrent -0.416-0.416 0.0680.068 0.66 (0.58-0.75)0.66 (0.58-0.75) <.0001<.0001 음주여부Drinking NeverNever refref PastPast -0.124-0.124 0.1050.105 0.88 (0.72-1.09)0.88 (0.72-1.09) 0.2390.239 CurrentCurrent -0.522-0.522 0.0550.055 0.59 (0.53-0.66)0.59 (0.53-0.66) <.0001<.0001 규칙적인 운동regular exercise NoNo refref YesYes -0.094-0.094 0.0520.052 0.91 (0.82-1.01)0.91 (0.82-1.01) 0.0680.068 고혈압 상태 (Blood pressure)¹ High blood pressure (Blood pressure) ¹ 정상normal refref 전고혈압Prehypertension 0.4240.424 0.0590.059 1.53 (1.36-1.71)1.53 (1.36-1.71) <.0001<.0001 고혈압High blood pressure 0.8000.800 0.0770.077 2.23 (1.92-2.59)2.23 (1.92-2.59) <.0001<.0001 당뇨병 상태 (Fasting Glucose and HBA1C)² Diabetes Status (Fasting Glucose and HBA1C) ² 정상normal refref 전당뇨Pre-diabetes -0.060-0.060 0.1360.136 0.94 (0.72-1.23)0.94 (0.72-1.23) 0.6600.660 당뇨병diabetes 0.8070.807 0.0670.067 2.24 (1.97-2.56)2.24 (1.97-2.56) <.0001<.0001 Total cholesterol, mg/dLTotal cholesterol, mg/dL <200<200 refref 200-240200-240 0.3140.314 0.0560.056 1.37 (1.23-1.53)1.37 (1.23-1.53) <.0001<.0001 240+240+ 0.4780.478 0.0830.083 1.61 (1.37-1.90)1.61 (1.37-1.90) <.0001<.0001 HDL-C, mg/dLHDL-C, mg/dL 남자 40+, 여자 50+40+ men, 50+ women refref 남자<40여자<50Men <40 Women <50 0.4300.430 0.0540.054 1.54 (1.38-1.71)1.54 (1.38-1.71) <.0001<.0001 TG, mg/dLTG, mg/dL < 150<150 refref 150+150+ 0.4470.447 0.0520.052 1.56 (1.41-1.73)1.56 (1.41-1.73) <.0001<.0001 ALBUMIN, g/dLALBUMIN, g/dL 3.4-5.43.4-5.4 refref <3.4<3.4 2.4572.457 0.5010.501 11.67 (4.37-31.16)11.67 (4.37-31.16) <.0001<.0001 심혈관질환 과거력 점수⁴ Cardiovascular disease history ⁴ 00 refref 1+1+ 0.7470.747 0.1500.150 2.11 (1.57-2.83)2.11 (1.57-2.83) <.0001<.0001 심혈관질환 가족력 점수⁵ Cardiovascular disease family history score ⁵ 00 refref 1+1+ 0.0550.055 0.0700.070 1.06 (0.92-1.21)1.06 (0.92-1.21) 0.4390.439 고혈압 유전체 점수Hypertensive genome score Continuous scaleContinuous scale Categorical scaleCategorical scale Quantile1Quantile1 refref Q2Q2 -0.015-0.015 0.0860.086 0.99 (0.83-1.17)0.99 (0.83-1.17) 0.8650.865 Q3Q3 0.1020.102 0.0840.084 1.11 (0.94-1.31)1.11 (0.94-1.31) 0.2230.223 Q4Q4 0.1180.118 0.0820.082 1.13 (0.96-1.32)1.13 (0.96-1.32) 0.1540.154 Q5Q5 0.1070.107 0.0820.082 1.12 (0.95-1.31)1.12 (0.95-1.31) 0.1920.192 당뇨병 유전체 점수Diabetes genome score Continuous scaleContinuous scale Categorical scaleCategorical scale Quantile1Quantile1 refref Q2Q2 -0.074-0.074 0.0810.081 0.93 (0.79-1.09)0.93 (0.79-1.09) 0.3590.359 Q3Q3 -0.011-0.011 0.0790.079 0.99 (0.85-1.16)0.99 (0.85-1.16) 0.8900.890 Q4Q4 -0.141-0.141 0.0820.082 0.87 (0.74-1.02)0.87 (0.74-1.02) 0.0860.086 Q5Q5 -0.082-0.082 0.0810.081 0.92 (0.79-1.08)0.92 (0.79-1.08) 0.3160.316

비만 관련 지표인 체질량 지수와 복부비만의 지표인 허리둘레는 증가할수록 만성신장질환 발생 위험이 높아지는 것을 확인할 수 있다. 또한, 규칙적운동의 경우, 만성신장질환 발생에 유의하진 않음을 확인할 수 있다. 또한, 혈액 이상 상태를 나타내는 혈액 마커의 경우, 지질이상지표인 TG, HDL-콜레스테롤, 총 콜레스테롤과 임상 참고치를 바탕으로 범주를 구분하였고, 정상범주에 비해 높은 수치를 가질 경우 당뇨병의 발생이 통계적으로 유의하게 증가하는 형태를 보인다. 현재 질병 이환 상태를 나타내는 고혈압, 당뇨병 이환상태일수록 만성신장질환 위험이 증가하는 양상을 보이는 것을 확인할 수 있다. 심혈관계 질환 가족력이 있을수록 만성신장질환의 발생 위험이 증가하는 것을 확인할 수 있다. 본원의 일 실시예에 따르면, 질병 발생 예측 모델 생성부(15)는 통계적 방법, 복수의 기계학습법을 이용하여 질병 발생 예측 모델을 구축하였을 때의 모델의 예측력을 비교할 수 있다. 질병 발생 예측 모델 생성부(15)는 다양한 인공지능 학습법을 적용한 질병 발생 예측 모델을 비교한 결과, 고혈압, 당뇨병과 그들의 동반질병인 만성신장질환 발생 위험 예측도가 가장 높은 방법이 적용된 예측 모델을 최종적으로 질병 발생 예측 모델로 선정할 수 있다. It can be seen that the risk of developing chronic kidney disease increases with increasing body mass index, which is an index related to obesity, and waist circumference, which is an indicator of obesity. In addition, it can be confirmed that in the case of regular exercise, it is not significant for the occurrence of chronic kidney disease. In addition, in the case of blood markers indicating blood abnormalities, categories were classified based on lipid abnormality indicators TG, HDL-cholesterol, total cholesterol, and clinical reference values. Significantly increasing form. It can be seen that the risk of chronic kidney disease increases as the hypertension and diabetes mellitus indicate the current disease-related condition. As the family history of cardiovascular disease increases, the risk of developing chronic kidney disease increases. According to one embodiment of the present application, the disease occurrence prediction model generator 15 may compare the predictive power of the model when the disease occurrence prediction model is constructed using a statistical method and a plurality of machine learning methods. The disease incidence prediction model generator 15 compares the disease incidence prediction model using various AI learning methods, and finally determines the prediction model to which the method of predicting the highest risk of developing chronic kidney disease, hypertension, diabetes and their accompanying diseases is applied. It can be selected as a disease outbreak prediction model.

예시적으로, 통계적 방법 (콕스모형), 3가지 기계학습법 (인공신경망, 딥러닝 기반 recurrent neural network (RNN), 랜덤 포레스트)으로 각각의 질병 예측 모형을 구축하였을 때의 모형의 예측력을 ROC-curve를 그려 비교하였고, 이에 산출된 AUC값은 표9 내지 표11과 같다.For example, ROC-curve predicts the model's predictive power when each disease prediction model is constructed using a statistical method (Cox model), three machine learning methods (artificial neural network, deep learning-based recurrent neural network (RNN), and random forest). The results were drawn and compared, and the calculated AUC values are shown in Tables 9 to 11.

표9는 고혈압 발생 위험도를 예측한 결과이다.Table 9 shows the results of predicting the risk of hypertension.

고혈압High blood pressure 안산안성Ansan Anseong 도시기반Urban infrastructure 농촌기반Rural base 도시+농촌 통합City + rural integration 통계기법Statistical technique 0.8550.855 0.6050.605 0.6540.654 0.6160.616 기계학습법Machine learning method 서프트벡터머신(SVM)Surf Vector Machine (SVM) 0.9820.982 0.5470.547 0.730.73 0.5620.562 RNNRNN 0.8530.853 0.5340.534 0.5520.552 0.5080.508 랜덤 포레스트Random forest 0.9880.988 0.6830.683 0.6230.623 0.6640.664

표 10은 당뇨병 발생 위험도를 예측한 결과이다.Table 10 is a result of predicting the risk of diabetes.

당뇨병diabetes 안산안성Ansan Anseong 도시기반Urban infrastructure 농촌기반Rural base 도시+농촌 통합City + rural integration 통계기법Statistical technique 0.8810.881 0.7910.791 0.8040.804 0.8160.816 기계학습법Machine learning method 서프트벡터머신(SVM)Surf Vector Machine (SVM) 0.9930.993 0.6920.692 0.7040.704 0.7110.711 RNNRNN 0.8530.853 0.6240.624 0.6400.640 0.6050.605 랜덤 포레스트Random forest 0.9980.998 0.8800.880 0.8120.812 0.8690.869

표11은 만성신장질환 발생 위험도를 예측한 결과이다.Table 11 shows the results of predicting the risk of developing chronic kidney disease.

만성신장질환Chronic kidney disease 안산안성Ansan Anseong 도시기반Urban infrastructure 농촌기반Rural base 도시+농촌 통합City + rural integration 통계기법Statistical technique 0.7480.748 0.7420.742 0.8600.860 0.8040.804 기계학습법Machine learning method 서프트벡터머신(SVM)Surf Vector Machine (SVM) 0.9960.996 0.5120.512 0.6680.668 0.5890.589 RNNRNN 0.7350.735 0.6410.641 0.6660.666 0.5940.594 랜덤 포레스트Random forest 0.9980.998 0.8030.803 0.8510.851 0.8240.824

도13은 본원의 일 실시예에 따른 통계적, 기계학습 모형에 따른 고혈압 발생위험 예측도 비교를 나타낸 도면이고, 도14는 본원의 일 실시예에 따른 통계적, 기계학습 모형에 따른 당뇨병 발생위험 예측도 비교를 나타낸 도면이고, 도15는 본원의 일 실시예에 따른 통계적, 기계학습 모형에 따른 만성신장질환 발생위험 예측도 비교를 나타낸 도면이다.질병 발생 예측 모델 생성부(15)는 여러 가지 모형을 비교한 결과, 고혈압, 당뇨병과 그들의 동반질병인 만성신장질환 발생 위험 예측도가 가장 높은 랜덤포레스트 방법을 최종 모형으로 선정할 수 있다.13 is a diagram showing a comparison of the predicted risk of developing hypertension according to a statistical and machine learning model according to an embodiment of the present application, and FIG. 14 is a predicted risk of developing diabetes according to a statistical and machine learning model according to an embodiment of the present application. 15 is a diagram showing a comparison of the predicted risk of developing chronic kidney disease according to a statistical and machine learning model according to an embodiment of the present disclosure. As a result of comparison, a random forest method with the highest risk of developing chronic kidney disease, which is hypertension, diabetes and their accompanying diseases, can be selected as the final model.

본원의 일 실시예에 따르면, 만성신장질환 발생 예측 장치(10)는 한국인 호발질병인 고혈압과 당뇨병에 대한 유전체 마커가 포함되어 있어 이들 질병과 이들의 동반이환 및 이들 질병이 원인이 되어 발생하는 만성신장질환 발생 위험 예측에 있어 기존 모형과 달리 정밀예측이 가능하다. 특히 이 유전체 마커는 기존 연구와 달리 1K genome 3.0 최신버전 기반으로 확장된 유전 정보를 이용하여 선별된 마커이다. According to one embodiment of the present application, the apparatus for predicting the development of chronic kidney disease 10 includes genomic markers for hypertension and diabetes, which are Korean phobias, and chronic diseases caused by these diseases and their accompanying morbidity and these diseases In predicting the risk of developing kidney disease, precise prediction is possible unlike the existing model. In particular, this genome marker is a marker selected using extended genetic information based on the latest version of 1K genome 3.0, unlike the previous studies.

또한, 만성신장질환 발생 예측 장치(10)는 현재 일반인에서의 질병 상태 (정상, 전질병 혹은 질병이환 상태)가 모형에 포함되어 있기 때문에 일반인구집단에서 질병 발생 위험이 더 높을 수 있는 이환 및 비이상 상태를 고려하였기 때문에 기존 환자들이나 비이상 상태의 환자들에게도 질병 예방관리 방안을 제안하고 이를 적용할 수 있다.In addition, the apparatus for predicting the occurrence of chronic kidney disease 10 currently has a disease state (normal, total disease or disease morbidity) in the general population because the model includes a morbidity and non-morbidity that may have a higher risk of disease in the general population. Since the abnormal condition is considered, a disease prevention management plan can be proposed and applied to existing patients or non-abnormal patients.

또한, 기존 타 모형에서는 일개 질병의 발생이나 이환을 표적으로 하여 질병 예측 모델(모형)을 구축한 반면, 만성신장질환 발생 예측 장치(10)에서 구축된 질병 발생 예측 모델(모형)은 질병의 생물학적 병리학적 기전과 질병의 자연사를 고려하여 4개의 질병 상태를 동시에 파악할 수 있는 통합 모형을 구축하였다는 점에 있어서 타 모형과 달리 특이성이 있다. In addition, in other existing models, a disease prediction model (model) was constructed by targeting the occurrence or morbidity of one disease, while the disease occurrence prediction model (model) constructed in the apparatus for predicting the occurrence of chronic kidney disease (model) is a biological disease. Unlike other models, it has specificity in that it has built an integrated model that can simultaneously identify four disease states in consideration of pathological mechanisms and natural history of disease.

또한, 만성신장질환 발생 예측 장치(10)에서 구축된 질병 발생 예측 모델은 기존 모형과 달리 여러 모형을 이용하여 훈련을 통해 가장 좋은 예측력을 가진 모형을 최종 선정하였으며, 현재 최종 모형에서는 (질병에 따라 다르지만) 예측력이 적어도 80% 이상 (-최고 91%)에 달하고 있어 매우 높은 예측력을 지닌다.In addition, unlike the existing model, the model for predicting disease outbreak built in the apparatus for predicting the occurrence of chronic kidney disease 10 was finally selected through training using several models, and in the current final model (depending on the disease Although different), the predictive power is at least 80% (-up to 91%), so it has a very high predictive power.

최종 선정된 모형의 변수들을 배정할 때 요인 노출의 시기와 이후 변화될 수 있는 상태의 시간적 연속성과 질병의 자연사 및 질병의 연속성을 모두 고려하여 생물학적 알고리즘 하에서 순차적으로 모형을 형성하였기 때문에 실제 인간 사회에서 일어날 수 있는 시간적으로 생물학적 변화 상태에 대한 예측이 가능한 모형으로 볼 수 있다.When assigning the parameters of the final selected model, the model was sequentially formed under biological algorithms taking into account both the time of factor exposure and the temporal continuity of the state that can be changed later and the continuity of the disease and the natural history of the disease. It can be seen as a predictable model of the state of biological change over time.

본원의 다른 일 실시예에 따르면, 만성신장질환 발생 예측 장치(10)는 만성신장 질환의 질환자의 유전자 정보 및 만성신장 질환의 질병 위험도를 입력으로 하여, 유전자 정보와 만성신장 질환의 질병 위험도 사이의 관계의 정도를 학습하는 유전자 정보 기계학습 모델을 생성할 수 있다. 또한, 만성신장질환 발생 예측 장치(10)는 유전자 정보 기계학습 모델을 이용하여 유전자 정보로부터 핵심 유전자 정보를 선택할 수 있다. 또한, 만성신장 질환의 질환자의 생활상태 변수 및 건강상태 변수를 포함하는 복수의 상태 변수, 핵심 유전자 정보 및 만성신장 질환의 질병 위험도를 입력으로 하여, 복수의 상태 변수 및 핵심 유전자 정보 중 적어도 하나 이상과 만성신장 질환의 질병 위험도 사이의 관계의 정도를 학습하는 질병 위험도 기계학습 모델을 생성할 수 있다. 또한, 만성신장질환 발생 예측 장치(10)는 대상자의 대상자 상태 변수 및 대상자 유전자 정보를 입력 받을 수 있다. 또한, 만성신장질환 발생 예측 장치(10)는 질병 위험도 기계학습 모델에 대상자의 대상자 상태 변수 및 대상자 유전자 정보를 적용하여 대상자의 대상자 질병 위험도를 예측할 수 있다. According to another embodiment of the present application, the apparatus for predicting the occurrence of chronic kidney disease 10 inputs genetic information of a patient with a chronic kidney disease and a disease risk of a chronic kidney disease, and thus, between the genetic information and the disease risk of a chronic kidney disease Genetic machine learning models that learn the degree of relationships can be created. In addition, the apparatus for predicting the occurrence of chronic kidney disease 10 may select key genetic information from genetic information using a genetic information machine learning model. In addition, at least one of the plurality of state variables and the core genetic information, by inputting a plurality of state variables, the core genetic information, and the disease risk of the chronic kidney disease, including the living state variable and the health state variable of the chronic kidney disease patient A disease learning machine learning model can be created to learn the degree of the relationship between and the disease risk of chronic kidney disease. In addition, the apparatus for predicting the occurrence of chronic kidney disease 10 may receive a subject's subject state variable and subject's gene information. In addition, the apparatus for predicting the occurrence of chronic kidney disease may predict the subject's disease risk by applying the subject's subject state variable and subject's gene information to the disease risk machine learning model.

또한, 만성신장질환 발생 예측 장치(10)는 만성신장 질환의 질환자의 유전자 정보 및 만성신장 질환의 질병 위험도를 입력으로 하여, 유전자 정보 각각의 존재 유무 또는 값에 따라 만성신장 질환의 질병 위험도를 확률적으로 나타내는 유전자 정보 통계확률 모델을 생성할 수 있다. 만성신장질환 발생 예측 장치(10)는 유전자 정보 통계확률 모델 및 유전자 정보 기계학습 모델을 이용하여 유전자 정보로부터 핵심 유전자 정보를 선택할 수 있다. In addition, the apparatus for predicting the occurrence of chronic kidney disease 10 inputs the genetic information of the patient of the chronic kidney disease and the disease risk of the chronic kidney disease, and the probability of the disease of the chronic kidney disease according to the presence or absence of each of the genetic information Genetic statistical model of gene information can be generated. The apparatus for predicting chronic kidney disease occurrence may select key genetic information from genetic information using a statistical probability model of genetic information and a machine learning model of genetic information.

또한, 만성신장질환 발생 예측 장치(10)는 만성신장 질환의 질환자의 복수의 상태 변수, 유전자 정보 및 만성신장 질환의 질병 위험도를 입력으로 하여, 복수의 상태 변수 및 유전자 정보 중 적어도 하나 이상의 존재 유무 또는 값에 따라 만성신장 질환의 질병 위험도를 확률적으로 나타내는 통계확률 모델을 생성할 수 있다. 또한, 질병 위험도 기계학습 모델 및 유전자 정보 통계확률 모델에 대상자의 대상자 상태 변수 및 대상자 유전자 정보를 적용하여 대상자의 대상자 질병 위험도를 예측할 수 있다. In addition, the apparatus for predicting the occurrence of chronic kidney disease 10 inputs a plurality of status variables, genetic information, and disease risk of chronic kidney disease patients as inputs, and the presence or absence of at least one of the plurality of status variables and genetic information. Alternatively, a statistical probability model that probably represents the disease risk of chronic kidney disease may be generated according to the value. In addition, the subject's disease risk can be predicted by applying the subject's subject state variables and subject's gene information to the disease risk machine learning model and genetic information statistical probability model.

또한, 만성신장질환 발생 예측 장치(10)는 만성신장 질환의 질환자의 복수의 상태 변수, 유전자 정보 및 만성신장 질환의 질병 위험도를 입력으로 하고, 복수의 상태 변수 중 상기 만성신장 질환과 연관된 적어도 하나 이상의 상태 변수를 선택하고, 적어도 하나 이상의 상태 변수의 존재 여부 또는 값에 대한 만성신장 질환의 질병 위험도를 확률적으로 나타내는 기본 통계확률 모델을 생성할 수 있다. 또한, 만성신장 질환과 연관된 유전자 정보의 존재 여부에 따라 만성신장 질환의 질병 위험도에 가중치를 적용함으로써, 기본 통계확률 모델로부터 통계확률 모델을 생성할 수 있다. In addition, the apparatus for predicting the occurrence of chronic kidney disease 10 inputs a plurality of status variables, genetic information, and disease risk of chronic kidney disease in a patient with a chronic kidney disease, and at least one of the plurality of status variables associated with the chronic kidney disease Selecting the above state variable and generating a basic statistical probability model that probabilistically indicates the disease risk of chronic kidney disease for the presence or value of at least one state variable. In addition, a statistical probability model can be generated from the basic statistical probability model by applying a weight to the disease risk of the chronic kidney disease according to the presence or absence of genetic information associated with the chronic kidney disease.

또한, 만성신장질환 발생 예측 장치(10)의 유전자 정보 기계학습 모델은 복수의 상태 변수 중 제 1 상태 변수를 입력층으로 하고 복수의 상태 변수 중 제 2 상태 변수를 은닉층으로 할 때, 입력층과 은닉층 사이의 관계의 정도를 학습하는 제 1 학습을 하고, 은닉층 및 유전자 정보를 입력층으로 하고 질병 위험도를 출력층으로 할 때, 은닉층과 출력층 사이의 관계의 정도를 학습하는 제 2 학습을 함으로써, 복수의 상태 변수 및 유전자 정보 중 적어도 하나 이상과 만성신장 질환의 질병 위험도 사이의 관계의 정도를 학습할 수 있다. In addition, the genetic information machine learning model of the apparatus for predicting the development of chronic kidney disease 10, when the first state variable among the plurality of state variables as the input layer and the second state variable among the plurality of state variables as the hidden layer, the input layer and By performing the first learning to learn the degree of the relationship between the hidden layer, and when the hidden layer and the genetic information as the input layer and the disease risk as the output layer, by performing the second learning to learn the degree of the relationship between the hidden layer and the output layer, plural It can learn the degree of the relationship between at least one or more of the state variables and genetic information and the disease risk of chronic kidney disease.

또한, 만성신장질환 발생 예측 장치(10)의 유전자 정보 기계학습 모델은 상기 복수의 상태 변수의 이전 시점 상태 변수를 입력층으로 하고 상기 복수의 상태 변수의 현재 시점 상태 변수를 은닉층으로 할 때, 상기 입력층과 은닉층 사이의 관계의 정도를 학습하는 제 1 학습을 하고, 상기 은닉층 및 상기 유전자 정보를 입력층으로 하고 상기 질병 위험도를 출력층으로 할 때, 상기 은닉층과 출력층 사이의 관계의 정도를 학습하는 제 2 학습을 함으로써, 상기 복수의 상태 변수 및 유전자 정보 중 적어도 하나 이상과 상기 만성신장 질환의 질병 위험도 사이의 관계의 정도를 학습할 수 있다. In addition, the genetic information machine learning model of the apparatus for predicting the occurrence of chronic kidney disease when the state variables of the previous time point of the plurality of state variables are input layers and the current time point state variables of the plurality of state variables are hidden layers, When performing the first learning to learn the degree of the relationship between the input layer and the hidden layer, and using the hidden layer and the genetic information as the input layer and the disease risk as the output layer, learning the degree of the relationship between the hidden layer and the output layer By performing the second learning, it is possible to learn the degree of relationship between at least one of the plurality of state variables and genetic information and the disease risk of the chronic kidney disease.

이하에서는 상기에 자세히 설명된 내용을 기반으로, 본원의 동작 흐름을 간단히 살펴보기로 한다.Hereinafter, based on the details described above, the operation flow of the present application will be briefly described.

도 16은 본원의 일 실시예에 따른 만성신장질환 발생 예측 방법에 대한 동작 흐름도이다.16 is a flowchart illustrating a method for predicting the development of chronic kidney disease according to an embodiment of the present application.

도16에 도시된 만성신장질환 발생 예측 방법은 앞서 설명된 만성신장질환 발생 예측 장치(10)에 의하여 수행될 수 있다. 따라서, 이하 생략된 내용이라고 하더라도 만성신장질환 발생 예측 장치(10)에 대하여 설명된 내용은 만성신장질환 발생 예측 방법에 대한 설명에도 동일하게 적용될 수 있다.The method for predicting the occurrence of chronic kidney disease may be performed by the apparatus for predicting the occurrence of chronic kidney disease 10 described above. Therefore, even if omitted, the description of the apparatus for predicting the development of chronic kidney disease 10 may be applied to the description of the method for predicting the development of chronic kidney disease.

단계 S161에서, 만성신장질환 발생 예측 장치(10)는 복수의 질환과 관련된 유전체 마커를 선별할 수 있다. In step S161, the chronic kidney disease occurrence prediction apparatus 10 may select a genomic marker associated with a plurality of diseases.

단계 S162에서, 만성신장질환 발생 예측 장치(10)는 선별된 복수의 질환과 관련된 유전체 마커를 이용하여 통합 유전체 지표를 산출할 수 있다. In step S162, the chronic kidney disease occurrence prediction apparatus 10 may calculate an integrated genomic index using genomic markers associated with a plurality of selected diseases.

단계 S163에서, 만성신장질환 발생 예측 장치(10)는 복수의 질환과 관련된 요인들을 도출하여 통합 위험 요인 모델을 구축할 수 있다.In step S163, the chronic kidney disease occurrence prediction apparatus 10 may derive factors related to a plurality of diseases to construct an integrated risk factor model.

단계 S164에서, 만성신장질환 발생 예측 장치(10)는 통합 유전체 지표 및 통합 위험 요인 모델을 입력으로 하여, 질병 발생 예측 모델을 구축할 수 있다.In step S164, the chronic kidney disease occurrence prediction apparatus 10 may build a disease outbreak prediction model by inputting an integrated genomic indicator and an integrated risk factor model.

단계 S165에서, 만성신장질환 발생 예측 장치(10)는 질병 발생 예측 모델에 신규 질환 예측 데이터를 입력으로 하여 대상자의 만성신장질환 발생을 예측할 수 있다.In step S165, the chronic kidney disease occurrence prediction apparatus 10 may predict the occurrence of chronic kidney disease in a subject by inputting new disease prediction data into the disease occurrence prediction model.

상술한 설명에서, 단계 S161 내지 S165은 본원의 구현예에 따라서, 추가적인 단계들로 더 분할되거나, 더 적은 단계들로 조합될 수 있다. 또한, 일부 단계는 필요에 따라 생략될 수도 있고, 단계 간의 순서가 변경될 수도 있다.In the above description, steps S161 to S165 may be further divided into additional steps or combined into fewer steps, according to an embodiment of the present application. In addition, some steps may be omitted if necessary, and the order between the steps may be changed.

본원의 일 실시 예에 따른 만성신장질환 발생 예측 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 본 발명을 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. 상기된 하드웨어 장치는 본 발명의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.The method for predicting the development of chronic kidney disease according to an embodiment of the present application may be implemented in a form of program instructions that can be executed through various computer means and recorded in a computer-readable medium. The computer-readable medium may include program instructions, data files, data structures, or the like alone or in combination. The program instructions recorded on the medium may be specially designed and configured for the present invention or may be known and usable by those skilled in computer software. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks, and magnetic tapes, optical media such as CD-ROMs, DVDs, and magnetic media such as floptical disks. -Hardware devices specifically configured to store and execute program instructions such as magneto-optical media, and ROM, RAM, flash memory, and the like. Examples of program instructions include high-level language code that can be executed by a computer using an interpreter, etc., as well as machine language codes made by a compiler. The hardware device described above may be configured to operate as one or more software modules to perform the operation of the present invention, and vice versa.

또한, 전술한 만성신장질환 발생 예측 방법은 기록 매체에 저장되는 컴퓨터에 의해 실행되는 컴퓨터 프로그램 또는 애플리케이션의 형태로도 구현될 수 있다.In addition, the above-described method for predicting the occurrence of chronic kidney disease may also be implemented in the form of a computer program or application executed by a computer stored in a recording medium.

전술한 본원의 설명은 예시를 위한 것이며, 본원이 속하는 기술분야의 통상의 지식을 가진 자는 본원의 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 쉽게 변형이 가능하다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다. 예를 들어, 단일형으로 설명되어 있는 각 구성 요소는 분산되어 실시될 수도 있으며, 마찬가지로 분산된 것으로 설명되어 있는 구성 요소들도 결합된 형태로 실시될 수 있다.The above description of the present application is for illustrative purposes, and a person having ordinary knowledge in the technical field to which the present application belongs will understand that it is possible to easily change to other specific forms without changing the technical spirit or essential characteristics of the present application. Therefore, it should be understood that the embodiments described above are illustrative in all respects and not restrictive. For example, each component described as a single type may be implemented in a distributed manner, and similarly, components described as distributed may be implemented in a combined form.

본원의 범위는 상기 상세한 설명보다는 후술하는 특허청구범위에 의하여 나타내어지며, 특허청구범위의 의미 및 범위 그리고 그 균등 개념으로부터 도출되는 모든 변경 또는 변형된 형태가 본원의 범위에 포함되는 것으로 해석되어야 한다.The scope of the present application is indicated by the claims, which will be described later, rather than the detailed description, and all modifications or variations derived from the meaning and scope of the claims and equivalent concepts should be interpreted to be included in the scope of the present application.

10: 만성신장질환 발생 예측 장치
11: 유전체 마커 선별부
12: 통합 유전체 지표 산출부
13: 통합 위험 요인 구축부
14: 설명변수 도출부
15: 질병 발생 예측 모델 생성부
16: 질환 예측부
20: 질병 예측 서버10: Chronic kidney disease outbreak prediction device
11: genome marker selection section
12: Integrated dielectric index calculation unit
13: Integrated risk factor building department
14: Derivation of explanatory variables
15: disease outbreak prediction model generator
16: disease prediction department
20: disease prediction server

Claims

In the apparatus for predicting the development of chronic kidney disease,
A genomic marker selection unit for selecting genomic markers associated with a plurality of diseases;
An integrated genomic index calculation unit for calculating an integrated genomic index using genomic markers related to a plurality of diseases selected by the genomic marker selection unit;
An integrated risk factor construction unit for deriving factors related to the plurality of diseases and constructing an integrated risk factor model;
A disease occurrence prediction model generator configured to build a disease occurrence prediction model using the integrated genomic indicator and the integrated risk factor model as inputs; And
A disease prediction unit that predicts the occurrence of chronic kidney disease in a subject by inputting new disease prediction data into the disease occurrence prediction model
Chronic kidney disease outbreak prediction apparatus comprising a.

According to claim 1,
The dielectric marker selection unit,
The first genome marker associated with the first disease is selected, and the selected first genome marker is verified in each of the urban cohort and the rural cohort to determine the first genome marker, and the second genome marker associated with the second disease is selected. , Chronic kidney disease incidence prediction device to determine the selected second dielectric marker in the urban cohort and rural cohort, respectively, to determine the second dielectric marker.

According to claim 2,
The dielectric marker selection unit,
Based on a regression analysis algorithm, a single base polymorphism (SNP) marker associated with a plurality of diseases is selected, and the core is considered by considering at least one of the single base polymorphism (SNP) marker, the first genome marker, and the second genome marker Genetic information, chronic kidney disease outbreak prediction device.

According to claim 1,
The integrated risk factor building unit,
The plurality of diseases in consideration of at least one of basic demographic factors, sociological factors, disease and family history factors, environmental behavior-related factors, obesity-related indicator factors, hematologic abnormality indicator factors, and underlying disease morbidity factors A device for predicting the development of chronic kidney disease that derives factors related to and builds an integrated risk factor model.

According to claim 1,
And a descriptive variable deriving unit for deriving data of a subject having at least one of a plurality of diseases as an explanatory variable.

According to claim 3,
The disease occurrence prediction model generation unit,
At least one of the plurality of state variables and the core gene information, by inputting a plurality of state variables including the living state variable and the health state variable of the patient with chronic kidney disease, the core gene information and the disease risk of the chronic kidney disease And a disease outbreak prediction model for learning the degree of relationship between the disease risk of the chronic kidney disease.

The method of claim 6,
The disease outbreak prediction model,
A degree of relationship between the input layer and the hidden layer when a first state variable and a previous time hidden layer among the plurality of state variables are used as an input layer and a second state variable or a current time state variable among the plurality of state variables is hidden. Do the first learning to learn,
When the hidden layer and the genetic information are used as input layers and the disease risk is set as the output layer, by performing a second learning to learn the degree of the relationship between the hidden layer and the output layer, at least one or more of the plurality of state variables and genetic information And the degree of relationship between the disease risk of the chronic kidney disease,
The first learning is to learn the degree of the relationship between the input layer and the hidden layer based on [Equation 1],
[Equation 1]

At this time, above

Is a hidden layer at time t, and

Is the hidden layer from the previous point,

Is a first state variable at time t, a device for predicting the development of chronic kidney disease.

The method of claim 7,
The second learning is to learn the degree of the relationship between the hidden layer and the output layer based on [Equation 1] and [Equation 2],
[Equation 2]

At this time, y is an output layer, and

Is a hidden layer at time t, and

Is a fourth weighting value indicating the degree of relationship between the gene information and the output layer in the input layer, z is the genetic information in the input layer, chronic kidney disease outbreak prediction device.

The method of claim 8,
The disease occurrence prediction model generation unit,
Based on [Equation 3], the weight is updated to an error that occurs when generating a machine learning model that learns the degree of relationship between at least one of the plurality of state variables and genetic information and the disease risk of the chronic kidney disease. Will be
[Equation 3]

The E is the detection value of the error of the disease occurrence prediction model generation unit, the t is whether the chronic kidney disease occurs, the y is the disease risk predicted through the machine learning model,

Is an L2 regular expression for preventing overfitting due to errors, and a device for predicting the development of chronic kidney disease.

According to claim 1,
The disease prediction unit,
The apparatus for predicting chronic kidney disease outbreak is to provide disease prevention management information associated with the prediction result of chronic kidney disease outbreak in the subject.

In the method of predicting the development of chronic kidney disease,
Selecting genomic markers associated with a plurality of diseases;
Calculating an integrated genomic index using genomic markers associated with the selected plurality of diseases;
Constructing an integrated risk factor model by deriving factors related to the plurality of diseases;
Constructing a disease outbreak prediction model using the integrated genomic index and the integrated risk factor model as inputs; And
Predicting the occurrence of chronic kidney disease in a subject by inputting new disease prediction data into the disease occurrence prediction model;
Method for predicting the occurrence of chronic kidney disease, including.