KR101923654B1

KR101923654B1 - Method and apparatus for machine-learning of a model predicting probability of outbreak of disease

Info

Publication number: KR101923654B1
Application number: KR1020160157476A
Authority: KR
Inventors: 최상훈; 채명훈; 박서진; 이관홍; 민충기
Original assignee: 주식회사 셀바스에이아이
Priority date: 2016-11-24
Filing date: 2016-11-24
Publication date: 2018-11-29
Also published as: KR20180058466A

Abstract

본 발명은 질환 발병 확률 예측 모델 학습 방법 및 장치에 관한 발명이며, 본 발명의 일 실시예에 따른 질환 발병 확률 예측 모델 학습 방법은 적어도 하나의 외부 데이터베이스로부터 복수의 항목을 포함하는 원본 데이터를 수신하는 단계, 원본 데이터를 기초로 미리 결정된 기준에 따라 1회의 진료 또는 1회의 건강 검진을 하나의 이벤트로 나타내는 가공 데이터를 생성하는 단계, 가공 데이터에 포함된 복수의 이벤트 각각에 대한 질환 예측 모델을 생성하는 단계, 질환 예측 모델에 가공 데이터를 대입하여, 질환 발병 여부를 나타내는 출력값을 산출하는 단계, 출력값과 정답을 비교하는 단계 및 비교된 결과에 따라 상기 질환 예측 모델을 업데이트하는 단계를 포함하고, 가공 데이터에 대해 미리 결정된 질환 확률과 질환 예측 모델을 통해 산출된 질환 발병 확률을 비교하고 비교된 결과에 따라 질환 예측 모델을 업데이트하여, 질환 예측 모델을 통해 산출되는 질환 발병 확률의 정확성을 높일 수 있는 질환 발병 확률 예측 모델 학습 방법 및 장치를 제공할 수 있는 효과가 있다.The present invention relates to a method and an apparatus for predicting a disease occurrence probability prediction model, and a disease occurrence probability prediction model learning method according to an embodiment of the present invention includes receiving source data including a plurality of items from at least one external database A step of generating processing data representing one medical examination or one medical examination as one event in accordance with a predetermined criterion based on the original data, generating a disease prediction model for each of a plurality of events included in the processing data Calculating an output value indicating whether the disease has occurred by substituting the processed data into the disease prediction model, comparing the output value with the correct answer, and updating the disease prediction model according to the compared result, The disease probability calculated with the predetermined disease probability and the disease prediction model A prediction method of a disease occurrence probability model and an apparatus capable of increasing the accuracy of a disease occurrence probability calculated through a disease prediction model by comparing the probability of occurrence of a disease and updating the disease prediction model according to a result of the comparison, have.

Description

TECHNICAL FIELD [0001] The present invention relates to a method and an apparatus for predicting a disease occurrence probability,

본 발명은 질환 발병 확률 예측 모델 학습 방법 및 장치에 관한 것으로서, 보다 상세하게는 이벤트와 이벤트에 해당하는 정답을 학습하여 특정 이벤트가 입력되었을 때 특정 이벤트에 대한 질환 발병 확률을 산출할 수 있는 질환 예측 모델을 생성하는 질환 발병 예측 학습 장법 및 장치에 관한 것이다. More particularly, the present invention relates to a method and apparatus for predicting a disease outbreak probability prediction model, and more particularly, to a method and apparatus for predicting a disease predicting model that can calculate a disease occurrence probability for a specific event when a specific event is input, The present invention relates to a disease prediction predictive learning method and apparatus for generating a model.

의료업계에서는 질환 발병을 예측하기 위하여 하나의 요소만을 사용하거나, 복수의 요소들을 기초로 통계학적으로만 활용하고 있고, 복수의 요소들을 필터링하여 필수적인 요소를 추출하는 데는 한계가 있다. 따라서, 의료 데이터를 활용하여, 의료 데이터에 포함된 복수의 요소들을 기초로 머신 러닝을 통해 추출된 요소를 다차원 형태로 고려하게 된다면 훨씬 높은 정확도의 질환 발병 확률을 예측할 수 있으며, 더 나아가, 한국인에게 적합한 질환 발병 예측 모델을 구현할 수 있다.In the medical industry, there is a limitation in extracting essential elements by filtering a plurality of elements, using only one element to predict disease outbreaks, or statistically using only a plurality of elements. Therefore, if medical data is used to take into account the elements extracted through machine learning on the basis of a plurality of elements included in the medical data in a multidimensional form, the probability of disease development with much higher accuracy can be predicted, and furthermore, A suitable disease onset prediction model can be implemented.

[관련기술문헌] [Related Technical Literature]

치주질환 예측 시스템 및 이를 이용한 치주질환 예측 방법 (공개특허 10-2016-0083502호)Periodontal disease prediction system and method for predicting periodontal disease using the same (Japanese Patent Laid-Open No. 10-2016-0083502)

본 발명이 해결하고자 하는 과제는 직접적으로 질환으로 결정될 수 있는 데이터를 제외하여 가공 데이터를 생성함으로써, 정확도가 높은 질환 발병 확률을 산출하는 모델을 생성할 수 있는 질환 발병 확률 예측 모델 학습 방법 및 장치를 제공하는 것이다.A problem to be solved by the present invention is to provide a disease incidence probability prediction model learning method and apparatus capable of generating a model for calculating a disease incidence probability with high accuracy by generating processed data by excluding data that can be directly determined as a disease .

본 발명이 해결하고자 하는 다른 과제는 가공 데이터에 대해 미리 결정된 질환 발병 확률과 질환 예측 모델을 통해 산출된 질환 발병 확률을 비교하고 비교된 결과에 따라 질환 예측 모델을 업데이트하여, 질환 예측 모델을 통해 산출되는 질환 발병 확률의 정확성을 높일 수 있는 질환 발병 확률 예측 모델 학습 방법 및 장치를 제공하는 것이다.Another problem to be solved by the present invention is to compare the probability of occurrence of a disease with a probability of occurrence of a disease determined through a disease prediction model, and update the disease prediction model according to a result of comparison, And to provide a method and apparatus for predicting a disease occurrence probability prediction model capable of increasing the accuracy of a disease occurrence probability.

발명의 과제들은 이상에서 언급한 과제들로 제한되지 않으며, 언급되지 않은 또 다른 과제들은 아래의 기재로부터 당업자에게 명확하게 이해될 수 있을 것이다. The problems of the present invention are not limited to the above-mentioned problems, and other problems not mentioned can be clearly understood by those skilled in the art from the following description.

전술한 바와 같은 과제를 해결하기 위하여 본 발명의 일 실시예에 따른 질환 발병 확률 예측 모델 학습 방법은 적어도 하나의 외부 데이터베이스로부터 복수의 항목을 포함하는 원본 데이터를 수신하는 단계, 원본 데이터를 기초로 미리 결정된 기준에 따라 1회의 진료 또는 1회의 건강 검진을 하나의 이벤트로 나타내는 가공 데이터를 생성하는 단계, 가공 데이터에 포함된 복수의 이벤트 각각에 대한 질환 예측 모델을 생성하는 단계, 질환 예측 모델에 가공 데이터를 대입하여, 질환 발병 여부를 나타내는 출력값을 산출하는 단계, 출력값과 정답을 비교하는 단계 및 비교된 결과에 따라 질환 예측 모델을 업데이트하는 단계를 포함한다.According to an aspect of the present invention, there is provided a disease prediction probability model learning method comprising: receiving original data including a plurality of items from at least one external database; Generating processing data representing one medical examination or one medical examination as one event in accordance with the determined criterion; generating a disease prediction model for each of a plurality of events included in the processing data; Calculating an output value indicating whether the disease has occurred, comparing the output value with the correct answer, and updating the disease prediction model according to the comparison result.

본 발명의 다른 특징에 따르면, 가공 데이터는, 사회학적 데이터, 적어도 상기 1회의 진료를 포함하는 진료 기록 데이터 및 적어도 상기 1회의 건강 검진을 포함하는 건강 검진 데이터 중 하나 이상을 가공하여 상기 1회의 진료 또는 상기 1회의 건강 검진으로 나타낸 상기 하나의 이벤트일 수 있다.According to another aspect of the present invention, the processed data includes at least one of sociological data, medical record data including at least the one medical examination, and health examination data including at least the one medical examination, Or the one event indicated by the one health checkup.

본 발명의 다른 특징에 따르면, 가공 데이터는, 질환자, 비질환자 각각의 상기 이벤트에 대한 비율이 동일할 수 있다.According to another aspect of the present invention, the processed data may have the same ratio for the event of each of the disease patients and the non-disease patients.

본 발명의 다른 특징에 따르면, 정답은, 질환 발병 여부를 0 또는 1로 나타낼 수 있다.According to another feature of the present invention, the correct answer can indicate 0 or 1 as to whether or not the disease has occurred.

본 발명의 또 다른 특징에 따르면, 정답은, 주상병이 부여된 시점 이후부터 1로 결정될 수 있다.According to another aspect of the present invention, the correct answer may be determined to be one after the point at which the columnar disease is given.

본 발명의 또 다른 특징에 따르면, 정답은, 발병 시점에만 1로 결정될 수 있다..According to another aspect of the present invention, the correct answer can be determined to be 1 at the time of onset.

본 발명의 또 다른 특징에 따르면, 정답은, 발병 시점 직전부터 0 이상 1이하로 결정될 수 있다.According to another aspect of the present invention, the correct answer may be determined to be 0 or more and 1 or less from the time immediately before the onset of the disease.

본 발명의 또 다른 특징에 따르면, 가공 데이터를 생성하는 단계는, 이벤트에 포함된 상기 복수의 항목 중 복용 약품 분류 코드 및 복용 약품 투약량을 나열하는 단계, 복용 약품 분류 코드와 예측하려는 질환이 연관 관계가 있는지 결정하는 단계 및 복용 약품 분류 코드와 질환이 연관된 경우, 상기 복용 약품 분류 코드 및 상기 복용 약품 투약량를 삭제하는 단계를 포함함할 수 있다.According to still another aspect of the present invention, the step of generating the processing data includes the steps of: listing the dosing drug classification code and dosage medication dosage among the plurality of items included in the event; And if the disease is associated with the dosing drug classification code, deleting the dosing drug classification code and the dosing amount of the dosing drug.

본 발명의 또 다른 특징에 따르면, 질환 발병 확률 예측 모델 학습 방법은 가공 데이터의 시간 구간 내에서 선택된 일부의 시간 구간을 갖는 추가 가공 데이터를 생성하는 단계를 더 포함할 수 있다.According to another aspect of the present invention, the disease occurrence probability prediction model learning method may further include generating additional processing data having a selected time period within a time interval of the processing data.

본 발명의 또 다른 특징에 따르면, 추가 가공 데이터를 생성하는 단계는, 가공 데이터가 질환자에 대한 데이터인 경우, 질환 발병 시점을 포함하는 시간 구간 내에서 선택된 시간 구간에 해당하는 가공 데이터를 추가 가공 데이터로 생성하는 단계일 수 있다.According to still another aspect of the present invention, in the step of generating additional processing data, when the processing data is data for a patient, processing data corresponding to a time period selected within a time period including a time of onset of the disease, . &Lt; / RTI >

본 발명의 또 다른 특징에 따르면, 추가 가공 데이터를 생성하는 단계는, 가공 데이터가 비질환자에 대한 데이터인 경우, 선택된 시간 구간에 해당하는 가공 데이터를 추가 가공 데이터로 생성하는 단계일 수 있다.According to still another aspect of the present invention, the step of generating additional processed data may be a step of generating, when the processed data is data for a non-diseased person, the processed data corresponding to the selected time period as additional processed data.

본 발명의 또 다른 특징에 따르면, 가공 데이터를 생성하는 단계는, 원본 데이터에 포함된 진료 기록 데이터 중 직접적으로 질환을 판별할 수 있는 데이터를 제외하도록 진료 데이터를 필터링하는 단계를 포함할 수 있다.According to another aspect of the present invention, the step of generating the processing data may include the step of filtering the medical care data so as to exclude data that can directly diagnose the disease among the medical care record data included in the original data.

본 발명의 또 다른 특징에 따르면, 가공 데이터를 생성하는 단계는, 질환자와 비질환자의 이벤트에 대한 평균 길이를 보정하는 단계를 더 포함할 수 있다.According to another aspect of the present invention, the step of generating the processing data may further include correcting an average length for an event of a diseased person and a non-diseased person.

본 발명의 또 다른 특징에 따르면, 가공 데이터를 생성하는 단계는, 복수의 항목에 해당하는 각각의 단위를 추출하는 단계 및 각각의 단위를 상기 가공 데이터에 필요한 단위로 변환하는 단계를 포함할 수 있다.According to still another aspect of the present invention, the step of generating the processing data may include a step of extracting each unit corresponding to a plurality of items and a step of converting each unit into a unit necessary for the processing data .

본 발명의 또 다른 특징에 따르면, 가공 데이터를 생성하는 단계는, 복수의 항목에 해당하는 값의 각각의 평균 및 표준편차를 계산하는 단계 및 평균 및 표준편차를 z-score로 변환하는 단계를 포함할 수 있다.According to another aspect of the present invention, the step of generating the processing data includes calculating an average and standard deviation of each of the values corresponding to the plurality of items, and converting the average and standard deviation into z-scores can do.

전술한 바와 같은 과제를 해결하기 위하여 본 발명의 일 실시예에 따른 질환 발병 확률 예측 모델 학습 장치는 적어도 하나의 외부 데이터베이스로부터 복수의 항목을 포함하는 원본 데이터를 수신하도록 구성된 통신부 및 원본 데이터를 기초로 미리 결정된 기준에 따라 1회의 진료 또는 1회의 건강 검진을 하나의 이벤트로 나타내는 가공 데이터를 생성하도록 구성된 프로세서 및 원본 데이터 및 가공 데이터를 저장하는 저장부를 포함하고, 프로세서는, 가공 데이터에 포함된 복수의 이벤트 각각에 대한 질환 예측 모델을 생성하고, 질환 예측 모델에 가공 데이터를 대입하여, 질환 발병 여부를 나타내는 출력값을 산출하고, 출력값과 정답을 비교하고, 비교된 결과에 따라 질환 예측 모델을 업데이트하도록 구성된다.According to an aspect of the present invention, there is provided an apparatus for predicting a disease probability prediction model, the apparatus comprising: a communication unit configured to receive original data including a plurality of items from at least one external database; A processor configured to generate processing data representing one medical examination or one medical examination as one event in accordance with a predetermined criterion, and a storage section for storing original data and processing data, The disease prediction model is generated for each event, the process data is substituted into the disease prediction model, the output value indicating whether the disease has occurred, the output value is compared with the correct answer, and the disease prediction model is updated in accordance with the comparison result do.

본 발명의 다른 특징에 따르면, 프로세서는, 이벤트에 포함된 복수의 항목 중 복용 약품 분류 코드 및 복용 약품 투약량을 나열하고, 복용 약품 분류 코드와 예측하려는 질환이 연관 관계가 있는지 결정하고, 복용 약품 분류 코드와 상기 질환이 연관된 경우, 복용 약품 분류 코드 및 투약량 데이터를 삭제하도록 구성될 수 있다.According to another aspect of the present invention, a processor is configured to classify a drug classifier code and an administered drug dosage among a plurality of items included in an event, determine whether the drug classifier code and the disease to be predicted have an association, If the code is associated with the disease, it may be configured to delete the dosing drug classification code and dose data.

본 발명의 또 다른 특징에 따르면, 프로세서는, 가공 데이터의 시간 구간 내에서 선택된 일부의 시간 구간을 갖는 추가 가공 데이터를 생성하도록 구성될 수 있다.According to another aspect of the present invention, the processor can be configured to generate additional machining data having a selected time period within a time interval of machining data.

본 발명의 또 다른 특징에 따르면, 프로세서는, 원본 데이터에 포함된 진료 기록 데이터 중 직접적으로 질환을 판별할 수 있는 데이터를 제외하도록 진료 데이터를 필터링하도록 구성될 수 있다.According to another aspect of the present invention, the processor can be configured to filter the medical care data so as to exclude data that can directly diagnose the disease among the medical care record data included in the original data.

본 발명의 또 다른 특징에 따르면, 프로세서는, 질환자와 비질환자의 이벤트에 대한 평균 길이를 보정하도록 구성될 수 있다.According to another aspect of the present invention, a processor may be configured to correct an average length for an event of a diseased person and a non-diseased person.

기타 실시예의 구체적인 사항들은 상세한 설명 및 도면들에 포함되어 있다.The details of other embodiments are included in the detailed description and drawings.

본 발명은 직접적으로 질환 발병으로 결정될 수 있는 데이터를 제외하여 가공 데이터를 생성함으로써, 정확도가 높은 질환 발병 확률을 산출하는 모델을 생성할 수 있는 질환 발병 확률 예측 모델 학습 방법 및 장치를 제공할 수 있는 효과가 있다.The present invention can provide a disease occurrence probability prediction model learning method and apparatus capable of generating a model for calculating a disease occurrence probability with high accuracy by generating processed data by excluding data that can be directly determined as disease outbreaks It is effective.

본 발명은 가공 데이터에 대해 미리 결정된 질환 발병 확률과 질환 예측 모델을 통해 산출된 질환 발병 확률을 비교하고 비교된 결과에 따라 질환 예측 모델을 업데이트하여, 질환 예측 모델을 통해 산출되는 질환 발병 확률의 정확성을 높일 수 있는 질환 발병 확률 예측 모델 학습 방법 및 장치를 제공할 수 있는 효과가 있다.The present invention compares the probability of occurrence of a disease with a predetermined probability of occurrence of disease and the probability of disease occurrence calculated through a disease prediction model and updates the disease prediction model according to the result of comparison to determine the accuracy of the disease occurrence probability The probability of disease occurrence prediction model learning method and apparatus can be provided.

본 발명에 따른 효과는 이상에서 예시된 내용에 의해 제한되지 않으며, 더욱 다양한 효과들이 본 명세서 내에 포함되어 있다.The effects according to the present invention are not limited by the contents exemplified above, and more various effects are included in the specification.

도 1은 본 발명의 일 실시예에 따른 질환 발병 확률 예측 모델 학습 방법을 설명하기 위한 개략도이다.
도 2는 본 발명의 일 실시예에 따른 질환 발병 확률 예측 모델 학습 장치의 개략적인 구성을 도시한 블록도이다.
도 3은 본 발명의 일 실시예에 따른 질환 발병 확률 예측 모델 학습 방법에 따라 질환 예측 모델을 생성하고, 업데이트하는 절차를 도시한 순서도이다.
도 4a 내지 도 4c는 본 발명의 일 실시예에 따라 정답을 나타내는 그래프이다.
도 5a 내지 도 5c는 본 발명의 일 실시예에 따라 심혈관 질환과 연관된 데이터를 제외한 가공 데이터 테이블을 도시한 개략도들이다.
도 6a은 본 발명의 일 실시예에 따라 질환자에 대한 추가 가공 데이터를 도시한 개략도이다.
도 6b는 본 발명의 일 실시예에 따라 비질환자에 대한 추가 가공 데이터를 도시한 개략도이다.
도 6c는 본 발명의 일 실시예에 따라 질환자에 대한 추가 가공 데이터 및 비질환자에 대한 추가 가공 데이터의 구성 테이블을 도시한 개략도이다.
도 6d는 가공 데이터만으로 질환 예측 모델을 학습했을 경우와 가공 데이터 및 추가 가공 데이터로 질환 예측 모델을 학습했을 경우의 성능 데이터 테이블을 도시한 개략도이다.
도 7a 내지 도 7b는 본 발명의 일 실시예에 따라 복수의 항목의 값을 정규화하여 입력한 가공 데이터 테이블을 도시한 개략도들이다.
도 8a 내지 도 8b는 본 발명의 일 실시예에 따라 복수의 항목의 값을 정의된 단위로 변환하여 입력한 가공 데이터 테이블을 도시한 개략도들이다.1 is a schematic diagram for explaining a disease occurrence probability prediction model learning method according to an embodiment of the present invention.
2 is a block diagram showing a schematic configuration of a disease occurrence probability prediction model learning apparatus according to an embodiment of the present invention.
3 is a flowchart illustrating a procedure for generating and updating a disease prediction model according to a disease occurrence probability prediction model learning method according to an embodiment of the present invention.
4A-4C are graphs showing the correct answer according to an embodiment of the present invention.
FIGS. 5A through 5C are schematic diagrams showing processed data tables excluding data related to cardiovascular diseases according to an embodiment of the present invention. FIG.
6A is a schematic diagram showing further processing data for a patient with a disease according to an embodiment of the present invention.
6B is a schematic diagram showing further processing data for a non-diseased person according to an embodiment of the present invention.
6C is a schematic diagram showing a configuration table of additional processing data for a diseased person and additional processing data for a non-diseased person according to an embodiment of the present invention.
FIG. 6D is a schematic view showing performance data tables when the disease prediction model is learned only by the processing data and when the disease prediction model is learned from the processing data and the additional processing data.
FIGS. 7A and 7B are schematic diagrams showing a processed data table input by normalizing values of a plurality of items according to an embodiment of the present invention. FIG.
8A and 8B are schematic diagrams showing a processed data table in which values of a plurality of items are converted into defined units according to an embodiment of the present invention.

본 발명의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술되어 있는 실시예들을 참조하면 명확해질 것이다. 그러나 본 발명은 이하에서 개시되는 실시예들에 한정되는 것이 아니라 서로 다른 다양한 형태로 구현될 것이며, 단지 본 실시예들은 본 발명의 개시가 완전하도록 하며, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 발명은 청구항의 범주에 의해 정의될 뿐이다. BRIEF DESCRIPTION OF THE DRAWINGS The advantages and features of the present invention, and the manner of achieving them, will be apparent from and elucidated with reference to the embodiments described hereinafter in conjunction with the accompanying drawings. The present invention may, however, be embodied in many different forms and should not be construed as being limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art. Is provided to fully convey the scope of the invention to those skilled in the art, and the invention is only defined by the scope of the claims.

본 발명의 실시예를 설명하기 위한 도면에 개시된 형상, 크기, 비율, 각도, 개수 등은 예시적인 것이므로 본 발명이 도시된 사항에 한정되는 것은 아니다. 또한, 본 발명을 설명함에 있어서, 관련된 공지 기술에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우 그 상세한 설명은 생략한다. 본 명세서 상에서 언급된 '포함한다', '갖는다', '이루어진다' 등이 사용되는 경우, '~만'이 사용되지 않는 이상 다른 부분이 추가될 수 있다. 구성요소를 단수로 표현한 경우에 특별히 명시적인 기재 사항이 없는 한 복수를 포함하는 경우를 포함한다.The shapes, sizes, ratios, angles, numbers, and the like disclosed in the drawings for describing the embodiments of the present invention are illustrative, and thus the present invention is not limited thereto. In the following description, well-known functions or constructions are not described in detail since they would obscure the invention in unnecessary detail. Where the terms 'comprises', 'having', 'done', and the like are used herein, other parts may be added as long as '~ only' is not used. Unless the context clearly dictates otherwise, including the plural unless the context clearly dictates otherwise.

구성요소를 해석함에 있어서, 별도의 명시적 기재가 없더라도 오차 범위를 포함하는 것으로 해석한다.In interpreting the constituent elements, it is construed to include the error range even if there is no separate description.

비록 제1, 제2 등이 다양한 구성요소들을 서술하기 위해서 사용되나, 이들 구성요소들은 이들 용어에 의해 제한되지 않는다. 이들 용어들은 단지 하나의 구성요소를 다른 구성요소와 구별하기 위하여 사용하는 것이다. 따라서, 이하에서 언급되는 제1 구성요소는 본 발명의 기술적 사상 내에서 제2 구성요소일 수도 있다.Although the first, second, etc. are used to describe various components, these components are not limited by these terms. These terms are used only to distinguish one component from another. Therefore, the first component mentioned below may be the second component within the technical spirit of the present invention.

별도로 명시하지 않는 한 명세서 전체에 걸쳐 동일 참조 부호는 동일 구성 요소를 지칭한다.Like reference numerals refer to like elements throughout the specification unless otherwise specified.

본 발명의 여러 실시예들의 각각 특징들이 부분적으로 또는 전체적으로 서로 결합 또는 조합 가능하며, 당업자가 충분히 이해할 수 있듯이 기술적으로 다양한 연동 및 구동이 가능하며, 각 실시예들이 서로에 대하여 독립적으로 실시 가능할 수도 있고 연관 관계로 함께 실시 가능할 수도 있다.It is to be understood that each of the features of the various embodiments of the present invention may be combined or combined with each other partially or entirely and technically various interlocking and driving is possible as will be appreciated by those skilled in the art, It may be possible to cooperate with each other in association.

도 1은 본 발명의 일 실시예에 따른 질환 발병 확률 예측 모델 학습 방법을 설명하기 위한 개략도이다.1 is a schematic diagram for explaining a disease occurrence probability prediction model learning method according to an embodiment of the present invention.

도 1을 참조하면, 질환 발병 예측 학습 시스템 (1000) 은 가공 데이터 (100) 에 대한 질환 예측 모델 (200) 을 생성하고, 질환 예측 모델 (200) 을 통해 산출된 출력값 (300) 과 미리 결정된 정답 (400) 을 비교한 결과 (500) 에 따라 질환 예측 모델 (200) 을 업데이트하는 시스템이다.Referring to FIG. 1, the disease onset prediction learning system 1000 generates a disease prediction model 200 for the processing data 100, and generates an output value 300 calculated through the disease prediction model 200 and a predetermined correct answer The disease prediction model 200 is updated according to the result 500 obtained by comparing the disease prediction model 400 with the disease prediction model 200.

가공 데이터 (100) 는 외부의 데이터베이스로부터 수신된 원본 데이터를 가공한 데이터로서, 미리 결정된 기준에 따라 원본 데이터를 통합하여 하나의 이벤트를 포함하도록 가공된다. 가공 데이터 (100) 는 적어도 하나의 이벤트를 포함한다. 이벤트는 심혈관 질환 발병 확률과 연관된 의료 관련 활동으로 정의된다. 예를 들어, 이벤트는 병원에서의 진료, 처방 또는 건강 검진으로 정의될 수 있다. 하나의 이벤트는 동일자의 진료와 처방을 포함할 수도 있다. 이 때, 가공 데이터 (100) 의 개수와 가공 데이터 (100) 에 포함된 이벤트의 개수는 제한되지 않는다. 이벤트의 길이가 길어짐에 따라 발생하는 질환 예측 모델 (200) 의 학습 성능 하락의 원인을 가공 데이터 (100) 의 생성을 통해 해결할 수 있다. The processing data 100 is data obtained by processing original data received from an external database and is processed to include one event by integrating original data according to a predetermined criterion. The processing data 100 includes at least one event. An event is defined as a medical-related activity associated with the probability of developing a cardiovascular disease. For example, an event can be defined as a clinic, prescription, or health checkup at a hospital. One event may include the care and prescription of the same person. At this time, the number of the processed data 100 and the number of events included in the processed data 100 are not limited. It is possible to solve the learning performance degradation of the disease prediction model 200 that occurs as the event length becomes longer by generating the machining data 100. [

질환 예측 모델 (200) 은 입력된 데이터를 연산 처리하여, 출력값 (300) 을 산출하는 모델이다. 이 때, 입력된 데이터는 가공 데이터 (100) 이며, 출력값 (300) 은 질환 발병 여부이다. 질환 예측 모델 (200) 은 복수의 가공 데이터 (100) 를 입력받을 수 있으며, 복수의 가공 데이터 (100) 각각에 해당하는 각각의 출력값 (300) 을 산출할 수 있다. 더 나아가, 질환 예측 모델 (200) 은 복수의 가공 데이터 (100) 를 연산 처리하여 복수의 가공 데이터 (100) 에 대한 하나의 출력값 (300) 을 산출할 수 있다.The disease prediction model 200 is a model for calculating the output value 300 by performing arithmetic processing on the input data. At this time, the input data is the processed data 100, and the output value 300 indicates whether or not the disease has occurred. The disease prediction model 200 can receive a plurality of the processed data 100 and calculate the respective output values 300 corresponding to the plurality of processed data 100. [ Furthermore, the disease prediction model 200 can calculate one output value 300 for a plurality of the processed data 100 by arithmetically processing a plurality of the processed data 100. Further,

출력값 (300) 은 질환 발병 여부에 대한 값으로, 질환 예측 모델 (200) 에 의해 산출된다. 이 때, 출력값 (300) 은 복수의 가공 데이터 (100) 각각에 해당하는 복수의 출력값 (300) 및 복수의 가공 데이터 (100) 에 해당하는 하나의 출력값 (300) 일 수 있다. 질환이 발병할 가능성이 있는 경우, 출력값 (300) 은 1에 근접한 값을 가지며, 질환이 발병할 가능성이 없는 경우, 출력값 (300) 은 0에 근접한 값을 가진다. 더 나아가, 임의의 이벤트에 대한 출력값 (300) 은 하나 이상일 수 있다. 예를 들어, 임의의 이벤트에 대한 출력값 (300) 은 질환 발병 확률과 질환이 발병하지 않을 확률인 2개의 출력으로 구성되어, 1의 값을 가질 수 있다. 또한, 임의의 이벤트에 대한 출력값 (300) 은 세분화된 질환의 확률 각각에 대응하여 복수의 출력으로 구성될 수 있다. The output value 300 is a value for whether or not the disease has occurred, and is calculated by the disease prediction model 200. At this time, the output value 300 may be a plurality of output values 300 corresponding to each of the plurality of processed data 100 and one output value 300 corresponding to the plurality of processed data 100. When the disease is likely to occur, the output value 300 has a value close to 1, and when the disease is not likely to occur, the output value 300 has a value close to zero. Further, the output value 300 for any event may be one or more. For example, the output value 300 for an arbitrary event may have a value of 1, consisting of two outputs, the probability of disease occurrence and the probability that the disease will not occur. In addition, the output value 300 for any event can be composed of a plurality of outputs corresponding to each probability of the subdivided disease.

정답 (400) 은 가공 데이터 (100) 에 대응하여 미리 결정된 값으로, 가공 데이터 (100) 에 포함된 이벤트에 대한 질환 발병 여부이다. 이 때, 정답 (400) 은 0 또는 1의 값을 가진다. 즉, 1은 질환이 발병되었을 때의 값이고, 0은 질환이 발병되지 않았을 때의 값이다.The correct answer 400 is a predetermined value corresponding to the machining data 100 and indicates whether or not the disease included in the machining data 100 occurs. At this time, the correct answer 400 has a value of 0 or 1. That is, 1 is a value when a disease is caused and 0 is a value when a disease is not caused.

결과 (500) 는 출력값 (300) 과 정답 (400) 을 비교하여 결정된 결과에 대한 데이터이다. 예를 들어, 출력값 (300) 이 질환이 발병되지 않을 수치값이고, 정답 (400) 이 질환이 발병되었을 때의 값인 경우, 결과 (500) 는 출력값 (300) 과 정답 (400) 은 일치하지 않는다는 데이터를 포함한다. 이 때, 결과 (500) 에 따라 질환 예측 모델 (200) 은 업데이트될 수 있다.The result 500 is data on the result determined by comparing the output value 300 and the correct answer 400. [ For example, if the output value 300 is a numerical value at which the disease will not occur and the correct answer 400 is the value at which the disease has occurred, the result 500 indicates that the output value 300 and the correct answer 400 do not match Data. At this time, the disease prediction model 200 may be updated according to the result 500.

이하에서는, 질환 예측 모델을 구현하는 질환 발병 확률 예측 모델 학습 장치 (600) 에서의 질환 발병 확률 예측 모델 학습 방법에 대한 보다 상세한 설명을 위해 도 2를 함께 참조한다.Hereinafter, FIG. 2 will be referred to for a more detailed description of the disease occurrence probability prediction model learning method 600 in the disease occurrence probability prediction model learning apparatus 600 implementing the disease prediction model.

도 2는 본 발명의 일 실시예에 따른 질환 발병 확률 예측 모델 학습 장치의 개략적인 구성을 도시한 블록도이다. 설명의 편의를 위해 도 1을 참조하여 설명한다.2 is a block diagram showing a schematic configuration of a disease occurrence probability prediction model learning apparatus according to an embodiment of the present invention. Will be described with reference to Fig. 1 for convenience of explanation.

도 2를 참조하면, 질환 발병 확률 예측 모델 학습 장치 (600) 는 통신부 (610), 프로세서 (620) 및 저장부 (630) 를 포함한다.Referring to FIG. 2, the disease occurrence probability prediction model learning apparatus 600 includes a communication unit 610, a processor 620, and a storage unit 630.

질환 발병 확률 예측 모델 학습 장치 (600) 의 통신부 (610) 는 적어도 하나의 외부 데이터베이스로부터 복수의 항목을 포함하는 원본 데이터를 수신하도록 구성된다. 여기서, 외부 데이터란, 건강보험공단의 건강 검진 코호트 데이터베이스, 진료기관의 진료 및 검진 데이터베이스의 데이터일 수 있다. 건강 검진 코호트 데이터베이스는 건강 보험 및 의료급여권자 전체에 대한 진료 명세서와 치료 내역, 상병 내역, 처방전 내역 등에 대한 데이터를 포함한다. 또한, 통신부 (610) 는 산출된 심혈관 질환 발병 확률을 의료 기관, 보험사 및 개인에게 제공할 수 있다. The communication unit 610 of the disease occurrence probability prediction model learning device 600 is configured to receive original data including a plurality of items from at least one external database. Here, the external data may be the data of the health examination cohort database of the health insurance corporation, and the medical examination and examination database of the medical institution. The health check-up cohort database contains data on the health insurance and medical claimant's entire medical treatment and treatment history, infectious disease history, and prescription history. In addition, the communication unit 610 can provide the calculated cardiovascular disease occurrence probability to a medical institution, an insurance company, and an individual.

질환 발병 확률 예측 모델 학습 장치 (600) 의 프로세서 (620) 는 원본 데이터를 기초로 미리 결정된 기준에 따라 1회의 진료 또는 1회의 건강 검진을 하나의 이벤트로 나타내는 가공 데이터를 생성하도록 구성된다. 이 때, 프로세서 (620) 는 심혈관 질환을 암시할 수 있는 일부의 이벤트를 삭제하여 가공 데이터를 생성할 수 있다. 또한, 프로세서 (620) 는 가공 데이터의 시간 구간 내에서 선택된 일부의 시간 구간을 갖는 추가 가공 데이터를 생성할 수 있다. 따라서, 프로세서 (620) 는 가공 데이터 뿐만 아니라 추가 가공 데이터에 포함된 복수의 이벤트 각각에 대한 질환 예측 모델을 생성하고, 질환 예측 모델에 가공 데이터를 대입하여, 출력값을 산출한다. 더 나아가, 프로세서 (620) 는 출력값과 정답을 비교하여, 비교한 결과에 따라 질환 예측 모델을 업데이트하도록 구성된다.The processor 620 of the disease occurrence probability prediction model learning device 600 is configured to generate processing data indicating one treatment or one health examination as one event in accordance with a predetermined criterion based on the original data. At this time, the processor 620 may delete some events that may imply cardiovascular disease to generate processed data. In addition, the processor 620 may generate additional machining data having a selected time period within the time interval of the machining data. Accordingly, the processor 620 generates a disease prediction model for each of a plurality of events included in the additional processing data as well as the processing data, substitutes the processing data into the disease prediction model, and calculates an output value. Further, the processor 620 is configured to compare the output value with the correct answer, and to update the disease prediction model according to the comparison result.

질환 발병 확률 예측 모델 학습 장치 (600) 의 저장부 (630) 는 수신한 데이터 및 생성된 데이터를 저장한다. 구체적으로, 저장부 (630) 는 외부 데이터베이스로부터 수신한 원본 데이터, 원본 데이터를 기초로 생성한 가공 데이터를 저장하며 더 나아가, 산출한 출력값을 저장한다. 또한, 저장부 (630) 는 정답 또한 저장할 수 있는데, 정답은 통신부 (610) 를 통해 외부 데이터베이스로부터 수신할 수도 있고, 사용자로부터 입력받을 수도 있다.The storage unit 630 of the disease occurrence probability prediction model learning device 600 stores the received data and the generated data. Specifically, the storage unit 630 stores the original data received from the external database, the processed data generated based on the original data, and further stores the calculated output value. Also, the storage unit 630 can also store the correct answer, and the correct answer can be received from the external database through the communication unit 610 or received from the user.

이하에서는 질환 발병 확률 예측 모델 학습 방법에 대한 보다 상세한 설명을 위해 도 3을 함께 참조한다.Hereinafter, FIG. 3 will be referred to for a more detailed description of the disease occurrence probability prediction model learning method.

도 3은 본 발명의 일 실시예에 따른 질환 발병 확률 예측 모델 학습 방법에 따라 질환 예측 모델을 생성하고, 업데이트하는 절차를 도시한 순서도이다. 이하에서는 설명의 편의를 위해 심혈관 질환 발병 예측 모델을 학습하는 방법에 대해서 설명한다. 그러나, 이에 제한되지 않고, 본 발명의 일 실시예에 따른 질환 발병 예측 모델 학습 방법은 다양한 질환 발병 예측 모델을 학습시키는데 사용될 수 있다. 또한, 설명의 편의를 위해 도 2의 구성 요소들과 도면 부호를 참조하여 설명한다.3 is a flowchart illustrating a procedure for generating and updating a disease prediction model according to a disease occurrence probability prediction model learning method according to an embodiment of the present invention. Hereinafter, a method for learning a model for predicting the onset of cardiovascular disease will be described for convenience of explanation. However, the present invention is not limited thereto, and the disease prediction prediction model learning method according to an embodiment of the present invention can be used to learn various disease prediction prediction models. For convenience of explanation, the components and the reference numerals of FIG. 2 will be described.

질환 발병 확률 예측 모델 학습 장치 (600) 의 통신부 (610) 는 적어도 하나의 외부 데이터베이스로부터 복수의 항목을 포함하는 원본 데이터를 수신한다 (S310).The communication unit 610 of the disease occurrence probability prediction model learning device 600 receives original data including a plurality of items from at least one external database (S310).

구체적으로, 통신부 (610) 는 사회학적 데이터, 적어도 1회의 진료를 포함하는 진료 기록 데이터 및 적어도 1회의 건강 검진을 포함하는 건강 검진 데이터인 원본 데이터 중 하나 이상을 수신한다. 여기서, 사회학적 데이터는 건강 보험 가입자 및 의료 급여 수급권자의 건강 보장 자격 정보로, 성, 연령, 거주 지역과 같은 인구 사회학적 정보, 사망일자, 사망원인을 포함하는 사망관련 정보, 건강보험 가입 여부, 의료급여 지급 여부와 같은 건강보장 유형 및 소득 분위 및 장애 등록 정보를 포함하는 사회 경제적 수준 및 기타 정보를 포함한다. 또한, 진료 기록 데이터는 요양 급여 비용 명세서 상의 의료 이용 내역 및 의료비 발생 내역을 의미한다. 진료 기록 데이터는 의료 기관 이용 정보, 요양 급여 비용, 진료 과목, 진료 상병 정보, 진찰, 처치, 수술, 기타 행위 급여 내역, 치료 재료 등의 상세 진료 내역을 포함한다. 구체적인 원본 데이터의 특징, 외부 데이터베이스에서의 필드명은 표 1과 같다.Specifically, the communication unit 610 receives at least one of sociological data, medical record data including at least one medical examination, and original data that is health examination data including at least one medical examination. Here, the sociological data is the health insurance qualification information of the health insurance subscribers and the medical benefit beneficiaries. It includes demographic information such as sex, age, residence area, death date, death related information including the cause of death, Economic status and other information, including types of health care, such as whether health care payments are made, and income quintiles and disability registration information. In addition, the medical record data refers to medical use history and medical fee occurrence details on the medical care benefit cost statement. The medical record data includes the details of medical treatment such as medical institution use information, medical care cost, medical treatment subject, medical illness information, medical examination, treatment, operation, other behavioral benefit details, and therapeutic materials. Table 1 shows the characteristics of the original data and field names in the external database.

특징Characteristic 건강검진 Health screenings 코호트DBCohort DB 필드 명Field name 비고Remarks 시간time NHIS_HEALS_HC.HME_DT, NHIS_HEALS_GY.RECU_FR_DT, NHIS_HEALS_GY.DTH_MDYNHIS_HEALS_HC.HME_DT, NHIS_HEALS_GY.RECU_FR_DT, NHIS_HEALS_GY.DTH_MDY 이벤트 시간과 2002년 1월 1일과의 차이 값Difference between event time and Jan. 1, 2002 성castle NHIS_HEALS_JK.SEXNHIS_HEALS_JK.SEX 연령age NHIS_HEALS_JK.AGENHIS_HEALS_JK.AGE 소득분위Income quintile NHIS_HEALS_JK.CTRB_PT_TYPE_CDNHIS_HEALS_JK.CTRB_PT_TYPE_CD categorical type으로 9개의 특징을 가짐It has 9 features in categorical type 장애중증도구분Severity classification NHIS_HEALS_JK.DFAB_GRD_CDNHIS_HEALS_JK.DFAB_GRD_CD 장애유형코드Fault type code NHIS_HEALS_JK.DFAB_PTN_CDNHIS_HEALS_JK.DFAB_PTN_CD 검진기관종별코드Type of examination institution code NHIS_HEALS_JK.YKIHO_GUBUN_CDNHIS_HEALS_JK.YKIHO_GUBUN_CD 체질량지수BMI NHIS_HEALS_HC.BMINHIS_HEALS_HC.BMI 허리둘레Waist circumference NHIS_HEALS_HC.WAISTNHIS_HEALS_HC.WAIST 수축기혈압Systolic blood pressure NHIS_HEALS_HC.BP_HIGHNHIS_HEALS_HC.BP_HIGH 이완기혈압Diastolic blood pressure NHIS_HEALS_HC.BP_LWSTNHIS_HEALS_HC.BP_LWST 식전혈당Pre-eclampsia NHIS_HEALS_HC.BLDSNHIS_HEALS_HC.BLDS 총콜레스테롤Total cholesterol NHIS_HEALS_HC.TOT_CHOLENHIS_HEALS_HC.TOT_CHOLE 트리글리세라이드Triglyceride NHIS_HEALS_HC.TRIGLYCERIDENHIS_HEALS_HC.TRIGLYCERIDE HDL콜레스테롤HDL cholesterol NHIS_HEALS_HC.HDL_CHOLENHIS_HEALS_HC.HDL_CHOLE LDL콜레스테롤LDL cholesterol NHIS_HEALS_HC.LDL_CHOLENHIS_HEALS_HC.LDL_CHOLE 혈색소hemoglobin NHIS_HEALS_HC.HMGNHIS_HEALS_HC.HMG 요단백Urine protein NHIS_HEALS_HC.OLIG_PROTE_CDNHIS_HEALS_HC.OLIG_PROTE_CD 혈청크레아틴Serum creatine NHIS_HEALS_HC.CREATININENHIS_HEALS_HC.CREATININE 혈청지오티Serum geothy NHIS_HEALS_HC.SGOT_ASTNHIS_HEALS_HC.SGOT_AST 혈청지피티Serum Gipitti NHIS_HEALS_HC.SGPT_ALTNHIS_HEALS_HC.SGPT_ALT 감마지티피Gamma Gippei NHIS_HEALS_HC.GAMMA_GTPNHIS_HEALS_HC.GAMMA_GTP

간장 질환 유무 가족력Family history of liver disease NHIS_HEALS_HC.FMLY_LIVER_DISE_PATIEN_YNNHIS_HEALS_HC.FMLY_LIVER_DISE_PATIEN_YN 가족력 뇌졸증 유무Family history of stroke NHIS_HEALS_HC.FMLY_APOP_PATIEN_YNNHIS_HEALS_HC.FMLY_APOP_PATIEN_YN 가족력 심장별 유무Family history NHIS_HEALS_HC.FMLY_HDISE_PATIEN_YNNHIS_HEALS_HC.FMLY_HDISE_PATIEN_YN 가족력 고혈압 유무Family history of hypertension NHIS_HEALS_HC.FMLY_HPRTS_PATIEN_YNNHIS_HEALS_HC.FMLY_HPRTS_PATIEN_YN 가족력 당뇨병 유무Family history of diabetes NHIS_HEALS_HC.FMLY_DIABML_PATIEN_YNNHIS_HEALS_HC.FMLY_DIABML_PATIEN_YN 가족력 암유무 유무Presence or absence of family history NHIS_HEALS_HC.FMLY_CANCER_PATIEN_YNNHIS_HEALS_HC.FMLY_CANCER_PATIEN_YN 흡연상태Smoking status NHIS_HEALS_HC.SMK_STAT_TYPE_RSPS_CDNHIS_HEALS_HC.SMK_STAT_TYPE_RSPS_CD 1회 음주량One time drinking NHIS_HEALS_HC.TM1_DRKQTY_RSPS_CDNHIS_HEALS_HC.TM1_DRKQTY_RSPS_CD 뇌졸증 과거 병력Stroke history NHIS_HEALS_HC.HCHK_APOP_PMH_YNNHIS_HEALS_HC.HCHK_APOP_PMH_YN 심장병 과거 병력History of heart disease NHIS_HEALS_HC.HCHK_HDISE_PMH_YNNHIS_HEALS_HC.HCHK_HDISE_PMH_YN 고혈압 과거 병력History of hypertension NHIS_HEALS_HC.HCHK_HPRTS_PMH_YNNHIS_HEALS_HC.HCHK_HPRTS_PMH_YN 당뇨병 과거 병력A past history of diabetes NHIS_HEALS_HC.HCHK_DIABML_PMH_YNNHIS_HEALS_HC.HCHK_DIABML_PMH_YN 고지혈증 과거 벙력Hyperlipidemia NHIS_HEALS_HC.HCHK_HPLPDM_PMH_YNNHIS_HEALS_HC.HCHK_HPLPDM_PMH_YN 폐결핵 과거 벙력Pulmonary tuberculosis past NHIS_HEALS_HC.HCHK_PHSS_PMH_YNNHIS_HEALS_HC.HCHK_PHSS_PMH_YN 기타(암포함) 과거 병력Others (including cancer) Past medical history NHIS_HEALS_HC.HCHK_ETCDSE_PMH_YNNHIS_HEALS_HC.HCHK_ETCDSE_PMH_YN (과거) 흡연기간(Past) smoking period NHIS_HEALS_HC.PAST_SMK_TERM_RSPS_CDNHIS_HEALS_HC.PAST_SMK_TERM_RSPS_CD (과거) 하루평균흡연량(Past) average smoking per day NHIS_HEALS_HC.PAST_DSQTY_RSPS_CDNHIS_HEALS_HC.PAST_DSQTY_RSPS_CD (현재) 흡연기간(Current) Smoking period NHIS_HEALS_HC.CUR_SMK_TERM_RSPS_CDNHIS_HEALS_HC.CUR_SMK_TERM_RSPS_CD (현재) 하루평균흡연량(Current) average smoking per day NHIS_HEALS_HC.CUR_DSQTY_RSPS_CDNHIS_HEALS_HC.CUR_DSQTY_RSPS_CD 1주 20분이상 격렬한 운동Intense exercise for over 20 minutes per week NHIS_HEALS_HC.MOV20_WEK_FREQ_IDNHIS_HEALS_HC.MOV20_WEK_FREQ_ID 1주 30분이상 격렬한 운동Intense exercise for over 30 minutes per week NHIS_HEALS_HC.MOV30_WEK_FREQ_IDNHIS_HEALS_HC.MOV30_WEK_FREQ_ID 1주 30분이상 걷기 운동Walking over 30 minutes a week NHIS_HEALS_HC.WLK30_WEK_FREQ_IDNHIS_HEALS_HC.WLK30_WEK_FREQ_ID 인지기능장애Cognitive impairment NHIS_HEALS_HC.KDSQ_CNHIS_HEALS_HC.KDSQ_C 인지기능/동년배와비교Cognitive function / comparison with peer group NHIS_HEALS_HC.KDSQ_C_1NHIS_HEALS_HC.KDSQ_C_1 인지기능/1년전과비교Cognitive function / comparison with 1 year ago NHIS_HEALS_HC.KDSQ_C_2NHIS_HEALS_HC.KDSQ_C_2 인지기능/중요한일지장여부Cognitive function / important obstacle NHIS_HEALS_HC.KDSQ_C_3NHIS_HEALS_HC.KDSQ_C_3 인지기능/타인의본인증상인지Cognitive function / awareness of others' symptoms NHIS_HEALS_HC.KDSQ_C_4NHIS_HEALS_HC.KDSQ_C_4 인지기능/일상생활지장여부Cognitive function / daily life impairment NHIS_HEALS_HC.KDSQ_C_5NHIS_HEALS_HC.KDSQ_C_5 1주 운동 횟수Number of exercises per week NHIS_HEALS_HC.EXERCI_FREQ_RSPS_CDNHIS_HEALS_HC.EXERCI_FREQ_RSPS_CD

더 나아가, 원본 데이터는 외부 데이터베이스 중 건강검진코호트 데이터베이스에서 심혈관 질환 혹은 암의 과거력이 없는 80세 미만의 데이터만 사용한다. 다양한 원본 데이터를 수신하기 때문에, 지역, 문화적인 특징, 그리고 시대에 따라 차이가 나는 환경적인 요인으로 인한 질환 발병 예측 정확도가 떨어지는 문제를 추가적인 데이터 수집, 지역별 복수의 질환 예측 모델을 생성하는 방법 등으로 보완할 수 있는 장점이 있다.Furthermore, the original data is used only in data from an external database that is less than 80 years old without a history of cardiovascular disease or cancer in the health examination cohort database. Because it receives a variety of original data, the problem of low localization, cultural characteristics, and environmental factors that are different from each other depending on the age is less accurate. There are advantages to be able to supplement.

이어서, 프로세서 (620) 는 원본 데이터를 기초로 미리 결정된 기준에 따라 1회의 진료 또는 1회의 건강 검진을 하나의 이벤트로 나타내는 가공 데이터를 생성한다 (S320).Subsequently, the processor 620 generates processing data indicating one medical examination or one medical examination as one event in accordance with a predetermined criterion based on the original data (S320).

여기서, 가공 데이터는 사회학적 데이터, 적어도 1회의 진료를 포함하는 진료 기록 데이터 및 적어도 1회의 건강 검진을 포함하는 건강 검진 데이터 중 하나 이상을 가공하여 1회의 진료 또는 1회의 건강 검진으로 나타낸 하나의 이벤트이다. 예를 들어, 프로세서 (620) 는 개인 일련 번호, 복용 약품 분류 코드, 복용 약품 투약량 등의 항목을 하루의 요양 개시 일자, 즉 1회의 진료 또는 1회의 건강 검진에 따라 분류함으로써 하나의 이벤트로 구성하여 미리 결정된 기준에 따라 가공 데이터를 생성한다. 이 때, 가공 데이터는 이벤트에 대한 정답을 필수적으로 포함한다.Here, the processed data may include at least one of sociological data, medical record data including at least one medical examination, and health examination data including at least one medical examination, and may include one event or one medical examination to be. For example, the processor 620 may classify items such as a personal serial number, a drug classification code, a medication dosage and the like as a single event by classifying the items according to the day of the day of medical treatment start, that is, And generates processing data in accordance with a predetermined criterion. At this time, the machining data essentially includes the correct answer to the event.

생성되는 가공 데이터는 질환자, 비질환자 각각의 이벤트에 대한 비율이 동일하도록 설정될 수 있다. 또한, 질환자와 비질환자의 이벤트에 대한 평균 길이는 프로세서 (620) 에 의해 보정된다. 즉, 프로세서 (620) 는 질환자와 비질환자의 이벤트에 대한 개수와 길이를 동일하게 함으로써, 질환자와 비질환자의 데이터를 동일하게 고려하여 출력값의 산출 정확도를 높일 수 있다.The generated processing data may be set so that the ratios for the respective events of the diseased person and the non-diseased person are the same. In addition, the average length for events of the diseased and non-diseased persons is corrected by the processor 620. That is, the processor 620 can increase the accuracy of calculating the output value by considering the data of the diseased person and the non-diseased person equally by making the number and the length of the events of the disease patient and the non-disease patient the same.

이 때, 프로세서 (620) 는 원본 데이터에 포함된 진료 기록 데이터 중 직접적으로 심혈관 질환을 판별할 수 있는 데이터를 제외하도록 진료 및 건강 검진 이벤트를 필터링하여 가공 데이터를 생성할 수 있다. 예를 들어, 프로세서 (620) 는 동맥 경화증, 협심증, 심근 경색증과 관련된 진료 데이터를 제외하도록 필터링하여 가공 데이터를 생성할 수 있다.At this time, the processor 620 may generate the processed data by filtering the medical care and health examination events so as to exclude the data that directly discriminates the cardiovascular disease from the medical record data included in the original data. For example, processor 620 may generate processed data by filtering to exclude clinical data related to arteriosclerosis, angina pectoris, myocardial infarction.

또한, 몇몇 실시예에서, 프로세서 (620) 는 이벤트에 포함된 복수의 항목 중 복용 약품 분류 코드 및 복용 약품 투약량을 나열하고, 복용 약품 분류 코드와 예측하려는 심혈관 질환이 연관 관계가 있는지 결정한다. 예를 들어, 프로세서 (620) 는 나열한 복용 약품 분류 코드 및 복용 약품 투약량이 혈전을 녹이는 기능이 있는 경우, 복용 약품 분류 코드가 심혈관 질환과 연관 관계가 있다고 결정한다. 복용 약품 분류 코드와 심혈관 질환이 연관된 경우, 프로세서 (620) 는 복용 약품 분류 코드 및 투약량을 삭제한다. 심혈관 질환과 상관관계가 높은 데이터는 제외하여 가공 데이터를 생성함으로써, 심혈관 질환을 암시하는 데이터가 입력되었다고해서 무조건적으로 심혈관 질환이 발병되었다고 결정하지 않도록 할 수 있다. 구체적인 상관관계가 높은 데이터를 제외하여 가공 데이터를 생성하는 실시예는 도 5a 내지 도 5c를 참조하여 상세히 후술한다.Further, in some embodiments, the processor 620 may list the dosing drug classification code and dosing medication dose among a plurality of items included in the event, and determine whether the dosing drug classification code is correlated with the cardiovascular disease to be predicted. For example, processor 620 determines that the dosing drug classification code and the dosing medication dose listed have the function of dissolving the thrombus, the dosing medication classification code is associated with cardiovascular disease. If a cardiovascular disease is associated with the dosing drug classification code, the processor 620 deletes the dosing drug classification code and dosing amount. By excluding processed data that are highly correlated with cardiovascular disease, processing data can be generated so that the input of data suggesting cardiovascular disease can be avoided unconditionally to determine that a cardiovascular disease has occurred. Embodiments for generating processed data by excluding highly specific correlated data will be described later in detail with reference to Figs. 5A to 5C.

다양한 실시예에서, 프로세서 (620) 는 복수의 항목에 해당하는 각각의 단위를 추출한다. 예를 들어, 프로세서 (620) 는 키 및 몸무게의 단위인 m와 kg을 추출한다. 이어서, 프로세서 (620) 는 각각의 단위를 가공 데이터에서 정의된 단위로 변환한다. 예를 들어, 가공 데이터에서 정의된 단위가 ft와 lb인 경우, 프로세서 (620) 는 키 및 몸무게 항목에 해당하는 단위를 m에서 ft로, kg에서 lb로 변환한다. 즉, 프로세서 (620) 는 복수의 항목에 해당하는 단위를 변환함으로써, 하나의 항목에 대해 각각 다른 경우에 단위를 통일할 수 있다. 구체적인 복수의 항목의 값을 정의된 단위로 변환하여 입력한 가공 데이터 테이블에 대해서는 도 8a 및 도 8b를 참조하여 상세히 후술한다.In various embodiments, processor 620 extracts each unit corresponding to a plurality of items. For example, processor 620 extracts m and kg, which are units of key and weight. Subsequently, the processor 620 converts each unit into a unit defined in the machining data. For example, if the units defined in the processing data are ft and lb, the processor 620 converts the units corresponding to the key and weight items from m to ft, and from kg to lb. That is, the processor 620 can convert the units corresponding to a plurality of items, thereby unifying the units for each item in different cases. The processed data table in which the values of a plurality of specific items are converted into the defined units and inputted is described later in detail with reference to Figs. 8A and 8B.

한편, 다양한 실시예에서, 프로세서 (620) 는 이벤트에 포함된 복수의 항목의 데이터에 대한 평균 및 표준편차를 계산한다. 이어서, 프로세서 (620) 는 계산한 평균 및 표준편차를 z-score로 변환하여 복수의 항목의 데이터에 입력한다. 이벤트에 포함된 복수의 항목의 데이터를 z-score로 변환하여 입력함으로써, 프로세서 (620) 는 각 항목에 대한 데이터를 정규화할 수 있다. 구체적인 복수의 항목의 값을 정규화하여 입력한 가공 데이터 테이블에 대해서는 도 7a 및 도 7b를 참조하여 상세히 후술한다.Meanwhile, in various embodiments, the processor 620 calculates the mean and standard deviation for the data of the plurality of items included in the event. Subsequently, the processor 620 converts the calculated mean and standard deviation into a z-score and inputs the data into a plurality of items of data. By converting the data of a plurality of items included in the event into z-scores and inputting them, the processor 620 can normalize data for each item. The machining data table that is input by normalizing the values of a plurality of specific items will be described later in detail with reference to Figs. 7A and 7B.

또다른 실시예에서, 프로세서 (620) 는 가공 데이터의 시간 구간 내에서 선택된 일부의 시간 구간을 갖는 추가 가공 데이터를 생성할 수도 있다. 구체적으로, 가공 데이터가 질환자에 대한 데이터인 경우, 프로세서 (620) 는 질환 발병 시점을 포함하는 시간 구간 내에서 선택된 시간 구간에 해당하는 가공 데이터를 추가 가공 데이터로 생성한다. 또한, 가공 데이터가 비질환자에 대한 데이터인 경우, 프로세서 (620) 는 선택된 시간 구간에 해당하는 가공 데이터를 추가 가공 데이터로 생성한다. 구체적인 추가 가공 데이터의 생성에 대한 실시예는 도 6a 및 도 6b를 참조하여 상세히 후술한다.In another embodiment, the processor 620 may generate additional machining data having a portion of the time interval selected within the time interval of the machining data. Specifically, when the processed data is data for a patient, the processor 620 generates the processed data corresponding to the selected time period within the time period including the disease occurrence point as additional processing data. When the processed data is data for a non-diseased person, the processor 620 generates the processed data corresponding to the selected time period as additional processed data. Embodiments for the generation of specific additional machining data will be described later in detail with reference to Figs. 6A and 6B.

또한, 건강검진코호트 데이터베이스로부터 수신한 51만명에 대한 원본 데이터 중 심혈관 질환자의 데이터는 전체의 7.9% 수준으로 질환자의 데이터보다 음성 데이터가 더 많은 불균형한 데이터이기 때문에, 프로세서 (620) 는 질환자의 데이터 전체와 임의로 선정된 비질환자의 데이터의 표본 집합을 다시 각각 6:2:2 비율로 학습 세트, 검증 세트 및 테스트 세트로 분할할 수 있다. 검증 세트는 질환 예측 모델의 학습 종료 시점을 결정하기 위해 사용되며, 최종적으로 테스트 세트로 질환 예측 모델의 성능을 확인한다. 학습 세트와 검증 세트를 6:2 비율대로 구성하여 학습할 경우 모두 비질환자의 데이터로 편향 학습되어 높은 정확도와 낮은 손실 값을 출력할 수 있다. 따라서, 프로세서 (620) 는 학습 세트와 검증 세트에서의 비질환자의 데이터 및 질환자의 데이터의 비율을 under-sampling 혹은 over-sampling과 같은 방법으로 일치시키고, 테스트 세트의 경우 전체 샘플 구성비대로 질환자 비율을 7.9%로 유지하여 구성할 수 있다.In addition, among the original data of 510,000 received from the health checkup cohort database, the cardiovascular disease data is 7.9% of the total, and the voice data is more unbalanced than the disease data. A sample set of data from whole and randomly selected non-disease patients can be further divided into study sets, validation sets and test sets at a 6: 2: 2 ratio, respectively. The validation set is used to determine the end of learning of the disease prediction model and finally confirms the performance of the disease prediction model with the test set. When the learning set and the verification set are constructed in a ratio of 6: 2, all of them are biased by the data of the non-diseased person, so that high accuracy and low loss value can be outputted. Accordingly, the processor 620 may match the ratio of the data of the non-diseased person and the data of the diseased person in the learning set and the verification set in a manner such as under-sampling or over-sampling, and in case of the test set, 7.9%.

이어서, 프로세서 (620) 는 가공 데이터에 포함된 복수의 이벤트 각각에 대한 질병 예측 모델을 생성한다 (S330). Subsequently, the processor 620 generates a disease prediction model for each of a plurality of events included in the processing data (S330).

즉, 프로세서 (620) 는 가공 데이터에 포함된 복수의 이벤트에 대한 출력값을 산출할 수 있도록 복수의 이벤트 각각에 대한 질병 예측 모델을 생성한다. 이 때, 질병 예측 모델은 여러 층으로 구성될 수 있다. 프로세서 (620) 가 생성한 질병 예측 모델은 가공 데이터에 포함된 복수의 이벤트에 대한 하나의 출력값을 산출할 수도 있고, 복수의 이벤트 각각에 대한 출력값을 산출할 수도 있다. That is, the processor 620 generates a disease prediction model for each of a plurality of events so as to calculate an output value for a plurality of events included in the processing data. At this time, the disease prediction model can be composed of several layers. The disease prediction model generated by the processor 620 may calculate one output value for a plurality of events included in the processed data or may calculate an output value for each of the plurality of events.

이어서, 프로세서 (620) 는 질병 예측 모델에 가공 데이터를 대입하여, 질환 발병 여부를 나타내는 출력값을 산출한다 (S340).Subsequently, the processor 620 substitutes the processed data into the disease prediction model and calculates an output value indicating whether the disease has occurred (S340).

여기서, 질병 예측 모델은 입력된 가공 데이터를 머신 러닝에 의해 학습되고, 학습의 결과로 결정된 파라미터들을 적용하여 질환 발병 확률을 산출한다. 이 때, 프로세서 (620) 는 가공 데이터에 포함된 복수의 이벤트 각각에 대한 질환 발병 확률을 산출할 수도 있고, 가공 데이터에 포함된 복수의 이벤트에 대해 통합한 하나의 질환 발병 확률을 산출할 수 있다. 더 나아가, 프로세서 (620) 는 질환의 종류에 따른 발병 확률도 산출할 수 있다. 구체적으로, 프로세서 (620) 는 질병 예측 모델에 복수의 이벤트를 포함하는 가공 데이터를 대입하여, 복수의 이벤트에 대응하는 하나의 출력값을 산출한다. 또한, 프로세서 (620) 는 가공 데이터에 포함된 복수의 이벤트 각각에 대응하는 질환 발병 여부를 나타내는 출력값을 산출할 수 있다. 이 때, 출력값은 질환이 발병될 가능성이 있는 경우, 1에 근접한 값을 갖는다. 또한, 질환이 발병될 가능성이 없는 경우, 0에 근접한 값을 갖는다.Here, the disease prediction model calculates the probability of disease occurrence by learning the input processing data by machine learning and applying the parameters determined as a result of the learning. At this time, the processor 620 may calculate a disease occurrence probability for each of a plurality of events included in the processing data, and may calculate a single disease occurrence probability integrated for a plurality of events included in the processing data . Furthermore, the processor 620 can also calculate the probability of onset according to the type of disease. Specifically, the processor 620 substitutes the process data including the plurality of events into the disease prediction model, and calculates one output value corresponding to the plurality of events. Further, the processor 620 may calculate an output value indicating whether or not a disease corresponding to each of a plurality of events included in the processed data occurs. At this time, the output value has a value close to 1 when the disease is likely to occur. In addition, when the disease is unlikely to occur, it has a value close to zero.

이어서, 프로세서 (620) 는 출력값과 정답을 비교한다 (S350).Subsequently, the processor 620 compares the output value with the correct answer (S350).

정답은 질환 발병 여부를 0 또는 1로 나타낸 값이다. 즉, 정답은 질병 예측 모델에 의해 산출된 값이 아닌 이벤트에 대한 실제의 질환 발병 여부로, 질환 예측 모델이 출력값을 정상적으로 산출했는지 비교하기 위한 값이다. 이 때, 정답은 주상병이 부여된 시점 이후부터 또는 발병 시점에만 1로 결정될 수 있다. 또한, 정답은 발병 시점 직전부터 0 이상 1 이하로 결정될 수 있다. 따라서, 프로세서 (620) 는 미리 결정된 정답과 질환 예측 모델에 의해 산출된 출력값을 비교한다. 구체적인 정답에 대한 종류는 도 4a 내지 도 4c를 참조하여 상세히 후술한다.The correct answer is 0 or 1, indicating whether the disease has occurred. That is, the correct answer is a value for comparing whether the disease prediction model has normally calculated the output value, whether the actual disease has occurred or not, not the value calculated by the disease prediction model. At this time, the correct answer may be determined to be 1 after the pox was given or only at the time of onset. In addition, the correct answer may be determined to be 0 or more and 1 or less immediately before the onset of the disease. Accordingly, the processor 620 compares the output value calculated by the disease prediction model with the predetermined correct answer. The types of specific correct answers will be described later in detail with reference to Figs. 4A to 4C.

이어서, 프로세서 (620) 는 비교된 결과에 따라 질환 예측 모델을 업데이트한다 (S360).Subsequently, the processor 620 updates the disease prediction model according to the comparison result (S360).

예를 들어, 출력값과 정답이 상이한 경우, 프로세서 (620) 는 질환 예측 모델이 정답과 동일한 출력값을 산출할 수 있도록 질환 예측 모델을 업데이트한다. 또한, 출력값과 정답이 동일한 경우, 프로세서 (620) 는 질환 예측 모델이 계속해서 정답과 동일한 출력값을 산출할 수 있도록 질환 예측 모델을 업데이트 할 수 있다.For example, if the output value and the correct answer are different, the processor 620 updates the disease prediction model so that the disease prediction model can produce the same output value as the correct answer. Further, when the output value and the correct answer are the same, the processor 620 can update the disease prediction model so that the disease prediction model can continuously calculate the same output value as the correct answer.

다양한 실시예에서, 프로세서 (620) 는 질환 예측 모델 (200) 에서 산출한 하나 이상의 이벤트에 대한 출력값 (300) 이전에 복수의 또 다른 출력값을 산출할 수 있다. 구체적으로, 프로세서 (620) 는 질환 예측 모델 (200) 에서 산출하는 T개의 이벤트 대한 내부 계산값을

라 할 수 있으며, 마지막 이벤트 o_T 값을 z값으로 변환할 수 있다. 이어서, 프로세서 (620) 는 z값을 변환하여 출력값 (300) 을 산출한다. 이 때, 프로세서 (620) 는 o_T 값을 출력값 (300) 의 출력의 개수에 일치되도록 변환하여 z를 산출한다. 프로세서 (620) 는 이하 수학식 1을 사용하여 o_T값을 z값으로 변환한다. 설명의 편의를 위해 하나 이상의 이벤트에 대한 하나의 이벤트에 대응 하는 내부 계산값 o_T를 선정하여 하나의 발병 확률을 계산하였으나 내부 계산값 o에 대한 각각의 이벤트에 대해 수학식 1, 2를 적용하면 각각의 이벤트에 대한 발병 확률을 출력할 수 있다.In various embodiments, the processor 620 may calculate a plurality of further output values prior to the output value 300 for one or more events computed in the disease prediction model 200. Specifically, the processor 620 calculates an internal calculation value for the T events calculated by the disease prediction model 200

, And the last event o _T value can be converted to z value. The processor 620 then converts the z value to yield an output value 300. [ At this time, the processor 620 converts the o _T value to match the number of outputs of the output value 300 and calculates z. The processor 620 uses the following Equation 1 to convert the o _T value to a z value. For convenience of explanation, one occurrence probability is calculated by selecting an internal calculation value o _T corresponding to one event for one or more events, but applying

Equations

1 and 2 to each event for the internal calculation value o The probability of occurrence for each event can be displayed.

[수학식 1][Equation 1]

또한, 프로세서 (620) 는 이하 수학식 2를 이용하여 z값을 출력값 (300) 으로 변환한다. 이어서, 프로세서 (620) 는 이하 수학식 2를 이용하여 심혈관 질환 발병 확률을 산출할 수 있다.The processor 620 also converts the z value to the output value 300 using Equation (2) below. Then, the processor 620 may calculate the probability of developing a cardiovascular disease using Equation (2) below.

[수학식 2]&Quot; (2) "

여기서,

이며, z는 출력값, y는 심혈관 질환 발병 확률을 나타낸다. 이 때, K는 출력값을 구성하는 출력의 개수를 의미한다. 예를 들어, 출력값이 질환 발병 확률 및 질환이 발병하지 않을 확률 두가지의 출력으로 구성된 경우, K는 2이다.here,

, Z is the output value, and y is the probability of developing cardiovascular disease. In this case, K denotes the number of outputs constituting the output value. For example, if the output is composed of two outputs, probability of disease outbreak and probability of no disease outbreak, K is 2.

이 때, o_T는 o의 원소인 o_T는 백터이며, 크기는 신경망 내부적으로 정의된 은닉 노드의 크기로 임의로 정의될 수 있다. 예를 들어, 출력값 (300) 을 질환/비질환 확률로 출력값 (300) 의 크기를 2로 정의하고, 은닉 노드의 크기를 6으로 정의하였다면, 프로세서 (620) 는 크기가 일치하지 않기 때문에 최적화할 행렬 W를 두고 크기를 일치킬 수 있다. 아울러 산출된 출력값 (300) 의 범위는 확률 값이 아니기 때문에 프로세서 (620) 는 수학식 2를 사용하여 0 ~ 1 사이의 확률 값으로 변환할 수 있다.In this case, o _T is an element of o, and _T is a vector, and the size can be arbitrarily defined as the size of a hidden node defined internally in the neural network. For example, if the size of the output value 300 is defined as 2 and the size of the hidden node is defined as 6 with the disease / non-disease probability as the output value 300, the processor 620 may optimize You can match the size with the matrix W. In addition, since the range of the output value 300 calculated is not a probability value, the processor 620 can convert the probability value into a probability value between 0 and 1 using Equation (2).

이에 따라, 질환 발병 확률 예측 모델 학습 장치 (600) 는 가공 데이터에 대응하는 정답과 질환 예측 모델을 통해 산출한 출력값을 비교하여, 비교한 결과에 따라 질환 예측 모델을 지속해서 업데이트함으로써, 질환 예측 모델이 정확한 출력값을 산출하도록 하며, 더 나아가, 정확도가 높은 심혈관 질환 발병 확률도 산출할 수 있도록 한다. 또한, 질환 발병 확률 예측 모델 학습 장치 (600) 는 산출한 심혈관 질환 발병 확률을 질환자, 비질환자, 보험사, 의료기관, 건강보험공단 등에 제공하여, 심혈관 질환의 발병을 보다 빨리 예측할 수 있게 함으로써, 질환자는 빠르게 진료받을 수 있도록 하며, 비질환자는 심혈관 질환의 발병을 예방할 수 있도록 한다.Accordingly, the disease occurrence probability prediction model learning device 600 compares the correct answer corresponding to the processing data with the output value calculated through the disease prediction model, and continuously updates the disease prediction model according to the comparison result, To calculate the accurate output value, and further, to calculate the probability of occurrence of cardiovascular disease with high accuracy. In addition, the disease occurrence probability prediction model learning apparatus 600 provides the probability of occurrence of a cardiovascular disease to a disease, a non-disease patient, an insurance company, a medical institution, and a health insurance corporation to predict the onset of a cardiovascular disease more quickly, It allows the patient to be treated quickly, while the non-ill person can prevent the onset of cardiovascular disease.

도 4a 내지 도 4c는 본 발명의 일 실시예에 따라 정답을 나타내는 그래프이다.4A-4C are graphs showing the correct answer according to an embodiment of the present invention.

도 4a를 참조하면, 제1 정답 그래프 (710) 는 진단 결과에 따라 주상병이 부여된 시점 이후에 정답을 1로 나타낸다. 즉, 주상병이 부여되기 전 시점 (711) 의 정답은 0이며, 주상병이 부여된 시점 (712) 및 주상병이 부여된 이후의 시점 (713) 의 정답은 1이다.Referring to FIG. 4A, the first correcting graph 710 shows the correct answer as 1 after the time when the main disease is given according to the diagnosis result. That is, the correct answer at the time point 711 before the pillar bottle is given is 0, and the correct answer at the time point 712 when the pillar bottle is given and the time point 713 after the pillar bottle is given is 1.

도 4b를 참조하면, 제2 정답 그래프 (720) 는 질환 발병 시점에만 정답을 1로 나타낸다. 즉, 질환 발병 이전 시점 (721) 의 정답은 0이며, 질환 발병 시점 (722) 의 정답은 1이다. 또한, 질환 발병 시점 (722) 이후에 특정 요인에 의해 질환이 개선된 경우, 예를 들어, 혈압 질환 관련 질환자의 혈압 수치가 개선된 경우, 질환 발병 이후 시점 (723) 의 정답은 0이다. Referring to FIG. 4B, the second correct answer graph 720 indicates correct answer 1 only at the time of disease occurrence. That is, the correct answer at time point before disease onset (721) is 0, and the correct answer is 1 for disease occurrence time point (722). Further, when the disease is improved by a specific factor after the disease occurrence time point 722, for example, when the blood pressure value of the blood pressure disease related disease is improved, the correct answer at time point 723 after the disease occurrence is 0.

도 4c를 참조하면, 제3 정답 그래프 (730) 는 질환 발병 시점 이전부터 질환 발병 시점까지 0 이상 1이하로 나타낸다. 구체적으로, 질환 발병 이전 시점 (731) 의 정답은 0이며, 질환 발병 이전 시점 (731) 에서 질환 발병 시점 (732) 까지의 0에서 1까지 비례하게 증가한다. 즉, 질환 발병 이전 시점 (731) 에서 질환 발병 시점 (732) 까지 질환은 발병되고 있다고 판단하여 정답은 0에서 1까지 비례하게 증가한다. 질환 발병 시점 (732) 에서 질환 발병 이후 시점 (733) 까지 질환이 개선되지 않은 경우, 정답은 1이다.Referring to FIG. 4C, the third correct answer graph 730 indicates 0 or more and 1 or less from the time before the onset of the disease until the onset of the disease. Specifically, the correct answer at the pre-disease time point 731 is 0 and increases proportionally from 0 to 1 at the pre-disease time point 731 to the disease onset time point 732. In other words, it is judged that the disease develops from the time point before the onset of disease (731) to the time of onset of disease (732), and the correct answer increases proportionally from 0 to 1. If the disease has not improved from the time of disease onset (732) to the time of onset (733), the correct answer is 1.

이에 따라, 질환 발병 확률 예측 모델 학습 장치 (600) 는 주상병이 부여된 시점, 질환 발병 시점 등 다양한 기준으로 정답을 나타내어, 질환 예측 모델이 다양한 경우의 수를 고려할 수 있도록 한다.Accordingly, the disease occurrence probability prediction model learning device 600 shows correct answers based on various criteria such as a time point when the main disease is given, a time point when the disease occurs, and so on, so that the disease prediction model can consider the number of cases.

도 5a 내지 도 5c는 본 발명의 일 실시예에 따라 심혈관 질환과 연관된 데이터를 제외한 가공 데이터 테이블을 도시한 개략도들이다.FIGS. 5A through 5C are schematic diagrams showing processed data tables excluding data related to cardiovascular diseases according to an embodiment of the present invention. FIG.

도 5a를 참조하면, 원본 데이터 테이블 (810) 은 하나의 진료 일자 (811, 812) 에 대한 복수의 이벤트를 포함한다. 예를 들어, 원본 데이터 테이블 (810) 은 2002년 12월 07일에 해당하는 진료 일자 (811) 에 대한 2가지의 복용 악품 분류 코드 (821) 및 복용 약품 투약량 (831) 을 포함한다. 따라서, 원본 데이터 테이블 (810) 은 A043016, A054502 인 복용 약품 분류 코드 (821) 에 따라 2002년 12월 07일인 진료 일자 (811) 에 해당하는 2개의 행을 포함한다. 이 때, 2002년 12월 07일인 진료 일자 (811) 에 해당하는 행에는 복용 약품 투약량 (831) 도 포함된다. 마찬가지로, 원본 데이터 테이블 (810) 은 A166503, A037008 인 복용 약품 분류 코드 (822) 에 따라 2002년 12월 21일인 진료 일자 (812) 에 해당하는 2개의 행을 포함한다. 이 때, 2002년 12월 21일인 진료 일자 (812) 에 해당하는 행에는 복용 약품 투약량 (832) 도 포함된다.Referring to FIG. 5A, the original data table 810 includes a plurality of events for one medical care date 811, 812. For example, the original data table 810 includes two dosing goods classification codes 821 and dosing medication amount 831 for the medical care date 811 corresponding to December 07, 2002. Therefore, the original data table 810 includes two rows corresponding to the medical care date 811 of December 07, 2002, in accordance with the drug classification code 821 for A043016 and A054502. At this time, the dosing medication dose 831 is included in the row corresponding to the medical care date 811 on December 07, 2002. Similarly, the original data table 810 includes two rows corresponding to the medical care date 812 of December 21, 2002, in accordance with drug classification code 822 with A166503, A037008. At this time, the dosing medicine dose 832 is included in the row corresponding to the medical care date 812 on December 21, 2002.

도 5b를 참조하면, 제1 가공 데이터 테이블 (820) 은 하나의 진료 일자에 대한 하나의 이벤트를 포함한다. 예를 들어, 제1 가공 데이터 테이블 (820) 은 하나의 행에 진료 일자에 대한 데이터 즉, 복용 약품 분류 코드 각각에 해당하는 복용 약품 투약량을 포함한다. 구체적으로, 제1 가공 데이터 테이블 (820) 은 하나의 진료 일자인 2002년 12월 07일의 진료 일자 (811) 에 복용 약품 분류 코드 (821) 와 복용 약품 투약량 (831) 을 포함한다. 또한, 제1 가공 데이터 테이블 (820) 은 2002년 12월 21일의 진료 일자 (812) 에 복용 약품 분류 코드 (822) 및 복용 약품 투약량 (832) 을 포함한다. 즉, 제1 가공 데이터 테이블 (820) 은 하나의 진료 일자에 해당하는 복수의 이벤트를 통합한 하나의 이벤트에 대한 행을 포함한다. 이 때, 복용 약품 분류 코드 (821, 822) 와 심혈관 질환의 연관 관계가 결정될 수 있다. 예를 들어, 복용 약품 분류 코드 (822) 에 포함된 A166503 및 A166503의 복용 약품 투약량 (832) 은 심혈관 질환과 연관 관계가 있다고 결정될 수 있다.Referring to FIG. 5B, the first processed data table 820 includes one event for one medical care date. For example, the first processing data table 820 includes data on the date of care, that is, dosage medicines corresponding to each of the dosing drug classification codes, in one row. Specifically, the first processing data table 820 includes a medication classifying code 821 and a dosing medication amount 831 on the medical care date 811 of December 07, 2002, which is a medical treatment date. In addition, the first processing data table 820 includes dosing drug classification code 822 and dosing medication dose 832 on the medical care date 812 of December 21, 2002. That is, the first processing data table 820 includes a row for one event that integrates a plurality of events corresponding to one medical care date. At this time, the relationship between the dosing drug classification codes 821 and 822 and cardiovascular diseases can be determined. For example, dosing doses 832 of A166503 and A166503 included in dosing drug classification code 822 may be determined to be associated with cardiovascular disease.

도 5c를 참조하면, 제2 가공 데이터 테이블 (830) 은 심혈관 질환과 연관된 복용 약품 분류 코드 및 복용 약품 투약량을 제외한 나머지 데이터를 포함한다. 구체적으로, 제2 가공 데이터 테이블 (830) 은 심혈관 질환과 연관된 복용 약품 분류 코드인 A166503과 A166503의 복용 약품 투약량을 제외하고, 심혈관 질환과 연관되지 않은 복용 약품 분류 코드 (821, 822) 및 복용 약품 투약량 (831, 832) 만을 포함한다.Referring to FIG. 5C, the second processed data table 830 includes remaining data excluding dosing drug classification codes associated with cardiovascular diseases and dosage medication doses. Specifically, the second processed data table 830 includes dosing drug classification codes 821, 822 that are not associated with cardiovascular disease (821, 822) and dosing medicines (not shown), except dosing doses of A166503 and A166503, which are dosing drug classification codes associated with cardiovascular disease Dose amounts 831 and 832 only.

이에 따라, 질환 발병 확률 예측 모델 학습 장치 (600) 는 심혈관 질환과 상관관계가 높은 데이터는 제외하여 가공 데이터를 생성함으로써, 질환 예측 모델이 심혈관 질환과 연관된 데이터가 입력되었다고해서 무조건적으로 심혈관 질환이 발병되었다고 판단하지 않도록 한다.Accordingly, the disease occurrence probability prediction model learning apparatus 600 generates processed data by excluding data having a high correlation with cardiovascular disease, so that, if the disease prediction model includes data related to cardiovascular diseases, Do not judge it to be.

도 6a은 본 발명의 일 실시예에 따라 질환자에 대한 추가 가공 데이터를 도시한 개략도이다.6A is a schematic diagram showing further processing data for a patient with a disease according to an embodiment of the present invention.

도 6a를 참조하면, 질환자의 이벤트 그래프 (910) 는 가공 데이터 (911), 제1 추가 가공 데이터 (912) 및 제2 추가 가공 데이터 (913) 를 포함한다. 가공 데이터 (911) 는 하나의 이벤트를 포함할 수도 있고, 복수의 이벤트를 포함할 수도 있다. 이 때, 가공 데이터 (911) 의 시간 구간은 1로 나타낼 수 있다. 제1 추가 가공 데이터 (912) 는 가공 데이터 (911) 의 3분의 1에 해당하는 시간 구간에 대한 데이터이다. 또한, 제2 추가 가공 데이터 (913) 는 가공 데이터 (911) 의 3분의 2에 해당하는 시간 구간에 대한 데이터이다. 이 때, 제1 추가 가공 데이터 (912) 및 제2 추가 가공 데이터 (913) 는 질환 발병 시점을 포함한다. 설명의 편의를 위해 시간 구간을 3분의 1 및 3분의 2로 기재하였지만, 시간 구간은 한정되지 않는다.Referring to FIG. 6A, the event graph 910 of a disease patient includes processing data 911, first additional processing data 912, and second additional processing data 913. The processing data 911 may include one event or may include a plurality of events. At this time, the time interval of the machining data 911 can be represented by 1. The first additional data 912 is data for a time period corresponding to one-third of the machining data 911. Further, the second additional processed data 913 is data for a time period corresponding to two-thirds of the machining data 911. At this time, the first additional processing data 912 and the second additional processing data 913 include a time point of disease onset. Although the time interval is described as one-third and two-thirds for convenience of explanation, the time interval is not limited.

도 6b는 본 발명의 일 실시예에 따라 비질환자에 대한 추가 가공 데이터를 도시한 개략도이다.6B is a schematic diagram showing further processing data for a non-diseased person according to an embodiment of the present invention.

도 6b를 참조하면, 비질환자의 이벤트 그래프 (920) 는 가공 데이터 (921), 제1 추가 가공 데이터 (922) 및 제2 추가 가공 데이터 (923) 를 포함한다. 가공 데이터 (921) 는 하나의 이벤트를 포함할 수도 있고, 복수의 이벤트를 포함할 수도 있다. 이 때, 가공 데이터 (921) 의 시간 구간은 1로 나타낼 수 있다. 제1 추가 가공 데이터 (922) 는 가공 데이터 (921) 의 3분의 1에 해당하는 시간 구간에 대한 데이터이다. 또한, 제2 추가 가공 데이터 (923) 는 가공 데이터 (921) 의 3분의 2에 해당하는 시간 구간에 대한 데이터이다. 이 때, 제1 추가 가공 데이터 (922) 및 제2 추가 가공 데이터 (923) 는 가공 데이터 (921) 의 마지막 검진 시점 또는 마지막 진료 시점 전의 무작위로 선택된 시간 구간에 대한 데이터이다. 설명의 편의를 위해 시간 구간을 3분의 1 및 3분의 2로 기재하였지만, 시간 구간은 한정되지 않는다.6B, an event graph 920 of a non-diseased person includes processing data 921, first additional processing data 922, and second additional processing data 923. The processing data 921 may include one event or may include a plurality of events. At this time, the time interval of the machining data 921 can be represented by 1. The first additional processed data 922 is data for a time period corresponding to one-third of the processed data 921. Further, the second additional processed data 923 is data for a time period corresponding to two-thirds of the machining data 921. At this time, the first additional data 922 and the second additional data 923 are data for a randomly selected time period before the last examination point of the processing data 921 or before the last examination point. Although the time interval is described as one-third and two-thirds for convenience of explanation, the time interval is not limited.

도 6c는 본 발명의 일 실시예에 따라 질환자에 대한 추가 가공 데이터 및 비질환자에 대한 추가 가공 데이터의 구성 테이블을 도시한 개략도이다.6C is a schematic diagram showing a configuration table of additional processing data for a diseased person and additional processing data for a non-diseased person according to an embodiment of the present invention.

도 6c를 참조하면, 데이터 구성 테이블 (930) 는 가공 데이터만을 포함한 학습 세트, 검증 세트, 테스트 세트의 수와 가공 데이터 및 추가 가공 데이터를 포함한 학습 세트, 검증 세트, 테스트 세트의 수를 포함한다. 이 때, 가공 데이터에 대한 학습 세트, 검증 세트 및 테스트 세트는 질환자와 비질환자에 대한 가공 데이터 각각을 포함한다. 또한, 가공 데이터 및 추가 가공 데이터에 대한 학습 세트, 검증 세트 및 테스트 세트도 질환자와 비질환자에 대한 가공 데이터 각각을 포함한다.Referring to FIG. 6C, the data configuration table 930 includes the number of learning sets, verification sets, and test sets that include only the processing data, the number of verification sets, the number of test sets and processing data, and additional processing data. At this time, the learning set, the verification set, and the test set for the processing data each include processing data for the diseased and non-diseased. Further, the training set, the verification set and the test set for the machining data and the additional machining data each include machining data for the diseased and non-diseased respectively.

도 6d는 가공 데이터만으로 질환 예측 모델을 학습했을 경우와 가공 데이터 및 추가 가공 데이터로 질환 예측 모델을 학습했을 경우의 성능 데이터 테이블을 도시한 개략도이다.FIG. 6D is a schematic view showing performance data tables when the disease prediction model is learned only by the processing data and when the disease prediction model is learned from the processing data and the additional processing data.

도 6d를 참조하면, 성능 데이터 테이블 (940) 는 가공 데이터를 적용한 질환 예측 모델 및 가공 데이터와 추가 가공 데이터를 적용한 질환 예측 모델에서의 성능 데이터를 포함한다. 성능 데이터 테이블 (940) 에 포함된 Precision은 양성 데이터로 분류한 질환자 중 실제 양성 데이터의 비율, Recall은 전체 양성 데이터 중 양성 데이터를 얼마나 찾았는지에 대한 비율, Accuracy는 양성 데이터와 음성 데이터의 수를 맞춘 비율, F1 Score은 Precision 및 Recall을 통합한 지표를 의미한다. 가공 데이터를 적용한 질환 예측 모델과 가공 데이터와 추가 가공 데이터를 적용한 질환 예측 모델에서의 F1 Score 값은 유사하지만, 가공 데이터와 추가 가공 데이터를 적용한 질환 예측 모델에서의 Precision은 가공 데이터를 적용한 질환 예측 모델의 Precision보다 약간 크다. 또한, 가공 데이터와 추가 가공 데이터를 적용한 질환 예측 모델에서의 Recall은 가공 데이터를 적용한 질환 예측 모델에서의 Recall보다 약간 작다.Referring to FIG. 6D, the performance data table 940 includes performance data in a disease prediction model to which the processing data is applied, and a disease prediction model to which the processing data and the additional processing data are applied. Precision included in the performance data table 940 is a ratio of actual positive data among the patients classified as positive data, Recall is the ratio of positive data among positive positive data, and Accuracy is the number of positive data and voice data The fit ratio, F1 Score, is an index that integrates Precision and Recall. Although the F1 Score values are similar in the disease prediction model and the disease prediction model using the processed data and the additional processed data, Precision in the disease prediction model using the processed data and the additional processed data is similar to the disease prediction model Which is slightly larger than Precision. In addition, Recall in the disease prediction model to which the processing data and the additional processing data are applied is slightly smaller than Recall in the disease prediction model to which the processing data is applied.

이에 따라, 질환 발병 확률 예측 모델 학습 장치 (600) 는 질환자인 경우, 질환 발병 시점을 포함하는 선택된 시간 구간에 대한 가공 데이터를 추가 가공 데이터로 생성하고, 비질환자인 경우, 무작위로 선택된 시간 구간에 대한 가공 데이터를 추가 가공 데이터로 생성함으로써, 질환 예측 모델이 다양한 경우의 이벤트를 고려할 수 있도록 한다.Accordingly, the disease occurrence probability prediction model learning device 600 generates, as additional processing data, processing data for a selected time period including a disease occurrence time point in the case of a disease, and, in the case of a non-diseased person, By generating the machining data for the additional machining data, it becomes possible to consider events in various cases of the disease prediction model.

도 7a 내지 도 7b는 본 발명의 일 실시예에 따라 복수의 항목의 값을 정규화하여 입력한 가공 데이터 테이블을 도시한 것이다.FIGS. 7A and 7B illustrate a processed data table in which values of a plurality of items are normalized and input according to an embodiment of the present invention.

도 7a를 참조하면, 원본 데이터 테이블 (1010) 은 개인 일련 번호에 따른 복수의 이벤트를 포함한다. 이 때, 복수의 이벤트는 BMI, 수축기 혈압, 이완기 혈압과 같은 복수의 항목을 포함하며, 복수의 항목은 각각 다른 단위의 수치값으로 입력된다. 예를 들어, BMI는 kg/m^2,수축기 혈압과 이완기 혈압은 mmHg에 해당하는 수치값으로 입력된다.Referring to FIG. 7A, the original data table 1010 includes a plurality of events according to a personal serial number. At this time, the plurality of events include a plurality of items such as BMI, systolic blood pressure, and diastolic blood pressure, and a plurality of items are inputted as numerical values of different units. For example, BMI is expressed in kg / m ^{2, and} systolic and diastolic blood pressure are entered as numerical values corresponding to mmHg.

도 7b를 참조하면, 가공 데이터 테이블 (1020) 은 복수의 항목에 z-score로 변환된 수치값을 포함한다. 이 때, z-score로 변환된 값은 각각 다른 단위의 수치값의 평균 및 표준편차에서 산출된다. 즉, 가공 데이터 테이블 (1020) 는 복수의 항목에 해당하는 각각 다른 단위의 수치값을 하나의 단위로 적용한 것과 같은 값인 z-score 변환 수치값을 복수의 항목에 포함할 수 있다.Referring to Fig. 7B, the machining data table 1020 includes numerical values converted into z-scores for a plurality of items. In this case, the values converted into z-scores are calculated from the mean and standard deviation of numerical values of different units. That is, the machining data table 1020 may include a z-score converted numerical value, which is the same value obtained by applying numerical values of different units corresponding to a plurality of items, as one unit, to a plurality of items.

이에 따라, 질환 발병 확률 예측 모델 학습 장치 (600) 는 각각 다른 단위의 복수의 항목을 z-score로 변환함으로써, 복수의 항목에 동일한 기준값을 적용하여 질환 발병 확률에 영향을 주는 항목을 보다 용이하게 인식할 수 있도록 한다.Accordingly, the disease occurrence probability prediction model learning device 600 can convert a plurality of items of different units into z-scores, thereby applying the same reference value to a plurality of items to make items that affect the disease occurrence probability easier So that it can be recognized.

도 8a 내지 도 8b는 본 발명의 일 실시예에 따라 복수의 항목의 값을 정의된 단위로 변환하여 입력한 가공 데이터 테이블을 도시한 것이다.8A and 8B illustrate a processed data table in which values of a plurality of items are converted into defined units according to an embodiment of the present invention.

도 8a를 참조하면, 원본 데이터 테이블 (1110) 은 개인 일련 번호에 따른 복수의 이벤트를 포함한다. 이 때, 복수의 이벤트는 키, 몸무게, 현재 흡연 기간, 현재 하루 평균 흡연량, 1회 음주량인 복수의 항목을 포함한다. 이 때, 하나의 항목에 대응하는 수치값은 각각 다른 단위로 입력될 수 있다. 예를 들어, 키는 cm, ft, 몸무게는 kg, lb, 현재 흡연 기간은 5년 단위, 1년 단위, 현재 하루 평균 흡연량은 반갑 단위, 개피 단위, 1회 음주량은 소주 반병 단위, 소주잔 단위로 입력될 수 있다.Referring to FIG. 8A, the original data table 1110 includes a plurality of events according to a personal serial number. At this time, the plurality of events includes a plurality of items of the key, the weight, the current smoking period, the average daily smoking amount, and the once drinking amount. At this time, numerical values corresponding to one item can be input in different units. For example, the key is cm, ft, the weight is kg, lb, the current smoking period is 5 years, the year is 1 year, the current average daily smoking amount is half a unit, Can be input.

도 8b를 참조하면, 가공 데이터 테이블 (1120) 은 하나의 항목에 동일한 단위의 수치값을 포함한다. 예를 들어, 가공 데이터 테이블 (1120) 은 cm인 키, kg인 몸무게, 1년 단위의 현재 흡연 기간, 개피 단위인 현재 하루 평균 흡연량, 소주잔 단위인 1회 음주량인 항목에 해당하는 수치값을 포함한다.Referring to Fig. 8B, the machining data table 1120 includes numerical values of the same unit in one item. For example, the machining data table 1120 includes numerical values corresponding to the items of a key of cm, a weight of kg, a current smoking period of a year, a current average daily smoking amount, do.

이에 따라, 질환 발병 확률 예측 모델 학습 장치 (600) 는 하나의 항목에 각각 다른 단위의 수치값을 동일한 단위의 수치값으로 생성함으로써, 심혈관 질환 발병 예측 모델이 각각 다른 단위의 수치값으로 구성되었던 원본 데이터도 입력받을 수 있어 보다 다양한 데이터를 기초로 정확도가 높은 질환 발병 확률을 산출할 수 있도록 한다.Accordingly, the disease occurrence probability prediction model learning device 600 generates numerical values of different units in the same item in numerical values of the same unit, so that the prediction model of the cardiovascular disease is the original Data can also be received so that the probability of disease occurrence with high accuracy can be calculated based on a variety of data.

본 명세서에서, 각 블록 또는 각 단계는 특정된 논리적 기능 (들) 을 실행하기 위한 하나 이상의 실행 가능한 인스트럭션들을 포함하는 모듈, 세그먼트 또는 코드의 일부를 나타낼 수 있다. 또한, 몇 가지 대체 실시예들에서는 블록들 또는 단계들에서 언급된 기능들이 순서를 벗어나서 발생하는 것도 가능함을 주목해야 한다. 예컨대, 잇달아 도시되어 있는 두 개의 블록들 또는 단계들은 사실 실질적으로 동시에 수행되는 것도 가능하고 또는 그 블록들 또는 단계들이 때때로 해당하는 기능에 따라 역순으로 수행되는 것도 가능하다.In this specification, each block or each step may represent a part of a module, segment or code that includes one or more executable instructions for executing the specified logical function (s). It should also be noted that in some alternative embodiments, the functions mentioned in the blocks or steps may occur out of order. For example, two blocks or steps shown in succession may in fact be performed substantially concurrently, or the blocks or steps may sometimes be performed in reverse order according to the corresponding function.

본 명세서에 개시된 실시예들과 관련하여 설명된 방법 또는 알고리즘의 단계는 프로세서에 의해 실행되는 하드웨어, 소프트웨어 모듈 또는 그 2 개의 결합으로 직접 구현될 수도 있다. 소프트웨어 모듈은 RAM 메모리, 플래시 메모리, ROM 메모리, EPROM 메모리, EEPROM 메모리, 레지스터, 하드 디스크, 착탈형 디스크, CD-ROM 또는 당업계에 알려진 임의의 다른 형태의 저장 매체에 상주할 수도 있다. 예시적인 저장 매체는 프로세서에 커플링되며, 그 프로세서는 저장 매체로부터 정보를 판독할 수 있고 저장 매체에 정보를 기입할 수 있다. 다른 방법으로, 저장 매체는 프로세서와 일체형일 수도 있다. 프로세서 및 저장 매체는 주문형 집적회로 (ASIC) 내에 상주할 수도 있다. ASIC는 사용자 단말기 내에 상주할 수도 있다. 다른 방법으로, 프로세서 및 저장 매체는 사용자 단말기 내에 개별 컴포넌트로서 상주할 수도 있다.The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. The software module may reside in a RAM memory, a flash memory, a ROM memory, an EPROM memory, an EEPROM memory, a register, a hard disk, a removable disk, a CD-ROM or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor, which is capable of reading information from, and writing information to, the storage medium. Alternatively, the storage medium may be integral with the processor. The processor and the storage medium may reside within an application specific integrated circuit (ASIC). The ASIC may reside within the user terminal. Alternatively, the processor and the storage medium may reside as discrete components in a user terminal.

이상 첨부된 도면을 참조하여 본 발명의 실시예들을 더욱 상세하게 설명하였으나, 본 발명은 반드시 이러한 실시예로 국한되는 것은 아니고, 본 발명의 기술사상을 벗어나지 않는 범위 내에서 다양하게 변형실시될 수 있다. 따라서, 본 발명에 개시된 실시예들은 본 발명의 기술 사상을 한정하기 위한 것이 아니라 설명하기 위한 것이고, 이러한 실시예에 의하여 본 발명의 기술 사상의 범위가 한정되는 것은 아니다. 그러므로, 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다. 본 발명의 보호 범위는 아래의 청구범위에 의하여 해석되어야 하며, 그와 동등한 범위 내에 있는 모든 기술 사상은 본 발명의 권리범위에 포함되는 것으로 해석되어야 할 것이다.Although the embodiments of the present invention have been described in detail with reference to the accompanying drawings, it is to be understood that the present invention is not limited to those embodiments and various changes and modifications may be made without departing from the scope of the present invention. . Therefore, the embodiments disclosed in the present invention are intended to illustrate rather than limit the scope of the present invention, and the scope of the technical idea of the present invention is not limited by these embodiments. Therefore, it should be understood that the above-described embodiments are illustrative in all aspects and not restrictive. The scope of protection of the present invention should be construed according to the following claims, and all technical ideas within the scope of equivalents should be construed as falling within the scope of the present invention.

100, 911, 921: 가공 데이터
200: 질환 예측 모델
300: 출력값
400: 정답
500: 결과
600: 질환 발병 확률 예측 모델 학습 장치
610: 통신부
620: 프로세서
630: 저장부
710: 제1 정답 그래프
711: 주상병이 부여되기 전 시점
712: 주상병이 부여된 시점
713: 주상병이 부여된 이후의 시점
720: 제2 정답 그래프
721, 731: 질환 발병 이전 시점
722, 732: 질환 발병 시점
723, 733: 질환 발병 이후 시점
730: 제3 정답 그래프
810: 원본 데이터 테이블
811, 812: 진료 일자
820: 제1 가공 데이터 테이블
821, 822: 복용 약품 분류 코드
830: 제2 가공 데이터 테이블
831, 832: 복용 약품 투약량
910: 질환자의 이벤트 그래프
912, 922: 제1 추가 가공 데이터
913, 923: 제2 추가 가공 데이터
920: 비질환자의 이벤트 그래프
1000: 질환 발병 예측 학습 시스템
1010, 1110: 원본 데이터 테이블
1020, 1120: 가공 데이터 테이블100, 911, 921: machining data
200: disease prediction model
300: output value
400: Correct answer
500: Results
600: Probability prediction model learning device
610:
620: Processor
630:
710: First Correction Graph
711: Before the pillar was given
712: When the pillar was given
713: Point after the pillar was given
720: Graph of second correct answer
721, 731: Before disease onset
722, 732: When the disease starts
723, 733: Post-onset disease
730: Graph of third correct answer
810: Original data table
811, 812: Date of medical examination
820: a first machining data table
821, 822: Drug classification code
830: second machining data table
831, 832: Dosage of medication
910: Event graph of patient
912, 922: first additional machining data
913, 923: second additional machining data
920: Event graph of non-ill person
1000: Disease outbreak prediction learning system
1010, 1110: original data table
1020, 1120: Process data table

Claims

Receiving original data including a plurality of items from at least one external database through a communication unit;
Generating processing data indicating one medical examination or one medical examination as one event so that the number and the length of the respective events of the diseased person and the non-diseased person are the same based on the original data through the processor;
Generating additional machining data having a selected time interval within a time interval of the machining data through the processor;
Generating a plurality of disease-specific prediction models for each of a plurality of events included in the processed data and the additional processed data through the processor;
Assigning the processed data and the additional processed data to the disease prediction models through the processor to calculate an output value indicating whether the disease has occurred;
Comparing the output value with the correct answer through the processor; And
Updating the disease prediction models according to a comparison result through the processor,
Wherein the step of generating the additional machining data comprises:
Generating processing data for a time interval including a time point of disease onset from a time point within a time period including a time point of disease onset as additional processing data for the patient; And
And generating, as additional processing data for the non-diseased person, processing data for a selected time interval before the last examination or treatment time of the processing data,
Wherein the processor divides the entire disease patient data and a sample set of arbitrarily selected non-disease patient data into a learning set, a verification set, and a test set at a 6: 2: 2 ratio, respectively.

The method according to claim 1,
The machining data includes:
Sociological data, at least one of the medical care record data including the medical care, and at least one of the health examination data including the one health medical examination, to generate the one event or the one event Probability Prediction Model Learning Method of Disease.

delete

The method according to claim 1,
The correct answer is,
Wherein the disease occurrence probability is represented by 0 or 1.

5. The method of claim 4,
The correct answer is,
A probability prediction model learning method in which the probability of disease occurrence is determined to be 1 from the point of time when the pillar disease is given.

5. The method of claim 4,
The correct answer is,
The probability of disease onset prediction model learning method is determined to be 1 only at the time of onset.

5. The method of claim 4,
The correct answer is,
A method for predicting the onset of a disease, which is determined to be 0 or more and 1 or less from the time of onset.

The method according to claim 1,
Wherein the step of generating the processed data comprises:
Listing an administration medication classification code and an administration medication dosage among the plurality of items included in the one event;
Determining whether the dosing drug classification code is associated with a disease to be predicted; And
And deleting the dosing drug classification code and the dosing medication dose when the dosing drug classification code is associated with the disease.

delete

The method according to claim 1,
Wherein the step of generating the processed data comprises:
And filtering the event so as to exclude data that can directly diagnose disease among the medical record data included in the original data.

The method according to claim 1,
Wherein the step of generating the processed data comprises:
Further comprising the step of correcting an average length for each event of said disease patient and said non-disease patient.

The method according to claim 1,
Wherein the step of generating the processed data comprises:
Extracting each unit corresponding to the plurality of items; And
And converting each of the units into a unit required for the processing data.

The method according to claim 1,
Wherein the step of generating the processed data comprises:
Calculating average and standard deviation of each of the values corresponding to the plurality of items; And
And converting the mean and standard deviation to a z-score.

A communication unit configured to receive original data including a plurality of items from at least one external database; And
Generating processing data indicating one medical examination or one medical examination as one event so that the number and the length of events of each of the diseased and non-diseased persons are the same based on the original data; A processor configured to generate additional machining data having a portion of time intervals; And
And a storage unit for storing the original data and the processed data,
The processor comprising:
Generating a plurality of disease prediction models for each of a plurality of events included in the processing data and the additional processing data,
Substituting the processed data into the disease prediction models, calculating an output value indicating whether the disease has occurred,
Compares the output value with the correct answer,
Updating the disease prediction models according to the comparison result,
The processor comprising:
The method according to any one of claims 1 to 6, wherein the processing data for the time period including the disease occurrence time point is selected as additional processing data for the patient from a time point selected from a time point including the time point of disease onset, Processing data for the section is generated as additional processing data for the non-diseased person,
The processor comprising:
And dividing the entire disease patient data and the arbitrarily selected sample set of non-diseased patient data into a learning set, a verification set, and a test set at a ratio of 6: 2: 2, respectively, of the original data.

17. The method of claim 16,
The processor comprising:
Wherein the drug classification code and the medication dosage amount are listed among the plurality of items included in the one event,
Determining whether the drug classifying code is associated with a disease to be predicted,
And delete the dosing drug classification code and the dosage amount data when the dosing drug classification code and the disease are associated.

delete

17. The method of claim 16,
The processor comprising:
And to filter the event so as to exclude data that can directly diagnose disease among the medical record data included in the original data.

17. The method of claim 16,
The processor comprising:
And correcting an average length for each event of the disease patient and the non-disease patient.