KR102442873B1

KR102442873B1 - A disease prediction system for planning insurance

Info

Publication number: KR102442873B1
Application number: KR1020200125198A
Authority: KR
Inventors: 김재원; 이찬중; 손기혁; 송승재
Original assignee: 주식회사 라이프시맨틱스
Priority date: 2019-12-27
Filing date: 2020-09-25
Publication date: 2022-09-14
Also published as: KR20210084224A

Abstract

문진을 통해 신체 정보, 과거력, 가족력, 생활 습관 등 고객의 건강관련 기본정보를 수집하고, 과거 건강검진 정보와 인구사회학적 변수를 수집하고, 이들 수집된 데이터 중 결측 데이터를 보정하여, 해당 고객의 예상 질환 및, 그 예상 질환의 발병 확률을 예측하는, 보험 설계를 위한 질환예측 서비스 시스템에 관한 것으로서, 과거 환자들의 인구사회학적 정보, 건강검진 데이터, 발병 데이터를 표본 데이터로 수집하는 표본자료 수집부; 고객의 인구사회학적 정보를 문진으로 입력받는 문진 입력부; 고객의 건강검진 데이터를 수집하는 건강검진 수집부; 상기 표본 데이터에서 고객과 유사한 유사군을 추출하고, 추출된 유사군의 대표값으로, 고객의 건강검진 데이터 중에서 결측 데이터를 대체하여 보정하는 결측치 보정부; 및, 고객의 인구사회학적 정보 및 보정된 건강검진 데이터를 이용하여 고객의 질환을 예측하는 질환 예측부를 포함하는 구성을 마련하여, 과거 환자들의 표본 데이터를 이용하여 고객의 결측 데이터를 보정함으로써, 고객의 건강검진 결과의 일부가 누락되더라도, 해당 고객의 예상 질환의 발병 확률을 예측할 수 있다. Basic health-related information such as body information, past history, family history, and lifestyle are collected through the questionnaire, past health checkup information and demographic variables are collected, and missing data among these collected data is corrected, It relates to a disease prediction service system for insurance design that predicts a predicted disease and the probability of the occurrence of the expected disease, and a sample data collection unit that collects demographic information, health checkup data, and onset data of past patients as sample data ; a questionnaire input unit that receives the customer's demographic information as a questionnaire; a health checkup collection unit that collects customer health checkup data; a missing value correction unit that extracts a similar group similar to the customer from the sample data, and replaces the missing data among the customer's health checkup data with a representative value of the extracted similar group to correct; And, by providing a configuration including a disease prediction unit for predicting a customer's disease using the customer's demographic information and the corrected health checkup data, and correcting the customer's missing data using the sample data of the past patients, the customer Even if a part of the health checkup result is omitted, the probability of occurrence of the expected disease of the customer can be predicted.

Description

{ A disease prediction system for planning insurance }

본 발명은 문진을 통해 신체 정보, 과거력, 가족력, 생활 습관 등 고객의 건강관련 기본정보를 수집하고, 과거 건강검진 정보와 인구사회학적 변수를 수집하고, 이들 수집된 데이터 중 결측 데이터를 보정하여, 해당 고객의 예상 질환 및, 그 예상 질환의 발병 확률을 예측하는, 보험 설계를 위한 질환예측 서비스 시스템에 관한 것이다.The present invention collects customer's health-related basic information such as body information, past history, family history, and lifestyle through a questionnaire, collects past health checkup information and demographic variables, and corrects missing data among these collected data, It relates to a disease prediction service system for insurance design that predicts the expected disease of the customer and the probability of the occurrence of the expected disease.

일반적으로, 보험은 우발적 사고나 병 따위의 장차 발생할 수 있는 일에 대비하여 미리 일정한 돈을 내게 하고, 약정된 조건이 성립될 경우 그에 맞는 일정 금액을 지급하는 제도이다. 대표적인 보험으로는 건강보험이 있다. 생활 수준이 높아지고 의료 기술이 발전함에 따라, 사람의 예상 기대 수명도 증가하여 의료비 부담 또한 증가하고 있다. 따라서, 예상 기대 수명의 증가로 인한 의료비 부담을 절감하기 위해, 암보험, 의료실비보험 등의 보험에 가입하여 질병에 대비하고 있다. 보험사들은 고객의 질병이 발생할수록 지급하는 보험금이 늘어나게 된다.In general, insurance is a system in which a certain amount of money is paid in advance in case of an accident or illness that may occur in the future, and a certain amount of money is paid when a contracted condition is met. A typical example is health insurance. As the standard of living increases and medical technology advances, the life expectancy of a person is also increasing, thereby increasing the burden of medical expenses. Therefore, in order to reduce the burden of medical expenses due to the increase in expected life expectancy, insurance companies such as cancer insurance and medical expenses insurance are purchased to prepare for diseases. Insurance companies pay more as the customer's illness occurs.

한편, 개인마다 건강상태, 체질, 생활 습관 등 다양한 요인에 따라 걸릴 수 있는 질병의 종류와 발병 확률이 다르다. 그럼에도 불구하고, 단순히 성별, 나이, 직업에만 맞춰 건강보험에 가입하게 되므로, 비교적 걸릴 확률이 낮은 질병에 대하여 고액의 보험료를 납부하게 되거나, 꼭 필요한 질병에 대하여 대비하지 못하는 사례가 발생할 수 있다.On the other hand, the type of disease and the probability of occurrence vary depending on various factors such as health status, constitution, and lifestyle for each individual. Nevertheless, since health insurance is simply purchased according to gender, age, and occupation, high premiums may be paid for diseases with a relatively low probability of contracting, or cases may occur where it is impossible to prepare for essential diseases.

반대로, 보험회사의 입장에서, 가입시 예상치보다 질병이 많이 발생되는 경우, 사차손(mortality loss) 등 손해가 발생할 수 있다. 따라서, 고객의 니즈를 최대한으로 충족시키는 것과 동시에 사차손이 발생할 가능성(또는 사차리스크)을 줄이기 위한 다양한 시도들이 보험업계 내에서 이루어지고 있다.Conversely, from an insurance company's point of view, if more diseases occur than expected at the time of subscription, damages such as mortality loss may occur. Accordingly, various attempts are being made in the insurance industry to maximize the customer's needs and at the same time reduce the possibility of occurrence of quasi-loss (or quasi-loss risk).

이를 위해, 각종 의료 시설에서 기초적인 진료를 받은 이후, 이러한 의료 시설의 진료 결과 및 피보험자의 현재 상태에 따라 예상 관련 질병의 보험에 가입시키려는 기술이 제시되고 있다[특허문헌 1,2]. 상기 선행기술은 피보험자 및 보험사 양측에 초기진료 단계에 의한 정확한 보험료 산출 및 정확도를 제공하려는 것이다. For this purpose, after receiving basic medical treatment at various medical facilities, a technique for enrolling in insurance for expected related diseases according to the medical results of these medical facilities and the current status of the insured has been proposed [Patent Documents 1 and 2]. The prior art is intended to provide accurate insurance premium calculation and accuracy by the initial treatment stage to both the insured and the insurer.

그러나 상기 선행기술들은 피보험자의 건강정보를 완전하게 수집한다는 것을 전제로 하나, 현실적으로 피보험자의 건강 데이터를 완벽하게 획득할 수 없는 문제점이 있다. However, although the prior art assumes that the health information of the insured is completely collected, there is a problem in that the health data of the insured cannot be perfectly obtained in reality.

단지 현재 피보험자의 건강 상태만을 고려하여 보험 상품을 선택하나, 해당 피보험자의 상태에 따라 소요되는 의료 비용을 고려하지 않고 있는 문제점이 있다.There is a problem in that insurance products are selected by considering only the current health condition of the insured, but medical expenses incurred according to the condition of the insured are not taken into account.

예를 들어, 혈중 지질수치에 대한 4가지(총콜레스테롤, 고밀도 콜레스테롤, 저밀도 콜레스테롤, 중성지방) 결과값이 특정 시점 이후로 매년 검사되는 것이 아닌, 간헐적으로 검사를 시행한다. 따라서 고객의 건강검진 데이터를 가져오더라도 고객 중에서 상기와 같은 혈중 지질수치 항목들 중 일부에 대한 결과값이 없는 경우가 있다.For example, the results of 4 blood lipid levels (total cholesterol, high-density cholesterol, low-density cholesterol, and triglycerides) are intermittently performed instead of being tested every year after a specific point in time. Therefore, even if the customer's health checkup data is brought, there are cases where there is no result value for some of the above blood lipid level items among customers.

또한 실제 건강검진시 일부항목을 수검자가 검진센터에서 누락하기도 하여, 항상 완결한 검진결과가 남아 있는 것이 아니다. 또한 고객이 건강검진 결과를 직접 입력하는 경우에도 고객이 일부 결과값을 입력하지 못하는 경우가 발생할 수 있다.In addition, some items are omitted from the examination center during actual health examinations, so complete examination results do not always remain. Also, even when the customer directly inputs the results of the health checkup, the customer may not be able to input some result values.

따라서 고객의 건강관련 정보 중에서 일부가 결측되더라도, 해당 항목의 결측값에 대한 보정 기술이 매우 필요하다.Therefore, even if some of the customer's health-related information is missing, a technique for correcting the missing value of the corresponding item is very necessary.

한국공개특허공보 제10-2003-0023667호(2003.03.19.공개)Korean Patent Publication No. 10-2003-0023667 (published on March 19, 2003) 한국공개특허공보 제10-2015-0049993호(2015.05.08.공개)Korean Patent Publication No. 10-2015-0049993 (published on May 8, 2015)

본 발명의 목적은 상술한 바와 같은 문제점을 해결하기 위한 것으로, 문진을 통해 신체 정보, 과거력, 가족력, 생활 습관 등 고객의 건강관련 기본정보를 수집하고, 과거 건강검진 정보와 인구사회학적 변수를 수집하고, 이들 수집된 데이터 중 결측 데이터를 보정하여, 해당 고객의 예상 질환 및, 그 예상 질환의 발병 확률을 예측하는, 보험 설계를 위한 질환예측 서비스 시스템을 제공하는 것이다.An object of the present invention is to solve the above problems, and collect basic health-related information of customers such as body information, past history, family history, and lifestyle through a questionnaire, and collect past health checkup information and demographic variables and to provide a disease prediction service system for insurance design that corrects missing data among the collected data to predict the expected disease of the customer and the probability of occurrence of the expected disease.

또한, 본 발명의 목적은 환자들의 건강검진정보 및 발병 자료에 대하여 인구사회학적 변수를 기반으로 하는 표본 데이터로 수집하고, 고객의 인구사회학적 요인을 기준으로 표본 데이터를 적용하여, 해당 고객의 건강검진 데이터의 결측 데이터를 보정하는, 보험 설계를 위한 질환예측 서비스 시스템을 제공하는 것이다.In addition, an object of the present invention is to collect sample data based on demographic variables for health checkup information and onset data of patients, and apply the sample data based on demographic factors of the customer, so that the health of the customer It is to provide a disease prediction service system for insurance design that corrects missing data of examination data.

상기 목적을 달성하기 위해 본 발명은 보험 설계를 위한 질환예측 서비스 시스템에 관한 것으로서, 과거 환자들의 인구사회학적 정보, 건강검진 데이터, 발병 데이터를 표본 데이터로 수집하는 표본자료 수집부; 고객의 인구사회학적 정보를 문진으로 입력받는 문진 입력부; 고객의 건강검진 데이터를 수집하는 건강검진 수집부; 상기 표본 데이터에서 고객과 유사한 유사군을 추출하고, 추출된 유사군의 대표값으로, 고객의 건강검진 데이터 중에서 결측 데이터를 대체하여 보정하는 결측치 보정부; 및, 고객의 인구사회학적 정보 및 보정된 건강검진 데이터를 이용하여 고객의 질환을 예측하는 질환 예측부를 포함하는 것을 특징으로 한다.In order to achieve the above object, the present invention relates to a disease prediction service system for insurance design, comprising: a sample data collection unit that collects demographic information, health checkup data, and onset data of past patients as sample data; a questionnaire input unit that receives the customer's demographic information as a questionnaire; a health checkup collection unit that collects customer health checkup data; a missing value correction unit that extracts a similar group similar to the customer from the sample data, and replaces the missing data among the customer's health checkup data with a representative value of the extracted similar group to correct; and a disease prediction unit for predicting the customer's disease by using the customer's demographic information and the corrected health checkup data.

또, 본 발명은 보험 설계를 위한 질환예측 서비스 시스템에 있어서, 상기 시스템은, 유사군 추출 변수를 선정하는 유사변수 선정부를 더 포함하고, 상기 결측치 보정부는 고객의 유사군 추출 변수의 데이터 값과 동일한 값을 가지는 데이터들을 유사군으로 추출하는 것을 특징으로 한다.In addition, the present invention provides a disease prediction service system for insurance design, wherein the system further comprises a similar variable selector for selecting a similar group extraction variable, and the missing value correction unit is the same as the data value of the customer's similar group extraction variable. It is characterized in that data having a value is extracted as a similar group.

또, 본 발명은 보험 설계를 위한 질환예측 서비스 시스템에 있어서, 상기 유사변수 선정부는 사전에 정해진 후보 변수들을 조합하여 다수의 조합을 구성하고, 사전에 정해진 테스트 셋에 대하여, 각 조합별로 고객의 검진 데이터를 추정하고, 추정된 검진 데이터와, 테스트 셋의 실제 데이터의 차이 값을 구하고, 각 조합별 모든 테스트 셋의 차이값의 통계치를 오차로 설정하고, 사전에 정해진 오차 범위의 오차를 가지는 조합들을 추출하고, 추출된 조합들 중에서 변수의 개수가 가장 적은 조합의 변수들을 유사군 추출 변수로 선정하는 것을 특징으로 한다.Further, in the present invention, in a disease prediction service system for insurance design, the similar variable selector configures a plurality of combinations by combining predetermined candidate variables, and with respect to a predetermined test set, the customer's examination for each combination Estimate the data, obtain the difference value between the estimated examination data and the actual data of the test set, set the statistical value of the difference value of all test sets for each combination as an error, and select combinations having an error within a predetermined error range. It is characterized in that the variables of the combination having the smallest number of variables among the extracted combinations are selected as the similar group extraction variables.

또, 본 발명은 보험 설계를 위한 질환예측 서비스 시스템에 있어서, 상기 유사군 추출 변수는 성별, 연령, BMI 수치로 구성되는 것을 특징으로 한다.In addition, the present invention is characterized in that in the disease prediction service system for insurance design, the similar group extraction variable consists of gender, age, and BMI.

또, 본 발명은 보험 설계를 위한 질환예측 서비스 시스템에 있어서, 상기 결측치 보정부는 사전에 정해진 개수를 갖도록 유사군을 형성하되, 동일한 값을 가지는 표본 데이터들 중에서 랜덤하게 추출하는 것을 특징으로 한다.In addition, in the disease prediction service system for insurance design, the present invention is characterized in that the missing value correcting unit forms a similar group to have a predetermined number, but randomly extracts it from sample data having the same value.

또, 본 발명은 보험 설계를 위한 질환예측 서비스 시스템에 있어서, 상기 결측치 보정부는 환자군과 대조군을 각각 구하여, 유사군을 형성하되, 유사군을 동일한 개수의 환자군과 대조군을 합하여 구성하고, 대조군은 표본 데이터에서 질병을 진단받지 않은 환자의 데이터들로부터 추출하고, 환자군은 표본 데이터에서 질병을 진단받은 환자의 데이터들로부터 추출하는 것을 특징으로 한다.In addition, the present invention provides a disease prediction service system for insurance design, wherein the missing value correcting unit obtains a patient group and a control group, respectively, to form a similar group, and configures the similar group by combining the same number of patient groups and control groups, and the control group is a sample It is characterized in that the data is extracted from the data of patients who have not been diagnosed with the disease, and the patient group is extracted from the data of the patients diagnosed with the disease from the sample data.

또, 본 발명은 보험 설계를 위한 질환예측 서비스 시스템에 있어서, 상기 유사군의 대표값은 범주형 데이터의 경우, 해당 값들의 '최빈값'으로, 연속형 데이터의 경우 해당 값들의 '중앙값'으로 설정되는 것을 특징으로 한다.Further, in the present invention, in the disease prediction service system for insurance design, the representative value of the similar group is set as the 'mode' of the corresponding values in the case of categorical data, and the 'median' of the corresponding values in the case of continuous data. characterized by being

또, 본 발명은 보험 설계를 위한 질환예측 서비스 시스템에 있어서, 상기 인구사회학적 정보는 성별, 연령, 거주지역, 보험가입유형, 소득분위, 장애유무, 신장, 체중을 포함하고, 상기 검진 데이터는, 허리둘레, 수축기 혈압, 이완기 혈압, 공복혈당, 총콜레스테롤, 고밀도 콜레스테롤, 저밀도 콜레스테롤, 트리글라세이드, 혈색소, 요단백, 혈청크레아티닌, 혈청지오티, 혈청지피티, 감마지티피를 포함하는 것을 특징으로 한다.In addition, in the present invention, in the disease prediction service system for insurance design, the demographic information includes gender, age, residential area, insurance subscription type, income decile, disability presence, height, and weight, and the examination data is , waist circumference, systolic blood pressure, diastolic blood pressure, fasting blood sugar, total cholesterol, high-density cholesterol, low-density cholesterol, triglycide, hemoglobin, urine protein, serum creatinine, serum GioT, serum GPT, gamma GTP do it with

상술한 바와 같이, 본 발명에 따른 보험 설계를 위한 질환예측 서비스 시스템에 의하면, 개인 신체 정보와 가족력, 습관 외에도 거주지역, 장애, 소득, 보험가입 등 인구사회학적 변수를 고려하여 발병 가능성을 예측함으로써, 사용자가 향후 발병 가능성이 있는 질환을 보다 정확하게 예측할 수 있는 효과가 얻어진다.As described above, according to the disease prediction service system for insurance design according to the present invention, in addition to personal physical information, family history, and habits, the possibility of onset is predicted by considering demographic variables such as residential area, disability, income, and insurance subscription. , an effect that allows the user to more accurately predict a disease that is likely to develop in the future is obtained.

또한, 본 발명에 따른 보험 설계를 위한 질환예측 서비스 시스템에 의하면, 과거 환자들의 표본 데이터를 이용하여 고객의 결측 데이터를 보정함으로써, 고객의 건강검진 결과의 일부가 누락되더라도, 해당 고객의 예상 질환의 발병 확률을 예측할 수 있는 효과가 얻어진다.In addition, according to the disease prediction service system for insurance design according to the present invention, by correcting the customer's missing data using the sample data of past patients, even if a part of the customer's health checkup result is omitted, the expected disease of the customer is corrected. An effect that can predict the probability of an outbreak is obtained.

도 1은 본 발명을 실시하기 위한 전체 시스템에 대한 구성도.
도 2는 본 발명의 일실시예에 따른 보험 설계를 위한 질환예측 서비스 시스템의 구성에 대한 블록도.
도 3은 본 발명의 일실시예에 따른 질환예측모델의 입력 변수를 나타낸 표.1 is a block diagram of an entire system for implementing the present invention.
2 is a block diagram of the configuration of a disease prediction service system for insurance design according to an embodiment of the present invention.
3 is a table showing input variables of a disease prediction model according to an embodiment of the present invention.

이하, 본 발명의 실시를 위한 구체적인 내용을 도면에 따라서 설명한다.Hereinafter, specific contents for carrying out the present invention will be described with reference to the drawings.

또한, 본 발명을 설명하는데 있어서 동일 부분은 동일 부호를 붙이고, 그 반복 설명은 생략한다.In addition, in demonstrating this invention, the same part is attached|subjected by the same code|symbol, and the repetition description is abbreviate|omitted.

먼저, 본 발명을 실시하기 위한 전체 시스템의 구성을 도 1을 참조하여 설명한다.First, the configuration of the entire system for implementing the present invention will be described with reference to FIG. 1 .

도 1에서 보는 바와 같이, 본 발명을 실시하기 위한 전체 시스템은 보험 설계사의 스마트폰 등 스마트 이동단말(10), 스마트 이동단말(10)에 설치되어 보험 설계를 지원하는 보험설계 클라이언트(20), 보험 설계 서비스를 제공하는 보험설계 서버(30), 및, 예상 질환의 발병 확률을 예측하는 예측 서버(50)로 구성된다. 스마트 이동단말(10) 및 보험설계 서버(30)는 인터넷 등 네트워크(80)를 통해 서로 연결된다. 또한, 건강검진 결과, 인구사회학적 기반 건강검진 결과의 통계 정보,질환 예측 정보 등을 저장하는 데이터베이스(40)를 추가적으로 구성될 수 있다.As shown in Figure 1, the entire system for implementing the present invention is a smart mobile terminal 10 such as a smart phone of an insurance planner, an insurance design client 20 installed in the smart mobile terminal 10 to support insurance design, It consists of an insurance design server 30 that provides an insurance design service, and a prediction server 50 that predicts a probability of occurrence of a predicted disease. The smart mobile terminal 10 and the insurance design server 30 are connected to each other through a network 80 such as the Internet. In addition, the database 40 for storing health checkup results, statistical information of demographic-based health checkup results, disease prediction information, and the like may be additionally configured.

먼저, 스마트 이동단말(10)은 보험 설계사가 이용하는 모바일 단말로서, 스마트폰, 패블릿, 태블릿PC 등 통상의 컴퓨팅 기능을 구비한 스마트 단말이다. 특히, 스마트 이동단말(10)은 어플리케이션 또는, 모바일용 어플리케이션(또는 앱, 어플) 등이 설치되어 실행될 수 있는 단말이다.First, the smart mobile terminal 10 is a mobile terminal used by an insurance planner, and is a smart terminal equipped with a normal computing function, such as a smartphone, a phablet, and a tablet PC. In particular, the smart mobile terminal 10 is a terminal in which an application or a mobile application (or an app, an application) can be installed and executed.

다음으로, 보험설계 클라이언트(20)는 스마트 이동단말(10)에 설치되어 수행되는 모바일용 어플리케이션(또는 앱, 어플)으로서, 보험설계 서버(30)와 연동하여 보험설계 서비스를 제공하는 프로그램 시스템이다.Next, the insurance design client 20 is a mobile application (or app, application) that is installed and executed on the smart mobile terminal 10, and is a program system that provides an insurance design service by interworking with the insurance design server 30 .

즉, 보험설계 클라이언트(20)는 문진을 통해 고객의 건강 상태 등을 수집하거나, 고객의 발병 가능 질환이나 발명 확률을 예측하여 제공하는 등 보험설계 지원과 관련된 서비스를 제공한다.That is, the insurance design client 20 provides services related to insurance design support, such as collecting the customer's health status, etc. through a questionnaire, or predicting and providing the customer's possible disease or invention probability.

다음으로, 보험설계 서버(30)는 통상의 어플리케이션 서버로서, 고객의 과거 검진 데이터 등을 수집하여 고객의 질환을 예측하고, 예측 결과에 따라 보험 설계를 제공하는 서버이다.Next, the insurance design server 30 is a normal application server, which collects the customer's past examination data, predicts the customer's disease, and provides an insurance design according to the prediction result.

특히, 보험설계 서버(30)는 고객의 질환 예측을 예측 서버(50)에 요청하고, 예측 서버(50)로부터 예측 정보를 수신한다.In particular, the insurance design server 30 requests the prediction server 50 to predict the customer's disease, and receives prediction information from the prediction server 50 .

또한, 예측 서버(50)는 환자들의 인구사회학적 데이터 및, 과거 검진/진단 및 발병 데이터 등 표본 데이터를 수집하고, 이를 이용하여 고객의 질환을 예측한다. 특히, 고객의 인구사회학적 정보와 건강검진 정보를 수신하면, 해당 고객에 대한 질환을 예측한다.In addition, the prediction server 50 collects sample data such as demographic data and past examination/diagnosis and onset data of patients, and predicts the customer's disease by using this. In particular, upon receiving the customer's demographic information and health checkup information, the disease for the customer is predicted.

또한, 예측 서버(50)는 해당 고객의 건강검진 정보 중 일부가 누락된 것으로 판단하면, 표본 데이터를 이용하여, 누락된 결측 데이터를 보정한다.In addition, when the prediction server 50 determines that some of the health checkup information of the customer is missing, the prediction server 50 corrects the missing missing data by using the sample data.

한편, 보험설계 서버(30)와 예측 서버(50)는 기능적으로 분리된 서버이며, 하나의 서버로도 구현될 수도 있다. 이하에서는 설명의 편의를 위하여, 보험설계 서버(30) 내에 예측 서버(50)의 기능이 구현되는 것으로 설명한다.Meanwhile, the insurance design server 30 and the prediction server 50 are functionally separated servers, and may be implemented as a single server. Hereinafter, for convenience of explanation, it will be described that the function of the prediction server 50 is implemented in the insurance design server 30 .

한편, 클라이언트(20)와, 보험설계 서버(30)(및 예측 서버(50))는 통상의 클라이언트와 서버의 구성 방법에 따라 구현될 수 있다. 즉, 전체 시스템의 기능들을 클라이언트의 성능이나 서버와 통신량 등에 따라 분담될 수 있다. 일례로서, 클라이언트(20)가 단순히 문진 등을 데이터를 수집하거나 분석 결과를 출력하는 작업만을 수행하고, 보험설계 서버(30)가 수집된 표본 자료를 수집하여 분석하거나, 고객의 질환을 예측할 수 있다. 또 다른 예로서, 보험설계 서버(30)는 고객의 과거 검진 데이터 수집 작업을 수행하고, 클라이언트(20)가 분석 작업이나 예측 작업 전체 또는 그 일부를 직접 수행할 수도 있다. 이하에서는 질환예측 서비스 시스템으로 설명하나, 서버-클라이언트의 구성 방법에 따라 다양한 분담 형태로 구현될 수 있다.Meanwhile, the client 20 and the insurance design server 30 (and the prediction server 50 ) may be implemented according to a typical method of configuring a client and a server. That is, the functions of the entire system can be divided according to the performance of the client or the amount of communication between the server and the server. As an example, the client 20 simply collects data for a questionnaire or outputs the analysis result, and the insurance design server 30 collects and analyzes the collected sample data, or predicts the customer's disease. . As another example, the insurance design server 30 may perform a customer's past examination data collection operation, and the client 20 may directly perform all or a part of an analysis operation or a prediction operation. Hereinafter, the disease prediction service system will be described, but it may be implemented in various forms of sharing depending on the server-client configuration method.

다음으로, 데이터베이스(40)는 환자들의 인구사회학적 정보와 건강검진 결과 등 표본 데이터를 저장하는 표본자료DB(41), 고객의 인구사회학적 정보 및 건강검진결과를 저장하는 고객건강정보DB(42), 고객의 발병 가능 질환 및 그 확률, 위험도 등을 저장하는 예측정보DB(43) 등을 포함한다. 그러나 상기 데이터베이스(40)의 구성은 바람직한 일실시예일 뿐이며, 구체적인 장치를 개발하는데 있어서, 접근 및 검색의 용이성 및 효율성 등을 감안하여 데이터베이스 구축이론에 의하여 다른 구조로 구성될 수 있다.Next, the database 40 includes a sample data DB 41 for storing sample data such as demographic information and health examination results of patients, and a customer health information DB 42 for storing demographic information and health examination results of customers. ), and a prediction information DB 43 that stores the customer's possible disease and its probability, risk, and the like. However, the configuration of the database 40 is only a preferred embodiment, and in developing a specific device, it may be configured in a different structure according to the database construction theory in consideration of the ease and efficiency of access and search.

다음으로, 본 발명의 일실시예에 따른 보험 설계를 위한 질환예측 서비스 시스템(300)을 도 2를 참조하여 설명한다.Next, a disease prediction service system 300 for insurance design according to an embodiment of the present invention will be described with reference to FIG. 2 .

본 발명에 따른 질환예측 서비스 시스템(300)은 앞서 설명한 바와 같이, 보험설계 클라언트(20)와 보험설계 서버(30)(또는, 예측 서버(50))와 서버-클라이언트 방식으로 구현된다.As described above, the disease prediction service system 300 according to the present invention is implemented in a server-client manner with the insurance design client 20 and the insurance design server 30 (or the prediction server 50 ).

도 2에서 보는 바와 같이, 본 발명의 일실시예에 따른 질환예측 서비스 시스템(300)은 과거 환자들의 인구사회학적 정보, 건강검진 정보, 발병 데이터 등을 표본 데이터로 수집하는 표본자료 수집부(31), 고객의 인구사회학적 정보 등을 문진으로 입력받는 문진 입력부(32), 고객의 건강검진결과를 수집하는 건강검진 수집부(33), 고객의 건강검진결과 중에서 결측 데이터를 보정하는 결측치 보정부(34), 및, 고객의 인구사회학적 정보 및 보정된 건강검진결과를 이용하여 고객의 질환을 예측하는 질환 예측부(35)로 구성된다. 또한, 유사군 추출 변수를 선정하는 유사변수 선정부(36)를 더 포함하여 구성될 수 있다.As shown in FIG. 2 , the disease prediction service system 300 according to an embodiment of the present invention includes a sample data collection unit 31 that collects demographic information, health checkup information, onset data, etc. of past patients as sample data. ), a questionnaire input unit 32 that receives the customer's demographic information as a questionnaire, a health checkup collection unit 33 that collects the customer's health checkup results, and a missing value correction unit that corrects missing data among the customer's health checkup results (34), and a disease prediction unit 35 for predicting the customer's disease by using the customer's demographic information and the corrected health checkup result. In addition, it may be configured to further include a similar variable selector 36 for selecting a similar group extraction variable.

먼저, 표본자료 수집부(31)는 과거 환자들의 인구사회학적 정보, 건강검진 정보, 발병 데이터 등을 표본 데이터로 수집한다.First, the sample data collection unit 31 collects demographic information, health checkup information, onset data, and the like of past patients as sample data.

인구사회학적 정보는 환자의 건강 상태를 나타내는 데이터로서, 나이, 성별, 신장, 체중, 장애 유무, 생활 습관 등으로 구성된다.Demographic information is data representing a patient's health status, and consists of age, gender, height, weight, presence of a disability, lifestyle, and the like.

또한, 건강검진 정보는 환자의 건강검진 데이터로서, 혈압, 콜레스테롤 수치, 혈색소, 요단백 수치 등 건강검진을 수행할 때 측정(검진)되는 데이터이다.In addition, the health checkup information is the patient's health checkup data, and is data measured (checked) when performing a health checkup, such as blood pressure, cholesterol level, hemoglobin level, and urine protein level.

또한, 발병 데이터는 해당 환자의 발병된 질환에 대한 데이터로서, 해당 환자의 질환 발병 여부를 나타낸다.In addition, the onset data is data on the disease onset of the patient, and indicates whether the patient has the disease.

한편, 표본 데이터는 인구사회학적 정보에 의해 분류 또는 식별되어 수집된다. 즉, 환자의 이름 등 환자를 식별하는 개인 정보는 제외되고, 나이, 성별, 신장, 체중, 장애 유무, 생활 습관 등 건강상태 정보를 기준으로 의료 데이터가 수집된다.On the other hand, sample data is collected by being classified or identified by demographic information. That is, personal information that identifies the patient, such as the patient's name, is excluded, and medical data is collected based on health status information such as age, gender, height, weight, disability, and lifestyle.

특히, 바람직하게는, 표본 데이터는 표본 코호트 DB를 이용한다. 표본코호트DB를 구축하고 있는 전체 데이터는 국민 100만명의 데이터를 의미한다. 해당 100만명의 대상자는 전국민의 성별 및 연령과 거주지역 분포를 기준으로 층화 추출되었으므로, 본 데이터를 통해 도출되는 결과값은 전국민을 대표한다고 할 수 있다.In particular, preferably, the sample data uses a sample cohort DB. The total data that is building the sample cohort DB refers to the data of 1 million people. The 1 million subjects were stratified based on the gender and age of the citizens and the distribution of residential areas, so it can be said that the results derived from this data are representative of the whole nation.

다음으로, 문진 입력부(32)는 고객의 인구사회학적 정보를 문진으로 입력받는다.Next, the questionnaire input unit 32 receives the customer's demographic information as a questionnaire.

앞서 설명한 바와 같이, 고객의 인구사회학적 정보는 나이, 성별, 신장, 체중, 장애 유무, 생활 습관, 소득 분위, 과거 병력, 가족 병력 등으로 구성된다. 고객의 인구사회학적 정보는 문진에 의해 취득된다. 따라서, 이하에서, 문진 데이터를 인구사회학적 데이터와 혼용한다.As described above, the customer's demographic information consists of age, gender, height, weight, presence or absence of a disability, lifestyle, income quintile, past medical history, family medical history, and the like. Customer's demographic information is obtained by questionnaire. Therefore, in the following, questionnaire data are mixed with demographic data.

일례로서, 보험 설계사의 스마트 이동단말(10)을 통해, 인구사회학적 정보를 문진하는 인터페이스를 제공하면, 고객이 각 문진에 대해 직접 응답하게 하여, 고객의 정보를 입력받는다.As an example, if an interface for questioning demographic information is provided through the insurance planner's smart mobile terminal 10, the customer directly responds to each questionnaire and receives the customer's information.

바람직하게는, 문진은 26개의 항목으로 구성된다. 즉, 문진 데이터는 성별, 연령, 거주지역, 보험가입유형, 소득분위, 장애유무, 검진기관종류, 신장, 체중, 본인(뇌졸중, 심장병, 고혈압, 당뇨, 이상지질혈증, 폐결핵, 암포함 기타질환)과거력, 가족(뇌졸증, 심장병, 고혈압, 당뇨, 간장질환, 암)과거력, 흡연상태, 흡연기간, 하루흡연량, 음주습관, 1회음주량, 1주운동량 등을 포함한다.Preferably, the questionnaire consists of 26 items. In other words, the questionnaire data includes gender, age, region of residence, insurance subscription type, income quintile, disability, type of examination institution, height, weight, person (stroke, heart disease, high blood pressure, diabetes, dyslipidemia, pulmonary tuberculosis, and other diseases including cancer) ), family history (stroke, heart disease, high blood pressure, diabetes, liver disease, cancer) history, smoking status, smoking period, smoking amount per day, drinking habits, alcohol consumption per day, exercise amount per week, etc.

다음으로, 건강검진 수집부(33)는 고객의 건강검진 데이터를 수집한다.Next, the health checkup collection unit 33 collects the customer's health checkup data.

고객의 건강검진 데이터는 수축기 혈압, 이완기 혈압, 식전 혈당, 총콜레스테롤, 고밀도 콜레스테롤, 저밀도 콜레스테롤, 중성지방, 요단백 등 건강 검진 시 측정되는 데이터들로 구성된다.The customer's health checkup data consists of data measured during the health checkup, such as systolic blood pressure, diastolic blood pressure, pre-meal blood sugar, total cholesterol, high-density cholesterol, low-density cholesterol, triglycerides, and urine protein.

이때, 고객의 건강검진 데이터는 고객이 직접 입력하거나, 고객의 공인인증 과정을 통해 가장 최근의 건강검진 데이터를 건강보험공단, 의료데이터 기관(건강인 사이트) 등으로부터 가져온다. 스마트 이동단말(10)에서 해당 사이트에 접근하여, 고객의 인증정보를 입력하고 직접 수집할 수 있다.At this time, the customer's health checkup data is directly input by the customer, or the most recent health checkup data is obtained from the Health Insurance Corporation, a medical data institution (Health Insight), etc. through the customer's accredited authentication process. By accessing the site from the smart mobile terminal 10, the customer's authentication information can be entered and directly collected.

일반적으로, 국민건강보험공단 일반건강검진을 진행할 때, 검진 전 설문조사로 수집하는 항목을 문진 데이터라고 통상 부른다. 그러나 고객의 건강검진 데이터를 가져올 때, 건강검진 과정에서의 문진 데이터를 가져오지 못한다. 따라서 앞서 문진 입력부(32)를 통해, 해당 데이터를 입력받는다. 또한 사용자의 '신장'과 '체중'의 경우도 건강검진 항목에 포함되지만, 위와 같은 이유로 본 서비스에서는 문진에 포함된다.In general, when the National Health Insurance Corporation general health checkup is conducted, the items collected by the questionnaire before the checkup are usually called questionnaire data. However, when importing the customer's health checkup data, the questionnaire data during the health checkup process cannot be imported. Accordingly, the corresponding data is received through the paper-weight input unit 32 in advance. In addition, the user's 'height' and 'weight' are included in the health checkup, but for the same reason as above, this service is included in the questionnaire.

통상 건강 검진시 측정되는 검진 데이터는, 허리둘레, 수축기 혈압, 이완기 혈압, 공복혈당, 총콜레스테롤, HDL 콜레스테롤, LDL 콜레스테롤, 트리글리세라이드, 혈색소, 요단백, 혈청크레아티닌, 혈청지오티, 혈청지피티, 감마지티피 등을 포함한다.Screening data that are usually measured during a health checkup include waist circumference, systolic blood pressure, diastolic blood pressure, fasting blood sugar, total cholesterol, HDL cholesterol, LDL cholesterol, triglycerides, hemoglobin, urine protein, serum creatinine, serum GioT, serum GPT, and Gamma GT.

한편, 불러오는 건강검진결과 중 혈중 지질수치에 대한 4가지(총콜레스테롤, 고밀도 콜레스테롤, 저밀도 콜레스테롤, 중성지방) 검진 데이터는 특정 시점 이후로 매년 검사되지 않고 간헐적으로 검사된다. 따라서 최근 건강검진 데이터를 불러올 때, 해당 수치가 결측될 수 있다.On the other hand, among the health checkup results, the four screening data for blood lipid levels (total cholesterol, high-density cholesterol, low-density cholesterol, and triglycerides) are intermittently examined without being examined every year after a certain point in time. Therefore, when retrieving the recent health checkup data, the corresponding value may be missing.

또한 실제 건강검진시 일부 항목을 검진센터에서 누락될 수 있다. 또한 고객이 직접 입력하는 경우, 고객이 일부 검진 데이터를 입력하지 못하여 결측 데이터가 발생할 수 있다.In addition, some items may be omitted from the examination center during the actual health examination. Also, when the customer directly inputs, the customer may not be able to input some examination data, resulting in missing data.

다음으로, 유사변수 선정부(36)는 표본 데이터에서 유사군을 추출하기 위한 추출 변수(또는 유사군 추출 변수)를 선정한다.Next, the similar variable selecting unit 36 selects an extraction variable (or a similar group extraction variable) for extracting a similar group from the sample data.

정확한 질환 발생예측 값을 얻기 위해서는 기본적으로 사용자의 인구사회학적정보, 다양한 건강검진정보가 필요할 뿐만 아니라, 사용자의 과거질병이력, 가족의 질병이력, 평소생활습관 등이 필요하다. 하지만 현실적으로 사용자가 상기의 모든 정보를 기억하고 있는 것은 불가능하며, 사용자가 본인의 건강정보를 획득하는 것이 어렵거나, 개인정보보호 문제등으로 질환예측모델 활용의 어려움이 있는 경우가 많다.In order to obtain an accurate disease occurrence prediction value, not only the user's demographic information and various health check-up information are required, but also the user's past disease history, family disease history, and daily life habits are required. However, in reality, it is impossible for the user to memorize all of the above information, and there are many cases where it is difficult for the user to obtain his/her own health information, or there is a difficulty in using the disease prediction model due to problems with personal information protection.

따라서 정확도는 다소 떨어지더라도 사용자는 현재 당장 기억할 수 있는 간편한 문진 및 설문항목(연령, 체중, 키, 평소생활습관, 본인 및 가족 질병이력 등)만으로 보다 용이하게 질환예측모델을 활용할 수 있어야 한다. 다시 말해서, 환자의 건강과 관련된 모든 지표를 통해 향후 질환의 발생예측모델을 활용하기에는 무리가 있으므로, 간편하게 답변할 수 있는 문진 및 설문항목만으로 결과를 도출하거나, 해당 간편 항목만으로 나머지 응답을 하지 못한 건강관련 모든 수치를 추정할 수 있다면 매우 편리할 것이다.Therefore, even if the accuracy is somewhat lower, the user should be able to use the disease prediction model more easily with simple questionnaires and questionnaires that can be memorized right away (age, weight, height, daily lifestyle, personal and family disease history, etc.). In other words, since it is difficult to use the future disease occurrence prediction model through all indicators related to the patient's health, it is difficult to derive results only from questionnaires and questionnaires that can be easily answered, or health where the rest of the responses cannot be answered only with the simple items. It would be very convenient to be able to estimate all relevant figures.

추정해야 하는 결측치가 질환예측모델 내에서 성능에 미치는 영향이 클수록, 실제 분류나 예측모델의 성능저하에 주된 요인이 된다. 따라서 해당 결측값의 정확한 추정이 모델 전체 성능과 산출결과에 중요한 요인이 된다.The greater the effect of the missing value to be estimated on the performance in the disease prediction model, the more the actual classification or the performance degradation of the predictive model becomes. Therefore, accurate estimation of the missing value is an important factor in the overall model performance and output.

본 발명에서는, 표본 데이터(표본 코호트 DB)에서 고객의 특성과 유사한 군을 추출하고, 추출된 유사군의 대표값(평균값, 중앙값 등)으로 고객의 결측치를 대체하여 보정한다.In the present invention, groups similar to customer characteristics are extracted from sample data (sample cohort DB), and missing values of customers are replaced with representative values (mean values, median values, etc.) of the extracted similar groups for correction.

그런데 유사군을 추출할 때 추출 변수를 설정하는데 있어 다음과 같은 요소를 고려해야 한다.However, when extracting similar groups, the following factors should be considered in setting the extraction variables.

먼저, 추출하는 군의 데이터는 결측이 최소화 된 군이어야 한다. 즉, 추출된 군의 데이터에 결측이 많다면, 그만큼 정확하고 객관적인 유사군을 형성할 수 없기 때문이다.First, the data of the group to be extracted should be the group with minimal missing. That is, if there are many missing data in the extracted group, an accurate and objective similar group cannot be formed.

또한, 추출 변수는 질환예측모델 개발에 투입된 학습 변수 중 (시간이 흘러도) 가장 변화가 적은 변수이어야 한다. 예를 들어, 인구사회학적변수(거주지역, 소득수준, 보험가입유형 등), 건강검진결과, 평소생활습관 정보 등은 시간 및 조건의 변화에 따라 매우 크게 변화될 수 있다. 따라서 이러한 변수들은 대체값을 추정하는데 있어 최대한 제외되어야 한다.In addition, the extracted variable should be the variable with the least change (even over time) among the learning variables input to the development of the disease prediction model. For example, demographic and socioeconomic variables (region of residence, income level, insurance subscription type, etc.), health checkup results, and daily lifestyle information can change significantly depending on time and conditions. Therefore, these variables should be excluded as much as possible in estimating the replacement value.

또한, 추출 변수의 개수는 최소화 하여 추정값 산출의 부하를 줄여야 한다. 실제 질환예측모델 개발단계에 투입된 학습변수 전체를 결측치의 추정 기준으로 삼는 것은 해당 기술 구현 시간을 과도하게 증가시키는 현실적 한계를 가지고 있다. 특히, 변수가 많을수록 부하가 커지므로, 변수의 개수를 최소화 하는 것이 효율적이다.In addition, the number of extraction variables should be minimized to reduce the load on estimation value calculation. Using all of the learning variables input in the actual disease prediction model development stage as a standard for estimating missing values has a practical limitation in excessively increasing the implementation time of the technology. In particular, since the load increases as the number of variables increases, it is efficient to minimize the number of variables.

유사변수 선정부(36)는 사전에 정의된 후보 변수들을 입력받는다. 즉, 관리자 등은 변화가 적은 변수나 결측되지 않는 변수를 선정하여, 후보 변수들을 사전에 정의해둔다. 바람직하게는, 후보 변수는 성별, 연령, 신장, 체중, BMI, 과거력 등으로 구성된다.The similar variable selector 36 receives predefined candidate variables. That is, the manager or the like selects a variable with little change or a variable that is not missing, and defines candidate variables in advance. Preferably, the candidate variables consist of gender, age, height, weight, BMI, history, and the like.

다음으로, 유사변수 선정부(36)는 후보 변수들의 조합을 구성한다. 일례로서, {성별, 연령}, {성별, 연령, 신장}, {성별, 연령, 체중}, {성별, 연령, BMI} 등 각 후보 변수들의 조합을 구성한다.Next, the similar variable selection unit 36 constitutes a combination of candidate variables. As an example, a combination of each candidate variable such as {sex, age}, {sex, age, height}, {sex, age, weight}, and {sex, age, BMI} is configured.

다음으로, 유사변수 선정부(36)는 사전에 정해진 테스트 셋을 입력받는다. 테스트 셋은 다수의 잠재고객 데이터들이다.Next, the similar variable selector 36 receives a preset test set. The test set is multiple audience data.

다음으로, 유사변수 선정부(36)는 각 조합별로 테스트 셋의 고객의 검진 데이터를 추정하고, 추정된 데이터와 실제 데이터의 차이 값들을 산출한다. 즉, 테스트 셋은 고객의 인구사회학적 정보, 검진 데이터, 발병 데이터 등을 포함한다. 따라서 검진 데이터의 특정 항목(변수)에 대하여 결측 데이터로 가정하여, 해당 변수의 데이터값을 추정한다. 그리고 추정된 값과, 테스트셋의 실제 데이터의 차이를 비교하여 차이값을 산출한다.Next, the similar variable selector 36 estimates the customer's examination data of the test set for each combination, and calculates difference values between the estimated data and the actual data. That is, the test set includes the customer's demographic information, examination data, onset data, and the like. Therefore, it is assumed that a specific item (variable) of the examination data is missing data, and the data value of the corresponding variable is estimated. Then, the difference value is calculated by comparing the difference between the estimated value and the actual data of the test set.

다음으로, 유사변수 선정부(36)는 각 조합별로 테스트셋의 차이값을 평균하여, 각 조합별 차이값의 평균을 구하고, 해당 평균을 각 조합별 오차로 설정한다.Next, the similar variable selector 36 averages the difference values of the test sets for each combination, obtains an average of the difference values for each combination, and sets the average as an error for each combination.

다음으로, 유사변수 선정부(36)는 사전에 정해진 오차 범위 내에서 오차를 가지는 조합을 선별하고, 선별된 조합들 중에서 실제값과 가장 오차가 가장 작은 조합을 선정한다.Next, the similar variable selector 36 selects a combination having an error within a predetermined error range, and selects a combination having the smallest error from the actual value among the selected combinations.

마지막으로, 유사변수 선정부(36)는 가장 적은 조합에 속하는 변수들을 유사군 추출 변수로 선정한다.Finally, the similar variable selection unit 36 selects variables belonging to the fewest combinations as the similar group extraction variables.

바람직하게는, 유사군 추출 변수는 성별, 연령, BMI 변수로 선정한다. 즉, 성별, 연령, BMI 이외의 요인을 결측값 추정기준으로 선정하기에는 추가되는 변수의 중요도와 이를 통한 산출결과의 통계량 및 성능도 우월하게 개선되지 않은 것으로 나타났다. 또한 위 3가지 요인 이외에 더 많은 변수를 고려할 경우, 타입(연속형 및 범주형)별 변수가 너무 많아져서, 이미 해당 변수에 결측이 존재하여 추정값 산출에 기여하지 못하거나, 결측치를 추정하는 기술구현의 시간적 소모가 매우 커지게 된다.Preferably, the similar group extraction variables are selected as gender, age, and BMI variables. In other words, it was found that the importance of additional variables and statistics and performance of the calculated results were not superiorly improved in selecting factors other than gender, age, and BMI as the missing value estimation criteria. In addition, if more variables other than the above three factors are considered, there are too many variables by type (continuous and categorical), so it cannot contribute to the estimation of the estimated value because there is already a missing value in the relevant variable, or implement a technique to estimate the missing value time consumption will be very large.

다음으로, 결측치 보정부(34)는 고객의 건강검진결과 중에서 결측 데이터를 보정한다. 특히, 결측치 보정부(34)는 질환예측모델의 입력 변수의 데이터 중에서 결측된 데이터를 보정한다.Next, the missing value correction unit 34 corrects the missing data among the results of the customer's health checkup. In particular, the missing value correction unit 34 corrects missing data among the data of the input variables of the disease prediction model.

즉, 결측치 보정부(34)는 표본 데이터 중에서 유사군을 추출하고, 해당 고객의 결측 데이터를 유사군의 대표값으로 설정하여 보정한다.That is, the missing value correcting unit 34 extracts a similar group from the sample data, sets the missing data of the corresponding customer as a representative value of the similar group, and corrects it.

유사군은 유사군 추출 변수에 의해, 표준 데이터에서 일부가 추출된 데이터들의 집합을 말한다. 즉, 유사군은 표준 데이터의 부분 집합이다. 이때, 고객의 추출 변수(유사군 추출 변수)의 값과 동일한 값(동일한 범위의 값)을 가지는 데이터들이 추출된다.A similar group refers to a set of data partially extracted from standard data by a similar group extraction variable. That is, the similar group is a subset of the standard data. At this time, data having the same value (value in the same range) as the value of the customer's extraction variable (similar group extraction variable) is extracted.

또한, 유사군의 개수는 사전에 한정된 숫자로 정해질 수 있다. 이 경우, 동일한 값을 가지는 데이터들이 랜덤하게 추출된다. 즉, 유사군의 개수를 1만개로 한정할 수도 있으며, 데이터베이스 내 전수 데이터를 활용할 수도 있다.Also, the number of similar groups may be set to a predetermined number. In this case, data having the same value are randomly extracted. That is, the number of similar groups may be limited to 10,000, and all data in the database may be utilized.

또한, 유사군 추출 변수는 사전에 정해진 다수 개의 변수들로 구성된다. 바람직하게는, 유사군 추출 변수는 성별, 연령, BMI 수치로 구성된다. BMI 수치는 비만도를 판정하기 위한 지수로서, 신장과 체중의 비율을 사용하여 산출된다. 유사군 추출 변수의 값들이 모두 동일한 경우(동일한 범위인 경우), 해당 데이터가 추출된다.In addition, the similarity group extraction variable is composed of a plurality of predetermined variables. Preferably, the similar-group extraction variables consist of gender, age, and BMI values. The BMI value is an index for determining the degree of obesity, and is calculated using the ratio of height to weight. When the values of the similar group extraction variables are all the same (in the same range), the corresponding data is extracted.

즉, 고객의 성별, 연령, BMI 수치와, 동일한 표본 데이터가 유사군으로 추출된다. 이때, 추출 변수의 동일 여부는, 변수의 값들을 범위나 범주로 구분하고, 해당 범위나 범주에 포함되면 동일한 것으로 판단한다. 예를 들어, 나이는 만 19세부터 85세 이상까지, 5세 단위의 범주로 구분할 수 있고, 성별은 남자, 여자 등 2개의 범주로 구분될 수 있다. 또한, BMI 수치는 저체중부터 3단계 비만까지 총 6계의 단계로 구분할 수 있으며, 저체중은 18.5 미만, 3단계 비만은 35 이상으로 구분된다. 이 경우, BMI 수치가 31인 43세 여자의 데이터와, BMI 수치가 33인 41세 여자의 데이터는 동일한 군(유사군)으로 분류된다.That is, the same sample data as the customer's gender, age, and BMI are extracted as a similar group. At this time, whether the extracted variables are the same, the values of the variables are divided into ranges or categories, and if they are included in the corresponding range or category, they are determined to be the same. For example, the age may be divided into categories of 5 years old, from 19 years old to 85 years old or older, and the gender may be divided into two categories such as male and female. In addition, the BMI value can be divided into a total of 6 stages from underweight to 3rd stage obesity, and underweight is classified as less than 18.5 and 3rd stage obesity as 35 or higher. In this case, data from a 43-year-old woman with a BMI of 31 and data from a 41-year-old woman with a BMI of 33 are classified into the same group (similar group).

또한, 유사군을 추출할 때, 환자군과 대조군을 각각 구하여, 유사군을 형성한다. 바람직하게는, 유사군은 동일한 개수의 환자군과 대조군을 합하여 구성된다.In addition, when extracting a similar group, a patient group and a control group are obtained, respectively, and a similar group is formed. Preferably, a similar group is formed by combining the same number of patient groups and control groups.

환자군은 표본 데이터에서 질병을 진단받은 환자의 데이터들의 집합이고, 대조군은 표본 데이터에서 질병을 진단받지 않은 환자의 데이터들의 집합을 말한다. 즉, 예측하고자 하는 질환의 질병코드(ex. 제2형당뇨:E11)를 주 진단으로 최초진단받은 환자의 당해연도 문진 데이터와 2년전 건강검진 데이터를 ‘환자군’ 데이터 셋으로 설정한다. 또한, 표본 데이터(또는 표본 코호트 DB)의 총 추적관찰 기간인 12년동안 주 진단 및 부 진단으로 해당 관련 질환(ex. 당뇨가 포함되어 있는 대사증후군 E00-99 전체 제외)에 대해 단 한 번도 해당질환을 진단받지 않은 사람들의 데이터를 ‘대조군’ 데이터셋으로 설정한다. 이때, 대조군의 수를 환자군의 데이터셋 수와 동일한 수로 랜덤하게 추출한다. 대조군의 데이터도 당해연도 문진데이터와 2년전 건강검진 데이터들로 구성한다.The patient group is a set of data of patients diagnosed with a disease in the sample data, and the control group is a set of data of patients who are not diagnosed with a disease in the sample data. In other words, the data set for the ‘patient group’ data set includes the medical questionnaire data for the year of the patient who was first diagnosed with the disease code of the disease to be predicted (ex. type 2 diabetes: E11) as the main diagnosis and the medical examination data 2 years ago. In addition, during the total follow-up period of the sample data (or sample cohort DB) for 12 years, as the main and minor diagnoses, none of the relevant diseases (ex. all except metabolic syndrome E00-99 including diabetes) The data of people who have not been diagnosed with the disease are set as the 'control' dataset. At this time, the number of controls is randomly extracted with the same number as the number of datasets in the patient group. The data of the control group is also composed of the questionnaire data for the current year and the health checkup data from 2 years ago.

특히, 바람직하게는, 유사군은 표본 데이터의 환자군과 대조군을 5:5로 설정한다. 즉, 질환 별 학습데이터는 환자군 및 대조군이 5:5로 설정된다. 다만, 결측치가 발생한 데이터가 환자군 데이터라면 환자군 데이터에서만 대체값이 산출되며, 반대로 대조군 데이터라면 대조군 데이터에서만 대체값이 산출된다.In particular, preferably, the similar group sets the patient group and control group of the sample data to be 5:5. That is, as for the learning data for each disease, the patient group and control group are set to 5:5. However, if the missing value is patient group data, the replacement value is calculated only from the patient group data. Conversely, if the data is control data, the replacement value is calculated only from the control group data.

한편, 유사군의 대표값은 범주형 데이터의 경우, 해당 값들의 '최빈값'으로, 연속형 데이터의 경우 해당 값들의 '중앙값'으로 설정된다.On the other hand, the representative value of the similar group is set as the 'mode' of the corresponding values in the case of categorical data and the 'median' of the corresponding values in the case of continuous data.

즉, 결측치 보정의 기본 원리는 국민을 가장 대표하는 값을 넣는다는 것에 있다. 만약 사용자가 입력하는 데이터에서 결측값이 발생한다면 각 질환별로 구축된 학습 데이터셋 전체에서 연속형 변수(ex. 신장, 체중, 혈압, 혈당 등)는 중앙값이, 범주형 변수(ex. 거주지역, 소득수준, 음주, 흡연, 운동 등)는 최빈값이 보정값으로 대체된다.In other words, the basic principle of correcting missing values is to insert the values most representative of the people. If a missing value occurs in the data input by the user, the median value of the continuous variable (ex. height, weight, blood pressure, blood sugar, etc.) in the entire training dataset constructed for each disease is the median value, and the categorical variable (ex. residential area, For income level, drinking, smoking, exercise, etc.), the mode value is replaced with the corrected value.

예를 들어, 유사군은 모두 1만개의 데이터로 구성되고, 고객의 결측 데이터는 저밀도 콜레스테롤, 중성지방이라고 가정한다. 이때, 저밀도 콜레스테롤과 중성지방은 연속된 수치이므로, 유사군의 1만개의 데이터의 저밀도 콜레스테롤과 중성지방의 각각 중앙값으로 고객의 결측 데이터를 설정하여 보정한다.For example, it is assumed that the similar group consists of all 10,000 data, and the customer's missing data are low-density cholesterol and triglycerides. At this time, since low-density cholesterol and triglycerides are continuous values, the customer's missing data is set and corrected as the median values of low-density cholesterol and triglycerides, respectively, of 10,000 data of similar groups.

요약하면, 결측치 보정부(34)는 표본 데이터로부터 고객의 유사군을 추출하고, 유사군의 대표값을 구하고, 구한 대표값으로 결측 데이터를 설정하여 보정한다.In summary, the missing value correcting unit 34 extracts a similar group of customers from the sample data, obtains a representative value of the similar group, and sets and corrects the missing data with the obtained representative value.

다음으로, 질환 예측부(35)는 고객의 인구사회학적 정보 및 보정된 건강검진데이터를 이용하여 고객의 질환을 예측한다.Next, the disease prediction unit 35 predicts the customer's disease by using the customer's demographic information and the corrected health checkup data.

바람직하게는, 질환 예측부(35)는 질환예측모델을 사용하여 고객의 질환을 예측한다. 질환예측모델은 사전에 정해진 입력 변수의 입력값을 입력받으면, 사전에 정해진 각 질환 변수의 발병 확률을 출력한다.Preferably, the disease prediction unit 35 predicts the customer's disease using the disease prediction model. When the disease prediction model receives an input value of a predetermined input variable, the disease prediction model outputs an onset probability of each predetermined disease variable.

특히, 질환예측모델은 신경망 등으로 구성되어, 학습 데이터에 의해 내부 변수들이 학습된다. 그리고 질환예측모델은 학습이 되면, 입력 변수의 값들을 입력받으면, 각 질환의 발병 확률을 출력시킨다.In particular, the disease prediction model is composed of a neural network, etc., and internal variables are learned by learning data. And when the disease prediction model is trained, when the values of the input variables are input, the disease prediction model outputs the probability of occurrence of each disease.

질환예측모델은 각 질환 별 국내 환자를 대표할 수 있게 선정된 수천명에서부터 수만명까지의 건강검진결과, 인구사회학적요인, 생활습관 등 수백만 건을 기계학습(Machine Learning)한 인공지능 신경망의 결과물이다. 해당 산출결과는 사용자의 꾸준한 건강행태 개선 등으로 얼마든지 달라질 수 있다.The disease prediction model is the result of an artificial intelligence neural network that machine-learned millions of health check-up results from thousands to tens of thousands of people selected to represent domestic patients for each disease, demographic factors, and lifestyles. The calculation result may vary freely due to the user's steady improvement in health behavior, etc.

도 3은 질환예측모델의 입력 변수를 나타내고 있다. 도 3과 같이, 모두 총 44개의 입력 변수로 구성된다. 특히, 이들 입력 변수 중에서 15-18번(혈중 지질수치 관련) 변수에 대한 결측값 비율이 높다.3 shows the input variables of the disease prediction model. As shown in FIG. 3 , all of them consist of a total of 44 input variables. In particular, among these input variables, the proportion of missing values for variables 15-18 (related to blood lipid levels) is high.

또한, 바람직하게는, 출력 변수는 12개의 질환(또는 12대 질환)에 대한 발병 확률로 구성된다. 특히, 질환은 유방암, 5대암, 암통합, 뇌혈관질환, 골다공증, 백내장, 고혈압, 비만, 당뇨, COPD(만성폐쇄성폐질환), 관절질환, 이상지혈증 등이다.Also, preferably, the output variable consists of the incidence probabilities for 12 diseases (or 12 major diseases). In particular, the diseases are breast cancer, five major cancers, cancer integration, cerebrovascular disease, osteoporosis, cataract, hypertension, obesity, diabetes, COPD (chronic obstructive pulmonary disease), joint disease, dyslipidemia, and the like.

또한, 질환 예측부(35)는 해당 질환의 출력값이 사전에 정해진 기준 확률 이상이 되면 해당 질환을 발병 가능 질환으로 선정한다.In addition, when the output value of the corresponding disease is greater than or equal to a predetermined reference probability, the disease prediction unit 35 selects the corresponding disease as a possible disease.

이상, 본 발명자에 의해서 이루어진 발명을 상기 실시 예에 따라 구체적으로 설명하였지만, 본 발명은 상기 실시 예에 한정되는 것은 아니고, 그 요지를 이탈하지 않는 범위에서 여러 가지로 변경 가능한 것은 물론이다.As mentioned above, although the invention made by the present inventors has been described in detail according to the above embodiments, the present invention is not limited to the above embodiments, and various modifications can be made without departing from the gist of the present invention.

10 : 스마트 이동단말 20 : 보험설계 클라이언트
30 : 보험설계 서버 31 : 표본자료 수집부
32 : 문진 입력부 33 : 건강검진 수집부
34 : 결측치 보정부 35 : 질환 예측부
36 : 유사변수 선정부
40 : 데이터베이스 41 : 표본자료DB
42 : 고객건강정보DB 43 : 예측정보DB
50 : 예측 서버 80 : 네트워크10: smart mobile terminal 20: insurance design client
30: insurance design server 31: sample data collection unit
32: questionnaire input unit 33: health checkup collection unit
34: missing value correction unit 35: disease prediction unit
36: similar variable selection unit
40: database 41: sample data DB
42: Customer health information DB 43: Prediction information DB
50: prediction server 80: network

Claims

In the disease prediction service system for insurance design,
a sample data collection unit that collects demographic information, health checkup data, and outbreak data of past patients as sample data;
a questionnaire input unit that receives the customer's demographic information as a questionnaire;
a health checkup collection unit that collects customer health checkup data;
a missing value correction unit that extracts a similar group similar to the customer from the sample data, and replaces the missing data among the customer's health checkup data with a representative value of the extracted similar group to correct; and,
A disease prediction unit that predicts the customer's disease using the customer's demographic information and the corrected health checkup data;
The system further includes a similar variable selection unit for selecting a similar group extraction variable,
The missing value correction unit extracts data having the same value as the data value of the similar group extraction variable of the customer into the similar group,
The similar variable selection unit configures a plurality of combinations by combining predetermined candidate variables, estimates the customer's examination data for each combination with respect to a predetermined test set, and estimates the examination data and the actual data of the test set. obtain the difference value of , set the statistical value of the difference value of all test sets for each combination as the error, extract combinations having an error within a predetermined error range, and select the combination with the smallest number of variables among the extracted combinations. A disease prediction service system for insurance design, characterized in that the variables are selected as similar group extraction variables.

delete

The method of claim 1,
The disease prediction service system for insurance design, characterized in that the similar group extraction variable consists of gender, age, and BMI.

The method of claim 1,
The missing value correcting unit forms a similarity group to have a predetermined number of data, but randomly extracts data having the same value as the data value of the customer's similarity group extraction variable from the sample data. Disease prediction service system.

6. The method of claim 5,
The missing value correction unit obtains a patient group and a control group, respectively, to form a similar group, and constitutes the similar group by combining the same number of patient groups and control groups, and the control group is extracted from the data of patients who have not been diagnosed with the disease from the sample data, and the patient group A disease prediction service system for insurance design, characterized in that the sample data is extracted from the data of patients diagnosed with the disease.

The method of claim 1,
The representative value of the similar group is set as the 'mode' of the corresponding values in the case of categorical data and the 'median' of the corresponding values in the case of continuous data.

The method of claim 1,
The demographic information includes gender, age, region of residence, insurance subscription type, income decile, disability, height, and weight,
The health checkup data includes waist circumference, systolic blood pressure, diastolic blood pressure, fasting blood sugar, total cholesterol, high-density cholesterol, low-density cholesterol, triglyceride, hemoglobin, urine protein, serum creatinine, serum G.T, serum GPT, and gamma GTP. Disease prediction service system for insurance design, characterized in that it comprises.