CN110993103B

CN110993103B - Method for establishing disease risk prediction model and method for recommending disease insurance product

Info

Publication number: CN110993103B
Application number: CN201911193197.4A
Authority: CN
Inventors: 王培�; 郭子颢; 郭小川; 高惠庭; 李春萌
Original assignee: Sunshine Life Insurance Co ltd
Current assignee: Sunshine Life Insurance Co ltd
Priority date: 2019-11-28
Filing date: 2019-11-28
Publication date: 2023-06-02
Anticipated expiration: 2039-11-28
Also published as: CN110993103A

Abstract

The invention relates to a method for establishing a disease risk prediction model and a method for recommending a disease insurance product, wherein the method comprises the steps of obtaining historical diagnosis and treatment data of medical insurance participants in a preset area, classifying and sampling the historical diagnosis and treatment data to obtain a sample data set, wherein each sample data set comprises historical disease diagnosis and treatment coding information of each sample in a preset time range, preprocessing the sample data set to eliminate invalid data, clustering the respective historical disease diagnosis and treatment coding information of all samples in the preprocessed sample data set according to disease attributes and focus positions to obtain a disease clustering feature label, screening the disease clustering feature label by adopting a preset feature selection algorithm to obtain a severe disease clustering feature label, establishing a disease risk prediction model corresponding to the preset severe disease according to the severe disease clustering feature label, gender, age and the diagnosis behavior information, and establishing a disease risk prediction model corresponding to the preset severe disease by combining an extreme gradient lifting algorithm.

Description

Method for establishing disease risk prediction model and method for recommending disease insurance product

Technical Field

The invention relates to the field of insurance, in particular to a method for establishing a disease risk prediction model and a method for recommending disease insurance products.

Background

With the development and needs of society, the consumer's awareness of insurance is gradually improved. The demand of consumers for insurance is also moving toward finer, and simple product pricing modes according to two dimensions of age and gender are more mechanical.

At present, in the insurance industry, a risk model or rule for judging the health risk of a customer is often formed by combing based on the traditional experience of the insurance industry, and the condition of hiding the health condition for reverse application cannot be eliminated, so that the disadvantage of lower accuracy of an insurance product recommended for the customer based on the traditional model is often present.

Disclosure of Invention

In view of the above, a method for establishing a disease risk prediction model and a method for recommending a disease insurance product are provided, which can classify, sample and process historical diagnosis and treatment data of medical insurance participants in a preset area, extract a serious disease clustering characteristic label corresponding to a preset serious disease, then further establish a disease risk prediction model corresponding to the preset serious disease according to the serious disease clustering characteristic label, gender, age and the diagnosis behavior information and combine an extreme gradient lifting algorithm, so that the risk of a disease insurance applicant can be accurately evaluated, and further a method for recommending a disease insurance product can be further provided according to the disease risk prediction model, and the popularization accuracy of the insurance product is greatly improved.

A method for establishing a disease risk prediction model comprises the following steps:

acquiring historical diagnosis and treatment data of medical insurance participants in a preset area;

classifying and sampling historical diagnosis and treatment data according to gender, a preset age interval and a preset comparison proportion to obtain a sample data set, wherein the sample data set comprises positive sample data and negative sample data of a preset serious disease, the preset comparison proportion is the ratio between the number of the positive samples and the number of the negative samples of the preset serious disease, and each sample data comprises historical disease diagnosis coding information and diagnosis behavior information of each sample in a preset time range;

preprocessing the sample data set to remove invalid data, and clustering historical disease diagnosis coding information corresponding to all samples in the preprocessed sample data set according to corresponding disease attributes and focus positions to obtain corresponding disease clustering feature labels;

screening the disease clustering feature labels by adopting a preset feature selection algorithm to obtain the heavy disease clustering feature labels corresponding to the preset heavy diseases;

and establishing a disease risk prediction model corresponding to the preset serious disease by combining an extreme gradient lifting algorithm according to the serious disease clustering characteristic label, the gender, the age and the diagnosis behavior information.

In one embodiment, the step of classifying and sampling the historical diagnosis and treat data according to gender, age and preset contrast ratio to obtain a sample data set comprises the following steps:

respectively classifying the historical diagnosis and treatment data according to the rule that the gender and the preset age interval are the same, so as to obtain an initial data set;

respectively screening first preset number of positive sample data and second preset number of negative sample data of preset serious diseases from the initial data set according to a preset comparison proportion, wherein the ratio of the first preset number to the second preset number is equal to the preset comparison proportion;

and obtaining a corresponding sample data set according to the positive sample data and the negative sample data.

In one embodiment, the establishing method further comprises:

combining the related preamble diseases corresponding to the preset serious diseases, and screening the cluster characteristic labels of the serious diseases again;

and establishing a disease risk prediction model corresponding to the preset serious disease by combining an extreme gradient lifting algorithm according to the re-screened serious disease clustering characteristic label, sex, age and diagnosis behavior information.

In one embodiment, the preset contrast ratio is set to

In addition, a recommendation method of the disease insurance product is provided, and the recommendation method comprises the following steps of:

designing a corresponding questionnaire according to a disease risk prediction model of a preset area;

acquiring basic data of a disease insurance applicant according to the questionnaire;

predicting the disease insurance applicant according to the basic data and the disease risk prediction model to obtain a corresponding disease risk prediction result;

and recommending corresponding disease insurance products for the disease insurance applicant according to the disease risk prediction result.

In addition, a design method of a disease insurance product is provided, and the design method comprises the following steps of:

according to the disease risk prediction model, disease risk prediction is respectively carried out on medical insurance participants in a preset area, and corresponding disease risk prediction probability is obtained;

and generating a corresponding disease insurance product fee rate table according to the disease risk prediction probability, sex and age, and designing a corresponding disease insurance product according to the disease insurance product fee rate table.

In one embodiment, the step of generating a corresponding disease insurance product rate table based on the disease risk prediction probability, gender, and age comprises:

dividing medical insurance participants in a preset area into a plurality of risk level groups according to the disease risk prediction probability;

and dividing each risk level crowd according to the respective corresponding disease occurrence probability distribution of each risk level crowd and the sex and the age to generate a corresponding disease insurance product rate table.

In addition, a device for establishing a disease risk prediction model is provided, and the device for establishing the disease risk prediction model comprises the following steps:

the data acquisition unit is used for acquiring historical diagnosis and treatment data of medical insurance participants in a preset area;

the data set generation unit is used for carrying out classified sampling processing on the historical diagnosis and treatment data according to gender, a preset age interval and a preset comparison proportion to obtain a sample data set, wherein the sample data set comprises positive sample data and negative sample data of a preset serious disease, the preset comparison proportion is the ratio between the number of the positive samples and the number of the negative samples of the preset serious disease, and each sample data comprises historical disease diagnosis coding information and diagnosis behavior information of each sample in a preset time range;

the cluster feature tag generation unit is used for preprocessing the sample data set to remove invalid data, and clustering the historical disease diagnosis coding information corresponding to all samples in the preprocessed sample data set according to the corresponding disease attribute and focus position to obtain a corresponding disease cluster feature tag;

the severe disease cluster feature label generating unit is used for screening the disease cluster feature labels by adopting a preset feature selection algorithm so as to obtain severe disease cluster feature labels corresponding to preset severe diseases;

the prediction model generation unit is used for establishing a disease risk prediction model corresponding to a preset serious disease according to the serious disease clustering characteristic label, the gender, the age and the diagnosis behavior information and combining an extreme gradient lifting algorithm.

Furthermore, there is provided a device terminal comprising a memory for storing a computer program and a processor for running the computer program to cause the device terminal to perform the above-described setup method.

Further, a readable storage medium is provided, which stores a computer program which, when run by a processor, performs the above-described establishing method.

According to the method for establishing the disease risk prediction model, the historical diagnosis and treatment data of medical insurance participants in the preset area are obtained, the historical diagnosis and treatment data are classified and sampled according to the gender, the preset age interval and the preset contrast proportion to obtain the sample data set, the sample data set comprises positive sample data and negative sample data of preset serious diseases, the preset contrast proportion is the ratio between the number of positive samples and the number of negative samples of the preset serious diseases, each sample data comprises historical disease diagnosis coding information and diagnosis behavior information of each sample in a preset time range, the sample data set is preprocessed by removing invalid data, all the historical disease diagnosis coding information corresponding to all samples in the preprocessed sample data set are clustered according to corresponding disease attributes and focus positions to obtain corresponding disease clustering feature labels, the disease clustering feature labels are screened by adopting a preset feature selection algorithm to obtain the disease clustering feature labels corresponding to the preset serious diseases, the disease prediction model corresponding to the preset serious diseases is established according to the clustering feature labels, the gender, the age and the diagnosis behavior information, and the extreme lifting algorithm is combined, the disease risk prediction model corresponding to the preset serious diseases is properly designed according to the recommended disease prediction method of the proper product and the recommended product, and the proper risk is provided for the product is promoted according to the recommended product, and the proper risk is guaranteed, and the product is well-recommended.

Drawings

In order to more clearly illustrate the technical solutions of the present invention, the drawings that are required for the embodiments will be briefly described, it being understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope of the present invention. Like elements are numbered alike in the various figures.

FIG. 1 is a flow chart of a method for establishing a disease risk prediction model according to one embodiment;

FIG. 2 is a graph of receiver operating characteristics of a disease risk prediction model provided in one embodiment;

FIG. 3 is a flow chart of a method of obtaining a sample dataset provided in one embodiment;

FIG. 4 is a flowchart of a method for establishing a disease risk prediction model according to another embodiment;

FIG. 5 is a flow chart of a method of recommending a disease insurance product according to one embodiment;

FIG. 6 is a flow chart of a method of designing a disease insurance product according to one embodiment;

FIG. 7 is a flow chart of a method of generating a disease insurance product tariff table provided in one embodiment;

fig. 8 is a block diagram of a device for establishing a disease risk prediction model according to an embodiment.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments.

The components of the embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the invention, as presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be made by a person skilled in the art without making any inventive effort, are intended to be within the scope of the present invention.

Hereinafter, various embodiments of the present disclosure will be more fully described. The present disclosure is capable of various embodiments and of modifications and variations therein. However, it should be understood that: there is no intention to limit the various embodiments of the disclosure to the specific embodiments disclosed herein, but rather the disclosure is to be interpreted to cover all modifications, equivalents, and/or alternatives falling within the spirit and scope of the various embodiments of the disclosure.

The terms "comprises," "comprising," "including," or any other variation thereof, are intended to cover a specific feature, number, step, operation, element, component, or combination of the foregoing, which may be used in various embodiments of the present invention, and are not intended to first exclude the presence of or increase the likelihood of one or more other features, numbers, steps, operations, elements, components, or combinations of the foregoing.

Furthermore, the terms "first," "second," "third," and the like are used merely to distinguish between descriptions and should not be construed as indicating or implying relative importance.

Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which various embodiments of the invention belong. The terms (such as those defined in commonly used dictionaries) will be interpreted as having a meaning that is the same as the context of the relevant art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein in connection with the various embodiments of the invention.

As shown in fig. 1, a method for establishing a disease risk prediction model is provided, where the method includes:

step S110, historical diagnosis and treatment data of medical insurance participants in a preset area are obtained.

Because of the differences of the environment, eating habits and medical level of different areas, the health conditions of the people in different areas have larger differences, so that when the data are processed, the historical diagnosis and treatment data of the people in a specific area need to be acquired, and the historical diagnosis and treatment data of the medical insurance participants in the preset area are generally accurate.

Wherein the historical diagnosis and treatment data generally comprises gender, age and treatment behavior information of the patients, wherein the treatment behavior information generally comprises treatment hospital grade, treatment frequency, treatment times, treatment time and treatment accumulated cost information.

For positive patients, the age of the extracted positive sample is the age when the first diagnosis is a serious disease, the grade of the hospital for diagnosis, the frequency of diagnosis, the number of hospitalizations, the time of diagnosis, the expense of diagnosis and the corresponding historical disease diagnosis coding information are data in two years before the first diagnosis is the confirmation day of a serious disease, for example, the first time of 12 th month 5 days in 2018 of the patient is confirmed as cancer, and the positive sample belongs to diagnosis related data in the time range between 5 th month 12 of 2016 and 4 th month 12 of 2018 corresponding diagnosis data extraction.

For a negative sample (a patient in a database which is never diagnosed as a serious illness), the cut-off year included in the current database is taken as a starting point, the starting point is taken as a starting point, diagnosis and treatment data in two years are taken as a starting point, for example, the current time point is taken as an example, the current applied database data is cut-off to the date of 31 in 12 months in 2018, and the extracted data are generally the diagnosis and treatment data in 2015 and 2016.

The historical diagnosis and treatment data are usually corresponding data (such as personal information including an identification card number and an address) from which sensitive information of a patient is removed.

Step S120, classifying and sampling the historical diagnosis and treatment data according to gender, a preset age interval and a preset comparison proportion to obtain a sample data set, wherein the sample data set comprises positive sample data and negative sample data of a preset serious disease, the preset comparison proportion is the ratio between the number of the positive samples and the number of the negative samples of the preset serious disease, and each sample data comprises historical disease diagnosis coding information and diagnosis behavior information of each sample in a preset time range.

The historical diagnosis and treatment data are further subjected to classification sampling treatment according to gender, a preset age interval and a preset comparison proportion, and then sample data sets are further obtained, wherein each sample data set comprises positive sample data and negative sample data of a preset serious disease, and the positive sample data and the negative sample data of the preset serious disease are set according to the comparison proportion.

Wherein, each sample data generally includes gender, age and visit behavior information in addition to the historical disease diagnosis code information of each sample within a preset time range, and the visit behavior information generally includes information such as a visit hospital grade, a visit frequency, a hospital stay, a visit time and a visit accumulated cost, and the historical disease diagnosis code information is generally coded by using the ICD 10.

Wherein the ICD10 code is a normalized representation of a diagnostic description of the patient by the physician, i.e., avoiding the use of different textual descriptions for the same disease.

The positive sample data corresponds to the time range of the historical diagnosis and treatment data extracted from the positive patient sample, and is also data that is pushed for two years before the first confirmation day of a certain serious disease, and correspondingly, the preset time range is usually two years before the first confirmation day of a certain serious disease.

The negative sample data corresponds to a time range of the historical diagnosis and treatment data extracted from the negative sample, and the preset time range is started from a cut-off year included in the current database by pushing the negative sample (a patient in a database which is never diagnosed as a serious disease) for two years forward, and the starting point is pushed for two years forward again.

The predetermined age interval may be divided into 5 years, for example, 0-4 years, 5-9 years, …,80+ years.

Step S130, preprocessing for eliminating invalid data is carried out on the sample data set, and the historical disease diagnosis coding information corresponding to all samples in the preprocessed sample data set is clustered according to the corresponding disease attribute and focus position, so that the corresponding disease clustering feature label is obtained.

The method comprises the steps of obtaining a sample data set after pretreatment, wherein historical disease diagnosis coding information of a certain sample data possibly existing in the sample data set is an empty set, and the pretreatment needs to be directly removed, then clustering the historical disease diagnosis coding information corresponding to each sample data according to corresponding disease attributes and focus positions to obtain corresponding disease clustering label characteristics, so that sparsity of the sample data is reduced.

And step S140, screening the disease cluster feature labels by adopting a preset feature selection algorithm to obtain the severe disease cluster feature labels corresponding to the preset severe diseases.

The preset feature selection algorithm generally adopts any one of a mutual information algorithm, a P value algorithm and an information adding algorithm, and can screen the disease cluster feature labels to obtain the severe disease cluster feature labels corresponding to the preset severe diseases.

Step S150, according to the severe disease clustering characteristic label, the gender, the age and the diagnosis behavior information, and combining an extreme gradient lifting algorithm to establish a disease risk prediction model corresponding to the preset severe disease.

After the severe disease cluster feature label is obtained, gender, age and diagnosis behavior information of each sample of the corresponding preprocessed sample data set are further combined to serve as model feature factors to be generated to participate in model establishment, and a disease risk prediction model corresponding to the preset severe disease is established by combining an extreme gradient lifting algorithm.

In one embodiment, firstly, 70% proportion data are randomly extracted from the sample data set to be used as a training sample data set, the rest 30% data portion is used as a test sample data set, an extreme gradient lifting algorithm is adopted as a training model, in the extreme gradient lifting algorithm, a target function selects two logistic regression functions, lifting types are set to be gradient lifting trees, the learning rate value range is 0.001-0.3, the maximum iteration number value range is 50-3000, a grid search algorithm is selected to carry out circulation traversal on super parameters set in the extreme gradient lifting algorithm, the training sample data set is pre-trained, the training effect of the model is evaluated by adopting K-fold cross validation, the training and screening are carried out to obtain proper model parameters, the test sample data set is further predicted according to the proper model parameters obtained by the screening, the disease risk prediction result corresponding to the test sample data set is obtained, and the disease risk prediction result corresponding to the test sample data set is compared with actual serious disease positive sample data, and correction is carried out continuously until the training is carried out to establish a corresponding disease risk prediction model.

In one embodiment, the learning rate in the above-mentioned extreme gradient lifting algorithm is set to any one of 0.001, 0.003, 0.01, 0.03, 0.1 and 0.3, and the maximum number of iterations is set to any one of 50, 100, 300, 500, 1000 and 3000.

In one embodiment, the receiver operating characteristic Curve for predicting the sample data set by using the disease risk prediction model is shown in fig. 2, where the true positive rate on the ordinate in the receiver operating characteristic Curve (Receiver Operating Characteristic, abbreviated as ROC Curve) represents the number of positive sample prediction results/the actual number of positive samples, the false positive rate on the abscissa represents the number of negative sample results/the actual number of negative samples predicted to be positive, and the AUC (Area Under the ROC Curve) corresponding to the ROC Curve in fig. 2 is equal to 0.86, and obviously, the AUC is greater than 0.5, that is, greater than the Area Under the dashed straight line in fig. 2, which indicates that the disease risk prediction model is good.

The method for establishing the disease risk prediction model can accurately predict the risk of the disease insurance applicant, further provide proper basis for insurance companies to design insurance products, and enable proper recommendation methods of the disease insurance products to be established according to the disease risk prediction model in the follow-up process of recommending the disease insurance products, so that the precision and the suitability of popularization of the disease insurance products are improved overall.

In one embodiment, the model feature factors corresponding to the disease risk prediction model are as follows in table 1:

model feature factor
	Sex (sex)
Age of
	Number of visits to the doctor
Accumulated consumption amount
	Number of hospitalizations
Symptoms and signs involving the skin and subcutaneous tissue
	Venous, lymphatic and lymphadenopathy, not classifiable in the same
Aplastic and other anemias
	Other diseases of blood and hematopoietic organs
Abnormality of blood examination
	Coagulation defects, purpura and other bleeding conditions
System structure hoof tissue diseases
	Certain diseases involving the immune mechanism

TABLE 1

In one embodiment, as shown in fig. 3, step S120 includes:

s122, classifying the historical diagnosis and treatment data according to the same rule of gender and preset age interval, and obtaining an initial data set.

The historical diagnosis and treatment data are divided into two parts according to gender, and then each part is further divided according to a preset age interval, so that initial data sets corresponding to the parts can be obtained.

S124, respectively screening out positive sample data of a first preset number and negative sample data of a second preset number of preset serious diseases from the initial data set according to a preset comparison proportion, wherein the ratio of the first preset number to the second preset number is equal to the preset comparison proportion.

After the initial data set is obtained, a first preset number of positive sample data of preset serious diseases can be further screened out from the initial data set, and a second preset number of negative sample data can be screened out from the initial data set, wherein the ratio of the first preset number to the second preset number is equal to a preset comparison proportion.

In one embodiment, for a preset severe disease, positive data corresponding to all patients suffering from the preset severe disease may be obtained from the historical diagnosis and treatment data first, then a first preset number of positive sample data are selected from the positive data, and then a second preset number of negative sample data are screened and extracted from the historical diagnosis and treatment data according to gender, a preset age interval and a preset comparison proportion.

Wherein the preset contrast ratio setting range is usually

In one embodiment, the predetermined control ratio is a 1:4 ratio.

And S126, obtaining a corresponding sample data set according to the positive sample data and the negative sample data.

In one embodiment, as shown in fig. 4, the above-mentioned establishing method further includes:

step S160, the cluster characteristic labels of the repeated diseases are screened again by combining the related preamble diseases corresponding to the preset repeated diseases.

Each preset serious disease usually has symptoms corresponding to a certain related precursor disease before diagnosis, so that the cluster characteristic labels of the serious diseases can be further screened according to the related precursor disease corresponding to the preset serious disease.

Step S170, establishing a disease risk prediction model corresponding to preset serious diseases according to the re-screened serious disease cluster characteristic label, sex, age and the diagnosis behavior information and combining an extreme gradient lifting algorithm.

After the re-screened severe disease cluster feature labels are obtained, gender, age and treatment behavior information of each sample of the corresponding pre-processed sample data set are further combined to serve as model feature factors to be generated to participate in model training, and a disease risk prediction model corresponding to the preset severe disease is established by combining an extreme gradient lifting algorithm.

In addition, as shown in fig. 5, a recommendation method of a disease insurance product is provided, where the recommendation method uses the disease risk prediction model, and the recommendation method includes:

step S210, basic data of a disease insurance applicant is acquired.

The disease risk prediction model is adopted for a preset area, and a targeted questionnaire is designed, so that the follow-up more accurate application of the disease risk prediction model for prediction is facilitated.

For example, the related questionnaire can be designed through the model feature factors in the table 1, so that the information of the potential applicant can be specifically mined.

Of course, in addition to the questionnaires described above, the underlying data of the disease insurance applicant may be obtained through other channels, such as through interview recordings, and the like.

In one embodiment, the basic data of the disease insurance applicant corresponding to each model feature factor is obtained in a targeted manner by filling in a questionnaire related to the model feature factor design in the above table 1.

Among these, the above questionnaires include, but are not limited to, internet questionnaires, weChat questionnaires, QQ questionnaires, and paper questionnaires.

Step S220, predicting the disease insurance applicant according to the basic data and the disease risk prediction model to obtain a corresponding disease risk prediction result.

After the basic data are obtained, the basic data are further extracted and input into a disease risk prediction model to predict for a disease insurance applicant, so that a corresponding disease risk prediction result is obtained.

Step S230, recommending corresponding disease insurance products for the disease insurance applicant according to the disease risk prediction result.

According to the method, the corresponding basic data of the disease insurance applicant is acquired in a targeted mode, the disease insurance applicant is predicted according to the basic data and the disease risk prediction model, the corresponding disease risk prediction result is obtained, and finally the corresponding disease insurance product is recommended for the disease insurance applicant according to the disease risk prediction result, so that the recommendation and popularization accuracy of the insurance product is greatly improved, and the marketing capability of an insurance company is improved.

In addition, as shown in fig. 6, there is also provided a method for designing a disease insurance product, the method using the disease risk prediction model, the method comprising:

step S310, according to the disease risk prediction model, disease risk prediction is respectively carried out on medical insurance participants in a preset area, and corresponding disease risk prediction probability is obtained.

Due to the environment, eating habit, medical level difference and other reasons in different areas, such as northern area and southern area, coastal area and plain area, the people in the respective areas have obvious difference in climate and eating habit, for example, the resident thyroid cancer in coastal area is higher than in non-coastal area, and the incidence rate of intestinal cancer in southern area is higher than in northern area. According to the regional differentiation characteristics, the model is divided into a plurality of regions, and different disease risk prediction models are constructed for different regional groups.

In one embodiment, the preset area is a Beijing area, and all the medical insurance participants in the Beijing area can be tested for lymphoma serious diseases according to the disease risk prediction model, so as to obtain respective corresponding disease risk prediction probabilities.

Step S320, generating a corresponding disease insurance product rate table according to the disease risk prediction probability, sex and age, and designing a corresponding disease insurance product according to the disease insurance product rate table.

Wherein, the disease risk prediction probability and the age belong to important factors positively related to the disease insurance product rate, the gender of men and women also has important influence, the corresponding disease insurance product rate table is generated comprehensively according to the disease risk prediction probability, the gender and the age, and the corresponding disease insurance product is designed according to the disease insurance product rate table.

According to the design method of the disease insurance product, different disease risk prediction models can be adopted according to different areas, and then the disease insurance product which is suitable for the preset area is designed, so that the disease insurance product can be matched with the actual situation of the preset area, the risk of the disease insurance product is reduced, the accuracy and the adaptability of the disease insurance product are greatly improved, and the market competitiveness of insurance companies is improved.

In one embodiment, as shown in fig. 7, step S320 includes:

step S322, dividing medical insurance participants in a preset area into a plurality of risk level groups according to the disease risk prediction probability.

After the disease risk prediction probability is obtained, the disease risk prediction probability can be further divided into a plurality of levels, and then medical insurance participants in a preset area are divided into a plurality of risk level groups.

Step S324, according to the disease occurrence probability distribution corresponding to each risk level crowd, and dividing each risk level crowd into intervals according to gender and age, and generating a corresponding disease insurance product rate table.

Each risk level crowd corresponds to a disease occurrence probability distribution, wherein the disease occurrence probability distribution refers to probability distribution of a preset serious disease, which is actually corresponding to each risk level crowd.

Therefore, the corresponding disease insurance product cost table can be designed and generated further according to the disease occurrence probability distribution corresponding to each risk level crowd and the interval division of each risk level crowd according to the gender and the age.

In addition, as shown in fig. 8, there is also provided an apparatus for establishing a disease risk prediction model, the apparatus comprising:

the data acquisition unit 410 is configured to acquire historical diagnosis and treatment data of medical insurance participants in a preset area.

The data set generating unit 420 is configured to perform classification sampling processing on the historical diagnosis and treatment data according to gender, a preset age interval, and a preset contrast ratio to obtain a sample data set, where the sample data set includes positive sample data and negative sample data of a preset serious disease, the preset contrast ratio is a ratio between the number of positive samples and the number of negative samples of the preset serious disease, and each sample data includes historical disease diagnosis coding information and diagnosis behavior information of each sample within a preset time range.

The first feature tag generating unit 430 is configured to perform preprocessing for removing invalid data from the sample data set, and cluster historical disease diagnosis coding information corresponding to all samples in the preprocessed sample data set according to the corresponding disease attribute and the focus part, so as to obtain a corresponding disease cluster feature tag.

The second feature tag generating unit 440 is configured to screen the disease cluster feature tags by using a preset feature selection algorithm to obtain a severe disease cluster feature tag corresponding to the preset severe disease.

The prediction model generating unit 450 is configured to establish a disease risk prediction model corresponding to a preset severe disease according to the severe disease cluster feature tag, the gender, the age and the diagnosis behavior information and by combining an extreme gradient lifting algorithm.

In the several embodiments provided in this application, it should be understood that the disclosed apparatus and method may be implemented in other manners as well. The apparatus embodiments described above are merely illustrative, for example, of the flow diagrams and block diagrams in the figures, which illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In addition, functional modules or units in various embodiments of the invention may be integrated together to form a single part, or the modules may exist alone, or two or more modules may be integrated to form a single part.

The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a smart phone, a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present invention.

Claims

1. A method for building a disease risk prediction model, the method comprising:

classifying and sampling the historical diagnosis and treatment data according to gender, a preset age interval and a preset comparison proportion to obtain a sample data set, wherein the sample data set comprises positive sample data and negative sample data of preset serious diseases, and the preset comparison proportion is thatThe ratio between the positive sample number and the negative sample number of the preset serious diseases is preset, each sample data comprises historical disease diagnosis coding information and diagnosis behavior information of each sample in a preset time range, wherein the preset control ratio is set as

screening the disease cluster feature labels by adopting a preset feature selection algorithm to obtain the severe disease cluster feature labels corresponding to the preset severe diseases;

2. The method according to claim 1, wherein the step of classifying and sampling the historical diagnosis and treat data according to gender, age and preset contrast ratio to obtain a sample data set comprises:

classifying the historical diagnosis and treatment data according to the rule that the gender and the preset age interval are the same, so as to obtain an initial data set;

3. The method of establishing according to claim 1, further comprising:

screening the cluster characteristic labels of the severe diseases again by combining the related preamble diseases corresponding to the preset severe diseases;

and establishing a disease risk prediction model corresponding to the preset serious disease by combining an extreme gradient lifting algorithm according to the re-screened serious disease clustering characteristic label, sex, age and the diagnosis behavior information.

4. A method of recommending a disease insurance product, characterized by employing the disease risk prediction model of any one of claims 1 to 3, the method comprising:

acquiring basic data of a disease insurance applicant;

5. A method of designing a disease insurance product, characterized by using the disease risk prediction model according to any one of claims 1 to 3, the method comprising:

according to the disease risk prediction model, disease risk prediction is respectively carried out on medical insurance participants in the preset area, and corresponding disease risk prediction probability is obtained;

and generating a corresponding disease insurance product rate table according to the disease risk prediction probability, sex and age, and designing a corresponding disease insurance product according to the disease insurance product rate table.

6. The design method according to claim 5, wherein the step of generating a corresponding disease insurance product tariff table according to the disease risk prediction probability, sex, and age comprises:

dividing medical insurance participants in the preset area into a plurality of risk level groups according to the disease risk prediction probability;

7. A disease risk prediction model building apparatus, wherein the building apparatus includes:

a data set generating unit, configured to perform classification sampling processing on the historical diagnosis and treatment data according to gender, a preset age interval, and a preset comparison proportion to obtain a sample data set, where the sample data set includes positive sample data and negative sample data of a preset serious disease, the preset comparison proportion is a ratio between the number of positive samples and the number of negative samples of the preset serious disease, each sample data includes historical disease diagnosis coding information and diagnosis behavior information of each sample in a preset time range, and the preset comparison proportion is set as follows

The cluster feature label generating unit is used for preprocessing the sample data set to remove invalid data, and clustering historical disease diagnosis coding information corresponding to all samples in the preprocessed sample data set according to corresponding disease attributes and focus positions to obtain corresponding disease cluster feature labels;

the severe disease cluster feature label generating unit is used for screening the disease cluster feature labels by adopting a preset feature selection algorithm to obtain severe disease cluster feature labels corresponding to the preset severe disease;

and the prediction model generation unit is used for establishing a disease risk prediction model corresponding to the preset serious disease according to the serious disease cluster characteristic label, the gender, the age and the diagnosis behavior information and combining an extreme gradient lifting algorithm.

8. A device terminal comprising a memory for storing a computer program and a processor that runs the computer program to cause the device terminal to perform the set-up method of any of claims 1 to 3.

9. A readable storage medium, characterized in that it stores a computer program which, when executed by a processor, performs the set-up method of any one of claims 1 to 3.