CN110993103B - Method for establishing disease risk prediction model and method for recommending disease insurance product - Google Patents

Method for establishing disease risk prediction model and method for recommending disease insurance product Download PDF

Info

Publication number
CN110993103B
CN110993103B CN201911193197.4A CN201911193197A CN110993103B CN 110993103 B CN110993103 B CN 110993103B CN 201911193197 A CN201911193197 A CN 201911193197A CN 110993103 B CN110993103 B CN 110993103B
Authority
CN
China
Prior art keywords
disease
preset
sample data
risk prediction
diagnosis
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911193197.4A
Other languages
Chinese (zh)
Other versions
CN110993103A (en
Inventor
王培�
郭子颢
郭小川
高惠庭
李春萌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sunshine Life Insurance Co ltd
Original Assignee
Sunshine Life Insurance Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sunshine Life Insurance Co ltd filed Critical Sunshine Life Insurance Co ltd
Priority to CN201911193197.4A priority Critical patent/CN110993103B/en
Publication of CN110993103A publication Critical patent/CN110993103A/en
Application granted granted Critical
Publication of CN110993103B publication Critical patent/CN110993103B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/30ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for calculating health indices; for individual health risk assessment
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/08Insurance
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Public Health (AREA)
  • Evolutionary Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Finance (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Accounting & Taxation (AREA)
  • Artificial Intelligence (AREA)
  • Medical Informatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Technology Law (AREA)
  • Epidemiology (AREA)
  • General Health & Medical Sciences (AREA)
  • Primary Health Care (AREA)
  • Pathology (AREA)
  • Databases & Information Systems (AREA)
  • Biomedical Technology (AREA)
  • General Business, Economics & Management (AREA)
  • Strategic Management (AREA)
  • Marketing (AREA)
  • Economics (AREA)
  • Development Economics (AREA)
  • Medical Treatment And Welfare Office Work (AREA)

Abstract

The invention relates to a method for establishing a disease risk prediction model and a method for recommending a disease insurance product, wherein the method comprises the steps of obtaining historical diagnosis and treatment data of medical insurance participants in a preset area, classifying and sampling the historical diagnosis and treatment data to obtain a sample data set, wherein each sample data set comprises historical disease diagnosis and treatment coding information of each sample in a preset time range, preprocessing the sample data set to eliminate invalid data, clustering the respective historical disease diagnosis and treatment coding information of all samples in the preprocessed sample data set according to disease attributes and focus positions to obtain a disease clustering feature label, screening the disease clustering feature label by adopting a preset feature selection algorithm to obtain a severe disease clustering feature label, establishing a disease risk prediction model corresponding to the preset severe disease according to the severe disease clustering feature label, gender, age and the diagnosis behavior information, and establishing a disease risk prediction model corresponding to the preset severe disease by combining an extreme gradient lifting algorithm.

Description

Method for establishing disease risk prediction model and method for recommending disease insurance product
Technical Field
The invention relates to the field of insurance, in particular to a method for establishing a disease risk prediction model and a method for recommending disease insurance products.
Background
With the development and needs of society, the consumer's awareness of insurance is gradually improved. The demand of consumers for insurance is also moving toward finer, and simple product pricing modes according to two dimensions of age and gender are more mechanical.
At present, in the insurance industry, a risk model or rule for judging the health risk of a customer is often formed by combing based on the traditional experience of the insurance industry, and the condition of hiding the health condition for reverse application cannot be eliminated, so that the disadvantage of lower accuracy of an insurance product recommended for the customer based on the traditional model is often present.
Disclosure of Invention
In view of the above, a method for establishing a disease risk prediction model and a method for recommending a disease insurance product are provided, which can classify, sample and process historical diagnosis and treatment data of medical insurance participants in a preset area, extract a serious disease clustering characteristic label corresponding to a preset serious disease, then further establish a disease risk prediction model corresponding to the preset serious disease according to the serious disease clustering characteristic label, gender, age and the diagnosis behavior information and combine an extreme gradient lifting algorithm, so that the risk of a disease insurance applicant can be accurately evaluated, and further a method for recommending a disease insurance product can be further provided according to the disease risk prediction model, and the popularization accuracy of the insurance product is greatly improved.
A method for establishing a disease risk prediction model comprises the following steps:
acquiring historical diagnosis and treatment data of medical insurance participants in a preset area;
classifying and sampling historical diagnosis and treatment data according to gender, a preset age interval and a preset comparison proportion to obtain a sample data set, wherein the sample data set comprises positive sample data and negative sample data of a preset serious disease, the preset comparison proportion is the ratio between the number of the positive samples and the number of the negative samples of the preset serious disease, and each sample data comprises historical disease diagnosis coding information and diagnosis behavior information of each sample in a preset time range;
preprocessing the sample data set to remove invalid data, and clustering historical disease diagnosis coding information corresponding to all samples in the preprocessed sample data set according to corresponding disease attributes and focus positions to obtain corresponding disease clustering feature labels;
screening the disease clustering feature labels by adopting a preset feature selection algorithm to obtain the heavy disease clustering feature labels corresponding to the preset heavy diseases;
and establishing a disease risk prediction model corresponding to the preset serious disease by combining an extreme gradient lifting algorithm according to the serious disease clustering characteristic label, the gender, the age and the diagnosis behavior information.
In one embodiment, the step of classifying and sampling the historical diagnosis and treat data according to gender, age and preset contrast ratio to obtain a sample data set comprises the following steps:
respectively classifying the historical diagnosis and treatment data according to the rule that the gender and the preset age interval are the same, so as to obtain an initial data set;
respectively screening first preset number of positive sample data and second preset number of negative sample data of preset serious diseases from the initial data set according to a preset comparison proportion, wherein the ratio of the first preset number to the second preset number is equal to the preset comparison proportion;
and obtaining a corresponding sample data set according to the positive sample data and the negative sample data.
In one embodiment, the establishing method further comprises:
combining the related preamble diseases corresponding to the preset serious diseases, and screening the cluster characteristic labels of the serious diseases again;
and establishing a disease risk prediction model corresponding to the preset serious disease by combining an extreme gradient lifting algorithm according to the re-screened serious disease clustering characteristic label, sex, age and diagnosis behavior information.
In one embodiment, the preset contrast ratio is set to
Figure BDA0002294078930000031
In addition, a recommendation method of the disease insurance product is provided, and the recommendation method comprises the following steps of:
designing a corresponding questionnaire according to a disease risk prediction model of a preset area;
acquiring basic data of a disease insurance applicant according to the questionnaire;
predicting the disease insurance applicant according to the basic data and the disease risk prediction model to obtain a corresponding disease risk prediction result;
and recommending corresponding disease insurance products for the disease insurance applicant according to the disease risk prediction result.
In addition, a design method of a disease insurance product is provided, and the design method comprises the following steps of:
according to the disease risk prediction model, disease risk prediction is respectively carried out on medical insurance participants in a preset area, and corresponding disease risk prediction probability is obtained;
and generating a corresponding disease insurance product fee rate table according to the disease risk prediction probability, sex and age, and designing a corresponding disease insurance product according to the disease insurance product fee rate table.
In one embodiment, the step of generating a corresponding disease insurance product rate table based on the disease risk prediction probability, gender, and age comprises:
dividing medical insurance participants in a preset area into a plurality of risk level groups according to the disease risk prediction probability;
and dividing each risk level crowd according to the respective corresponding disease occurrence probability distribution of each risk level crowd and the sex and the age to generate a corresponding disease insurance product rate table.
In addition, a device for establishing a disease risk prediction model is provided, and the device for establishing the disease risk prediction model comprises the following steps:
the data acquisition unit is used for acquiring historical diagnosis and treatment data of medical insurance participants in a preset area;
the data set generation unit is used for carrying out classified sampling processing on the historical diagnosis and treatment data according to gender, a preset age interval and a preset comparison proportion to obtain a sample data set, wherein the sample data set comprises positive sample data and negative sample data of a preset serious disease, the preset comparison proportion is the ratio between the number of the positive samples and the number of the negative samples of the preset serious disease, and each sample data comprises historical disease diagnosis coding information and diagnosis behavior information of each sample in a preset time range;
the cluster feature tag generation unit is used for preprocessing the sample data set to remove invalid data, and clustering the historical disease diagnosis coding information corresponding to all samples in the preprocessed sample data set according to the corresponding disease attribute and focus position to obtain a corresponding disease cluster feature tag;
the severe disease cluster feature label generating unit is used for screening the disease cluster feature labels by adopting a preset feature selection algorithm so as to obtain severe disease cluster feature labels corresponding to preset severe diseases;
the prediction model generation unit is used for establishing a disease risk prediction model corresponding to a preset serious disease according to the serious disease clustering characteristic label, the gender, the age and the diagnosis behavior information and combining an extreme gradient lifting algorithm.
Furthermore, there is provided a device terminal comprising a memory for storing a computer program and a processor for running the computer program to cause the device terminal to perform the above-described setup method.
Further, a readable storage medium is provided, which stores a computer program which, when run by a processor, performs the above-described establishing method.
According to the method for establishing the disease risk prediction model, the historical diagnosis and treatment data of medical insurance participants in the preset area are obtained, the historical diagnosis and treatment data are classified and sampled according to the gender, the preset age interval and the preset contrast proportion to obtain the sample data set, the sample data set comprises positive sample data and negative sample data of preset serious diseases, the preset contrast proportion is the ratio between the number of positive samples and the number of negative samples of the preset serious diseases, each sample data comprises historical disease diagnosis coding information and diagnosis behavior information of each sample in a preset time range, the sample data set is preprocessed by removing invalid data, all the historical disease diagnosis coding information corresponding to all samples in the preprocessed sample data set are clustered according to corresponding disease attributes and focus positions to obtain corresponding disease clustering feature labels, the disease clustering feature labels are screened by adopting a preset feature selection algorithm to obtain the disease clustering feature labels corresponding to the preset serious diseases, the disease prediction model corresponding to the preset serious diseases is established according to the clustering feature labels, the gender, the age and the diagnosis behavior information, and the extreme lifting algorithm is combined, the disease risk prediction model corresponding to the preset serious diseases is properly designed according to the recommended disease prediction method of the proper product and the recommended product, and the proper risk is provided for the product is promoted according to the recommended product, and the proper risk is guaranteed, and the product is well-recommended.
Drawings
In order to more clearly illustrate the technical solutions of the present invention, the drawings that are required for the embodiments will be briefly described, it being understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope of the present invention. Like elements are numbered alike in the various figures.
FIG. 1 is a flow chart of a method for establishing a disease risk prediction model according to one embodiment;
FIG. 2 is a graph of receiver operating characteristics of a disease risk prediction model provided in one embodiment;
FIG. 3 is a flow chart of a method of obtaining a sample dataset provided in one embodiment;
FIG. 4 is a flowchart of a method for establishing a disease risk prediction model according to another embodiment;
FIG. 5 is a flow chart of a method of recommending a disease insurance product according to one embodiment;
FIG. 6 is a flow chart of a method of designing a disease insurance product according to one embodiment;
FIG. 7 is a flow chart of a method of generating a disease insurance product tariff table provided in one embodiment;
fig. 8 is a block diagram of a device for establishing a disease risk prediction model according to an embodiment.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments.
The components of the embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the invention, as presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be made by a person skilled in the art without making any inventive effort, are intended to be within the scope of the present invention.
Hereinafter, various embodiments of the present disclosure will be more fully described. The present disclosure is capable of various embodiments and of modifications and variations therein. However, it should be understood that: there is no intention to limit the various embodiments of the disclosure to the specific embodiments disclosed herein, but rather the disclosure is to be interpreted to cover all modifications, equivalents, and/or alternatives falling within the spirit and scope of the various embodiments of the disclosure.
The terms "comprises," "comprising," "including," or any other variation thereof, are intended to cover a specific feature, number, step, operation, element, component, or combination of the foregoing, which may be used in various embodiments of the present invention, and are not intended to first exclude the presence of or increase the likelihood of one or more other features, numbers, steps, operations, elements, components, or combinations of the foregoing.
Furthermore, the terms "first," "second," "third," and the like are used merely to distinguish between descriptions and should not be construed as indicating or implying relative importance.
Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which various embodiments of the invention belong. The terms (such as those defined in commonly used dictionaries) will be interpreted as having a meaning that is the same as the context of the relevant art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein in connection with the various embodiments of the invention.
As shown in fig. 1, a method for establishing a disease risk prediction model is provided, where the method includes:
step S110, historical diagnosis and treatment data of medical insurance participants in a preset area are obtained.
Because of the differences of the environment, eating habits and medical level of different areas, the health conditions of the people in different areas have larger differences, so that when the data are processed, the historical diagnosis and treatment data of the people in a specific area need to be acquired, and the historical diagnosis and treatment data of the medical insurance participants in the preset area are generally accurate.
Wherein the historical diagnosis and treatment data generally comprises gender, age and treatment behavior information of the patients, wherein the treatment behavior information generally comprises treatment hospital grade, treatment frequency, treatment times, treatment time and treatment accumulated cost information.
For positive patients, the age of the extracted positive sample is the age when the first diagnosis is a serious disease, the grade of the hospital for diagnosis, the frequency of diagnosis, the number of hospitalizations, the time of diagnosis, the expense of diagnosis and the corresponding historical disease diagnosis coding information are data in two years before the first diagnosis is the confirmation day of a serious disease, for example, the first time of 12 th month 5 days in 2018 of the patient is confirmed as cancer, and the positive sample belongs to diagnosis related data in the time range between 5 th month 12 of 2016 and 4 th month 12 of 2018 corresponding diagnosis data extraction.
For a negative sample (a patient in a database which is never diagnosed as a serious illness), the cut-off year included in the current database is taken as a starting point, the starting point is taken as a starting point, diagnosis and treatment data in two years are taken as a starting point, for example, the current time point is taken as an example, the current applied database data is cut-off to the date of 31 in 12 months in 2018, and the extracted data are generally the diagnosis and treatment data in 2015 and 2016.
The historical diagnosis and treatment data are usually corresponding data (such as personal information including an identification card number and an address) from which sensitive information of a patient is removed.
Step S120, classifying and sampling the historical diagnosis and treatment data according to gender, a preset age interval and a preset comparison proportion to obtain a sample data set, wherein the sample data set comprises positive sample data and negative sample data of a preset serious disease, the preset comparison proportion is the ratio between the number of the positive samples and the number of the negative samples of the preset serious disease, and each sample data comprises historical disease diagnosis coding information and diagnosis behavior information of each sample in a preset time range.
The historical diagnosis and treatment data are further subjected to classification sampling treatment according to gender, a preset age interval and a preset comparison proportion, and then sample data sets are further obtained, wherein each sample data set comprises positive sample data and negative sample data of a preset serious disease, and the positive sample data and the negative sample data of the preset serious disease are set according to the comparison proportion.
Wherein, each sample data generally includes gender, age and visit behavior information in addition to the historical disease diagnosis code information of each sample within a preset time range, and the visit behavior information generally includes information such as a visit hospital grade, a visit frequency, a hospital stay, a visit time and a visit accumulated cost, and the historical disease diagnosis code information is generally coded by using the ICD 10.
Wherein the ICD10 code is a normalized representation of a diagnostic description of the patient by the physician, i.e., avoiding the use of different textual descriptions for the same disease.
The positive sample data corresponds to the time range of the historical diagnosis and treatment data extracted from the positive patient sample, and is also data that is pushed for two years before the first confirmation day of a certain serious disease, and correspondingly, the preset time range is usually two years before the first confirmation day of a certain serious disease.
The negative sample data corresponds to a time range of the historical diagnosis and treatment data extracted from the negative sample, and the preset time range is started from a cut-off year included in the current database by pushing the negative sample (a patient in a database which is never diagnosed as a serious disease) for two years forward, and the starting point is pushed for two years forward again.
The predetermined age interval may be divided into 5 years, for example, 0-4 years, 5-9 years, …,80+ years.
Step S130, preprocessing for eliminating invalid data is carried out on the sample data set, and the historical disease diagnosis coding information corresponding to all samples in the preprocessed sample data set is clustered according to the corresponding disease attribute and focus position, so that the corresponding disease clustering feature label is obtained.
The method comprises the steps of obtaining a sample data set after pretreatment, wherein historical disease diagnosis coding information of a certain sample data possibly existing in the sample data set is an empty set, and the pretreatment needs to be directly removed, then clustering the historical disease diagnosis coding information corresponding to each sample data according to corresponding disease attributes and focus positions to obtain corresponding disease clustering label characteristics, so that sparsity of the sample data is reduced.
And step S140, screening the disease cluster feature labels by adopting a preset feature selection algorithm to obtain the severe disease cluster feature labels corresponding to the preset severe diseases.
The preset feature selection algorithm generally adopts any one of a mutual information algorithm, a P value algorithm and an information adding algorithm, and can screen the disease cluster feature labels to obtain the severe disease cluster feature labels corresponding to the preset severe diseases.
Step S150, according to the severe disease clustering characteristic label, the gender, the age and the diagnosis behavior information, and combining an extreme gradient lifting algorithm to establish a disease risk prediction model corresponding to the preset severe disease.
After the severe disease cluster feature label is obtained, gender, age and diagnosis behavior information of each sample of the corresponding preprocessed sample data set are further combined to serve as model feature factors to be generated to participate in model establishment, and a disease risk prediction model corresponding to the preset severe disease is established by combining an extreme gradient lifting algorithm.
In one embodiment, firstly, 70% proportion data are randomly extracted from the sample data set to be used as a training sample data set, the rest 30% data portion is used as a test sample data set, an extreme gradient lifting algorithm is adopted as a training model, in the extreme gradient lifting algorithm, a target function selects two logistic regression functions, lifting types are set to be gradient lifting trees, the learning rate value range is 0.001-0.3, the maximum iteration number value range is 50-3000, a grid search algorithm is selected to carry out circulation traversal on super parameters set in the extreme gradient lifting algorithm, the training sample data set is pre-trained, the training effect of the model is evaluated by adopting K-fold cross validation, the training and screening are carried out to obtain proper model parameters, the test sample data set is further predicted according to the proper model parameters obtained by the screening, the disease risk prediction result corresponding to the test sample data set is obtained, and the disease risk prediction result corresponding to the test sample data set is compared with actual serious disease positive sample data, and correction is carried out continuously until the training is carried out to establish a corresponding disease risk prediction model.
In one embodiment, the learning rate in the above-mentioned extreme gradient lifting algorithm is set to any one of 0.001, 0.003, 0.01, 0.03, 0.1 and 0.3, and the maximum number of iterations is set to any one of 50, 100, 300, 500, 1000 and 3000.
In one embodiment, the receiver operating characteristic Curve for predicting the sample data set by using the disease risk prediction model is shown in fig. 2, where the true positive rate on the ordinate in the receiver operating characteristic Curve (Receiver Operating Characteristic, abbreviated as ROC Curve) represents the number of positive sample prediction results/the actual number of positive samples, the false positive rate on the abscissa represents the number of negative sample results/the actual number of negative samples predicted to be positive, and the AUC (Area Under the ROC Curve) corresponding to the ROC Curve in fig. 2 is equal to 0.86, and obviously, the AUC is greater than 0.5, that is, greater than the Area Under the dashed straight line in fig. 2, which indicates that the disease risk prediction model is good.
The method for establishing the disease risk prediction model can accurately predict the risk of the disease insurance applicant, further provide proper basis for insurance companies to design insurance products, and enable proper recommendation methods of the disease insurance products to be established according to the disease risk prediction model in the follow-up process of recommending the disease insurance products, so that the precision and the suitability of popularization of the disease insurance products are improved overall.
In one embodiment, the model feature factors corresponding to the disease risk prediction model are as follows in table 1:
model feature factor
Sex (sex)
Age of
Number of visits to the doctor
Accumulated consumption amount
Number of hospitalizations
Symptoms and signs involving the skin and subcutaneous tissue
Venous, lymphatic and lymphadenopathy, not classifiable in the same
Aplastic and other anemias
Other diseases of blood and hematopoietic organs
Abnormality of blood examination
Coagulation defects, purpura and other bleeding conditions
System structure hoof tissue diseases
Certain diseases involving the immune mechanism
TABLE 1
In one embodiment, as shown in fig. 3, step S120 includes:
s122, classifying the historical diagnosis and treatment data according to the same rule of gender and preset age interval, and obtaining an initial data set.
The historical diagnosis and treatment data are divided into two parts according to gender, and then each part is further divided according to a preset age interval, so that initial data sets corresponding to the parts can be obtained.
S124, respectively screening out positive sample data of a first preset number and negative sample data of a second preset number of preset serious diseases from the initial data set according to a preset comparison proportion, wherein the ratio of the first preset number to the second preset number is equal to the preset comparison proportion.
After the initial data set is obtained, a first preset number of positive sample data of preset serious diseases can be further screened out from the initial data set, and a second preset number of negative sample data can be screened out from the initial data set, wherein the ratio of the first preset number to the second preset number is equal to a preset comparison proportion.
In one embodiment, for a preset severe disease, positive data corresponding to all patients suffering from the preset severe disease may be obtained from the historical diagnosis and treatment data first, then a first preset number of positive sample data are selected from the positive data, and then a second preset number of negative sample data are screened and extracted from the historical diagnosis and treatment data according to gender, a preset age interval and a preset comparison proportion.
Wherein the preset contrast ratio setting range is usually
Figure BDA0002294078930000131
In one embodiment, the predetermined control ratio is a 1:4 ratio.
And S126, obtaining a corresponding sample data set according to the positive sample data and the negative sample data.
In one embodiment, as shown in fig. 4, the above-mentioned establishing method further includes:
step S160, the cluster characteristic labels of the repeated diseases are screened again by combining the related preamble diseases corresponding to the preset repeated diseases.
Each preset serious disease usually has symptoms corresponding to a certain related precursor disease before diagnosis, so that the cluster characteristic labels of the serious diseases can be further screened according to the related precursor disease corresponding to the preset serious disease.
Step S170, establishing a disease risk prediction model corresponding to preset serious diseases according to the re-screened serious disease cluster characteristic label, sex, age and the diagnosis behavior information and combining an extreme gradient lifting algorithm.
After the re-screened severe disease cluster feature labels are obtained, gender, age and treatment behavior information of each sample of the corresponding pre-processed sample data set are further combined to serve as model feature factors to be generated to participate in model training, and a disease risk prediction model corresponding to the preset severe disease is established by combining an extreme gradient lifting algorithm.
In addition, as shown in fig. 5, a recommendation method of a disease insurance product is provided, where the recommendation method uses the disease risk prediction model, and the recommendation method includes:
step S210, basic data of a disease insurance applicant is acquired.
The disease risk prediction model is adopted for a preset area, and a targeted questionnaire is designed, so that the follow-up more accurate application of the disease risk prediction model for prediction is facilitated.
For example, the related questionnaire can be designed through the model feature factors in the table 1, so that the information of the potential applicant can be specifically mined.
Of course, in addition to the questionnaires described above, the underlying data of the disease insurance applicant may be obtained through other channels, such as through interview recordings, and the like.
In one embodiment, the basic data of the disease insurance applicant corresponding to each model feature factor is obtained in a targeted manner by filling in a questionnaire related to the model feature factor design in the above table 1.
Among these, the above questionnaires include, but are not limited to, internet questionnaires, weChat questionnaires, QQ questionnaires, and paper questionnaires.
Step S220, predicting the disease insurance applicant according to the basic data and the disease risk prediction model to obtain a corresponding disease risk prediction result.
After the basic data are obtained, the basic data are further extracted and input into a disease risk prediction model to predict for a disease insurance applicant, so that a corresponding disease risk prediction result is obtained.
Step S230, recommending corresponding disease insurance products for the disease insurance applicant according to the disease risk prediction result.
According to the method, the corresponding basic data of the disease insurance applicant is acquired in a targeted mode, the disease insurance applicant is predicted according to the basic data and the disease risk prediction model, the corresponding disease risk prediction result is obtained, and finally the corresponding disease insurance product is recommended for the disease insurance applicant according to the disease risk prediction result, so that the recommendation and popularization accuracy of the insurance product is greatly improved, and the marketing capability of an insurance company is improved.
In addition, as shown in fig. 6, there is also provided a method for designing a disease insurance product, the method using the disease risk prediction model, the method comprising:
step S310, according to the disease risk prediction model, disease risk prediction is respectively carried out on medical insurance participants in a preset area, and corresponding disease risk prediction probability is obtained.
Due to the environment, eating habit, medical level difference and other reasons in different areas, such as northern area and southern area, coastal area and plain area, the people in the respective areas have obvious difference in climate and eating habit, for example, the resident thyroid cancer in coastal area is higher than in non-coastal area, and the incidence rate of intestinal cancer in southern area is higher than in northern area. According to the regional differentiation characteristics, the model is divided into a plurality of regions, and different disease risk prediction models are constructed for different regional groups.
In one embodiment, the preset area is a Beijing area, and all the medical insurance participants in the Beijing area can be tested for lymphoma serious diseases according to the disease risk prediction model, so as to obtain respective corresponding disease risk prediction probabilities.
Step S320, generating a corresponding disease insurance product rate table according to the disease risk prediction probability, sex and age, and designing a corresponding disease insurance product according to the disease insurance product rate table.
Wherein, the disease risk prediction probability and the age belong to important factors positively related to the disease insurance product rate, the gender of men and women also has important influence, the corresponding disease insurance product rate table is generated comprehensively according to the disease risk prediction probability, the gender and the age, and the corresponding disease insurance product is designed according to the disease insurance product rate table.
According to the design method of the disease insurance product, different disease risk prediction models can be adopted according to different areas, and then the disease insurance product which is suitable for the preset area is designed, so that the disease insurance product can be matched with the actual situation of the preset area, the risk of the disease insurance product is reduced, the accuracy and the adaptability of the disease insurance product are greatly improved, and the market competitiveness of insurance companies is improved.
In one embodiment, as shown in fig. 7, step S320 includes:
step S322, dividing medical insurance participants in a preset area into a plurality of risk level groups according to the disease risk prediction probability.
After the disease risk prediction probability is obtained, the disease risk prediction probability can be further divided into a plurality of levels, and then medical insurance participants in a preset area are divided into a plurality of risk level groups.
Step S324, according to the disease occurrence probability distribution corresponding to each risk level crowd, and dividing each risk level crowd into intervals according to gender and age, and generating a corresponding disease insurance product rate table.
Each risk level crowd corresponds to a disease occurrence probability distribution, wherein the disease occurrence probability distribution refers to probability distribution of a preset serious disease, which is actually corresponding to each risk level crowd.
Therefore, the corresponding disease insurance product cost table can be designed and generated further according to the disease occurrence probability distribution corresponding to each risk level crowd and the interval division of each risk level crowd according to the gender and the age.
In addition, as shown in fig. 8, there is also provided an apparatus for establishing a disease risk prediction model, the apparatus comprising:
the data acquisition unit 410 is configured to acquire historical diagnosis and treatment data of medical insurance participants in a preset area.
The data set generating unit 420 is configured to perform classification sampling processing on the historical diagnosis and treatment data according to gender, a preset age interval, and a preset contrast ratio to obtain a sample data set, where the sample data set includes positive sample data and negative sample data of a preset serious disease, the preset contrast ratio is a ratio between the number of positive samples and the number of negative samples of the preset serious disease, and each sample data includes historical disease diagnosis coding information and diagnosis behavior information of each sample within a preset time range.
The first feature tag generating unit 430 is configured to perform preprocessing for removing invalid data from the sample data set, and cluster historical disease diagnosis coding information corresponding to all samples in the preprocessed sample data set according to the corresponding disease attribute and the focus part, so as to obtain a corresponding disease cluster feature tag.
The second feature tag generating unit 440 is configured to screen the disease cluster feature tags by using a preset feature selection algorithm to obtain a severe disease cluster feature tag corresponding to the preset severe disease.
The prediction model generating unit 450 is configured to establish a disease risk prediction model corresponding to a preset severe disease according to the severe disease cluster feature tag, the gender, the age and the diagnosis behavior information and by combining an extreme gradient lifting algorithm.
Furthermore, there is provided a device terminal comprising a memory for storing a computer program and a processor for running the computer program to cause the device terminal to perform the above-described setup method.
Further, a readable storage medium is provided, which stores a computer program which, when run by a processor, performs the above-described establishing method.
In the several embodiments provided in this application, it should be understood that the disclosed apparatus and method may be implemented in other manners as well. The apparatus embodiments described above are merely illustrative, for example, of the flow diagrams and block diagrams in the figures, which illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
In addition, functional modules or units in various embodiments of the invention may be integrated together to form a single part, or the modules may exist alone, or two or more modules may be integrated to form a single part.
The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a smart phone, a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present invention.

Claims (9)

1. A method for building a disease risk prediction model, the method comprising:
acquiring historical diagnosis and treatment data of medical insurance participants in a preset area;
classifying and sampling the historical diagnosis and treatment data according to gender, a preset age interval and a preset comparison proportion to obtain a sample data set, wherein the sample data set comprises positive sample data and negative sample data of preset serious diseases, and the preset comparison proportion is thatThe ratio between the positive sample number and the negative sample number of the preset serious diseases is preset, each sample data comprises historical disease diagnosis coding information and diagnosis behavior information of each sample in a preset time range, wherein the preset control ratio is set as
Figure FDA0004097113080000011
Preprocessing the sample data set to remove invalid data, and clustering historical disease diagnosis coding information corresponding to all samples in the preprocessed sample data set according to corresponding disease attributes and focus positions to obtain corresponding disease clustering feature labels;
screening the disease cluster feature labels by adopting a preset feature selection algorithm to obtain the severe disease cluster feature labels corresponding to the preset severe diseases;
and establishing a disease risk prediction model corresponding to the preset serious disease by combining an extreme gradient lifting algorithm according to the serious disease clustering characteristic label, the gender, the age and the diagnosis behavior information.
2. The method according to claim 1, wherein the step of classifying and sampling the historical diagnosis and treat data according to gender, age and preset contrast ratio to obtain a sample data set comprises:
classifying the historical diagnosis and treatment data according to the rule that the gender and the preset age interval are the same, so as to obtain an initial data set;
respectively screening first preset number of positive sample data and second preset number of negative sample data of preset serious diseases from the initial data set according to a preset comparison proportion, wherein the ratio of the first preset number to the second preset number is equal to the preset comparison proportion;
and obtaining a corresponding sample data set according to the positive sample data and the negative sample data.
3. The method of establishing according to claim 1, further comprising:
screening the cluster characteristic labels of the severe diseases again by combining the related preamble diseases corresponding to the preset severe diseases;
and establishing a disease risk prediction model corresponding to the preset serious disease by combining an extreme gradient lifting algorithm according to the re-screened serious disease clustering characteristic label, sex, age and the diagnosis behavior information.
4. A method of recommending a disease insurance product, characterized by employing the disease risk prediction model of any one of claims 1 to 3, the method comprising:
acquiring basic data of a disease insurance applicant;
predicting the disease insurance applicant according to the basic data and the disease risk prediction model to obtain a corresponding disease risk prediction result;
and recommending corresponding disease insurance products for the disease insurance applicant according to the disease risk prediction result.
5. A method of designing a disease insurance product, characterized by using the disease risk prediction model according to any one of claims 1 to 3, the method comprising:
according to the disease risk prediction model, disease risk prediction is respectively carried out on medical insurance participants in the preset area, and corresponding disease risk prediction probability is obtained;
and generating a corresponding disease insurance product rate table according to the disease risk prediction probability, sex and age, and designing a corresponding disease insurance product according to the disease insurance product rate table.
6. The design method according to claim 5, wherein the step of generating a corresponding disease insurance product tariff table according to the disease risk prediction probability, sex, and age comprises:
dividing medical insurance participants in the preset area into a plurality of risk level groups according to the disease risk prediction probability;
and dividing each risk level crowd according to the respective corresponding disease occurrence probability distribution of each risk level crowd and the sex and the age to generate a corresponding disease insurance product rate table.
7. A disease risk prediction model building apparatus, wherein the building apparatus includes:
the data acquisition unit is used for acquiring historical diagnosis and treatment data of medical insurance participants in a preset area;
a data set generating unit, configured to perform classification sampling processing on the historical diagnosis and treatment data according to gender, a preset age interval, and a preset comparison proportion to obtain a sample data set, where the sample data set includes positive sample data and negative sample data of a preset serious disease, the preset comparison proportion is a ratio between the number of positive samples and the number of negative samples of the preset serious disease, each sample data includes historical disease diagnosis coding information and diagnosis behavior information of each sample in a preset time range, and the preset comparison proportion is set as follows
Figure FDA0004097113080000031
The cluster feature label generating unit is used for preprocessing the sample data set to remove invalid data, and clustering historical disease diagnosis coding information corresponding to all samples in the preprocessed sample data set according to corresponding disease attributes and focus positions to obtain corresponding disease cluster feature labels;
the severe disease cluster feature label generating unit is used for screening the disease cluster feature labels by adopting a preset feature selection algorithm to obtain severe disease cluster feature labels corresponding to the preset severe disease;
and the prediction model generation unit is used for establishing a disease risk prediction model corresponding to the preset serious disease according to the serious disease cluster characteristic label, the gender, the age and the diagnosis behavior information and combining an extreme gradient lifting algorithm.
8. A device terminal comprising a memory for storing a computer program and a processor that runs the computer program to cause the device terminal to perform the set-up method of any of claims 1 to 3.
9. A readable storage medium, characterized in that it stores a computer program which, when executed by a processor, performs the set-up method of any one of claims 1 to 3.
CN201911193197.4A 2019-11-28 2019-11-28 Method for establishing disease risk prediction model and method for recommending disease insurance product Active CN110993103B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911193197.4A CN110993103B (en) 2019-11-28 2019-11-28 Method for establishing disease risk prediction model and method for recommending disease insurance product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911193197.4A CN110993103B (en) 2019-11-28 2019-11-28 Method for establishing disease risk prediction model and method for recommending disease insurance product

Publications (2)

Publication Number Publication Date
CN110993103A CN110993103A (en) 2020-04-10
CN110993103B true CN110993103B (en) 2023-06-02

Family

ID=70087865

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911193197.4A Active CN110993103B (en) 2019-11-28 2019-11-28 Method for establishing disease risk prediction model and method for recommending disease insurance product

Country Status (1)

Country Link
CN (1) CN110993103B (en)

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111652737B (en) * 2020-04-17 2023-12-22 世纪保众(北京)网络科技有限公司 Insurance verification method and apparatus based on text semantic processing
CN115171910A (en) * 2020-04-22 2022-10-11 第四范式(北京)技术有限公司 Method and system for generating screening model and screening infectious disease high-risk infected people
CN112017788B (en) * 2020-09-07 2023-07-04 平安科技(深圳)有限公司 Disease ordering method, device, equipment and medium based on reinforcement learning model
CN112102955B (en) * 2020-09-07 2024-03-15 武汉科瓴智能科技有限公司 Patient disease prediction control system and method based on Gaussian mixture model
CN112100331A (en) * 2020-09-14 2020-12-18 泰康保险集团股份有限公司 Medical data analysis method and device, storage medium and electronic equipment
CN112487287B (en) * 2020-11-26 2024-03-22 深圳韦格纳医学检验实验室 Method for recommending serious diseases by using gene detection result and questionnaire
CN112435745B (en) * 2020-12-18 2024-04-05 深圳赛安特技术服务有限公司 Method and device for recommending treatment strategy, electronic equipment and storage medium
CN113362137B (en) * 2021-06-11 2024-04-05 北京十一贝科技有限公司 Insurance product recommendation method and device, terminal equipment and storage medium
CN114300116B (en) * 2021-11-10 2023-11-28 安徽大学 Robust syndrome detection method based on online classification algorithm
CN114974579B (en) * 2022-04-20 2024-02-27 山东大学齐鲁医院 Auxiliary judging system and equipment for prognosis of digestive tract submucosal tumor endoscopic treatment
CN116561183B (en) * 2023-07-10 2023-09-19 北京环球医疗救援有限责任公司 Intelligent information retrieval system for massive medical insurance data
CN117408823A (en) * 2023-10-24 2024-01-16 北京明熹五品人工智能科技有限公司 Health data design life insurance scheme system for real-time human body monitoring

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108231201A (en) * 2018-01-25 2018-06-29 华中科技大学 A kind of construction method, system and the application of disease data analyzing and processing model
WO2019030840A1 (en) * 2017-08-09 2019-02-14 日本電気株式会社 Disease development risk prediction system, disease development risk prediction method, and disease development risk prediction program
US20190085394A1 (en) * 2015-12-14 2019-03-21 Parkinson's Institute Refining diagnosis and treatment of complex multi-symptom neurological disorders
CN109978022A (en) * 2019-03-08 2019-07-05 腾讯科技(深圳)有限公司 A kind of medical treatment text message processing method and device, storage medium

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA2216681A1 (en) * 1996-09-30 1998-03-30 Smithkline Beecham Corporation Disease management method and system
JP6909078B2 (en) * 2017-07-07 2021-07-28 株式会社エヌ・ティ・ティ・データ Disease onset prediction device, disease onset prediction method and program
CN107910068A (en) * 2017-11-29 2018-04-13 平安健康保险股份有限公司 Insure health risk Forecasting Methodology, device, equipment and the storage medium of user
CN109117864B (en) * 2018-07-13 2020-02-28 华南理工大学 Coronary heart disease risk prediction method, model and system based on heterogeneous feature fusion
CN110211690A (en) * 2019-04-19 2019-09-06 平安科技(深圳)有限公司 Disease risks prediction technique, device, computer equipment and computer storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190085394A1 (en) * 2015-12-14 2019-03-21 Parkinson's Institute Refining diagnosis and treatment of complex multi-symptom neurological disorders
WO2019030840A1 (en) * 2017-08-09 2019-02-14 日本電気株式会社 Disease development risk prediction system, disease development risk prediction method, and disease development risk prediction program
CN108231201A (en) * 2018-01-25 2018-06-29 华中科技大学 A kind of construction method, system and the application of disease data analyzing and processing model
CN109978022A (en) * 2019-03-08 2019-07-05 腾讯科技(深圳)有限公司 A kind of medical treatment text message processing method and device, storage medium

Also Published As

Publication number Publication date
CN110993103A (en) 2020-04-10

Similar Documents

Publication Publication Date Title
CN110993103B (en) Method for establishing disease risk prediction model and method for recommending disease insurance product
US11922348B2 (en) Generating final abnormality data for medical scans based on utilizing a set of sub-models
McCarthy et al. Mammography use helps to explain differences in breast cancer stage at diagnosis between older black and white women
DuBrava et al. Using random forest models to identify correlates of a diabetic peripheral neuropathy diagnosis from electronic health record data
Banerjee et al. Prediction of age-related macular degeneration disease using a sequential deep learning approach on longitudinal SD-OCT imaging biomarkers
CN106611023B (en) Method and device for detecting website access abnormality
Dixit et al. Dimensions of antenatal care service and the alacrity of mothers towards institutional delivery in South and South East Asia
Kravdal et al. Mental health benefits of cohabitation and marriage: A longitudinal analysis of Norwegian register data
CN112132624A (en) Medical claims data prediction system
CN112447270A (en) Medication recommendation method, device, equipment and storage medium
Chang et al. Impact of cuts in reimbursement on outcome of acute myocardial infarction and use of percutaneous coronary intervention: a nationwide population-based study over the period 1997 to 2008
Shi et al. Development of multimorbidity over time: an analysis of Belgium primary care data using Markov chains and weighted association rule mining
Welton et al. Research prioritization based on expected value of partial perfect information: a case-study on interventions to increase uptake of breast cancer screening
Khader et al. Medical transformer for multimodal survival prediction in intensive care: integration of imaging and non-imaging data
Doogan et al. Opioid use disorder among Ohio’s Medicaid population: prevalence estimates from 19 counties using a multiplier method
Alvarez-Galvez et al. Discovery and classification of complex multimorbidity patterns: unravelling chronicity networks and their social profiles
US20240161035A1 (en) Multi-model medical scan analysis system and methods for use therewith
CN116153496A (en) Neural network model training method and depression emotion detection method
KR20230107219A (en) Systems and methods for exposome clinical application
Golmakani et al. Nonhomogeneous Markov chain for estimating the cumulative risk of multiple false positive screening tests
CN111091472A (en) Data processing method, device and equipment
Oh et al. Analyzing to discover origins of CNNs and ViT architectures in medical images
Evans et al. Cost-effectiveness
Lloyd et al. An application of multinomial logistic regression to estimating performance of a multiple-screening test with incomplete verification
Kubin et al. Ruling out potential dating partners: the role of self-concept clarity in initial romantic partner evaluations

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant