CN115601183A

CN115601183A - Claims data processing analysis method and system

Info

Publication number: CN115601183A
Application number: CN202211246824.8A
Authority: CN
Inventors: 李翔; 李飞龙; 陆培
Original assignee: Jinwei Medical Insurance Information Management China Co ltd
Current assignee: Jinwei Medical Insurance Information Management China Co ltd
Priority date: 2022-10-12
Filing date: 2022-10-12
Publication date: 2023-01-13

Abstract

The embodiment of the invention provides a method and a system for processing and analyzing claim data, and relates to the field of data processing. The claim data processing and analyzing method comprises the following steps: reading a plurality of historical claim settlement data; obtaining the correlation between each characteristic variable and the claim amount variable, and taking the characteristic variable which has the correlation with the claim amount variable as a correlation variable; constructing a regression model by taking a plurality of characteristic variables as independent variables and taking claim amount variables as dependent variables; and training the regression model by using the constructed training set to obtain a target regression model for evaluating the claim amount. According to the method, historical claim settlement data are counted according to multiple data dimensions, an insurance company can conveniently master the change dynamics of the claim settlement data in real time, the screened related fields are introduced into the regression model, and the accurate effect of the regression model on the assessment of the claim settlement amount is improved.

Description

Claims data processing and analyzing method and system

Technical Field

The invention relates to the technical field of data processing, in particular to a method and a system for processing and analyzing claim settlement data.

Background

In the health insurance claim settlement process, image data such as medical records, disease diagnosis books, original documents of medical expense invoices, inspection reports, expense detailed lists and the like can be collected, at present, due to the development of the internet, online insurance is handled more and more, along with the gradual development and maturity of the OCR technology, in the underwriting process, the accuracy rate of identifying the image data such as the uploaded invoices, medical records, inspection reports and the like on line is greatly improved, the accumulation of claim settlement data is accelerated, and the data is effectively processed and analyzed, so that products can be better developed and services can be developed, and the method is very important.

The core technical problem of effectively improving the service and the benefit of insurance companies by modeling and analyzing the claim settlement data is realized after the factors are considered. Based on the technical problems, the applicant proposes a technical scheme of the application.

Disclosure of Invention

The invention aims to provide a method and a system for processing and analyzing claim data, which are used for counting and visually displaying historical claim data according to multiple data dimensions, so that an insurance company can conveniently master the change dynamics of the claim data in real time, judge the correlation among the data, screen out related fields and introduce a regression model, and improve the accurate effect of the regression model on the evaluation of claim money.

In order to achieve the above object, the present invention provides a method for processing and analyzing claim data, comprising: reading a plurality of historical claim data, wherein each historical claim data comprises: the values of the plurality of characteristic variables and the value of the claim amount variable; obtaining the correlation between each characteristic variable and the claim amount variable according to the historical claim settlement data, and taking the characteristic variable which has the correlation with the claim amount variable as a correlation variable; establishing a regression model by taking the characteristic variables as independent variables and the claim amount variable as dependent variables, and establishing a training set containing the values of the relevant variables and the claim amount variable in the historical claim data; and training the regression model by using the constructed training set to obtain a target regression model for evaluating the claim amount.

The invention also provides a claim settlement data processing and analyzing system, which comprises: the data reading module is used for reading a plurality of historical claim data, and each historical claim data comprises: the values of the characteristic variables and the value of the claim amount variable; the correlation determination module is used for obtaining the correlation between each characteristic variable and the claim amount variable according to the historical claim settlement data, and taking the characteristic variable which has the correlation with the claim amount variable as a correlation variable; the model construction module is used for constructing a regression model by taking the characteristic variables as independent variables and the claim amount variable as dependent variables, and constructing a training set containing the values of the relevant variables and the values of the claim amount variable in the historical claim data; and the model training module is used for training the regression model by using the constructed training set to obtain a target regression model for evaluating the claim amount.

In the embodiment of the invention, the claim data processing and analyzing method visually displays the historical claim data through the selected characteristic variables and displays the dynamic change result of the historical claim data according to the time change, so that an insurance company can find the problem of the insurance business in time; after the correlation between the characteristic variables in the historical claim data and the claim amount variables is obtained, the regression model is further optimized and trained by utilizing the correlation, and therefore the target regression model which is relatively accurate in evaluation of the claim amount is obtained.

In one embodiment, after reading the historical claim data, the method for processing and analyzing the claim data further comprises: and dividing the historical claim settlement data into different data dimensions, and performing data statistics under each data dimension to obtain a statistical calculation result for display.

In one embodiment, after the dividing the historical claim data into different data dimensions, and performing data statistics in each data dimension to obtain a statistical calculation result for presentation, the method for processing and analyzing the claim data further includes: and respectively converting the application data statistical result, the payment data statistical result and the disease data statistical result into a characteristic visualization chart, and generating an evaluation conclusion of the data in the characteristic visualization chart based on a set comparison threshold.

In one embodiment, the dividing the historical claim data into different data dimensions, and performing data statistics in each data dimension to obtain a statistical calculation result for presentation includes: reading insurance data, claim data and disease data in the historical claim settlement data; classifying age groups based on the age distribution of the insurantors in the insurable data, and counting the number of the insurable people in each age group and the corresponding sex to form an insurable data counting result; dividing age groups based on age distribution of the applicant during the settlement of the claim in the claim data, counting the number of claim groups and corresponding gender in each age group, and counting the claim amount and average claim amount of each quarter in a single year in the claim data to form a claim data counting result; and counting the annual reimbursement number, the annual reimbursement amount and the annual reimbursement and treatment times of each disease in a single year based on the treatment records in the disease data to form a disease reimbursement counting result.

In one embodiment, the calculating the correlation between the characteristic variables includes any one or any combination of the following: pearson correlation coefficient, rank correlation coefficient, kendel correlation coefficient, kappa consistency coefficient, chi-square test, fisher's exact test, and Anova one-way variance analysis.

In one embodiment, after the building of the training set including the values of the relevant variables and the values of the claim amount variables in the historical claim data, the method for processing and analyzing the claim data further includes: preprocessing the values of the variables in the training set, wherein the preprocessing mode comprises the following steps: any one or more of control processing, abnormal value processing and data standardization processing.

In one embodiment, the regression model is any one of: linear regression models, polynomial regression models, ridge regression models, lasso regression models, and support vector regression models.

In one embodiment, after reading the historical claim data, the method for processing and analyzing the claim data further comprises: and calling an Sqoop tool to import the historical claim data into the HDFS, and calling a Hive tool to perform data cleaning on the historical claim data.

Drawings

FIG. 1 is a detailed flowchart of a method for processing and analyzing claim data according to a first embodiment of the present invention;

FIG. 2 is a detailed flowchart of step 102 of the claim data processing and analysis method of FIG. 1.

Detailed Description

The embodiments of the present invention will be described in detail below with reference to the accompanying drawings in order to more clearly understand the objects, features and advantages of the present invention. It should be understood that the embodiments shown in the drawings are not intended to limit the scope of the present invention, but are merely intended to illustrate the spirit of the technical solution of the present invention.

In the following description, for the purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the various disclosed embodiments. One skilled in the relevant art will recognize, however, that an embodiment can be practiced without one or more of the specific details. In other instances, well-known devices, structures and techniques associated with this application may not be shown or described in detail to avoid unnecessarily obscuring the description of the embodiments.

Throughout the specification and claims, the word "comprise" and variations thereof, such as "comprises" and "comprising," are to be understood as an open, inclusive meaning, i.e., as being interpreted to mean "including, but not limited to," unless the context requires otherwise.

Reference throughout this specification to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment. Thus, the appearances of the phrases "in one embodiment" or "in an embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.

As used in this specification and the appended claims, the singular forms "a", "an", and "the" include plural referents unless the context clearly dictates otherwise. It should be noted that the term "or" is generally employed in its sense including "or/and" unless the context clearly dictates otherwise.

In the following description, for the purposes of clearly illustrating the structure and operation of the present invention, directional terms will be used, but terms such as "front", "rear", "left", "right", "outer", "inner", "outer", "inward", "upper", "lower", etc. should be construed as words of convenience and should not be construed as limiting terms.

The first embodiment of the invention relates to a claim data processing and analyzing method which is applied to a claim data processing and analyzing system, relevant fields are screened out and introduced into a regression model by judging the correlation between each characteristic variable in historical claim data and a claim amount variable, the accuracy of the regression model in evaluation of the claim amount is improved, data statistics is carried out on the historical claim data according to multiple data dimensions, an intuitive chart display result is generated, and an insurance company can conveniently master the change dynamics of the claim data in real time.

Fig. 1 shows a specific flow of the claim data processing and analyzing method according to the present embodiment.

Step 101, reading a plurality of historical claim data, wherein each historical claim data comprises: the values of a plurality of characteristic variables and the value of a claim amount variable.

Specifically, after collection of claim data, the collected claim data are centrally stored in a background database storage mode, the claim data are integrated and classified according to the application dimension, the claim dimension and the disease dimension, and different data interfaces are set for interface calling. And respectively reading the insurance data, the claim data and the disease data in the historical claim settlement data by calling different data interfaces. For example, a Web interface is adopted, and a data interface is realized through a SpringBoot framework of JAVA. After reading the historical claim data, calling an Sqoop tool to import the claim data into the HDFS, and calling a Hive tool to perform data cleaning on the historical claim data.

And 102, dividing the historical claim settlement data into different data dimensions, and performing data statistics under each data dimension to obtain a statistical calculation result for display.

In one example, referring to fig. 2, step 102 includes the following sub-steps:

and a substep 1021, reading the insurance data, the claim data and the disease data in the historical claim data.

Specifically, the insurance application data, the claim payment data and the disease data in the claim settlement data are respectively read by calling a data interface.

And a substep 1022 of dividing age groups based on the age distribution of the applicant in the application data, and counting the number of the applicant groups and the corresponding gender in each age group to form an application data counting result.

Specifically, after the insurance data is read, the insurance applicant age at the time of insurance application is calculated according to the value in the policy validation start date field in the insurance data, the insurance applicant age distribution is divided into age groups, for example, 5 age groups of 20-30, 30-40, 40-50, 50-60 and 60-70, in each age group, the number of insurance application people is counted according to the policy validation start date and the expiration date in the insurance data, and the number of insurance application people in each age group and the gender distribution are correlated according to the gender distribution in the insurance data to form the statistical result of the insurance data.

And a substep 1023 of dividing age groups based on age distribution of the claimants in the claim data when claims are settled, counting the number of claim groups and corresponding sexes in each age group, and counting the claim amount and the average claim amount of each quarter in a single year in the claim data to form a claim data counting result.

Specifically, after the claim data is read, the age at the time of claim can be calculated according to the claim date in the claim data, the age at the time of claim is divided into age groups and the gender is counted, the division of the age groups can refer to 5 age group division modes in the substep 1023, then the claim amount and the average claim amount in each quarter in a single year in the claim data are calculated, and the number of the claim groups in each age group, the distribution of the gender, the claim amount and the average claim amount in each quarter in the single year are associated to form the claim data statistical result.

And a substep 1024 of counting annual reimbursement number, annual reimbursement amount and annual reimbursement and treatment times of each disease in a single year based on the treatment records in the disease data to form a disease reimbursement statistical result.

Specifically, according to the treatment records in the disease data, the number of annual pay persons, annual pay amount and annual pay treatment times of each disease in a single year are counted, and the counting results are sorted. Meanwhile, the diseases are classified, in each disease class, the annual disease is statistically calculated and sorted according to the annual pay amount, the annual pay number, the annual pay amount and the annual pay visit times of each disease in the above single year are related to each other to form a disease pay statistical table, and in each disease class, the annual disease is related to the annual pay amount to form the disease pay statistical table.

And a substep 1025, converting the application data statistical result, the payment data statistical result and the disease data statistical result into a characteristic visualization chart respectively, and generating an evaluation conclusion of the data in the characteristic visualization chart based on a set comparison threshold.

Specifically, the statistical results of the application data, the statistical results of the payment data, and the statistical results of the disease data are converted into characteristic visualization charts, respectively, and the characteristic visualization charts may include various charts. In some examples, there are bar graphs, pie charts, line graphs, etc. that may reflect simple contrasts or linear changes between characteristic variables. In some examples, there are tree graphs, radar graphs, etc. that may reflect complex distribution relationships between multiple feature variables. For example, under the insurance dimension, the statistics of the number of insurers of each age layer are visualized, and bar graphs are selected for displaying. For example, in the application dimension, the gender distribution is visualized and a pie chart is selected for display. The various charts can form chart pictures through a visual means to be displayed on a visual screen, the readability of the historical claim settlement data is improved through the visual mode of the charts, the data reading experience is improved, the dynamic change results of the historical claim settlement data in each season or each year can be conveniently and clearly viewed, and therefore the insurance company can be helped to find out problems in business in time and adjust the problems in a targeted mode.

After the characteristic visualization chart is obtained, setting each comparison threshold value to generate an analysis conclusion of the chart, setting a threshold value for each characteristic data, judging whether the characteristic data of a certain time node is out of the threshold value range by using the threshold value, if so, judging the data of the time node as abnormal data, and eliminating the abnormal data during data cleaning. And comparing the data difference of the time nodes before and after to generate an analysis conclusion of the data change. For example, listing or ranking several diseases with the highest number of payers or the highest amount paid in the disease dimension can help the product design department of the insurance company to better adjust and improve the insurance product.

And 103, obtaining the correlation between each characteristic variable and the claim amount variable according to the historical claim settlement data, and taking the characteristic variable having the correlation with the claim amount variable as a correlation variable.

Specifically, according to distribution conditions among characteristic variables appearing in historical claim data, a relevance checking tool is called for relevance judgment, characteristic variables which are relevant to claim amount variables are set as relevant variables, variables which are not relevant to the claim amount variables are set as interference variables, the judgment examples of the two characteristic variables are described below, if the relevance between one of the two characteristic variables and the claim amount variable is larger than a preset relevance threshold value, and the relevance between the other characteristic variable and the claim amount variable is smaller than or equal to a preset relevance threshold value, then the relevance between the one characteristic variable and the claim amount variable is calibrated, the relevance between the other characteristic variable and the claim amount variable is not relevant, one characteristic field of the two characteristic variables is identified as a relevant variable, and the other field is used as an interference variable. The correlation variable and the interference variable are stored respectively for later use as different data input types of the regression model.

And 104, constructing a regression model by taking the characteristic variables as independent variables and the claim amount variable as dependent variables, and constructing a training set containing the values of the related variables and the claim amount variable in the historical claim data.

And 105, training the regression model by using the constructed training set to obtain a target regression model for evaluating the claim amount.

Specifically, the regression model is any one of: linear regression models, polynomial regression models, ridge regression models, lasso regression models, and support vector regression models. And constructing any regression model by taking a plurality of characteristic variables as independent variables and taking claim amount variables as dependent variables. When the regression model is trained, in order to enhance the evaluation accurate effect of the regression model on the claim amount, the selected training sample data includes the values of the relevant variables and the claim amount variables in the historical claim data, and does not include the values of the disturbance variables.

In the sample training process, the values of all variables in the training set are preprocessed, and the preprocessing mode comprises the following steps: any one or more of control processing, abnormal value processing, and data normalization processing. The preprocessing is to remove dirty data in a plurality of characteristic variables to form standardized data, the dirty data can influence the training of the regression model and the accuracy of the subsequent model applied to prediction, and the standardized data can improve the convergence rate of the regression model. For example, the plurality of characteristic variables of the historical claim settlement data include the industry to which the company of the claim object belongs, the average age, the number of insured persons, the region, the type of insurance product and the proportion of males, and among the plurality of characteristic variables, the industry to which the company belongs, the region to which the company belongs and the type of insurance product are all discrete characteristics and need to be individually coded, for example, by One-Hot coding; the average age, the number of insured persons of a company and the proportion of males belong to continuous characteristics, and need to be subjected to binning, dummy variable coding and standardization processing so as to be converted into data with discrete characteristics. For example, according to the data characteristics, min-Max standardization, log function conversion or Z-Score standardization is selected for processing.

In the sample training process, one of an average Absolute Error (MAE), a Root Mean Square Error (RMSE) and a decision Coefficient (Coefficienc of determination, R squared) can be selected as an evaluation index for training, after the training of the regression model is completed, a hyper-parameter is selected to traverse the regression model, an optimal hyper-parameter is searched in the traversing process, and the regression model is optimized through multiple times of traversal to obtain a target regression model for evaluating the claim amount. In the process of searching the optimal hyperparameter, the optimal hyperparameter is searched by adopting a cross validation mode, the obtained association relation between the relevant fields and the claim amount fields is output, and the importance degree sequence of the association relation is output.

In one example, in the step 103, when obtaining the correlation between the characteristic variables, the correlation checking tool is used to perform by calling any one of the following interfaces, including pearson correlation coefficient, rank correlation coefficient, kender correlation coefficient, kappa consistency coefficient, chi-square test, fisher exact test, anova one-factor variance analysis.

The Pearson correlation coefficient interface is described below, the Pearson correlation coefficient measures a linear correlation, and the Pearson correlation coefficient (Pearson) applies the condition: the two variables are in linear relation and are continuous data, and the total of the two variables is normal distribution or unimodal distribution close to normal; the observations of the two variables are paired, and each pair of observations is independent of each other. The calculation formula is shown as the following formula (1):

wherein r represents a pearson correlation coefficient, and if r =0, it indicates a wireless correlation between the data field X and the data field Y, and the larger the absolute value of r is, the closer to 1, the stronger the correlation is, the closer to 0, the weaker the correlation is. Xi represents the ith data value in data field X, yi represents the ith data value in data field Y,

represents the mean value in the data field X,

represents the mean value in the data field Y, and n represents the number of fields of the data field X, taking a positive integer.

Rank correlation coefficient (Spearman) application conditions are: random variables are pairs of ordered categorical variables; regardless of the data distribution, the data is in a monotonic relationship, and the nonlinear relationship of random variables can be measured.

The application condition of Kendall correlation coefficient (Kendall Rank) is as follows: random variables are pairs of ordered categorical variables; regardless of the distribution of the data, the data is in a monotonic relationship, and the nonlinear relationship of random variables can be measured.

The Kappa consistency factor application conditions are as follows: random variables are pairs of categorical variables.

The chi-square test application conditions are as follows: random variables are pairs of categorical variables; large sample data is preferred, and typically each case appears preferably once, and one quarter appears at least five times, and if the data is not satisfactory, the correction chi-square is applied.

Fisher's exact test applies to the conditions: on the basis of chi-square test, if the sample size is less than 40 or the minimum theoretical frequency is less than 5; if the chi-square test has a p-value around 0.05, fisher's exact test is used.

The application conditions of Anova one-way Anova analysis of variance are as follows: each sample is a random sample which is independent of each other and follows normal distribution; the overall variances of the samples compared with each other are equal, and the samples have the homogeneity of the variances. And selecting a proper method to calculate the test correlation according to the characteristics of the characteristic variables and the applicable conditions of the test method.

A second embodiment of the present invention relates to a claim data processing analysis system, including: the data reading module is used for reading historical claim settlement data, and the claim settlement data comprises: the values of the characteristic variables and the value of the claim amount variable; the result display module is used for dividing the historical claim settlement data into different data dimensions, and performing data statistics under each data dimension to obtain a statistical calculation result for display; the correlation determination module is used for obtaining the correlation between each characteristic variable and the claim amount variable according to the historical claim settlement data, and taking the characteristic variable which has the correlation with the claim amount variable as a correlation variable; the model construction module is used for constructing a regression model by taking the characteristic variables as independent variables and the claim amount variable as dependent variables, and constructing a training set containing the values of the relevant variables and the value of the claim amount variable in the historical claim data; and the model training module is used for training the regression model by using the constructed training set to obtain a target regression model for evaluating the claim settlement amount.

Since the first embodiment corresponds to the present embodiment, the present embodiment can be implemented in cooperation with the first embodiment. The related technical details mentioned in the first embodiment are still valid in this embodiment, and the technical effects that can be achieved in the first embodiment can also be achieved in this embodiment, and are not described herein again in order to reduce repetition. Accordingly, the related-art details mentioned in the present embodiment can also be applied to the first embodiment.

While the preferred embodiments of the present invention have been described in detail above, it should be understood that aspects of the embodiments can be modified, if necessary, to employ aspects, features and concepts of the various patents, applications and publications to provide yet further embodiments.

These and other changes can be made to the embodiments in light of the above detailed description. In general, in the claims, the terms used should not be construed to be limited to the specific embodiments disclosed in the specification and the claims, but should be construed to include all possible embodiments along with the full scope of equivalents to which such claims are entitled.

Claims

1. A method for processing and analyzing claim data is characterized by comprising the following steps:

reading a plurality of historical claim data, wherein each historical claim data comprises: the values of the plurality of characteristic variables and the value of the claim amount variable;

obtaining the correlation between each characteristic variable and the claim amount variable according to the historical claim settlement data, and taking the characteristic variable which has the correlation with the claim amount variable as a correlation variable;

taking the characteristic variables as independent variables and the claim amount variable as dependent variables to construct a regression model, and constructing a training set containing the values of the relevant variables and the values of the claim amount variable in the historical claim data;

and training the regression model by using the constructed training set to obtain a target regression model for evaluating the claim amount.

2. The claims data processing analysis method of claim 1, wherein after the reading of the plurality of historical claims data, further comprising:

and dividing the historical claim settlement data into different data dimensions, and performing data statistics under each data dimension to obtain a statistical calculation result for display.

3. The method for processing and analyzing claim data as claimed in claim 2, wherein the dividing of the historical claim data into different data dimensions and the performing of data statistics in each data dimension to obtain statistical calculation results for presentation comprises:

reading insurance data, claim data and disease data in the historical claim settlement data;

classifying age groups based on the age distribution of the insurantors in the insurable data, and counting the number of the insurable people in each age group and the corresponding sex to form an insurable data counting result;

dividing age groups based on age distribution of the applicant during the settlement of the claim in the claim data, counting the number of claim groups and corresponding gender in each age group, and counting the claim amount and average claim amount of each quarter in a single year in the claim data to form a claim data counting result;

and counting the annual reimbursement number, the annual reimbursement amount and the annual reimbursement and treatment times of each disease in a single year based on the treatment records in the disease data to form a disease reimbursement counting result.

4. The method for processing and analyzing claims data as claimed in claim 3, wherein after the dividing the historical claims data into different data dimensions and performing data statistics in each data dimension to obtain statistical calculation results for presentation, the method further comprises:

and respectively converting the application data statistical result, the payment data statistical result and the disease data statistical result into a characteristic visualization chart, and generating an evaluation conclusion of the data in the characteristic visualization chart based on a set comparison threshold.

5. The method for processing and analyzing claim data as claimed in claim 1, wherein the manner of obtaining the correlation between the characteristic variables comprises any one or any combination of the following: pearson correlation coefficient, rank correlation coefficient, kendel correlation coefficient, kappa consistency coefficient, chi-square test, fisher's exact test, and Anova one-way variance analysis.

6. The claims data processing analysis method of claim 1, after the training set containing the values of the relevant variables and the values of the claim amount variables in the historical claim data is constructed, the method further comprises the following steps:

preprocessing the values of the variables in the training set, wherein the preprocessing mode comprises the following steps: any one or more of control processing, abnormal value processing and data standardization processing.

7. The claims data processing analysis method of claim 1, wherein the regression model is any one of: linear regression models, polynomial regression models, ridge regression models, lasso regression models, and support vector regression models.

8. The claims data processing analysis method of claim 1, wherein after the reading of the plurality of historical claims data, further comprising:

and calling an Sqoop tool to import the historical claim data into the HDFS, and calling a Hive tool to perform data cleaning on the historical claim data.

9. An claims data processing and analysis system, comprising:

the data reading module is used for reading a plurality of historical claim data, and each historical claim data comprises: the values of the plurality of characteristic variables and the value of the claim amount variable;

the correlation determination module is used for obtaining the correlation between each characteristic variable and the claim amount variable according to the historical claim settlement data, and taking the characteristic variable which has the correlation with the claim amount variable as a correlation variable;

the model construction module is used for constructing a regression model by taking the characteristic variables as independent variables and the claim amount variable as dependent variables, and constructing a training set containing the values of the relevant variables and the value of the claim amount variable in the historical claim data;

and the model training module is used for training the regression model by using the constructed training set to obtain a target regression model for evaluating the claim amount.

10. The claims data processing and analysis system of claim 9, further comprising a result presentation module configured to divide the historical claims data into different data dimensions, and perform data statistics in each data dimension to obtain statistical calculation results for presentation.