CN113642672B

CN113642672B - Feature processing method and device of medical insurance data, computer equipment and storage medium

Info

Publication number: CN113642672B
Application number: CN202111009455.6A
Authority: CN
Inventors: 李佳秀
Original assignee: Ping An Medical and Healthcare Management Co Ltd
Current assignee: Ping An Medical and Healthcare Management Co Ltd
Priority date: 2021-08-31
Filing date: 2021-08-31
Publication date: 2024-05-14
Anticipated expiration: 2041-08-31
Also published as: CN113642672A

Abstract

The application discloses a feature processing method, a device, computer equipment and a storage medium of medical insurance data, relates to the field of medical big data processing, and is used for comprehensively model training data. The characteristic processing method of the medical insurance data comprises the following steps: acquiring original medical insurance data with preset dimensionality in a medical insurance data source, wherein the medical insurance data source is a database which is shared by medical insurance institutions in an open mode; performing data preprocessing on the original medical insurance data to generate basic data; performing feature classification, summarization and grouping on the basic data according to the preset dimension to obtain derivative data; and integrating the original medical insurance data, the basic data and the derivative data to generate a training data set. According to the feature processing method of the medical insurance data, a large number of model indexes are generated through the medical insurance data with multiple dimensions, the comprehensiveness of the data is increased, and the developed model is comprehensive and high in accuracy.

Description

Feature processing method and device of medical insurance data, computer equipment and storage medium

Technical Field

The embodiment of the invention relates to the field of medical big data processing, in particular to a feature processing method and device of medical insurance data, computer equipment and a storage medium.

Background

Along with the rapid growth of medical expenses, the task of supervision of the medical insurance fund is also heavier and heavier, medical data are required for the supervision of the medical insurance fund, such as data of information of an audit report person, a disease agency, a medical scheme, medication, rehabilitation treatment and the like, the traditional supervision of the medical insurance fund depends on experience audit, no specific audit standard is adopted, so that the audit accuracy is lower, the condition of more fish leaking from the net is caused, in addition, medical insurance fraud is increasingly hidden and complicated and changeable, further, the wind control difficulty is continuously upgraded, the technology of medical big data and machine learning is gradually used for mining some complicated and hidden illegal behaviors, the audit accuracy can be greatly improved, and the refinement and specialized management of the medical insurance fund are realized.

There are many methods of machine learning, but before all models are trained, the step of feature engineering is not avoided, and data and features define the upper limit of the machine learning model, so that feature selection and feature engineering processing play a vital role in the whole modeling flow. However, the current modeling flow mainly focuses on the modeling method, but ignores the processing process of medical insurance data, so that the trained model has strong bias and low accuracy.

Disclosure of Invention

The embodiment of the invention provides a feature processing method, a device, computer equipment and a storage medium for medical insurance data, which can improve the comprehensiveness of model training data.

In order to achieve the above purpose, the embodiment of the application adopts the following technical scheme:

In a first aspect, a method for processing features of medical insurance data is provided, including:

Acquiring original medical insurance data with preset dimensionality in a medical insurance data source, wherein the medical insurance data source is a database which is shared by medical insurance institutions in an open mode;

Performing data preprocessing on the original medical insurance data to generate basic data;

performing feature classification, summarization and grouping on the basic data according to the preset dimension to obtain derivative data;

And integrating the original medical insurance data, the basic data and the derivative data to generate a training data set.

Optionally, after the step of integrating the raw medical insurance data, the base data, and the derivative data to generate a training data set, the method further comprises the steps of:

inputting the training data set into a preset neural network for training, and generating a medical insurance detection model;

inputting the medical insurance data to be detected into the medical insurance detection model, and detecting the type of the medical insurance data to be detected.

Optionally, after the step of inputting the medical insurance data to be detected into the medical insurance detection model and detecting the type of the medical insurance data to be detected, the method further includes the following steps:

obtaining an auditing result of rechecking the medical insurance data to be detected, which is abnormal in the type of the medical insurance data to be detected;

And adding the auditing result as a marked training sample into the training data set.

Optionally, after the step of obtaining the audit result of rechecking the medical insurance data to be detected, which is abnormal in the type of the medical insurance data to be detected, the method further includes the following steps:

Acquiring the underwriting person information of the medical insurance data to be detected, wherein the underwriting result is illegal;

and acquiring all medical insurance history records matched with the underwriting person information, and rechecking the medical insurance history records.

Optionally, after the step of inputting the training data set to a preset neural network for training and generating a medical insurance detection model, the method further includes the steps of:

Acquiring performance parameters of the medical insurance detection model;

and adjusting model parameters of the medical insurance detection model according to the performance parameters.

Optionally, the step of adjusting the model parameters of the medical insurance detection model according to the performance parameters includes the steps of:

judging whether the performance parameters accord with preset model standards or not;

And deleting the medical insurance detection model when the performance parameter is judged to be not in accordance with the model standard.

Performing fitting detection on the medical insurance detection model, and judging whether the medical insurance detection model is fitted or not according to a detection result;

And when the medical insurance detection model is judged to be over-fitted, retraining the medical insurance detection model according to a preset adjustment strategy until the medical insurance detection model is not over-fitted.

In a second aspect, there is provided a feature processing apparatus of medical insurance data, the feature processing apparatus of medical insurance data including:

the medical insurance data acquisition module is used for acquiring original medical insurance data with preset dimensionality in a medical insurance data source, wherein the medical insurance data source is a database which is shared by medical insurance institutions in an open mode;

the data preprocessing module is used for preprocessing the data of the original medical insurance data to generate basic data;

the data deriving module is used for classifying, summarizing and grouping the characteristics of the basic data according to the preset dimension to obtain derived data;

And the data integration module is used for integrating the original medical insurance data, the basic data and the derivative data to generate a training data set.

Optionally, the apparatus further comprises:

the model training module is used for inputting the training data set into a preset neural network for training and generating a medical insurance detection model;

the data detection module is used for inputting the medical insurance data to be detected into the medical insurance detection model and detecting the type of the medical insurance data to be detected.

Optionally, the apparatus further comprises:

The data review module is used for obtaining a review result of conducting review on the medical insurance data to be detected, which is abnormal in the type of the medical insurance data to be detected;

And the training sample adding module is used for adding the auditing result into the training data set as a marked training sample.

Optionally, the apparatus further comprises:

the ginseng and insurance person information acquisition module is used for acquiring the ginseng and insurance person information of the medical insurance data to be detected, the audit result of which is illegal;

And the medical insurance history record auditing module is used for acquiring all medical insurance history records matched with the participant information and rechecking the medical insurance history records.

Optionally, the apparatus further comprises:

the model performance acquisition module is used for acquiring performance parameters of the medical insurance detection model;

And the model parameter adjustment module is used for adjusting the model parameters of the medical insurance detection model according to the performance parameters.

Optionally, the apparatus further comprises:

The model performance judging module is used for judging whether the performance parameters accord with preset model standards or not;

And the model deleting module is used for deleting the medical insurance detection model when the performance parameter is judged to be not in accordance with the model standard.

Optionally, the apparatus further comprises:

the overfitting detection module is used for carrying out overfitting detection on the medical insurance detection model and judging whether the medical insurance detection model is overfitted according to a detection result;

And the model circulation training module is used for retraining the medical insurance detection model according to a preset training strategy when the medical insurance detection model is judged to be over-fitted until the medical insurance detection model is not over-fitted.

In a third aspect, in order to solve the foregoing technical problem, an embodiment of the present invention further provides a computer device, including a memory and a processor, where the memory stores computer readable instructions, where the computer readable instructions, when executed by the processor, cause the processor to execute the steps of the feature processing method of medical insurance data described above.

The computer device may be a network device or may be a part of an apparatus in a network device, such as a chip system in a network device. The system-on-a-chip is configured to support the network device to implement the functions involved in the first aspect and any one of its possible implementations, for example, to receive, determine, and shunt data and/or information involved in the feature processing method of the medical insurance data. The chip system includes a chip, and may also include other discrete devices or circuit structures.

In a fourth aspect, in order to solve the above technical problem, an embodiment of the present invention further provides a storage medium storing computer readable instructions, where the computer readable instructions when executed by one or more processors cause the one or more processors to execute the steps of the feature processing method of medical insurance data.

In a fifth aspect, there is provided a computer program product which, when run on a computer, causes the computer to perform the feature processing method of medical insurance data as described in the first aspect and any one of its possible designs.

It should be noted that, the above-mentioned computer instructions may be stored in whole or in part on the first computer storage medium. The first computer storage medium may be packaged together with the processor of the feature processing device of the medical insurance data, or may be packaged separately from the processor of the feature processing device of the medical insurance data, which is not limited in the embodiment of the present application.

The description of the second, third, fourth and fifth aspects of the present invention may refer to the detailed description of the first aspect; the advantages of the second aspect, the third aspect, the fourth aspect and the fifth aspect may be referred to as analysis of the advantages of the first aspect, and will not be described here.

In the embodiment of the present application, the names of the feature processing apparatuses of the medical insurance data are not limited to the devices or functional modules, and in actual implementation, these devices or functional modules may appear under other names. Insofar as the function of each device or function module is similar to that of the present application, it falls within the scope of the claims of the present application and the equivalents thereof.

These and other aspects of the invention will be more readily apparent from the following description.

The embodiment of the invention has the beneficial effects that: after the original medical insurance data in the medical insurance data source is obtained, the original medical insurance data is subjected to preprocessing operations such as data cleaning, conversion and the like to obtain basic data, the basic data is subjected to feature classification summarization and grouping to obtain derivative data, then the original medical insurance data, the basic data and the derivative data are integrated into a training data set, a large amount of index processing time is saved through one-time generation of the universal training data set, and labor time is saved.

Drawings

FIG. 1 is a schematic flow chart of a feature processing method of medical insurance data provided by an embodiment of the application;

FIG. 2 is a schematic flow chart of modeling and detecting medical insurance data anomalies by the feature processing method of the medical insurance data provided by the embodiment of the application;

FIG. 3 is a schematic flow chart of rechecking the medical insurance data to be detected according to the feature processing method of the medical insurance data provided by the embodiment of the application;

FIG. 4 is a schematic flow chart of checking medical insurance history data of a offending participant according to the feature processing method of medical insurance data provided by the embodiment of the application;

FIG. 5 is a schematic flow chart of feature processing method model parameter adjustment of medical insurance data provided by the embodiment of the application;

FIG. 6 is a schematic flow chart of a processing model of a feature processing method of medical insurance data provided by an embodiment of the application;

FIG. 7 is a schematic flow chart of a feature processing method model overfitting detection of medical insurance data provided by an embodiment of the application;

FIG. 8 is a schematic structural view of an embodiment of a feature processing device for medical insurance data according to an embodiment of the present application;

fig. 9 is a basic structural block diagram of a computer device according to an embodiment of the present application.

Detailed Description

Embodiments of the present application are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative only and are not to be construed as limiting the application.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless expressly stated otherwise, as understood by those skilled in the art. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It will be understood by those skilled in the art that all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs unless defined otherwise. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

As described in the background art, the existing feature processing method of medical insurance data mainly focuses on the modeling method, but ignores the processing process of the medical insurance data, so that the trained model has strong bias and low accuracy.

In view of the above problems, an embodiment of the present application provides a feature processing method for medical insurance data, after obtaining original medical insurance data in a medical insurance data source, performing preprocessing operations such as data cleaning and conversion on the original medical insurance data to obtain basic data, performing feature classification summarization and grouping on the basic data to obtain derivative data, and integrating the original medical insurance data, the basic data and the derivative data into a training data set, thereby generating a general training data set at one time, saving a large amount of index processing time, saving labor time, and generating a large amount of model indexes through the medical insurance data with multiple dimensions because the original medical insurance data is medical insurance data with preset dimensions in a database which is shared by a medical insurance institution, so that the comprehensiveness of the data is increased, and further the developed model is comprehensive and has high accuracy.

The feature processing method of the medical insurance data can be applied to computer equipment. The computer equipment can be equipment for medical insurance wind control supervision, a chip in the equipment, and a system on chip in the equipment.

Alternatively, the device may be a physical machine, for example: desktop computers, also known as desktop computers or desktops (desktops), cell phones, tablet computers, notebook computers, ultra-mobile personal computer (UMPC), netbooks, personal Digital Assistants (PDAs), and other terminal devices.

Alternatively, the above-mentioned computer device may implement the functions to be implemented by the above-mentioned computer device through a Virtual Machine (VM) deployed on a physical machine.

The following describes the feature processing method of the medical insurance data provided by the embodiment of the application in detail with reference to the accompanying drawings. As shown in fig. 1, the feature processing method of the medical insurance data includes: S101-S104.

S101, acquiring original medical insurance data with preset dimensions in a medical insurance data source, wherein the medical insurance data source is a database which is shared by medical insurance institutions in an open mode.

Alternatively, the medical insurance organization may be a national medical insurance organization, or an organization or platform capable of acquiring medical insurance data from the national medical insurance organization, for example, an enterprise website for regularly issuing medical insurance statistical gazettes, etc., where the data source of the enterprise website is the national medical insurance organization.

In one possible implementation manner, the original medical insurance data is medical data, such as personal health files, prescriptions, inspection reports, and the like, specifically, when the original medical insurance data is acquired, the computer device first determines dimensions of the required data, and preset dimensions include, but are not limited to, a personnel number ID dimension, a hospital dimension, a doctor number dimension, a department dimension, a doctor dimension, and an underwriting unit dimension, where the personnel number ID dimension, the hospital dimension, the doctor number dimension, the department dimension, the doctor dimension, and the underwriting unit dimension are taken as examples, the personnel number ID dimension is a number ID of a reimburser, such as an identification card number, the hospital dimension is a hospital name or an abbreviated name of a reimburser, the doctor number dimension is a number of a doctor card used when the reimburser visits, the department dimension is a department name of a hospital when the reimburser visits, the doctor dimension is a attending physician, and the underwriting unit dimension is a unit for reimburser to handle the medical insurance.

The computer equipment extracts data from the medical insurance data source in six dimensions, namely a personnel number ID dimension, a hospital dimension, a visit number dimension, a department dimension, a doctor dimension and an insurance unit dimension, and takes the hospital dimension as an example, and the computer equipment extracts data related to a hospital from the medical insurance data source as original medical insurance data.

S102, carrying out data preprocessing on the original medical insurance data to generate basic data.

Specifically, after the original medical insurance data is obtained, the original medical insurance data can be subjected to preprocessing operations such as cleaning and conversion to obtain basic data.

Exemplary, the original medical insurance data is subjected to cleaning operations, including but not limited to dimensionless processing, missing value processing, outlier processing, discrete data processing, and the like, and the data sample is required to meet a certain standard in the preprocessing of the data, wherein the dimensionless processing is a data processing mode of converting data with different specifications into the same specification, so that comparability exists among the data with different attributes (different units) is represented.

The dimensionless processing method comprises a linear dimensionless method, a nonlinear dimensionless method and a dimensionless method of a qualitative index, wherein the linear dimensionless method is used for converting an actual index value into an index evaluation value which is not influenced by dimension, and the linear dimensionless method, the non-linear dimensionless method and the non-linear dimensionless method are assumed to be in a linear relation, and the change of the actual index value causes a corresponding proportion change of the index evaluation value. Linear non-dimensionality methods include, but are not limited to, min-max normalization and Z-score methods, among others.

The original data is subjected to linear transformation in the min-max normalization method, and is mapped between [0,1 ]. The formula of the min-max normalization method is as follows:

In the formula ①, min is the minimum value of the sample, max is the maximum value of the sample, x is the original data, taking a group of height data ([ 1.70], [1.71], [1.72], [1.70], [1.73 ]) as an example, and after min-max normalization, the method comprises the following steps: ([0],[0.3333],[0.6667],[0],[1]). The difference between the data is amplified through the data normalized by the min-max, and the model learning is facilitated.

In other embodiments, the formula of the Z-score (zero-mena normalization, 0-means normalization) method is as follows:

X in equation ② is the raw data, u is the sample mean, and σ is the sample standard deviation. The Z-score method can remove dimensions and avoid the influence of selection of different dimensions on distance calculation.

In some embodiments, failure of data collection or storage may result in missing data, such as failure of data storage, memory damage, or mechanical failure, etc., which may be caused by subjective factors, such as the interviewee rejecting answers to reveal related questions, data entry personnel error drop-out entering data, etc. The missing value processing may process missing data, and in implementation, the missing value processing includes, but is not limited to, missing value complementation, deletion of a feature containing a missing value, direct use of a feature containing a missing value, and the like, and is not particularly limited herein.

Optionally, the missing value completions include, but are not limited to, mean interpolation, homogeneous mean interpolation, median interpolation, mode interpolation, etc., in some embodiments, mean interpolation refers to interpolation of missing values using the average of sample property valid values, taking the visit number dimension as an example, visit number includes ([ 123], [124], [125], [126], [ xxx ], [122 ]), where xxx represents data missing, with the calculated average being: 124, the number of visits after mean interpolation is: ([123],[124],[125],[126],[124],[122]).

Further, the above-mentioned mean value interpolation is applicable to the case where the distance of the sample attribute is measurable, and when the distance of the sample attribute is not measurable (non-numerical value), the missing value may be interpolated by using the mode of the effective value of the sample attribute, that is, the value with the largest number of values of the sample attribute. Taking department dimensions as an example, the department dimensions comprise ([ orthopaedics ], [ surgery ], [ internal medicine ], [ neurology ], [ none ], [ surgery ]), wherein no data is represented, and the surgery is interpolated into the position of the missing value as the surgery occurs most frequently, so that ([ orthopaedics ], [ surgery ], [ internal medicine ], [ neurology ], [ surgery ]).

Further, homogeneous mean interpolation first needs to classify sample data, and then interpolates missing values with the mean of samples in the class.

Further, the median interpolation is to sort a group of data according to the size, and take the effective value at the middle position to interpolate the missing value, for example, the above-mentioned visit number ([ 123], [124], [125], [126], [ xxx ], [122 ]), and the median is 124, and the visit number after the median interpolation is: ([123],[124],[125],[126],[124],[122]).

It is understood that in other embodiments, the missing value completion may also use other missing value completion methods, such as hot card interpolation, regression interpolation, multiple interpolation, etc., which can be used to complete missing data.

In some embodiments, there may be an outlier in the original medical insurance data, which refers to both an unreasonable value in the data set, also referred to as an outlier, e.g., the person number dimension is an identification number, the identification number has 18 digits, and when the collected person number appears to be a number other than 18 digits, the number is confirmed to be an outlier.

Optionally, the determination of outliers includes, but is not limited to, a bin graph analysis, a3 delta principle, a simple statistical analysis, etc., where the simple statistical analysis is a descriptive statistic of attributes to see which values are unreasonable. For example, the attribute of the identification card number is defined as follows: the identification card number is 18 bits, and if the number of bits of the identification card number in the sample data is not 18 bits, the sample data is indicated to belong to an abnormal value.

Further, when the data obeys the normal too-distribution, the 3 delta principle can be used, according to the definition of the normal too-distribution, the probability of being out of the average value 3 delta is 0.003, which belongs to the extremely small probability event, and then the sample data with the distance more than 3 delta from the average value can be considered to belong to an abnormal value. Of course, in other embodiments, when the data does not follow the positive distribution, the standard deviation of how many times the average distance is away can be determined, and the value of how many times can be determined according to the actual situation. For example, if the probability of being 5 times away from the average distance is 0.001, it can be determined that the sample data being 5 times away from the average distance belongs to an outlier.

Further, to improve the accuracy of outlier determination, a bin graph analysis may be used, where the bin graph uses five statistics in the data: the method for describing data comprises the steps of minimum value, first quartile, median, third quartile and maximum value, wherein first quartile (Q1), median and third quartile (Q3) are calculated, specifically, a group of data can be ordered from small to large, the digits in the middle position are digits in 50% of the middle position, and the digits in 25% and 75% of the digits after the first quartile and the third quartile are ordered from small to large. Let iqr=q3-Q1, then values between q3+1.5 (IQR) and Q1-1.5 (IQR) are values within the acceptable range, and values outside q3+1.5 (IQR) and Q1-1.5 (IQR) are considered outliers.

Further, the outlier processing method includes, but is not limited to, deleting a sample containing an outlier, performing outlier processing with regard to the outlier as a missing value, and the like, and is not particularly limited herein.

Alternatively, the outlier is regarded as a missing value, and the missing value processing may refer to the step of completing the missing value, for example, performing interpolation outliers such as mean interpolation, homogeneous mean interpolation, median interpolation, and mode interpolation on the missing value.

In some alternative embodiments, the original medical insurance data may be further processed in a discrete data process, where the discrete data process refers to converting the type data into a numerical value that can be put into a model for calculation, where the original medical insurance data is sometimes not always a continuous value, possibly some sort values, and the value of some sort values has a big meaning, for example, size: [ X, XL, XXL ], can be mapped with values { X:1, XL:2, XXL:3}. While other classification values have no significance in terms of magnitude, such as hospitals: one-hot-encoding (one-hot-encoding) can be used for this type of data, one-hot-encoding being a method of converting a classification variable into several binary columns. Taking department dimensions including orthopedics, internal medicine, surgery, neurology and ophthalmology as examples, the single-heat codes of orthopedics are: 10000, the single hot code of internal medicine is: 01000, surgical single-heat coding: 00100, the monothermal coding of the neurology department is: 00010 ophthalmic monothermal coding is: 00001. the data after the single thermal coding can be directly used for the classifier, so that the problem that the classifier cannot benefit attribute data is solved.

The above-mentioned cleaning of the data is performed by using a non-dimensionality process, a missing value process, an abnormal value process, and a discrete data process, and the execution order of the non-dimensionality process, the missing value process, the abnormal value process, and the discrete data process is not fixed, for example, the non-dimensionality process, the abnormal value process, and the discrete data process may be performed at one time, the non-dimensionality process, the abnormal value process, and the discrete data process may be performed after the missing value process, the abnormal value process, and the discrete data process are performed, or the missing value process, the abnormal value process, and the non-dimensionality process may be performed after the discrete data process.

And S103, carrying out feature classification, summarization and grouping on the basic data according to the preset dimension to obtain derivative data.

The computer equipment performs feature classification, feature grouping summarization and feature grouping on the cleaned data according to different dimensions, and then performs statistics and derives an interpretable summarization field, optionally, the feature classification is classified according to preset dimensions in medical insurance data, for example, classification is performed according to six dimensions of personnel number ID dimension, hospital dimension, visit number dimension, department dimension, doctor dimension and participation unit dimension, so that features of corresponding personnel number ID dimension, hospital dimension, visit number dimension, department dimension, doctor dimension and participation unit dimension are respectively obtained.

Further, the feature group is summarized as features generated based on different dimensions, and is respectively summarized as N dimension features, for example, the six dimension features are summarized, and by taking a personnel number ID dimension as an example, the data of the personnel number ID can be searched in the cleaned data, and the searched data of the personnel number ID can be summarized under the personnel number ID dimension.

Specifically, the feature classification may be performed on the data by using a variance selection method, chi-square test, or recursive feature elimination method, where the variance selection method first calculates the variance of each feature, and then selects features with variances greater than a threshold according to the threshold.

In some embodiments, the chi-square test is to test the correlation of a qualitative independent variable to a qualitative dependent variable. Assuming that the independent variable has N values and the dependent variable has M values, the feature can be selected by considering the observed value of the sample frequency with the independent variable equal to i and the dependent variable equal to j and the expected gap, specifically, using SelectKBest kinds of feature_selection library in combination with chi-square test.

Alternatively, the recursive elimination feature method performs multiple rounds of training by using a base model, after each round of training, eliminates the features of several weight coefficients, and then performs the next round of training based on a new feature set, specifically, the RFE class of feature_selection library can be used to select the features.

Further, the statistical method for the statistics after feature grouping comprises a statistical analysis method and a duty ratio analysis method, wherein the statistical analysis method is used for carrying out statistical analysis on the features of the middle group and solving statistical indexes of the features, including maximum value, minimum value, average value, summation, median, quarter bit number, three-quarter bit number and/or variance;

further, the duty ratio analysis method refers to calculating various proportions of each dimension; taking the dimension of the hospital as an example, the proportion of the patient number of the hospital a to the patient number of the hospital in the time period can be calculated, or the proportion of the patient fee of the hospital b to the patient fee of the hospital in the time period can be calculated.

S104, integrating the original medical insurance data, the basic data and the derivative data to generate a training data set.

Specifically, after derived data is obtained, the computer equipment root gathers the original medical insurance data, the basic data and the derived data, and when the method is implemented, data gathering and integration can be performed according to different dimensions to generate a training data set. The generated training data set can be used for the subsequent model development, a large number of model indexes (data) are generated through the feature library, so that the data modeling analysis can be performed in multiple dimensions, and the comprehensiveness of the data is increased.

According to the embodiment of the application, after the original medical insurance data in the medical insurance data source is obtained, the original medical insurance data is subjected to preprocessing operations such as data cleaning and conversion to obtain the basic data, the basic data is subjected to feature classification summarization and grouping to obtain the derivative data, then the original medical insurance data, the basic data and the derivative data are integrated into the training data set, the universal training data set is generated at one time, a large amount of index processing time is saved, the labor time is saved, and the original medical insurance data is medical insurance data with preset dimensions in a database which is shared by a medical insurance institution, a large amount of model indexes are generated through the medical insurance data with multiple dimensions, so that the comprehensiveness of the data is increased, and further, the developed model is comprehensive and high in accuracy.

In some alternative embodiments, referring to FIG. 2, FIG. 2 is a flow chart illustrating modeling and detection of medical insurance data anomalies according to one embodiment of the present application.

As shown in fig. 2, after the step of integrating the original medical insurance data, the basic data and the derivative data to generate a training data set, the feature processing method of the medical insurance data provided by the application further includes the following steps:

S105, inputting the training data set into a preset neural network for training, and generating a medical insurance detection model;

S106, inputting the medical insurance data to be detected into the medical insurance detection model, and detecting the type of the medical insurance data to be detected, wherein the type comprises normal, suspected and illegal.

The neural network is a computing system having interconnected nodes, optionally including, but not limited to, a feed Forward Neural Network (FNN), a Convolutional Neural Network (CNN), a Recurrent Neural Network (RNN), a recurrent neural network, and the like. In implementation, taking a deep learning algorithm as an example, a value can be taken on a training data set by using vectors of M dimensions, wherein taking 6 dimensions of the training data set including the personnel number ID dimension, the hospital dimension, the visit number dimension, the department dimension, the doctor dimension and the participation unit dimension as examples, a default step length is set to be 1, 2m+1 vectors of M dimensions are generated, the 2m+1 vectors of M dimensions are processed by a forest of a preset type to respectively generate 2m+1 vectors of 3 dimensions, and the 2m+1 vectors of 3 dimensions are connected together to generate a G-dimensional vector, wherein m=2, and g=3 (2m+1) forest numbers.

For the generated G-dimensional vector, each layer receives feature information contained in the feature vector in a cascading mode, performs feature learning through multi-layer and multi-type forest combinations, splices the output result of each layer with the generated G-dimensional vector, performs feature learning through the forest combinations of each layer again, and keeps parameters of each layer as a medical insurance detection model according to training times and convergence indexes.

The type of the medical insurance data to be detected can be detected through the medical insurance detection model, and optionally, the type of the medical insurance data is the type of the medical insurance behavior contained in the medical insurance data, including normal, suspected and illegal. The medical insurance violation behavior in the medical process can be accurately, timely and effectively detected in real time, so that the accuracy and the high efficiency of medical insurance fund supervision are improved.

In some alternative embodiments, referring to fig. 3, fig. 3 is a schematic flow chart of a re-audit of medical insurance data to be tested according to an embodiment of the present application.

As shown in fig. 3, after the step of inputting the medical insurance data to be detected into the medical insurance detection model and detecting the type of the medical insurance data to be detected, the feature processing method of the medical insurance data provided by the application further includes the following steps:

S107, obtaining an auditing result of rechecking the medical insurance data to be detected, which is abnormal in the type of the medical insurance data to be detected;

S108, taking the auditing result as a marked training sample and adding the labeled training sample into the training data set.

After the medical insurance data to be detected is input into the medical insurance detection model for detection, the detected medical insurance data to be detected, which belongs to suspected and illegal anomalies and the like, can be subjected to rechecking, and optionally, the rechecking of the data can be manually checked, so that the efficiency and the accuracy of the checking of the medical insurance data can be effectively improved. And the result after rechecking can be added into a training data set for training a model, and a self-learning closed loop for detecting medical insurance violations can be formed, so that the sensitivity of the medical insurance detection model is improved, the newly-appearing medical insurance violations can be detected more quickly, pre-prevention early warning can be carried out on different violations, in-process warning prompt and post-analysis control can be carried out, and the real-time performance of medical insurance behavior supervision is ensured.

In some alternative embodiments, referring to fig. 4, fig. 4 is a flow chart of a review of medical insurance history data of a offending participant in accordance with one embodiment of the present application.

As shown in fig. 4, after the step of obtaining the auditing result of rechecking the to-be-detected medical insurance data with abnormal type, the feature processing method of the medical insurance data provided by the application further includes the following steps:

s109, obtaining the underwriting person information of the medical insurance data to be detected, of which the auditing result is illegal;

S110, acquiring all medical insurance history records matched with the underwriting person information, and rechecking the medical insurance history records.

After rechecking the to-be-detected medical insurance data with abnormal types, finding out illegal to-be-detected medical insurance data and corresponding underwriting person information, wherein the underwriting person information is identity information of a person buying medical insurance, when the medical insurance data of a certain underwriting person is determined to be illegal, possibly, the medical insurance history data of the underwriting person also has illegal data, the system can acquire all medical insurance histories of the underwriting person, recheck the medical insurance histories for the medical insurance histories, thereby finding out all illegal medical insurance behaviors of the underwriting person, facilitating subsequent adjustment of the underwriting person, such as refusing the medical insurance maintenance of the underwriting person, or enlarging the auditing strength of the medical insurance data of the underwriting person, and the like, without being particularly limited.

In some alternative embodiments, referring to fig. 5, fig. 5 is a schematic flow chart of model tuning according to an embodiment of the present application.

As shown in fig. 5, after the training data set is input to a preset neural network to perform training, and a medical insurance detection model is generated, the feature processing method of medical insurance data provided by the application further includes the following steps:

s111, acquiring performance parameters of a medical insurance detection model;

and S112, adjusting model parameters of the medical insurance detection model according to the performance parameters.

After generating the medical insurance test model, the computer device may also evaluate the performance of the model to obtain performance parameters of the medical insurance test model, optionally including, but not limited to, accuracy (Accuracy), precision (precision), recall (recall), and F1 values, among others.

Further, the accuracy of the evaluation index of the medical insurance detection model can be calculated through the output result of the confusion matrix, and the accuracy calculation formula is as follows:

TP and TN in equation ③ represent correctly predicted samples, all data represents all samples, so accuracy represents the proportion of correctly predicted samples in all samples.

Further, the accuracy rate represents the proportion of samples with the real category as the attribute among samples predicted as the attribute, the recall rate represents the proportion of samples successfully predicted by the model among samples with the attribute, and the F1 value is a harmonic average of the accuracy rate and the recall rate, and the specific type of performance evaluation is not limited.

Further, after the performance parameters of the model are obtained, the parameters of the model, such as the tuning parameters, may be adjusted according to the performance parameters, and in implementation, the tuning parameters may be selected near default parameters of the learning algorithm, or parameter tuning spaces including a plurality of parameters may be configured, and parameters in the parameter tuning spaces may be traversed to select optimal parameters as parameters of the model, so that the performance of the model may be optimized, which is not limited herein.

In some alternative embodiments, referring to fig. 6, fig. 6 is a flow chart of a process model according to an embodiment of the present application.

As shown in fig. 6, after the step of obtaining the performance parameters of the medical insurance detection model, the feature processing method of the medical insurance data provided by the application further includes the following steps:

s113, judging whether the performance parameters accord with preset model standards;

After obtaining the performance parameters of the medical insurance detection model, the computer device may further determine whether the performance parameters of the model meet a preset model standard, where the model standard is a pre-configured parameter threshold, for example, the parameter threshold of the accuracy rate includes a minimum value and a maximum value, where the minimum value is 99%, when the calculated accuracy rate reaches 99%, the parameters of the model are adjusted to optimize the model, and the model is saved, and if the accuracy rate is lower than 99%, the model performance is poor, and if it is determined that the model performance is not met, step S114 is performed.

S114, deleting the medical insurance detection model. When the performance of the medical insurance detection module is poor, the computer equipment can delete the medical insurance detection module, so that the models with poor performance and good preservation performance are removed, and the accuracy of data detection can be effectively improved.

In some alternative embodiments, referring to fig. 7, fig. 7 is a schematic flow chart of the model overfitting detection according to an embodiment of the present application.

As shown in fig. 7, after the training data set is input to a preset neural network to perform training, and a medical insurance detection model is generated, the feature processing method of medical insurance data provided by the application further includes the following steps:

s115, performing fitting detection on the medical insurance detection model;

S116, judging whether the medical insurance detection model is over-fitted according to the detection result;

After the medical insurance detection model is generated, the medical insurance detection model can be subjected to fitting detection, wherein the fitting refers to that the model verification set and the training set perform well, the model verification set and the training set perform poorly, and when the model verification set and the training set perform poorly, whether the fitting phenomenon occurs or not can be judged through a prediction result, when the fitting phenomenon occurs, the step S117 is executed, and otherwise, the step S118 is executed.

S117, retraining the medical insurance detection model according to a preset training strategy until fitting is not performed.

S118, saving the medical insurance detection model.

After the model is over-fitted, the model parameters can be adjusted to retrain and correct the training over-fitting, for example, the model is circularly trained by adding a data set and regularized items until the model is no longer over-fitted.

The foregoing description of the solution provided by the embodiments of the present application has been mainly presented in terms of a method. To achieve the above functions, it includes corresponding hardware structures and/or software modules that perform the respective functions. Those of skill in the art will readily appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as hardware or combinations of hardware and computer software. Whether a function is implemented as hardware or computer software driven hardware depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

According to the embodiment of the application, the function modules of the feature processing device of the medical insurance data can be divided according to the method example, for example, each function module can be divided corresponding to each function, and two or more functions can be integrated in one processing module. The integrated modules may be implemented in hardware or in software functional modules. Optionally, the division of the modules in the embodiment of the present application is schematic, which is merely a logic function division, and other division manners may be implemented in practice.

Referring specifically to fig. 8, fig. 8 is a schematic diagram illustrating a basic structure of a feature processing device for medical insurance data according to this embodiment.

As shown in fig. 8, a feature processing device for medical insurance data includes:

The medical insurance data acquisition module 1110 is configured to acquire original medical insurance data with a preset dimension in a medical insurance data source, where the medical insurance data source is a database that is shared by medical insurance institutions in an open manner;

the data preprocessing module 1120 is configured to perform data preprocessing on the original medical insurance data to generate basic data;

The data deriving module 1130 is configured to perform feature classification, summarization and grouping on the basic data according to the preset dimension, so as to obtain derived data;

The data integration module 1140 is configured to integrate the original medical insurance data, the base data, and the derivative data to generate a training data set.

Optionally, the feature processing device of medical insurance data provided by the application further comprises:

In order to solve the technical problems, the embodiment of the invention also provides computer equipment. Referring specifically to fig. 9, fig. 9 is a basic structural block diagram of a computer device according to the present embodiment.

As shown in fig. 9, the internal structure of the computer device is schematically shown. The computer device includes a processor, a non-volatile storage medium, a memory, and a network interface connected by a system bus. The nonvolatile storage medium of the computer device stores an operating system, a database and computer readable instructions, the database can store a control information sequence, and the computer readable instructions can enable the processor to realize a feature processing method of medical insurance data when the computer readable instructions are executed by the processor. The processor of the computer device is used to provide computing and control capabilities, supporting the operation of the entire computer device. The memory of the computer device may have stored therein computer readable instructions that, when executed by the processor, cause the processor to perform a method of characterizing medical insurance data. The network interface of the computer device is for communicating with a terminal connection. It will be appreciated by persons skilled in the art that the architecture shown in fig. 9 is merely a block diagram of some of the architecture relevant to the present inventive arrangements and is not limiting as to the computer device to which the present inventive arrangements are applicable, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.

The processor in this embodiment is configured to perform specific functions of the data acquisition module 1110, the data preprocessing module 1120, the data deriving module 1130, and the data integration module 1140 in fig. 8, and the memory stores program codes and various types of data required for executing the above modules. The network interface is used for data transmission between the user terminal or the server. The memory in this embodiment stores program codes and data required for executing all the sub-modules in the face image key point detection device, and the server can call the program codes and data of the server to execute the functions of all the sub-modules.

The computer equipment obtains basic data by acquiring original medical insurance data in a medical insurance data source, then carrying out preprocessing operations such as data cleaning and conversion on the original medical insurance data, carrying out feature classification summarization and grouping on the basic data to obtain derivative data, integrating the original medical insurance data, the basic data and the derivative data into a training data set, generating a universal training data set at one time, saving a large amount of index processing time and labor time, and generating a large amount of model indexes through the multi-dimensional medical insurance data because the original medical insurance data are medical insurance data with preset dimensions in a database which is shared by a medical insurance institution, thereby increasing the comprehensiveness of the data and further enabling the developed model to be comprehensive and high in accuracy.

The present invention also provides a storage medium storing computer readable instructions that, when executed by one or more processors, cause the one or more processors to perform the steps of the feature processing method of medical insurance data of any of the embodiments described above.

Those skilled in the art will appreciate that implementing all or part of the above-described methods in accordance with the embodiments may be accomplished by way of a computer program stored in a computer-readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. The storage medium may be a nonvolatile storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a random access Memory (Random Access Memory, RAM).

Those of skill in the art will appreciate that the various operations, methods, steps in the flow, acts, schemes, and alternatives discussed in the present application may be alternated, altered, combined, or eliminated. Further, other steps, means, or steps in a process having various operations, methods, or procedures discussed herein may be alternated, altered, rearranged, disassembled, combined, or eliminated. Further, steps, measures, schemes in the prior art with various operations, methods, flows disclosed in the present application may also be alternated, altered, rearranged, decomposed, combined, or deleted.

The foregoing is only a partial embodiment of the present application, and it should be noted that it will be apparent to those skilled in the art that modifications and adaptations can be made without departing from the principles of the present application, and such modifications and adaptations are intended to be comprehended within the scope of the present application.

Claims

1. A method for characterizing medical insurance data, comprising:

Performing feature classification, summarization and grouping on the basic data according to the preset dimension to obtain derivative data, wherein the feature classification of the data can use a variance selection method, chi-square test or a recursive feature elimination method;

integrating the original medical insurance data, the basic data and the derivative data to generate a training data set;

After the step of integrating the raw medical insurance data, the base data, and the derivative data to generate a training dataset, the method further comprises the steps of:

Inputting medical insurance data to be detected into the medical insurance detection model, and detecting the type of the medical insurance data to be detected;

After the step of inputting the medical insurance data to be detected into the medical insurance detection model and detecting the type of the medical insurance data to be detected, the method further comprises the following steps:

Adding the auditing result as a marked training sample into the training data set;

After the step of obtaining the auditing result of rechecking the medical insurance data to be detected, which is abnormal in the type of the medical insurance data to be detected, the method further comprises the following steps:

2. The method for processing features of medical insurance data according to claim 1, wherein after said step of inputting said training data set into a preset neural network for training to generate a medical insurance detection model, said method further comprises the steps of:

Acquiring performance parameters of the medical insurance detection model;

3. The method of claim 2, wherein after the step of obtaining performance parameters of the medical insurance detection model, the method further comprises the steps of:

4. The method for processing features of medical insurance data according to claim 1, wherein after said step of inputting said training data set into a preset neural network for training to generate a medical insurance detection model, said method further comprises the steps of:

and when the medical insurance detection model is judged to be over-fitted, retraining the medical insurance detection model according to a preset training strategy until the medical insurance detection model is not over-fitted.

5. A device for characterizing medical insurance data, comprising:

the data deriving module is used for carrying out feature classification, summarization and grouping on the basic data according to the preset dimension to obtain derived data, wherein the feature classification of the data can be carried out by using a variance selection method, a chi-square test or a recursive feature elimination method;

The data integration module is used for integrating the original medical insurance data, the basic data and the derivative data to generate a training data set;

The apparatus further comprises:

the data detection module is used for inputting medical insurance data to be detected into the medical insurance detection model and detecting the type of the medical insurance data to be detected;

The apparatus further comprises:

the training sample adding module is used for adding the auditing result serving as a marked training sample into the training data set;

The apparatus further comprises:

6. A computer device comprising a memory and a processor, wherein the memory has stored therein computer readable instructions which, when executed by the processor, cause the processor to perform the steps of the method of characterizing medical insurance data according to any one of claims 1 to 4.

7. A non-volatile storage medium, characterized in that it stores a computer program implemented according to the method of characterizing medical insurance data according to any one of claims 1 to 4, which, when invoked by a computer, executes the steps comprised by the method.