CN113920384A

CN113920384A - Feature validity evaluation method, device, equipment and storage medium

Info

Publication number: CN113920384A
Application number: CN202111187176.9A
Authority: CN
Inventors: 满天龙; 张俊杰
Original assignee: Ping An Medical and Healthcare Management Co Ltd
Current assignee: Shenzhen Ping An Medical Health Technology Service Co Ltd
Priority date: 2021-10-12
Filing date: 2021-10-12
Publication date: 2022-01-11

Abstract

The invention discloses a characteristic effectiveness evaluation method, a device, equipment and a storage medium, wherein the characteristic effectiveness evaluation method is applied to a medical insurance office wind control model, specifically, a first effective characteristic data set is obtained through a tree structure model, a second effective characteristic data set is obtained through a deep learning neural network model, a third effective characteristic data set is obtained through a target service scene, and then the union of the first effective characteristic data set, the second effective characteristic data set and the third effective characteristic data set is used as a target effective characteristic data set, so that high-quality characteristics can be selected for the wind control model through multiple dimensions, and the wind control intelligent assistance is provided for big data of medical insurance.

Description

Feature validity evaluation method, device, equipment and storage medium

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a feature validity evaluation method, a feature validity evaluation device, feature validity evaluation equipment and a storage medium.

Background

The characteristic engineering is an important ring in the field of big data wind control. The abundance degree of the features generated by the feature engineering determines the quality of the machine learning and deep learning models to a certain degree. Feature engineering can produce tens of thousands of different kinds of features. Extracting useful features from these features is an important link of feature engineering. Useless features interfere with the training model, so that the training time is prolonged, and sometimes, the problems of model under-fitting and the like are even caused.

At present, a feature engineering feature effectiveness evaluation method is based on manual screening of service personnel to a great extent. The requirement on the service capability of service personnel is high, the labor cost is high, and the effective characteristics are not completely easy to omit. Some automated feature validity assessment methods are decentralized and non-systematic.

Therefore, the prior art still needs to be improved.

Disclosure of Invention

The present invention provides a method, an apparatus, a device and a storage medium for evaluating feature validity, aiming at solving the technical problems of time and labor consumption and easy omission in selecting valid features in the feature validity evaluation method in the prior art.

In a first aspect, the present application provides a feature validity assessment method, including:

acquiring the importance of each feature data in the initial feature data set by using a tree structure model, and taking the feature data with the importance greater than an importance threshold value as effective feature data to obtain a first effective feature data set;

inputting the initial feature data set into a deep learning neural network model to obtain a first result, obtaining a first subset of initial feature data and a second subset of the initial feature data based on the initial feature data set, inputting the first subset of the initial feature data into the deep learning neural network model to obtain a second result, and taking the second subset of the initial feature data as effective feature data to obtain a second effective feature data set under the condition that the error between the first result and the second result is greater than an error threshold value, wherein the initial feature data set comprises the first subset of the initial feature data and the second subset of the initial feature data;

acquiring the initial characteristic data set on a business dimension, acquiring characteristic data on a target business dimension related to the target business scene in the business dimension as effective characteristic data to obtain a third effective characteristic data set based on the target business scene, wherein the business dimension comprises one or more of a personnel number dimension, a hospital dimension, a visit number dimension, a department dimension, a doctor dimension, a participation and insurance unit dimension and a multi-dimensional characteristic;

taking the union of the first valid feature data set, the second valid feature data set and the third valid feature data set as the target valid feature data set.

Optionally, the tree structure model includes a random forest model, a composite tree model xgboost and a decision tree model lightgbm, the importance of each feature data in the initial feature data set is obtained by using the tree structure model, and the feature data whose importance is greater than an importance threshold is used as effective feature data to obtain a first effective feature data set, where the method includes:

acquiring a first importance of each feature data in the initial feature data set by using a random forest model, and taking the feature data of which the first importance is greater than an importance threshold value as effective feature data to obtain a fourth effective feature data set;

acquiring a second importance of each feature data in the initial feature data set by using a composite tree model xgboost, and taking the feature data with the second importance greater than an importance threshold as effective feature data to obtain a fifth effective feature data set;

acquiring a third importance of each feature data in the initial feature data set by using a decision tree model lightgm, and taking the feature data with the third importance greater than an importance threshold as effective feature data to obtain a sixth effective feature data set;

determining an intersection of the fourth valid feature data set, the fifth valid feature data set and the sixth valid feature data set as the first valid feature data set, or determining a union of the fourth valid feature data set, the fifth valid feature data set and the sixth valid feature data set as the first valid feature data set.

Optionally, the deep learning neural network model includes three hidden layers, the inputting the initial feature data set into the deep learning neural network model to obtain a first result, obtaining a first subset of initial feature data and a second subset of initial feature data based on the initial feature data set, inputting the first subset of initial feature data into the deep learning neural network model to obtain a second result, and taking the second subset of initial feature data as valid feature data to obtain a second valid feature data set when an error between the first result and the second result is greater than an error threshold, where the obtaining includes:

inputting the initial characteristic data set into a deep learning neural network model to obtain a first result;

selecting three quarters of the initial feature data as a first subset of the initial feature data and one quarter of the initial feature data as a second subset of the initial feature data;

inputting the first subset of the initial feature data into the deep learning neural network model to obtain a second result;

under the condition that the error between the first result and the second result is larger than an error threshold value, reselecting feature data less than the last selected proportion from the initial feature data as the initial feature data first subset, and electing feature data more than or equal to the last selected proportion from the initial feature data as the initial feature data second subset;

inputting the initial characteristic data first subset into the deep learning neural network model again to obtain a new first result;

and under the condition that the error between the first result and the second result is greater than an error threshold value, repeatedly selecting less than the last selection proportion of feature data from the initial feature data as a first subset of the initial feature data, and selecting more than or equal to the last selection proportion of feature data from the initial feature data as a second subset of the initial feature data, until under the condition that the error between the first result and the second result is less than or equal to the error threshold value, using the second subset of the initial feature data selected last as effective feature data to obtain a second effective feature data set.

Optionally, after the first subset of the initial feature data is input into the deep learning neural network model again to obtain a new first result, the method further includes:

and under the condition that the error between the first result and the second result is smaller than or equal to an error threshold value, removing the second subset of the initial characteristic data from the initial characteristic data subset to obtain a new data set so as to update the initial characteristic data subset.

Optionally, the taking a union of the first valid feature data set, the second valid feature data set, and the third valid feature data set as the target valid feature data set includes:

acquiring an intersection of the first valid feature data set, the second valid feature data set and the third valid feature data set;

and carrying out duplication elimination processing on the intersection to obtain the target effective characteristic data set.

Optionally, the target service scenario includes back-and-forth hospitalization in a low-grade hospital, frequent visitors to the low-grade hospital, formation of a group, drug opening, and seniority monitoring, and the obtaining, based on the target service scenario, feature data in a target service dimension related to the target service scenario in the service dimension as effective feature data obtains a third effective feature data set, including:

in the case that the target business scenario includes back-and-forth hospitalization of the low-grade hospital, obtaining a third effective characteristic data set based on characteristic data, wherein the characteristic data is used as effective characteristic data, the maximum annual hospitalization cost in the hospital dimension, the number of non-tertiary hospital hospitalization institutions in the department dimension, the number of non-tertiary hospitals in the department dimension, the number of tertiary hospital hospitalization times in the visit number dimension, and the number of times that the hospitalization cost in the hospital dimension is smaller than the average hospitalization cost of the whole-market insured person;

in the case that the target business scenario includes the low-grade hospital frequent visitors, obtaining a third effective feature data set based on feature data of a first-grade hospital inpatient number in the visit number dimension, a patient disease number in the staff number dimension, a patient inpatient outpatient ratio in the participation unit dimension, a total patient inpatient cost of the multi-dimensional feature, a number of average patient inpatient times in the multi-dimensional feature, a first-grade hospital inpatient number in the multi-dimensional feature, an auxiliary medication use in the multi-dimensional feature, and a first-grade hospital distance in the multi-dimensional feature as effective feature data;

under the condition that the target business scene comprises group formation drug dispensing, obtaining a third effective characteristic data set based on whether illegal behaviors suspected of group formation drug dispensing exist in the multi-dimensional characteristics, the times of group formation drug dispensing in the hospital dimension within the last year, the average time interval of violation of each time in the physician dimension and characteristic data on medical institution data of medical treatment in the multi-dimensional characteristics are taken as effective characteristic data;

and under the condition that the target business scene comprises relatively slow qualification monitoring, obtaining a third effective characteristic data set by using characteristic data on the medicine cost of other medical types in the multi-dimensional characteristic as effective characteristic data based on the fact that the hospital at the first stage of hospitalization in the multi-dimensional characteristic is far away, the necessary medication cost of the number of treatment dimension, the necessary medication and medicine taking times of the number of treatment dimension, the medicine taking time interval of the number of treatment dimension, the total medication and total cost of the number of treatment dimension, and the multi-dimensional characteristic.

Optionally, after the deduplication processing is performed on the intersection to obtain the target valid feature data set, the method further includes:

and performing derivative feature processing on the target effective feature set to obtain a derivative effective feature data set, and merging the derivative effective feature data set into the target effective feature data set to obtain an updated target effective feature data set.

In a second aspect, the present application provides a feature validity evaluation apparatus, including:

the first obtaining module 1 is configured to obtain importance of each feature data in an initial feature data set by using a tree structure model, and obtain a first valid feature data set by using feature data of which the importance is greater than an importance threshold as valid feature data;

a second obtaining module 2, configured to input the initial feature data set into a deep learning neural network model to obtain a first result, obtain a first subset of initial feature data and a second subset of initial feature data based on the initial feature data set, input the first subset of initial feature data into the deep learning neural network model to obtain a second result, and obtain a second valid feature data set by using the second subset of initial feature data as valid feature data when an error between the first result and the second result is greater than an error threshold, where the initial feature data set includes the first subset of initial feature data and the second subset of initial feature data;

a third obtaining module 3, configured to obtain the initial feature data set in a business dimension, and obtain, based on a target business scene, feature data in a target business dimension related to the target business scene in the business dimension as effective feature data to obtain a third effective feature data set, where the business dimension includes one or more of a staff numbering dimension, a hospital dimension, a visit number dimension, a department dimension, a physician dimension, a participation unit dimension, and a multi-dimensional feature;

and a processing module 4, configured to use a union of the first valid feature data set, the second valid feature data set, and the third valid feature data set as the target valid feature data set.

In a third aspect, the present application provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, the processor implements the feature validity evaluation method according to any one of the above technical solutions.

In a fourth aspect, the present application provides a computer-readable storage medium, which stores a computer program, and when the computer program is executed by a processor, the computer program implements the feature validity evaluation method according to any one of the above technical solutions.

Has the advantages that: the invention provides a feature validity evaluation method, which comprises the following steps: acquiring the importance of each feature data in the initial feature data set by using a tree structure model, and taking the feature data with the importance greater than an importance threshold value as effective feature data to obtain a first effective feature data set; inputting the initial feature data set into a deep learning neural network model to obtain a first result, obtaining a first subset of initial feature data and a second subset of the initial feature data based on the initial feature data set, inputting the first subset of the initial feature data into the deep learning neural network model to obtain a second result, and taking the second subset of the initial feature data as effective feature data to obtain a second effective feature data set under the condition that the error between the first result and the second result is greater than an error threshold value, wherein the initial feature data set comprises the first subset of the initial feature data and the second subset of the initial feature data; acquiring the initial characteristic data set on a business dimension, acquiring characteristic data on a target business dimension related to the target business scene in the business dimension as effective characteristic data to obtain a third effective characteristic data set based on the target business scene, wherein the business dimension comprises one or more of a personnel number dimension, a hospital dimension, a visit number dimension, a department dimension, a doctor dimension, a participation and insurance unit dimension and a multi-dimensional characteristic; taking the union of the first valid feature data set, the second valid feature data set and the third valid feature data set as the target valid feature data set. In the scheme, the characteristic effectiveness evaluation method is applied to a medical insurance bureau wind control model, specifically, a first effective characteristic data set is obtained through a tree structure model, a second effective characteristic data set is obtained through a deep learning neural network model, a third effective characteristic data set is obtained through a target service scene, and then a union set of the first effective characteristic data set, the second effective characteristic data set and the third effective characteristic data set is used as a target effective characteristic data set, so that high-quality characteristics can be selected for the wind control model through multiple dimensions, and the wind control intelligent assistance is provided for medical insurance big data.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments or the background art of the present application, the drawings required to be used in the embodiments or the background art of the present application will be described below.

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and, together with the description, serve to explain the principles of the application.

FIG. 1 is a schematic flow chart of a feature validity assessment method according to the present application;

FIG. 2 is a schematic diagram illustrating the principle of a feature validity assessment method of the present application;

FIG. 3 is a schematic structural diagram of a feature validity assessment apparatus according to the present application;

fig. 4 is a schematic diagram of a hardware structure of a feature validity evaluation apparatus according to the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Based on this, the present application intends to provide a solution to the above technical problem, the details of which will be explained in the following embodiments.

Fig. 1 is a schematic flow diagram of a feature validity evaluation method, and as shown in fig. 1, the feature validity evaluation method provided by the embodiment of the present invention is applied to a wind control model of a medical insurance office, and is intended to select high-quality features for a big data wind control model, to secure a foundation for the model, and to provide intelligent assistance for big data wind control of the medical insurance office.

The method can be applied to a feature validity evaluation device, which can be a terminal device, a server or other processing devices. The terminal device may be a User Equipment (UE), a mobile device, a User terminal, a cellular phone, a cordless phone, a Personal Digital Assistant (PDA), a handheld device, a computing device, a vehicle-mounted device, a wearable device, or the like.

As shown in fig. 1, the feature validity evaluation method includes:

s100, acquiring importance of each feature data in an initial feature data set by using a tree structure model, and taking the feature data with the importance greater than an importance threshold value as effective feature data to obtain a first effective feature data set;

in this embodiment, the initial feature data set is a set including a plurality of initial data features, specifically, the data type of the initial feature may be a numerical type, a category type, a time-series type, a statistical type, a multidimensional type, or the like, the tree structure model is set in advance, when any data in the initial feature data set is input into the tree structure model, the importance of the feature data is obtained, and the feature data with the importance greater than the importance threshold is selected as the first valid feature data set.

Specifically, the importance threshold may be set according to an actual requirement, a value of the importance threshold affects accuracy of the first small characteristic data set, and when accuracy of the first effective characteristic data set needs to be improved, the preset value of the importance threshold may be improved.

S200, inputting the initial feature data set into a deep learning neural network model to obtain a first result, obtaining a first subset of initial feature data and a second subset of the initial feature data based on the initial feature data set, inputting the first subset of the initial feature data into the deep learning neural network model to obtain a second result, and taking the second subset of the initial feature data as effective feature data to obtain a second effective feature data set under the condition that the error between the first result and the second result is greater than an error threshold value, wherein the initial feature data set comprises the first subset of the initial feature data and the second subset of the initial feature data;

in the present embodiment, the deep learning neural network model is obtained by training based on sample feature data, specifically, the sample feature data may be numerical type, category type, time-series type, statistical type, multidimensional type, or the like. Specifically, the step S200 includes:

A. when initial characteristic data in an initial characteristic data set is input into a deep learning neural network, a first result is obtained;

in one embodiment, when the initial characteristic data set is some information related to the patient, for example, the initial characteristic data set includes personal information, historical medication information, historical visit information, medical insurance information, work situations, and academic records information, the deep-learning neural network model may be a neural network model for recommending medication for the patient, so that the first result may be a medication recommendation result for the first patient.

B. And then dividing the initial characteristic data set into a first subset of initial characteristic data and a second subset of the initial characteristic data, wherein the first subset of the initial characteristic data and the second subset of the initial characteristic data are both part of the initial characteristic data set.

In one embodiment, the first subset of initial features and the second subset of initial features form a full set of initial features.

In a specific embodiment, when the initial feature data set includes personal information, historical medication information, historical encounter information, medical insurance information, work situations and academic information, and the first subset of initial features includes academic information and work situations, the second subset of initial feature data includes personal information, historical medication information, historical encounter information and medical insurance information, so that the first subset of initial features and the second subset of initial features form the full set of initial feature data.

C. Inputting the first subset of the initial characteristic data into the deep learning neural network model to obtain a second result; and under the condition that the error between the first result and the second result is larger than an error threshold value, taking the second subset of the initial characteristic data as effective characteristic data to obtain a second effective characteristic data set.

Because the first result is the output result of the deep learning neural network model under the initial characteristic data set, after the second subset of the initial characteristic data is removed from the initial characteristic data set, only the first subset of the initial characteristic data is input into the deep learning neural network model to obtain a second result, and the first result is compared with the second result, so that the effectiveness of the second subset of the initial characteristic data can be reflected. If the influence on the output result of the deep learning neural network model is large after the second subset of the initial characteristic data is removed, the influence on the output result of the part of data is large, and if the influence is large, the part of data is valid; and under the condition that the error between the first result and the second result is smaller than the error threshold, removing the second subset of the initial feature data and having little influence on the output structure of the deep learning neural network model so as to determine the second subset of the initial feature data as incoherent data, and determining the first subset of the initial feature data as a second effective feature set.

In a specific embodiment, when the initial characteristic data set comprises personal information, historical medication information, historical clinic information, medical insurance information, working conditions and academic information, the first result is that the first patient recommended medication result output by inputting the initial characteristic data set into the deep learning neural network model is A medication, and when the initial characteristic first subset comprises the academic information and the working conditions, the initial characteristic second subset comprises the personal information, the historical medication information, the historical clinic information and the medical insurance information, the deep learning neural network model is a neural network model used for recommending medication for a patient, so that when only the academic information and the working conditions are input into the deep learning neural network model, the second patient recommended medication result is also A medication. Further, by comparing the first patient recommended medication result with the second patient recommended medication result, a conclusion that the error is smaller than the threshold value can be obtained, namely, the first subset of the initial characteristic data, namely, the academic information and the working condition are effective information;

in another embodiment, when the initial characteristic data set comprises personal information, historical medication information, historical clinic information, medical insurance information, working conditions and academic record information, the first result is that the first patient recommended medication result output by inputting the initial characteristic data set into the deep learning neural network model is medication A, when the initial characteristic first subset comprises the academic record information and the working conditions, the initial characteristic second subset comprises the personal information, the historical medication information, the historical clinic information and the medical insurance information, and the deep learning neural network model is a neural network model for recommending medication for a patient, so that when only the academic record information and the working conditions are input into the deep learning neural network model, the second patient recommended medication result is medication B. Further, by comparing the first patient recommended medication result with the second patient recommended medication result, a conclusion that the error is larger than the threshold value can be obtained, that is, the second subset of the initial characteristic data is described, that is, the personal information, the historical medication information, the historical visit information and the medical insurance information are effective information.

S300, acquiring the initial characteristic data set in a business dimension, and acquiring characteristic data on a target business dimension related to the target business scene in the business dimension as effective characteristic data to obtain a third effective characteristic data set based on the target business scene, wherein the business dimension comprises one or more of a personnel number dimension, a hospital dimension, a visit number dimension, a department dimension, a doctor dimension, a participation and insurance unit dimension and a multi-dimensional characteristic;

in this embodiment, the service dimension refers to a feature related to a target service scenario, the feature related to the target service scenario may be a staff number dimension, a hospital dimension, a visit number dimension, a department dimension, a physician dimension, a security unit dimension, and the like, various features may be generated based on these service dimensions, and a manual staff may select a partial feature according to actual needs to be used as a third effective feature set. The mode through artifical screening has stronger pertinence, can mend the drug effect characteristic, avoids effective characteristic to omit.

S400, taking the union of the first valid feature data set, the second valid feature data set and the third valid feature data set as the target valid feature data set.

In this embodiment, on the first hand, a first effective feature data set is obtained through a tree structure model, on the second hand, a second effective feature data set is obtained through a deep learning neural network model, on the third hand, a third effective feature data set is completed through a target service scene, and then a union set of the first effective feature data set, the second effective feature data set and the third effective feature data set is used as a target effective feature data set, so that high-quality features can be selected for a wind control model through multiple dimensions.

As an optional implementation, the tree structure model includes a random forest model, a composite tree model xgboost, and a decision tree model lightgbm, and the step S100 includes:

s102, acquiring first importance of each feature data in an initial feature data set by using a random forest model, and taking the feature data of which the first importance is greater than an importance threshold value as effective feature data to obtain a fourth effective feature data set;

specifically, the random forest is a classifier that trains and predicts initial feature data by using a plurality of trees, and when a random forest model is obtained by training, the initial feature data is input to obtain a first importance result related to the initial feature data, and when the first importance is greater than a preset importance threshold, the initial feature data is used as fourth effective feature data. And when each data in the initial characteristic data set is sequentially input into the random forest model, a fourth effective characteristic data set can be obtained.

S104, acquiring a second importance of each feature data in the initial feature data set by using a composite tree model xgboost, and taking the feature data with the second importance larger than an importance threshold as effective feature data to obtain a fifth effective feature data set;

in this embodiment, after the composite tree model is obtained through training, the initial feature data is input to the composite number model to obtain a second importance result related to the initial feature data, and when the second importance is greater than a preset importance threshold, the initial feature data is used as fifth valid feature data. And after each data in the initial characteristic data set is sequentially input into the composite tree model, a fifth effective characteristic data set can be obtained.

S106, obtaining a third importance of each feature data in the initial feature data set by using a decision tree model lightgbm, and obtaining a sixth effective feature data set by using the feature data of which the third importance is greater than an importance threshold as effective feature data;

in this embodiment, after the decision tree model lightgbm is obtained through training, the initial feature data is input to the decision tree model to obtain a third importance result related to the initial feature data, and when the third importance is greater than a preset importance threshold, the initial feature data is used as sixth valid feature data. And when each data in the initial characteristic data set is input into the decision tree model in sequence, a fifth effective characteristic data set can be obtained.

And S108, determining the intersection of the fourth valid feature data set, the fifth valid feature data set and the sixth valid feature data set as the first valid feature data set, or taking the union of the fourth valid feature data set, the fifth valid feature data set and the sixth valid feature data set as the first valid feature data set.

In the embodiment, three tree structure models of random forest, xgboost and lightgbm are selected. Because the algorithms of each model are inconsistent, the importance degree of the same initial feature data to the three tree structure models may be different, so after the fourth effective feature data set, the fifth effective feature data set and the sixth effective feature data set are obtained, when the intersection of the fourth effective feature data set, the fifth effective feature data set and the sixth effective feature data set is used as the first effective feature data set, the comprehensiveness of the first effective feature data set can be increased, and the omission of effective features is avoided. And when the union of the fifth effective characteristic data set and the sixth effective characteristic data set is used as the first effective characteristic data set, the accuracy of the first effective characteristic data set can be increased, and effective characteristics with higher quality and stronger relevance can be conveniently selected.

As an alternative embodiment, the deep learning neural network model includes three hidden layers, and the step S200 includes:

s202, inputting the initial characteristic data set into a deep learning neural network model to obtain a first result;

in one embodiment, when the initial characteristic data set is some information related to the patient, for example, the initial characteristic data set includes personal information, historical medication information, historical visit information, medical insurance information, work conditions, and academic history information, the deep learning neural network model may be a neural network model for recommending medication for the patient, so that the first result may be a first result of recommending medication for the patient.

S204, selecting three quarters of the initial characteristic data as a first subset of the initial characteristic data, and selecting one quarter of the initial characteristic data as a second subset of the initial characteristic data;

in one embodiment, when the initial feature data set includes personal information, historical medication information, historical visit information, and medical insurance information, the first subset of initial feature data is the personal information, historical medication information, and historical visit information, and the second subset of initial feature data is the medical insurance information.

S206, inputting the first subset of the initial characteristic data into the deep learning neural network model to obtain a second result;

in one embodiment, inputting the first subset of initial characteristic data as personal information, historical medication information, historical visit information into the deep learning neural network model results in a second result, which may be a second result of the patient recommending medication.

S208, when the error between the first result and the second result is larger than the error threshold, re-selecting the feature data with the proportion smaller than the last selection proportion from the initial feature data as the first subset of the initial feature data, and selecting the feature data with the proportion larger than or equal to the last selection proportion from the initial feature data as the second subset of the initial feature data;

in case the error of the first result and the second result is larger than the error threshold, it is indicated that the second subset of initial feature data affects the error, that is, the second subset of the initial feature data is the valid feature, and at the same time, some valid features also exist in the second subset of the initial feature data, at this time, the allocation of the first subset of the initial feature data and the second subset of the initial feature data in the initial feature data is reselected, specifically, the selection ratio of the newly reallocated first subset of the initial feature data is less than the ratio of the first subset of the initial feature data selected for the first time, the selection ratio of the newly reallocated second subset of the initial feature data is greater than or equal to the ratio of the first selected second subset of the initial feature data, optionally, one half of the initial feature data is selected as the first subset of the initial feature data, and one half of the initial feature data is selected as the second subset of the initial feature data.

In one embodiment, when the initial feature data set includes personal information, historical medication information, historical visit information, and medical insurance information, the first subset of initial feature data is the personal information, historical medication information, and historical visit information, and the second subset of initial feature data is the medical insurance information. When the personal information, the historical medication information, the historical clinic information and the medical insurance information are input into the deep learning neural network model, a first result C is obtained, the personal information, the historical medication information and the historical clinic information are input into the deep learning neural network model, a second result D is obtained, when the error between the first result C and the second result D is larger than the error threshold value, the medical insurance information influences the output result, at the moment, the personal information in the original initial characteristic data first subset can be added into the initial characteristic data second subset, the newly distributed initial characteristic first subset is the historical medication information and the historical clinic information, and the newly distributed initial characteristic second subset is the personal information and the medical insurance information.

S210, inputting the initial characteristic data first subset into the deep learning neural network model again to obtain a new second result;

in one embodiment, when the first newly allocated subset of initial features is the historical medication information and the historical visit information, and the second newly allocated subset of initial features is the personal information and the medical insurance information, the historical medication information and the historical visit information are input into the deep learning neural network model to obtain a new second result.

S212, under the condition that an error between the first result and the second result is greater than an error threshold, repeatedly selecting, as the first subset of initial feature data, feature data less than a last selection ratio from the initial feature data, and selecting, as the second subset of initial feature data, feature data greater than or equal to the last selection ratio from the initial feature data, until, under the condition that an error between the first result and the second result is less than or equal to the error threshold, obtaining a second valid feature data set by using the second subset of initial feature data selected last as valid feature data.

And continuing to compare the relationship between the error of the new first result and the error of the second result and the error threshold, and under the condition that the error of the new first result and the error of the second result is greater than the error threshold, indicating that valid features still exist in the newly selected first subset of the initial feature data, therefore, continuing to execute the steps of S208 and S210 until the error of the first result and the error of the second result is less than or equal to the error threshold, indicating that no valid features exist in the previously selected first subset of the initial feature data, that is, all valid features are located in the first subset of the initial feature data, and at this time, taking the previously selected second subset of the initial feature data as valid feature data to obtain a second valid feature data set.

In one embodiment, when the first newly allocated subset of initial features is the historical medication information and the historical visit information, and the second newly allocated subset of initial features is the personal information and the medical insurance information, the historical medication information and the historical visit information are input into the deep learning neural network model, and a new second result E is obtained.

Comparing the first result C with the second result E, wherein an error between the first result C and the second result E is larger than an error threshold value, it is indicated that valid feature data still exists in the historical medication information and the historical clinic information, at this time, a new initial feature first subset is redistributed to be the historical medication information, a new initial feature second subset is the historical clinic information, the personal information and the medical insurance information, the historical medication information is input into the deep learning neural network model, a new second result F is continuously obtained, if an error value between the second result F and the first result C is smaller than the error threshold value at this time, it is indicated that the historical medication information is invalid feature data, and at this time, the historical clinic information, the personal information and the medical insurance information are taken as a second valid feature data set.

Optionally, the step S200 further includes:

s214, under the condition that the error between the first result and the second result is smaller than or equal to an error threshold value, removing the second subset of the initial characteristic data from the initial characteristic data subset to obtain a new data set so as to update the initial characteristic data subset.

In this embodiment, when the error between the first result and the second result is less than or equal to the error threshold, it indicates that the second subset of the initial feature data does not affect the determination result of the deep learning neural network model, that is, the second subset of the initial feature data is an insignificant feature, and at this time, the second subset of the initial feature data is removed from the initial feature data subset to obtain a new initial feature data subset.

As an alternative implementation, the step S400 includes:

s402, acquiring an intersection of the first valid feature data set, the second valid feature data set and the third valid feature data set;

in this embodiment, in order to further screen out a more valuable effective feature data set, after the first effective feature data set, the second effective feature data set, and the third effective feature data are obtained, intersection processing is performed on the first effective feature data set, the second effective feature data set, and the third effective feature data.

S404, carrying out duplication elimination processing on the intersection to obtain the target effective characteristic data set.

After the intersection of the first effective characteristic data set, the second effective characteristic data set and the third effective characteristic data is obtained, as the same characteristic data may exist in the first effective characteristic data set, the second effective characteristic data set and the third effective characteristic data, the intersection is subjected to deduplication processing, data duplication is avoided, the target effective data obtained through deduplication processing is judged to be the data set of the effective characteristic data in three modes, and the accuracy of the effective characteristic data can be reflected better.

As an optional implementation manner, the step S300 includes:

s302, under the condition that the target business scene comprises back-and-forth hospitalization of the low-grade hospital, obtaining a third effective characteristic data set based on characteristic data of the maximum annual hospitalization cost in the hospital dimension, the number of non-tertiary hospital hospitalization institutions in the department dimension, the non-tertiary hospital in the department dimension, the number of tertiary hospital hospitalization in the visit number dimension and the number of times that the hospitalization cost in the hospital dimension is smaller than the average hospitalization cost of the whole-market insured person as effective characteristic data;

in this embodiment, the effective characteristic data is combined with the actual application scene, and the effective characteristic is selected in a manual screening mode, specifically, under the condition that the low-grade hospital is hospitalized back and forth: and adding the maximum annual hospitalization cost, the number of hospitalization institutions in the non-tertiary hospital, the hospitalization times of the tertiary hospital and the times that the hospitalization cost is less than the average hospitalization cost of the full-market ginseng security personnel into a third effective characteristic data set as an effective data characteristic set.

S304, under the condition that the target business scene comprises the low-grade hospital frequent visitors, obtaining a third effective characteristic data set based on the characteristic data of the first-grade hospital inpatient times in the visit number dimension, the patient disease number in the staff number dimension, the patient inpatient outpatient ratio in the participation unit dimension, the total patient inpatient cost of the multi-dimensional characteristic, the average number of the patient inpatient times in the multi-dimensional characteristic, the first-grade hospital number in the visit of the multi-dimensional characteristic, the auxiliary medication usage in the multi-dimensional characteristic and the first-grade hospital distance in the multi-dimensional characteristic as effective characteristic data;

specifically, under the condition that many sick frequent visitors in low-grade hospital, through the mode of manual screening, can be with: the hospital hospitalization times of the first-level hospital, the number of the diseases of the patients in the staff numbering dimension are large, the hospitalization outpatient ratio of the patients in the participating and protecting unit dimension is high, the total hospitalization cost of the patients in the multi-dimensional feature is high, the number of the average number of the hospitalization times of the patients in the multi-dimensional feature is small, the number of the hospitalization first-level hospitals in the multi-dimensional feature is large, the usage of auxiliary medicines in the multi-dimensional feature is large, and the feature data of the hospitalization first-level distance in the multi-dimensional feature is added into a third effective feature data set as an effective data feature set.

S306, under the condition that the target business scene comprises the group formation group dispensing, obtaining a third effective characteristic data set based on whether the multi-dimensional characteristics have the illegal behavior of suspected group formation group dispensing, the times of group formation group dispensing in the hospital dimension in the last year, the average time interval of violation in the physician dimension, and characteristic data on the medical institution data of medical treatment in the multi-dimensional characteristics as effective characteristic data;

under the condition of group formation for drug dispensing, whether illegal behaviors suspected of group formation for drug dispensing exist in the multi-dimensional features, the times of group formation for drug dispensing in the hospital dimension in the last year, the average time interval of violation of each time in the physician dimension, and feature data on data of medical institutions for medical treatment in the multi-dimensional features can be added into the third effective feature data set as effective feature data sets in a manual screening mode.

S308, under the condition that the target business scene comprises relatively slow qualification monitoring, a third effective characteristic data set is obtained by taking characteristic data on the medicine cost of other medical types in the multi-dimensional characteristic as effective characteristic data based on the long inpatient primary hospital distance in the multi-dimensional characteristic, the necessary medication cost of the visit number dimension, the necessary medication and medicine taking times of the visit number dimension, the medicine taking time interval of the visit number dimension, the total medicine taking cost of the visit number dimension, the total medicine taking times of the visit number dimension.

And under the condition that the target business scene comprises the relatively slow qualification monitoring, adding characteristic data on the hospitalization first-level hospital distance, the necessary medication cost of the treatment number dimension, the necessary medication and medicine taking times of the treatment number dimension, the medicine taking time interval of the treatment number dimension, the total medication cost of the treatment number dimension, the total medicine taking times of the treatment number dimension and the medicine cost of other medical types in the multi-dimensional characteristic into a third effective characteristic data set as effective characteristic data.

In the implementation mode, under various different target service scenes, effective characteristics are supplemented through experienced service personnel and service knowledge, and omission of finally obtained target effective characteristic data is avoided.

As an optional implementation manner, after performing deduplication processing on the intersection to obtain the target valid feature data set, the method further includes:

s500, carrying out derivative feature processing on the target effective feature set to obtain a derivative effective feature data set, and merging the derivative effective feature data set into the target effective feature data set to obtain an updated target effective feature data set.

In this embodiment, after the target valid feature set is obtained, the valid features in the target valid feature set may be further subjected to derivative processing to obtain a derivative feature data set, and the derivative feature data set is also merged into the target valid feature data set. In addition, when derived signature data is derived from valid signatures, the valid signatures corresponding to the derived signature data can be deleted to avoid duplication of data, for example, when, for example, BMI ═ weight ÷ height 2, BMI is also a valid signature when both weight and height are valid signatures, since it is likely that the BMI, rather than height and weight, affects the result, at which time height and weight can be deleted.

Based on the same inventive concept, as shown in fig. 3, an embodiment of the present application provides a feature validity evaluation apparatus, including:

As an optional implementation manner, the tree structure model includes a random forest model, a composite tree model xgboost, and a decision tree model lightgbm, and the first obtaining module 1 includes:

the first sub-acquisition module is used for acquiring the first importance of each feature data in the initial feature data set by using a random forest model, and acquiring a fourth effective feature data set by using the feature data of which the first importance is greater than an importance threshold as effective feature data;

a second sub-obtaining module, configured to obtain a second importance of each feature data in the initial feature data set by using a composite tree model xgboost, and obtain a fifth valid feature data set by using feature data of which the second importance is greater than an importance threshold as valid feature data;

a third sub-obtaining module, configured to obtain a third importance of each feature data in the initial feature data set by using a decision tree model lightgbm, and obtain a sixth valid feature data set by using feature data with the third importance greater than an importance threshold as valid feature data;

a first obtaining module, configured to determine an intersection of the fourth valid feature data set, the fifth valid feature data set, and the sixth valid feature data set as the first valid feature data set, or use a union of the fourth valid feature data set, the fifth valid feature data set, and the sixth valid feature data set as the first valid feature data set.

As an optional implementation, the deep learning neural network model includes three hidden layers, and the second obtaining module 2 includes:

the first input module is used for inputting the initial characteristic data set into a deep learning neural network model to obtain a first result;

a first selecting module, configured to select three-quarters of the initial feature data as a first subset of the initial feature data, and select one-quarter of the initial feature data as a second subset of the initial feature data;

the second input module is used for inputting the first subset of the initial characteristic data into the deep learning neural network model to obtain a second result;

a second selecting module, configured to select feature data less than a last selected proportion from the initial feature data as the first subset of initial feature data again when an error between the first result and the second result is greater than an error threshold, and select feature data greater than or equal to the last selected proportion from the initial feature data as the second subset of initial feature data; the third input module is used for inputting the initial characteristic data first subset into the deep learning neural network model again to obtain a new first result;

and a second obtaining module, configured to, when an error between the first result and the second result is greater than an error threshold, repeatedly select, as the first subset of initial feature data, feature data less than a last selection ratio from the initial feature data, and select, as the second subset of initial feature data, feature data greater than or equal to the last selection ratio from the initial feature data, until, when the error between the first result and the second result is less than or equal to the error threshold, obtaining a second valid feature data set by using, as valid feature data, the previously selected second subset of initial feature data.

In an optional implementation manner, the second obtaining module 2 further includes:

and the updating module is used for removing the second subset of the initial characteristic data from the initial characteristic data subset to obtain a new data set so as to update the initial characteristic data subset under the condition that the error between the first result and the second result is less than or equal to an error threshold value.

As an alternative implementation, the processing module 4 includes:

a fourth obtaining module, configured to obtain an intersection of the first valid feature data set, the second valid feature data set, and the third valid feature data set;

and the third obtaining module is used for carrying out duplication removal processing on the intersection to obtain the target effective characteristic data set.

As an optional implementation, the third obtaining module 3 includes:

a fourth obtaining module, configured to, when the target business scenario includes back-and-forth hospitalization of the low-grade hospital, obtain a third valid feature data set based on feature data obtained by using feature data of a maximum annual hospitalization cost in the hospital dimension, a number of non-tertiary hospital hospitalization institutions in the department dimension, non-tertiary hospitals in the department dimension, a number of tertiary hospital hospitalization times in the visit number dimension, and a number of hospitalization costs in the hospital dimension that is less than an average hospitalization cost of a full-market insured person as valid feature data;

a fifth obtaining module, configured to, when the target business scenario includes the low-level hospital frequent visitors, obtain a third effective feature data set based on feature data of a number of hospital hospitalizations in the visit number dimension, a number of patient diseases in the staff number dimension, a patient hospitalization outpatient ratio in the participation unit dimension, a total patient hospitalization cost of the multidimensional feature, a number of average patient hospitalization times in the multidimensional feature, a number of first-level hospitals visited in the multidimensional feature, usage of auxiliary medications in the multidimensional feature, and a first-level hospital hospitalization distance in the multidimensional feature as effective feature data;

a sixth obtaining module, configured to obtain a third effective feature data set based on, in the multi-dimensional feature, feature data on the medical costs of other medical types in the multi-dimensional feature, as effective feature data, the characteristic data in the target business scenario including the low qualification monitoring, the first-class hospital distance of hospitalization, the necessary medication cost of the visit number dimension, the necessary medication and medication times of the visit number dimension, the medication and medication time interval of the visit number dimension, the total medication and medication times of the visit number dimension, and the effective feature data.

As an optional implementation, the apparatus further comprises:

and the derivation module is used for carrying out derivation characteristic processing on the target effective characteristic set to obtain a derivation effective characteristic data set, and merging the derivation effective characteristic data set into the target effective characteristic data set to obtain an updated target effective characteristic data set.

Specifically, when the feature validity evaluation apparatus is executed, the steps of the feature validity evaluation method of the present invention are implemented, and the specific steps and corresponding beneficial effects of the feature validity evaluation method have been described in detail above, and are not repeated herein.

Based on the same inventive concept, the embodiment of the present invention provides a computer apparatus, as shown in fig. 4, including a memory 22, a processor 21, an input device 23, and an output device 24. The processor 21, the memory 22, the input device 23 and the output device 24 are coupled by a connector, which includes various interfaces, transmission lines or buses, etc., and the embodiment of the present application is not limited thereto. It should be appreciated that in various embodiments of the present application, coupled refers to being interconnected in a particular manner, including being directly connected or indirectly connected through other devices, such as through various interfaces, transmission lines, buses, and the like.

The computer device may be a notebook computer, a tablet computer, a desktop computer, a mobile phone or a workstation.

The processor 21 may be one or more Graphics Processing Units (GPUs), and in the case that the processor 21 is one GPU, the GPU may be a single-core GPU or a multi-core GPU. Alternatively, the processor 21 may be a processor group composed of a plurality of GPUs, and the plurality of processors are coupled to each other through one or more buses. Alternatively, the processor may be other types of processors, and the like, and the embodiments of the present application are not limited.

Memory 22 may be used to store computer program instructions, as well as various types of computer program code for executing the program code of aspects of the present application. Alternatively, the memory includes, but is not limited to, Random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM), or compact disc read-only memory (CD-ROM), which is used for associated instructions and data.

The input means 23 are for inputting data and/or signals and the output means 24 are for outputting data and/or signals. The input device 23 and the output device 24 may be separate devices or may be an integral device.

It is understood that, in the embodiment of the present application, the memory 22 may be used to store not only the relevant instructions, but also relevant data, for example, the memory 22 may be used to store the target feature data acquired through the input device 23, or the memory 22 may also be used to store the comparison result obtained by the processor 21, and the like, and the embodiment of the present application is not limited to the data specifically stored in the memory.

It will be appreciated that fig. 3 shows only a simplified design of a vehicle identification device. In practical applications, the vehicle identification devices may further include other necessary components, including but not limited to any number of input/output devices, processors, memories, etc., and all vehicle identification devices that can implement the embodiments of the present application are within the scope of the present application.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

Based on the same inventive concept, an embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the method for evaluating feature validity according to any of the above technical solutions is implemented.

Computer-readable storage media may include computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. The computer-readable storage medium may include: a Read Only Memory (ROM), a Random Access Memory (RAM), a Solid State Drive (SSD), or an optical disc. The Random Access Memory may include a resistive Random Access Memory (ReRAM) and a Dynamic Random Access Memory (DRAM).

In summary, the present invention provides a feature validity evaluation method, as shown in fig. 2, the method includes: acquiring the importance of each feature data in the initial feature data set by using a tree structure model, and taking the feature data with the importance greater than an importance threshold value as effective feature data to obtain a first effective feature data set; inputting the initial feature data set into a deep learning neural network model to obtain a first result, obtaining a first subset of initial feature data and a second subset of the initial feature data based on the initial feature data set, inputting the first subset of the initial feature data into the deep learning neural network model to obtain a second result, and taking the second subset of the initial feature data as effective feature data to obtain a second effective feature data set under the condition that the error between the first result and the second result is greater than an error threshold value, wherein the initial feature data set comprises the first subset of the initial feature data and the second subset of the initial feature data; acquiring the initial characteristic data set on a business dimension, acquiring characteristic data on a target business dimension related to the target business scene in the business dimension as effective characteristic data to obtain a third effective characteristic data set based on the target business scene, wherein the business dimension comprises one or more of a personnel number dimension, a hospital dimension, a visit number dimension, a department dimension, a doctor dimension, a participation and insurance unit dimension and a multi-dimensional characteristic; taking the union of the first valid feature data set, the second valid feature data set and the third valid feature data set as the target valid feature data set. In the scheme, the characteristic effectiveness evaluation method is applied to a medical insurance bureau wind control model, specifically, a first effective characteristic data set is obtained through a tree structure model, a second effective characteristic data set is obtained through a deep learning neural network model, a third effective characteristic data set is obtained through a target service scene, and then a union set of the first effective characteristic data set, the second effective characteristic data set and the third effective characteristic data set is used as a target effective characteristic data set, so that high-quality characteristics can be selected for the wind control model through multiple dimensions, and the wind control intelligent assistance is provided for medical insurance big data.

The technical principle of the present invention is described above in connection with specific embodiments. The description is made for the purpose of illustrating the principles of the invention and should not be construed in any way as limiting the scope of the invention. Based on the explanations herein, those skilled in the art will be able to conceive of other embodiments of the present invention without inventive effort, which would fall within the scope of the present invention.

Claims

1. A feature validity assessment method, the method comprising:

2. The method according to claim 1, wherein the tree structure model includes a random forest model, a composite tree model xgboost and a decision tree model lightgmb, and the obtaining, by using the tree structure model, importance of each feature data in the initial feature data set, and obtaining a first valid feature data set by using the feature data with the importance greater than an importance threshold as valid feature data includes:

3. The method of claim 1, wherein the deep-learning neural network model comprises three hidden layers, wherein inputting the initial feature data set into the deep-learning neural network model yields a first result, wherein obtaining a first subset of initial feature data and a second subset of initial feature data based on the initial feature data set, and wherein inputting the first subset of initial feature data into the deep-learning neural network model yields a second result, and wherein taking the second subset of initial feature data as valid feature data yields a second valid feature data set if an error between the first result and the second result is greater than an error threshold comprises:

4. The method of claim 3, wherein after re-inputting the first subset of initial feature data into the deep-learning neural network model to obtain a new first result, the method further comprises:

5. The method according to any one of claims 1 to 4, wherein the taking a union of the first valid feature data set, the second valid feature data set, and the third valid feature data set as the target valid feature data set comprises:

6. The method according to claim 5, wherein the target business scenario includes back-and-forth hospitalization in a low-grade hospital, frequent visitors in the low-grade hospital, group formation and drug development, and gate slow seniority monitoring, and the obtaining feature data in a target business dimension related to the target business scenario in the business dimension as valid feature data based on the target business scenario results in a third valid feature data set, including:

in the case that the target business scenario includes the low-grade hospital frequent visitors, obtaining a third effective feature data set based on the number of hospital stays of the first-grade hospital in the visit number dimension, the number of patient diseases in the staff number dimension, the patient outpatient clinic proportion in the participation unit dimension, the total patient hospitalization cost of the multi-dimensional feature, the number of average patient stays of the multi-dimensional feature, the number of hospital stays of the first-grade hospital in the multi-dimensional feature, the usage of auxiliary drugs in the multi-dimensional feature, and feature data of the multi-dimensional feature, which is far away from the hospital stay of the first-grade hospital, as effective feature data;

7. The method of claim 6, wherein after performing the de-duplication on the intersection to obtain the target valid feature data set, the method further comprises:

8. A feature validity evaluation device characterized by comprising:

9. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the feature validity assessment method according to any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, in which a computer program is stored, which, when being executed by a processor, implements the feature validity evaluation method according to any one of claims 1 to 7.