CN116226629B

CN116226629B - Multi-model feature selection method and system based on feature contribution

Info

Publication number: CN116226629B
Application number: CN202211357878.1A
Authority: CN
Inventors: 陈超; 宋彪; 张瑞环
Original assignee: Inner Mongolia Weishu Data Technology Co ltd
Current assignee: Inner Mongolia Weishu Data Technology Co ltd
Priority date: 2022-11-01
Filing date: 2022-11-01
Publication date: 2024-03-22
Anticipated expiration: 2042-11-01
Also published as: CN116226629A

Abstract

The invention discloses a multi-model feature selection method and system based on feature contribution, comprising the following steps: s1, extracting blood routine test data and biochemical test data, and respectively obtaining sample feature sets; s2, obtaining a test set and a training set based on a k-fold cross validation method; s3, selecting a plurality of machine learning classifier models to perform sample set training, performing embedded feature selection, obtaining the average accuracy of each model, outputting feature importance and giving a weight value; s4, sorting and assigning corresponding weights according to the average accuracy of each model, combining the weights with characteristic weight values to construct a formula, calculating the total weight of each characteristic under different models, and sorting to select an optimal characteristic subset; s5, training an optimal feature subset by using a model with high average accuracy, comparing with the S4 result, and determining the effect of the selected optimal feature subset; the training complexity is greatly reduced while the prediction accuracy is ensured, and the optimal feature subset is more efficiently and quickly found in the high-dimensional data set.

Description

Multi-model feature selection method and system based on feature contribution

Technical Field

The invention relates to the technical fields of inspection medicine and disease screening, in particular to a multi-model feature selection method and system based on feature contribution.

Background

Currently, with the development of artificial intelligence technology, machine learning gradually begins to apply to the medical field. The machine learning is mainly used in the field of screening and auxiliary diagnosis of disease models, the feature engineering is needed to be carried out before the models are built, the feature engineering is an indispensable part in the machine learning, the field takes an important role, the feature engineering is to screen relatively good data features from original data by a series of methods so as to improve the training effect of the models, the feature selection is an important link in the feature engineering, the aim of the feature selection is to find an optimal feature subset, and the aim of the feature selection is to remove some features with lower contribution degree, so that the number of the features is reduced to improve the accuracy of the models and the running time is reduced.

In 2022, wiharto et al used genetic algorithm and support vector machine for feature selection and deep neural network combined diagnosis of coronary heart disease, and the accuracy rate reached 87.7%; in 2021, dharmalingam et al used a hybrid multi-objective particle swarm optimization algorithm with a local tabu search algorithm for pulmonary disease classification with accuracy up to 90.588%; in 2020, shifa et al used a dataset to extract 902 behavioral cues, voice prosody, eye movement, and head pose from speech behaviors, and then aggregate and select optimal features from 38 different class feature selection algorithms to explain the depression detection model, and the results indicate that the voice behavioral features are the most significant features of the depression detection model; in 2017, li et al review feature selection research from the data perspective, and classified the feature selection algorithms for conventional data into similarity-based, information theory-based, sparse learning-based, and statistical-based methods, while explaining how to evaluate the feature selection algorithms, and in summary, demonstrate that feature selection can be applied to disease screening.

However, in the method proposed by Wiharto et al, the convergence speed of the genetic algorithm is low, the local searching capability is poor, meanwhile, the control variables are more, the support vector machine is difficult to implement on a large-scale training sample, the time is long, and the combination speed of the two methods is not improved as a whole; the method proposed in Dharmalingam et al only works very locally, while the convergence rate at the later stage of the iteration is slow; the feature dimensions in the method proposed by shifa et al are relatively difficult to obtain unless it is a institution or hospital dedicated to the study of depression; li et al simply classified and evaluated the method of feature selection from the data perspective.

Therefore, how to provide a multi-model feature selection method and system based on feature contribution is a problem that needs to be solved by those skilled in the art.

Disclosure of Invention

In view of the above, the invention provides a multi-model feature selection method and system based on feature contribution, which uses a conventional test data construction model to perform feature selection according to feature contribution degree, so as to achieve a better effect by using fewer features, and meanwhile, the conventional test data is relatively easy to obtain, and the whole implementation is relatively convenient and rapid.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

a multi-model feature selection method based on feature contribution, comprising the steps of:

s1, respectively extracting a blood routine test data set and a biochemical test data set of a patient and a healthy individual, and respectively obtaining sample feature sets of all data dimensions;

s2, dividing a sample feature set of each data dimension into K parts based on a K-fold cross validation method to obtain a test set and a training set, wherein K is a constant which is arbitrarily larger than 1;

s3, selecting a plurality of machine learning classifier models to train sample sets of sample feature sets of each data dimension, performing embedded feature selection to obtain average accuracy of each model, outputting feature importance of the sample feature sets of each data dimension and giving weight values;

s4, sorting according to the average accuracy of the machine learning classifier models, assigning corresponding weights to the models, combining the model weights with the feature weight values to construct a formula, calculating the total weight of each feature under different models, and sorting according to the results to select an optimal feature subset;

s5, training the optimal feature subset by using a machine learning model with highest average accuracy, comparing the optimal feature subset with the result of S4, and determining that the effect of the selected optimal feature subset is superior to or equal to the effect of the feature full set.

Preferably, S1 further includes data preprocessing, which specifically includes: and respectively selecting a data dimension characteristic set according to the blood routine test data set and the biochemical test data set of the patient and the healthy individual, wherein the data dimension characteristic set comprises the blood routine data set, the biochemical data set and the data set of the combination of the blood routine test and the biochemical test.

Preferably, the specific content of S3 includes:

s31, classifying sample feature sets of all data dimensions by using five machine learning classifiers respectively, testing classification effects by using a test set to obtain sensitivity TPR, specificity TNR and classification accuracy ACC of the classifier, and drawing PR and ROC graphs to obtain AUC and AP values;

s32, obtaining average accuracy of sample feature sets of all data dimensions in all machine learning classifier models;

s33, explaining the contribution degree of the feature importance added to the features when modeling each machine learning classifier model, and carrying out weight assignment sigma on the feature contribution degree sequence _j J is the feature number.

Preferably, in S31, the TPR, TNR, and ACC are specifically:

TPR＝TP/(TP+FN)

TNR＝TN/(FP+TN)

ACC＝(TP+TN)/(TP+FP+FN+TN)

wherein TP represents the number of patients for which the classifier is identified as truly healthy individuals, FP represents the number of patients for which the classifier is misrecognized as healthy individuals, FN represents the number of patients for which the classifier is identified as healthy individuals, and TN represents the number of healthy individuals for which the classifier is identified as truly healthy individuals;

the method comprises the steps that k-fold cross validation training is used for each classifier model to obtain TPR, TNR and ACC, wherein sensitivity TPR is the proportion of all positive samples to all samples, specificity TNR is the proportion of negative samples to all negative samples, and ACC is the classification accuracy of the classifier;

in S32, a TPR is taken as an ordinate, a FPR is taken as an abscissa, an ROC curve is drawn, a TPR is taken as an abscissa, precision is taken as an ordinate, a PR curve is drawn, an AUC value is an area enclosed by the ROC curve and the coordinate axis, and an AP is a graphic area enclosed by the PR curve and the X axis;

wherein the precision calculation formula is:

Precision＝TP/(TP+FP)；

and obtaining the average accuracy of each model according to the classification accuracy of the K-time classifier of the K-fold cross verification.

Preferably, the plurality of machine learning classifier models includes a random forest RF model, a neural network NN model, a support vector machine SVM model, a K nearest neighbor KNN model, and a logistic regression LR model.

Preferably, the specific content of S4 includes:

s41, sorting and setting model weights beta according to average accuracy obtained by the sample feature set under different models _i The average accuracy rate of the different models is RF, NN, SVM, LR, KNN, and the weight changes along with the position of the sequence;

s42, the total weight of each feature under different models is as follows:

wherein sigma _ij A weight value representing a jth feature under an ith model;

s43, calculating the ordering condition of the total weight of each feature under the sample feature set of each data dimension, and selecting the feature of the first half of the sample feature set of each data dimension after ordering as the optimal feature subset.

A multi-model feature selection system based on feature contribution, comprising: the system comprises a data acquisition module, a plurality of machine learning classifier models, a feature weight assignment module, a feature total weight calculation module, an optimal feature subset acquisition module and an optimal feature subset determination module;

the data acquisition module is used for respectively extracting a blood routine test data set and a biochemical test data set of a patient and a healthy individual and respectively acquiring a sample feature set of each data dimension;

the multiple machine learning classification models are used for dividing a sample feature set of each data dimension into K parts based on a K-fold cross validation method, taking one part as a test set, taking the rest K-1 parts as a training set, and enabling K to be a constant which is more than 1 arbitrarily; selecting a plurality of machine learning classifier models to train sample sets of sample characteristics of each data dimension, performing embedded characteristic selection to obtain average accuracy of each model, and outputting characteristic importance of the sample characteristic sets of each data dimension;

the feature weight assignment module is used for assigning a feature weight value according to the output feature importance;

the feature total weight calculation module is used for sorting according to the average accuracy of the machine learning classifier models, assigning corresponding weights to the models, combining the model weights with feature weight values to construct a formula, and calculating the total weight of each feature under different models;

the optimal feature subset acquisition module is used for sorting and selecting an optimal feature subset according to the total weight result of each feature under different models;

and the optimal feature subset determining module is used for comparing the training of the optimal feature subset by using the machine learning model with high average accuracy with the result of the feature total weight calculating module, and determining that the effect of the selected optimal feature subset is superior to or equal to the effect of the feature total set.

Preferably, the multi-model feature selection system based on feature contribution further comprises a data preprocessing module, wherein the data dimension feature sets are selected according to blood routine test data sets and biochemical test data sets of patients and healthy individuals respectively, and the data dimension feature sets comprise the blood routine test data sets, the biochemical test data sets and data sets of blood routine test and biochemical test combination.

Compared with the prior art, the multi-model feature selection method and system based on feature contribution provided by the invention have the advantages that the abnormal data set contains a plurality of related features, the features are subjected to contribution degree analysis and weighted value assignment by using a machine learning classification model, then the features are sequenced according to the accuracy of different models and weighted value assignment, then the feature values under different models are calculated according to a related formula constructed according to the weighted values of the different models, the first half of the features are sequenced according to the values and selected as the optimal feature subset, finally the optimal feature subset is trained by using the machine learning model to obtain the result and the previous result to be compared, the advantage of feature selection is fully exerted, only feature information with high feature contribution degree is selected as the features of the machine learning model, compared with the traditional method, the accuracy of prediction is ensured, meanwhile training complexity is greatly reduced, and the optimal feature subset is more efficiently and rapidly found in the high-dimensional data set.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present invention, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of a multi-model feature selection method based on feature contribution provided by the invention;

FIG. 2 is a schematic diagram of preliminary statistics of feature importance under each model provided in an embodiment of the present invention;

FIG. 3 is a diagram showing the comparison between the training results of all features and the training results of the optimal feature subset of the random forest model according to the embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The embodiment of the invention discloses a multi-model feature selection method based on feature contribution, as shown in fig. 1, comprising the following steps:

In practical application, k takes a value of 5 or 10, one of the k is taken as a test set, and the rest k-1 is taken as a training set.

In this embodiment, the sample feature set includes 22 blood routine data and 14 biochemical data.

In order to further implement the above technical solution, S1 further includes data preprocessing, and specific contents are: and respectively selecting a data dimension characteristic set according to the blood routine test data set and the biochemical test data set of the patient and the healthy individual, wherein the data dimension characteristic set comprises blood routine data, biochemical data and blood routine and biochemical combination data.

In this example, the blood routine test dataset and the biochemical test dataset in S1 are datasets for patients and healthy individuals who selected a single disease.

In order to further implement the above technical solution, the specific content of S3 includes:

s33, explaining the contribution degree of the feature importance added to the features when modeling each machine learning classifier model, and carrying out weight assignment sigma on the feature contribution degree sequence _j J is the feature number, as in fig. 2.

In this embodiment, j=1, 2,..36.

In order to further implement the above technical solution, in S31, the TPR, TNR, and ACC are specifically:

TPR＝TP/(TP+FN)

TNR＝TN/(FP+TN)

ACC＝(TP+TN)/(TP+FP+FN+TN)

wherein the precision calculation formula is:

Precision＝TP/(TP+FP)；

In order to further implement the above technical solution, the plurality of machine learning classifier models includes a random forest RF model, a neural network NN model, a support vector machine SVM model, a K nearest neighbor KNN model, and a logistic regression LR model.

In this embodiment, the specific usage parameters of the five classifiers of random forest, neural network, support vector machine, K nearest neighbor, and logistic regression are respectively:

the random forest algorithm uses a random forest function, where n_estimators is set to 100, min_samples_split is set to 2, and min_samples_leaf is set to 1;

the neural network algorithm uses a Sequential model layer number in Keras as 512 and 1, and activation is set as relu and sigmoid, wherein a loss function adopts a binary cross entropy loss function; selecting Adam by an optimization algorithm; epoch is set to 200; the batch_size is set to 32;

the SVC algorithm uses SVC function, the parameter kernel is set as linear, class_weight is set as balance, and the probability is set as True;

the K nearest neighbor algorithm uses a KNEIGHIBORSClassifier function, wherein n_neighbors is set to 5, and other parameters are defaults;

the logistic regression algorithm uses a logistic regression function, where class_weight is set to {0:1,1:2}, with other parameters defaults.

In order to further implement the above technical solution, the specific content of S4 includes:

s42, the total weight of each feature under different models is as follows:

wherein sigma _ij Weight values representing the jth feature under the ith model, i=1, 2, 5;

In this embodiment, the accuracy is a random forest RF model, and the result RF of training all features using the random forest model is compared with the result jwrf of training the optimal feature subset using the random forest model to determine that the effect of the selected feature subset is better than or equal to the effect of the feature subset, as shown in fig. 3.

In order to further implement the technical scheme, the multi-model feature selection system based on feature contribution further comprises a data preprocessing module, wherein the data dimension feature sets are selected according to blood routine test data sets and biochemical test data sets of patients and healthy individuals respectively, and the data dimension feature sets comprise the blood routine test data sets, the biochemical test data sets and data sets of blood routine test and biochemical test combination.

In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method of multi-model feature selection based on feature contribution, comprising the steps of:

s1, respectively extracting a blood routine test data set and a biochemical test data set of a patient and a healthy individual, and obtaining a sample feature set of each data dimension;

s2, dividing a sample feature set of each data dimension into K parts based on a K-fold cross validation method, and respectively obtaining a test set and a training set, wherein K is a constant which is arbitrarily larger than 1;

s5, training the optimal feature subset by using a machine learning model with highest average accuracy, comparing the optimal feature subset with the result of S4, and determining that the effect of the selected optimal feature subset is superior to or equal to the effect of the feature full set;

the specific content of S3 comprises:

s33, explaining the contribution degree of the feature importance added to the features when modeling each machine learning classifier model, and carrying out weight assignment sigma on the feature contribution degree sequence _j J is a feature number;

TPR, TNR and ACC are specifically:

TPR＝TP/(TP+FN)

TNR＝TN/(FP+TN)

ACC＝(TP+TN)/(TP+FP+FN+TN)

drawing a ROC curve by taking TPR as an ordinate and FPR as an abscissa, drawing a PR curve by taking TPR as an abscissa and precision as an ordinate, wherein an AUC value is an area enclosed by the ROC curve and a coordinate axis, and AP is a graph area enclosed by the PR curve and an X axis;

wherein the precision calculation formula is:

Precision＝TP/(TP+FP)；

obtaining the average accuracy of each model according to the classification accuracy of the K-time classifier of the K-fold cross verification;

the specific content of S4 comprises:

s42, the total weight of each feature under different models is as follows:

wherein sigma _ij A weight value representing a jth feature under an ith model;

2. The method for selecting a plurality of types of characteristics based on characteristic contribution according to claim 1, wherein S1 further comprises data preprocessing, and the specific contents are as follows: and respectively selecting a data dimension characteristic set according to the blood routine test data set and the biochemical test data set of the patient and the healthy individual, wherein the data dimension characteristic set comprises the blood routine test data set, the biochemical test data set and a data set combining the blood routine test and the biochemical test.

3. A multi-model feature selection method based on feature contribution as recited in claim 1, wherein the plurality of machine learning classifier models includes a random forest RF model, a neural network NN model, a support vector machine SVM model, a K nearest neighbor KNN model, and a logistic regression LR model.

4. A feature contribution based multi-model feature selection system, characterized in that a feature contribution based multi-model feature selection method according to any of claims 1-3, comprising: the system comprises a data acquisition module, a plurality of machine learning classifier models, a feature weight assignment module, a feature total weight calculation module, an optimal feature subset acquisition module and an optimal feature subset determination module;

the multiple machine learning classification models are used for dividing a sample feature set of each data dimension into K parts based on a K-fold cross validation method to respectively obtain a test set and a training set, wherein K is a constant which is arbitrarily larger than 1; selecting a plurality of machine learning classifier models to train sample sets of sample characteristics of each data dimension, performing embedded characteristic selection to obtain average accuracy of each model, and outputting characteristic importance of the sample characteristic sets of each data dimension;

and the optimal feature subset determining module is used for comparing the training of the optimal feature subset by using the machine learning model with the highest average accuracy with the result of the feature total weight calculating module, and determining that the effect of the selected optimal feature subset is better than or equal to the effect of the feature total set.

5. The multi-model feature selection system based on feature contribution of claim 4, further comprising a data preprocessing module for selecting data dimension feature sets from a blood routine test data set and a biochemical test data set of the patient and the healthy individual, respectively, including the blood routine test data set, the biochemical test data set, and a blood routine test combined with biochemical test data set.

6. A multi-model feature selection system based on feature contribution as recited in claim 4, wherein the plurality of machine learning classifier models includes a random forest RF model, a neural network NN model, a support vector machine SVM model, a K nearest neighbor KNN model, and a logistic regression LR model.