CN116226629B - Multi-model feature selection method and system based on feature contribution - Google Patents
Multi-model feature selection method and system based on feature contribution Download PDFInfo
- Publication number
- CN116226629B CN116226629B CN202211357878.1A CN202211357878A CN116226629B CN 116226629 B CN116226629 B CN 116226629B CN 202211357878 A CN202211357878 A CN 202211357878A CN 116226629 B CN116226629 B CN 116226629B
- Authority
- CN
- China
- Prior art keywords
- feature
- model
- machine learning
- sample
- models
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000010187 selection method Methods 0.000 title claims abstract description 12
- 238000010801 machine learning Methods 0.000 claims abstract description 46
- 239000008280 blood Substances 0.000 claims abstract description 27
- 210000004369 blood Anatomy 0.000 claims abstract description 27
- 238000012549 training Methods 0.000 claims abstract description 27
- 238000010876 biochemical test Methods 0.000 claims abstract description 23
- 238000009666 routine test Methods 0.000 claims abstract description 23
- 238000000034 method Methods 0.000 claims abstract description 22
- 230000000694 effects Effects 0.000 claims abstract description 20
- 238000012360 testing method Methods 0.000 claims abstract description 16
- 238000002790 cross-validation Methods 0.000 claims abstract description 10
- 238000007637 random forest analysis Methods 0.000 claims description 13
- 238000013528 artificial neural network Methods 0.000 claims description 9
- 238000004364 calculation method Methods 0.000 claims description 9
- 238000007477 logistic regression Methods 0.000 claims description 9
- 238000012706 support-vector machine Methods 0.000 claims description 9
- 238000007781 pre-processing Methods 0.000 claims description 6
- 230000035945 sensitivity Effects 0.000 claims description 6
- 238000013145 classification model Methods 0.000 claims description 4
- 238000012795 verification Methods 0.000 claims description 3
- 238000004422 calculation algorithm Methods 0.000 description 12
- 230000006870 function Effects 0.000 description 5
- 201000010099 disease Diseases 0.000 description 4
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 4
- 238000010586 diagram Methods 0.000 description 3
- 238000012216 screening Methods 0.000 description 3
- 230000003542 behavioural effect Effects 0.000 description 2
- 238000001514 detection method Methods 0.000 description 2
- 238000003745 diagnosis Methods 0.000 description 2
- 230000002068 genetic effect Effects 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 1
- 208000019693 Lung disease Diseases 0.000 description 1
- 230000002159 abnormal effect Effects 0.000 description 1
- 230000004913 activation Effects 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 208000029078 coronary artery disease Diseases 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 239000003814 drug Substances 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000004424 eye movement Effects 0.000 description 1
- 238000007689 inspection Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 239000002245 particle Substances 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000012552 review Methods 0.000 description 1
- 238000010845 search algorithm Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A90/00—Technologies having an indirect contribution to adaptation to climate change
- Y02A90/10—Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Evolutionary Computation (AREA)
- Data Mining & Analysis (AREA)
- Medical Informatics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Physics & Mathematics (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Artificial Intelligence (AREA)
- Investigating Or Analysing Biological Materials (AREA)
- Image Analysis (AREA)
- Image Processing (AREA)
Abstract
The invention discloses a multi-model feature selection method and system based on feature contribution, comprising the following steps: s1, extracting blood routine test data and biochemical test data, and respectively obtaining sample feature sets; s2, obtaining a test set and a training set based on a k-fold cross validation method; s3, selecting a plurality of machine learning classifier models to perform sample set training, performing embedded feature selection, obtaining the average accuracy of each model, outputting feature importance and giving a weight value; s4, sorting and assigning corresponding weights according to the average accuracy of each model, combining the weights with characteristic weight values to construct a formula, calculating the total weight of each characteristic under different models, and sorting to select an optimal characteristic subset; s5, training an optimal feature subset by using a model with high average accuracy, comparing with the S4 result, and determining the effect of the selected optimal feature subset; the training complexity is greatly reduced while the prediction accuracy is ensured, and the optimal feature subset is more efficiently and quickly found in the high-dimensional data set.
Description
Technical Field
The invention relates to the technical fields of inspection medicine and disease screening, in particular to a multi-model feature selection method and system based on feature contribution.
Background
Currently, with the development of artificial intelligence technology, machine learning gradually begins to apply to the medical field. The machine learning is mainly used in the field of screening and auxiliary diagnosis of disease models, the feature engineering is needed to be carried out before the models are built, the feature engineering is an indispensable part in the machine learning, the field takes an important role, the feature engineering is to screen relatively good data features from original data by a series of methods so as to improve the training effect of the models, the feature selection is an important link in the feature engineering, the aim of the feature selection is to find an optimal feature subset, and the aim of the feature selection is to remove some features with lower contribution degree, so that the number of the features is reduced to improve the accuracy of the models and the running time is reduced.
In 2022, wiharto et al used genetic algorithm and support vector machine for feature selection and deep neural network combined diagnosis of coronary heart disease, and the accuracy rate reached 87.7%; in 2021, dharmalingam et al used a hybrid multi-objective particle swarm optimization algorithm with a local tabu search algorithm for pulmonary disease classification with accuracy up to 90.588%; in 2020, shifa et al used a dataset to extract 902 behavioral cues, voice prosody, eye movement, and head pose from speech behaviors, and then aggregate and select optimal features from 38 different class feature selection algorithms to explain the depression detection model, and the results indicate that the voice behavioral features are the most significant features of the depression detection model; in 2017, li et al review feature selection research from the data perspective, and classified the feature selection algorithms for conventional data into similarity-based, information theory-based, sparse learning-based, and statistical-based methods, while explaining how to evaluate the feature selection algorithms, and in summary, demonstrate that feature selection can be applied to disease screening.
However, in the method proposed by Wiharto et al, the convergence speed of the genetic algorithm is low, the local searching capability is poor, meanwhile, the control variables are more, the support vector machine is difficult to implement on a large-scale training sample, the time is long, and the combination speed of the two methods is not improved as a whole; the method proposed in Dharmalingam et al only works very locally, while the convergence rate at the later stage of the iteration is slow; the feature dimensions in the method proposed by shifa et al are relatively difficult to obtain unless it is a institution or hospital dedicated to the study of depression; li et al simply classified and evaluated the method of feature selection from the data perspective.
Therefore, how to provide a multi-model feature selection method and system based on feature contribution is a problem that needs to be solved by those skilled in the art.
Disclosure of Invention
In view of the above, the invention provides a multi-model feature selection method and system based on feature contribution, which uses a conventional test data construction model to perform feature selection according to feature contribution degree, so as to achieve a better effect by using fewer features, and meanwhile, the conventional test data is relatively easy to obtain, and the whole implementation is relatively convenient and rapid.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
a multi-model feature selection method based on feature contribution, comprising the steps of:
s1, respectively extracting a blood routine test data set and a biochemical test data set of a patient and a healthy individual, and respectively obtaining sample feature sets of all data dimensions;
s2, dividing a sample feature set of each data dimension into K parts based on a K-fold cross validation method to obtain a test set and a training set, wherein K is a constant which is arbitrarily larger than 1;
s3, selecting a plurality of machine learning classifier models to train sample sets of sample feature sets of each data dimension, performing embedded feature selection to obtain average accuracy of each model, outputting feature importance of the sample feature sets of each data dimension and giving weight values;
s4, sorting according to the average accuracy of the machine learning classifier models, assigning corresponding weights to the models, combining the model weights with the feature weight values to construct a formula, calculating the total weight of each feature under different models, and sorting according to the results to select an optimal feature subset;
s5, training the optimal feature subset by using a machine learning model with highest average accuracy, comparing the optimal feature subset with the result of S4, and determining that the effect of the selected optimal feature subset is superior to or equal to the effect of the feature full set.
Preferably, S1 further includes data preprocessing, which specifically includes: and respectively selecting a data dimension characteristic set according to the blood routine test data set and the biochemical test data set of the patient and the healthy individual, wherein the data dimension characteristic set comprises the blood routine data set, the biochemical data set and the data set of the combination of the blood routine test and the biochemical test.
Preferably, the specific content of S3 includes:
s31, classifying sample feature sets of all data dimensions by using five machine learning classifiers respectively, testing classification effects by using a test set to obtain sensitivity TPR, specificity TNR and classification accuracy ACC of the classifier, and drawing PR and ROC graphs to obtain AUC and AP values;
s32, obtaining average accuracy of sample feature sets of all data dimensions in all machine learning classifier models;
s33, explaining the contribution degree of the feature importance added to the features when modeling each machine learning classifier model, and carrying out weight assignment sigma on the feature contribution degree sequence j J is the feature number.
Preferably, in S31, the TPR, TNR, and ACC are specifically:
TPR=TP/(TP+FN)
TNR=TN/(FP+TN)
ACC=(TP+TN)/(TP+FP+FN+TN)
wherein TP represents the number of patients for which the classifier is identified as truly healthy individuals, FP represents the number of patients for which the classifier is misrecognized as healthy individuals, FN represents the number of patients for which the classifier is identified as healthy individuals, and TN represents the number of healthy individuals for which the classifier is identified as truly healthy individuals;
the method comprises the steps that k-fold cross validation training is used for each classifier model to obtain TPR, TNR and ACC, wherein sensitivity TPR is the proportion of all positive samples to all samples, specificity TNR is the proportion of negative samples to all negative samples, and ACC is the classification accuracy of the classifier;
in S32, a TPR is taken as an ordinate, a FPR is taken as an abscissa, an ROC curve is drawn, a TPR is taken as an abscissa, precision is taken as an ordinate, a PR curve is drawn, an AUC value is an area enclosed by the ROC curve and the coordinate axis, and an AP is a graphic area enclosed by the PR curve and the X axis;
wherein the precision calculation formula is:
Precision=TP/(TP+FP);
and obtaining the average accuracy of each model according to the classification accuracy of the K-time classifier of the K-fold cross verification.
Preferably, the plurality of machine learning classifier models includes a random forest RF model, a neural network NN model, a support vector machine SVM model, a K nearest neighbor KNN model, and a logistic regression LR model.
Preferably, the specific content of S4 includes:
s41, sorting and setting model weights beta according to average accuracy obtained by the sample feature set under different models i The average accuracy rate of the different models is RF, NN, SVM, LR, KNN, and the weight changes along with the position of the sequence;
s42, the total weight of each feature under different models is as follows:
wherein sigma ij A weight value representing a jth feature under an ith model;
s43, calculating the ordering condition of the total weight of each feature under the sample feature set of each data dimension, and selecting the feature of the first half of the sample feature set of each data dimension after ordering as the optimal feature subset.
A multi-model feature selection system based on feature contribution, comprising: the system comprises a data acquisition module, a plurality of machine learning classifier models, a feature weight assignment module, a feature total weight calculation module, an optimal feature subset acquisition module and an optimal feature subset determination module;
the data acquisition module is used for respectively extracting a blood routine test data set and a biochemical test data set of a patient and a healthy individual and respectively acquiring a sample feature set of each data dimension;
the multiple machine learning classification models are used for dividing a sample feature set of each data dimension into K parts based on a K-fold cross validation method, taking one part as a test set, taking the rest K-1 parts as a training set, and enabling K to be a constant which is more than 1 arbitrarily; selecting a plurality of machine learning classifier models to train sample sets of sample characteristics of each data dimension, performing embedded characteristic selection to obtain average accuracy of each model, and outputting characteristic importance of the sample characteristic sets of each data dimension;
the feature weight assignment module is used for assigning a feature weight value according to the output feature importance;
the feature total weight calculation module is used for sorting according to the average accuracy of the machine learning classifier models, assigning corresponding weights to the models, combining the model weights with feature weight values to construct a formula, and calculating the total weight of each feature under different models;
the optimal feature subset acquisition module is used for sorting and selecting an optimal feature subset according to the total weight result of each feature under different models;
and the optimal feature subset determining module is used for comparing the training of the optimal feature subset by using the machine learning model with high average accuracy with the result of the feature total weight calculating module, and determining that the effect of the selected optimal feature subset is superior to or equal to the effect of the feature total set.
Preferably, the multi-model feature selection system based on feature contribution further comprises a data preprocessing module, wherein the data dimension feature sets are selected according to blood routine test data sets and biochemical test data sets of patients and healthy individuals respectively, and the data dimension feature sets comprise the blood routine test data sets, the biochemical test data sets and data sets of blood routine test and biochemical test combination.
Preferably, the plurality of machine learning classifier models includes a random forest RF model, a neural network NN model, a support vector machine SVM model, a K nearest neighbor KNN model, and a logistic regression LR model.
Compared with the prior art, the multi-model feature selection method and system based on feature contribution provided by the invention have the advantages that the abnormal data set contains a plurality of related features, the features are subjected to contribution degree analysis and weighted value assignment by using a machine learning classification model, then the features are sequenced according to the accuracy of different models and weighted value assignment, then the feature values under different models are calculated according to a related formula constructed according to the weighted values of the different models, the first half of the features are sequenced according to the values and selected as the optimal feature subset, finally the optimal feature subset is trained by using the machine learning model to obtain the result and the previous result to be compared, the advantage of feature selection is fully exerted, only feature information with high feature contribution degree is selected as the features of the machine learning model, compared with the traditional method, the accuracy of prediction is ensured, meanwhile training complexity is greatly reduced, and the optimal feature subset is more efficiently and rapidly found in the high-dimensional data set.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present invention, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic diagram of a multi-model feature selection method based on feature contribution provided by the invention;
FIG. 2 is a schematic diagram of preliminary statistics of feature importance under each model provided in an embodiment of the present invention;
FIG. 3 is a diagram showing the comparison between the training results of all features and the training results of the optimal feature subset of the random forest model according to the embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The embodiment of the invention discloses a multi-model feature selection method based on feature contribution, as shown in fig. 1, comprising the following steps:
s1, respectively extracting a blood routine test data set and a biochemical test data set of a patient and a healthy individual, and respectively obtaining sample feature sets of all data dimensions;
s2, dividing a sample feature set of each data dimension into K parts based on a K-fold cross validation method to obtain a test set and a training set, wherein K is a constant which is arbitrarily larger than 1;
s3, selecting a plurality of machine learning classifier models to train sample sets of sample feature sets of each data dimension, performing embedded feature selection to obtain average accuracy of each model, outputting feature importance of the sample feature sets of each data dimension and giving weight values;
s4, sorting according to the average accuracy of the machine learning classifier models, assigning corresponding weights to the models, combining the model weights with the feature weight values to construct a formula, calculating the total weight of each feature under different models, and sorting according to the results to select an optimal feature subset;
s5, training the optimal feature subset by using a machine learning model with highest average accuracy, comparing the optimal feature subset with the result of S4, and determining that the effect of the selected optimal feature subset is superior to or equal to the effect of the feature full set.
In practical application, k takes a value of 5 or 10, one of the k is taken as a test set, and the rest k-1 is taken as a training set.
In this embodiment, the sample feature set includes 22 blood routine data and 14 biochemical data.
In order to further implement the above technical solution, S1 further includes data preprocessing, and specific contents are: and respectively selecting a data dimension characteristic set according to the blood routine test data set and the biochemical test data set of the patient and the healthy individual, wherein the data dimension characteristic set comprises blood routine data, biochemical data and blood routine and biochemical combination data.
In this example, the blood routine test dataset and the biochemical test dataset in S1 are datasets for patients and healthy individuals who selected a single disease.
In order to further implement the above technical solution, the specific content of S3 includes:
s31, classifying sample feature sets of all data dimensions by using five machine learning classifiers respectively, testing classification effects by using a test set to obtain sensitivity TPR, specificity TNR and classification accuracy ACC of the classifier, and drawing PR and ROC graphs to obtain AUC and AP values;
s32, obtaining average accuracy of sample feature sets of all data dimensions in all machine learning classifier models;
s33, explaining the contribution degree of the feature importance added to the features when modeling each machine learning classifier model, and carrying out weight assignment sigma on the feature contribution degree sequence j J is the feature number, as in fig. 2.
In this embodiment, j=1, 2,..36.
In order to further implement the above technical solution, in S31, the TPR, TNR, and ACC are specifically:
TPR=TP/(TP+FN)
TNR=TN/(FP+TN)
ACC=(TP+TN)/(TP+FP+FN+TN)
wherein TP represents the number of patients for which the classifier is identified as truly healthy individuals, FP represents the number of patients for which the classifier is misrecognized as healthy individuals, FN represents the number of patients for which the classifier is identified as healthy individuals, and TN represents the number of healthy individuals for which the classifier is identified as truly healthy individuals;
the method comprises the steps that k-fold cross validation training is used for each classifier model to obtain TPR, TNR and ACC, wherein sensitivity TPR is the proportion of all positive samples to all samples, specificity TNR is the proportion of negative samples to all negative samples, and ACC is the classification accuracy of the classifier;
in S32, a TPR is taken as an ordinate, a FPR is taken as an abscissa, an ROC curve is drawn, a TPR is taken as an abscissa, precision is taken as an ordinate, a PR curve is drawn, an AUC value is an area enclosed by the ROC curve and the coordinate axis, and an AP is a graphic area enclosed by the PR curve and the X axis;
wherein the precision calculation formula is:
Precision=TP/(TP+FP);
and obtaining the average accuracy of each model according to the classification accuracy of the K-time classifier of the K-fold cross verification.
In order to further implement the above technical solution, the plurality of machine learning classifier models includes a random forest RF model, a neural network NN model, a support vector machine SVM model, a K nearest neighbor KNN model, and a logistic regression LR model.
In this embodiment, the specific usage parameters of the five classifiers of random forest, neural network, support vector machine, K nearest neighbor, and logistic regression are respectively:
the random forest algorithm uses a random forest function, where n_estimators is set to 100, min_samples_split is set to 2, and min_samples_leaf is set to 1;
the neural network algorithm uses a Sequential model layer number in Keras as 512 and 1, and activation is set as relu and sigmoid, wherein a loss function adopts a binary cross entropy loss function; selecting Adam by an optimization algorithm; epoch is set to 200; the batch_size is set to 32;
the SVC algorithm uses SVC function, the parameter kernel is set as linear, class_weight is set as balance, and the probability is set as True;
the K nearest neighbor algorithm uses a KNEIGHIBORSClassifier function, wherein n_neighbors is set to 5, and other parameters are defaults;
the logistic regression algorithm uses a logistic regression function, where class_weight is set to {0:1,1:2}, with other parameters defaults.
In order to further implement the above technical solution, the specific content of S4 includes:
s41, sorting and setting model weights beta according to average accuracy obtained by the sample feature set under different models i The average accuracy rate of the different models is RF, NN, SVM, LR, KNN, and the weight changes along with the position of the sequence;
s42, the total weight of each feature under different models is as follows:
wherein sigma ij Weight values representing the jth feature under the ith model, i=1, 2, 5;
s43, calculating the ordering condition of the total weight of each feature under the sample feature set of each data dimension, and selecting the feature of the first half of the sample feature set of each data dimension after ordering as the optimal feature subset.
In this embodiment, the accuracy is a random forest RF model, and the result RF of training all features using the random forest model is compared with the result jwrf of training the optimal feature subset using the random forest model to determine that the effect of the selected feature subset is better than or equal to the effect of the feature subset, as shown in fig. 3.
A multi-model feature selection system based on feature contribution, comprising: the system comprises a data acquisition module, a plurality of machine learning classifier models, a feature weight assignment module, a feature total weight calculation module, an optimal feature subset acquisition module and an optimal feature subset determination module;
the data acquisition module is used for respectively extracting a blood routine test data set and a biochemical test data set of a patient and a healthy individual and respectively acquiring a sample feature set of each data dimension;
the multiple machine learning classification models are used for dividing a sample feature set of each data dimension into K parts based on a K-fold cross validation method, taking one part as a test set, taking the rest K-1 parts as a training set, and enabling K to be a constant which is more than 1 arbitrarily; selecting a plurality of machine learning classifier models to train sample sets of sample characteristics of each data dimension, performing embedded characteristic selection to obtain average accuracy of each model, and outputting characteristic importance of the sample characteristic sets of each data dimension;
the feature weight assignment module is used for assigning a feature weight value according to the output feature importance;
the feature total weight calculation module is used for sorting according to the average accuracy of the machine learning classifier models, assigning corresponding weights to the models, combining the model weights with feature weight values to construct a formula, and calculating the total weight of each feature under different models;
the optimal feature subset acquisition module is used for sorting and selecting an optimal feature subset according to the total weight result of each feature under different models;
and the optimal feature subset determining module is used for comparing the training of the optimal feature subset by using the machine learning model with high average accuracy with the result of the feature total weight calculating module, and determining that the effect of the selected optimal feature subset is superior to or equal to the effect of the feature total set.
In order to further implement the technical scheme, the multi-model feature selection system based on feature contribution further comprises a data preprocessing module, wherein the data dimension feature sets are selected according to blood routine test data sets and biochemical test data sets of patients and healthy individuals respectively, and the data dimension feature sets comprise the blood routine test data sets, the biochemical test data sets and data sets of blood routine test and biochemical test combination.
In order to further implement the above technical solution, the plurality of machine learning classifier models includes a random forest RF model, a neural network NN model, a support vector machine SVM model, a K nearest neighbor KNN model, and a logistic regression LR model.
In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
Claims (6)
1. A method of multi-model feature selection based on feature contribution, comprising the steps of:
s1, respectively extracting a blood routine test data set and a biochemical test data set of a patient and a healthy individual, and obtaining a sample feature set of each data dimension;
s2, dividing a sample feature set of each data dimension into K parts based on a K-fold cross validation method, and respectively obtaining a test set and a training set, wherein K is a constant which is arbitrarily larger than 1;
s3, selecting a plurality of machine learning classifier models to train sample sets of sample feature sets of each data dimension, performing embedded feature selection to obtain average accuracy of each model, outputting feature importance of the sample feature sets of each data dimension and giving weight values;
s4, sorting according to the average accuracy of the machine learning classifier models, assigning corresponding weights to the models, combining the model weights with the feature weight values to construct a formula, calculating the total weight of each feature under different models, and sorting according to the results to select an optimal feature subset;
s5, training the optimal feature subset by using a machine learning model with highest average accuracy, comparing the optimal feature subset with the result of S4, and determining that the effect of the selected optimal feature subset is superior to or equal to the effect of the feature full set;
the specific content of S3 comprises:
s31, classifying sample feature sets of all data dimensions by using five machine learning classifiers respectively, testing classification effects by using a test set to obtain sensitivity TPR, specificity TNR and classification accuracy ACC of the classifier, and drawing PR and ROC graphs to obtain AUC and AP values;
s32, obtaining average accuracy of sample feature sets of all data dimensions in all machine learning classifier models;
s33, explaining the contribution degree of the feature importance added to the features when modeling each machine learning classifier model, and carrying out weight assignment sigma on the feature contribution degree sequence j J is a feature number;
TPR, TNR and ACC are specifically:
TPR=TP/(TP+FN)
TNR=TN/(FP+TN)
ACC=(TP+TN)/(TP+FP+FN+TN)
wherein TP represents the number of patients for which the classifier is identified as truly healthy individuals, FP represents the number of patients for which the classifier is misrecognized as healthy individuals, FN represents the number of patients for which the classifier is identified as healthy individuals, and TN represents the number of healthy individuals for which the classifier is identified as truly healthy individuals;
the method comprises the steps that k-fold cross validation training is used for each classifier model to obtain TPR, TNR and ACC, wherein sensitivity TPR is the proportion of all positive samples to all samples, specificity TNR is the proportion of negative samples to all negative samples, and ACC is the classification accuracy of the classifier;
drawing a ROC curve by taking TPR as an ordinate and FPR as an abscissa, drawing a PR curve by taking TPR as an abscissa and precision as an ordinate, wherein an AUC value is an area enclosed by the ROC curve and a coordinate axis, and AP is a graph area enclosed by the PR curve and an X axis;
wherein the precision calculation formula is:
Precision=TP/(TP+FP);
obtaining the average accuracy of each model according to the classification accuracy of the K-time classifier of the K-fold cross verification;
the specific content of S4 comprises:
s41, sorting and setting model weights beta according to average accuracy obtained by the sample feature set under different models i The average accuracy rate of the different models is RF, NN, SVM, LR, KNN, and the weight changes along with the position of the sequence;
s42, the total weight of each feature under different models is as follows:
wherein sigma ij A weight value representing a jth feature under an ith model;
s43, calculating the ordering condition of the total weight of each feature under the sample feature set of each data dimension, and selecting the feature of the first half of the sample feature set of each data dimension after ordering as the optimal feature subset.
2. The method for selecting a plurality of types of characteristics based on characteristic contribution according to claim 1, wherein S1 further comprises data preprocessing, and the specific contents are as follows: and respectively selecting a data dimension characteristic set according to the blood routine test data set and the biochemical test data set of the patient and the healthy individual, wherein the data dimension characteristic set comprises the blood routine test data set, the biochemical test data set and a data set combining the blood routine test and the biochemical test.
3. A multi-model feature selection method based on feature contribution as recited in claim 1, wherein the plurality of machine learning classifier models includes a random forest RF model, a neural network NN model, a support vector machine SVM model, a K nearest neighbor KNN model, and a logistic regression LR model.
4. A feature contribution based multi-model feature selection system, characterized in that a feature contribution based multi-model feature selection method according to any of claims 1-3, comprising: the system comprises a data acquisition module, a plurality of machine learning classifier models, a feature weight assignment module, a feature total weight calculation module, an optimal feature subset acquisition module and an optimal feature subset determination module;
the data acquisition module is used for respectively extracting a blood routine test data set and a biochemical test data set of a patient and a healthy individual and respectively acquiring a sample feature set of each data dimension;
the multiple machine learning classification models are used for dividing a sample feature set of each data dimension into K parts based on a K-fold cross validation method to respectively obtain a test set and a training set, wherein K is a constant which is arbitrarily larger than 1; selecting a plurality of machine learning classifier models to train sample sets of sample characteristics of each data dimension, performing embedded characteristic selection to obtain average accuracy of each model, and outputting characteristic importance of the sample characteristic sets of each data dimension;
the feature weight assignment module is used for assigning a feature weight value according to the output feature importance;
the feature total weight calculation module is used for sorting according to the average accuracy of the machine learning classifier models, assigning corresponding weights to the models, combining the model weights with feature weight values to construct a formula, and calculating the total weight of each feature under different models;
the optimal feature subset acquisition module is used for sorting and selecting an optimal feature subset according to the total weight result of each feature under different models;
and the optimal feature subset determining module is used for comparing the training of the optimal feature subset by using the machine learning model with the highest average accuracy with the result of the feature total weight calculating module, and determining that the effect of the selected optimal feature subset is better than or equal to the effect of the feature total set.
5. The multi-model feature selection system based on feature contribution of claim 4, further comprising a data preprocessing module for selecting data dimension feature sets from a blood routine test data set and a biochemical test data set of the patient and the healthy individual, respectively, including the blood routine test data set, the biochemical test data set, and a blood routine test combined with biochemical test data set.
6. A multi-model feature selection system based on feature contribution as recited in claim 4, wherein the plurality of machine learning classifier models includes a random forest RF model, a neural network NN model, a support vector machine SVM model, a K nearest neighbor KNN model, and a logistic regression LR model.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211357878.1A CN116226629B (en) | 2022-11-01 | 2022-11-01 | Multi-model feature selection method and system based on feature contribution |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211357878.1A CN116226629B (en) | 2022-11-01 | 2022-11-01 | Multi-model feature selection method and system based on feature contribution |
Publications (2)
Publication Number | Publication Date |
---|---|
CN116226629A CN116226629A (en) | 2023-06-06 |
CN116226629B true CN116226629B (en) | 2024-03-22 |
Family
ID=86583109
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211357878.1A Active CN116226629B (en) | 2022-11-01 | 2022-11-01 | Multi-model feature selection method and system based on feature contribution |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116226629B (en) |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105184316A (en) * | 2015-08-28 | 2015-12-23 | 国网智能电网研究院 | Support vector machine power grid business classification method based on feature weight learning |
CN110177112A (en) * | 2019-06-05 | 2019-08-27 | 华东理工大学 | The network inbreak detection method deviated based on dibaryon spatial sampling and confidence |
CN110379521A (en) * | 2019-06-24 | 2019-10-25 | 南京理工大学 | Medical data collection feature selection approach based on information theory |
CN112381787A (en) * | 2020-11-12 | 2021-02-19 | 福州大学 | Steel plate surface defect classification method based on transfer learning |
CN113535694A (en) * | 2021-06-18 | 2021-10-22 | 北方民族大学 | Stacking frame-based feature selection method |
CN114724715A (en) * | 2022-04-12 | 2022-07-08 | 南京邮电大学 | Multi-machine learning model feature selection method based on optimal AUC |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20220293266A1 (en) * | 2021-03-09 | 2022-09-15 | The Hong Kong Polytechnic University | System and method for detection of impairment in cognitive function |
-
2022
- 2022-11-01 CN CN202211357878.1A patent/CN116226629B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105184316A (en) * | 2015-08-28 | 2015-12-23 | 国网智能电网研究院 | Support vector machine power grid business classification method based on feature weight learning |
CN110177112A (en) * | 2019-06-05 | 2019-08-27 | 华东理工大学 | The network inbreak detection method deviated based on dibaryon spatial sampling and confidence |
CN110379521A (en) * | 2019-06-24 | 2019-10-25 | 南京理工大学 | Medical data collection feature selection approach based on information theory |
CN112381787A (en) * | 2020-11-12 | 2021-02-19 | 福州大学 | Steel plate surface defect classification method based on transfer learning |
CN113535694A (en) * | 2021-06-18 | 2021-10-22 | 北方民族大学 | Stacking frame-based feature selection method |
CN114724715A (en) * | 2022-04-12 | 2022-07-08 | 南京邮电大学 | Multi-machine learning model feature selection method based on optimal AUC |
Also Published As
Publication number | Publication date |
---|---|
CN116226629A (en) | 2023-06-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN114398961B (en) | Visual question-answering method based on multi-mode depth feature fusion and model thereof | |
CN111967495B (en) | Classification recognition model construction method | |
Katarya et al. | Comparison of different machine learning models for diabetes detection | |
Afrin et al. | Supervised machine learning based liver disease prediction approach with LASSO feature selection | |
Jatav | An algorithm for predictive data mining approach in medical diagnosis | |
Díaz-Santos et al. | Classical vs. Quantum machine learning for breast cancer detection | |
CN117195027A (en) | Cluster weighted clustering integration method based on member selection | |
CN116226629B (en) | Multi-model feature selection method and system based on feature contribution | |
Reddy et al. | AdaBoost for Parkinson's disease detection using robust scaler and SFS from acoustic features | |
CN112465054B (en) | FCN-based multivariate time series data classification method | |
Reddy et al. | Diabetes Prediction using Extreme Learning Machine: Application of Health Systems | |
Purnomo et al. | Synthesis ensemble oversampling and ensemble tree-based machine learning for class imbalance problem in breast cancer diagnosis | |
Riyaz et al. | Ensemble learning for coronary heart disease prediction | |
CN108304546B (en) | Medical image retrieval method based on content similarity and Softmax classifier | |
Alajlan | Model-based approach for anEarly diabetes PredicationUsing machine learning algorithms | |
Usha et al. | Predicting Heart Disease Using Feature Selection Techniques Based On Data Driven Approach | |
Kecman et al. | Adaptive local hyperplane for regression tasks | |
Mishra et al. | Machine learning approaches for type-2 diabetes software predictor | |
Wu et al. | Comparison of different machine learning models in breast cancer | |
Patidar et al. | An efficient SVM and ACO-RF method for the cluster-based feature selection and classification | |
Cao et al. | Alzheimer’s Disease Stage Detection Method Based on Convolutional Neural Network | |
Krishna et al. | Parkinson's Disease Detection from Speech Signals Using Explainable Artificial Intelligence | |
Romalt et al. | Prediction of Cardio Vascular Disease by Deep Learning and Machine Learning-A Combined Data Science Approach | |
Jindal et al. | Design and Development of Cardiovascular Disease Prediction System Using Voting Classifier | |
Lavanya et al. | An ensemble deep learning classifier of entropy convolutional neural network and divergence weight bidirectional LSTM for efficient disease prediction |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |