CN116226629B - Multi-model feature selection method and system based on feature contribution - Google Patents

Multi-model feature selection method and system based on feature contribution Download PDF

Info

Publication number
CN116226629B
CN116226629B CN202211357878.1A CN202211357878A CN116226629B CN 116226629 B CN116226629 B CN 116226629B CN 202211357878 A CN202211357878 A CN 202211357878A CN 116226629 B CN116226629 B CN 116226629B
Authority
CN
China
Prior art keywords
feature
model
machine learning
sample
models
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211357878.1A
Other languages
Chinese (zh)
Other versions
CN116226629A (en
Inventor
陈超
宋彪
张瑞环
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inner Mongolia Weishu Data Technology Co ltd
Original Assignee
Inner Mongolia Weishu Data Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inner Mongolia Weishu Data Technology Co ltd filed Critical Inner Mongolia Weishu Data Technology Co ltd
Priority to CN202211357878.1A priority Critical patent/CN116226629B/en
Publication of CN116226629A publication Critical patent/CN116226629A/en
Application granted granted Critical
Publication of CN116226629B publication Critical patent/CN116226629B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Investigating Or Analysing Biological Materials (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

The invention discloses a multi-model feature selection method and system based on feature contribution, comprising the following steps: s1, extracting blood routine test data and biochemical test data, and respectively obtaining sample feature sets; s2, obtaining a test set and a training set based on a k-fold cross validation method; s3, selecting a plurality of machine learning classifier models to perform sample set training, performing embedded feature selection, obtaining the average accuracy of each model, outputting feature importance and giving a weight value; s4, sorting and assigning corresponding weights according to the average accuracy of each model, combining the weights with characteristic weight values to construct a formula, calculating the total weight of each characteristic under different models, and sorting to select an optimal characteristic subset; s5, training an optimal feature subset by using a model with high average accuracy, comparing with the S4 result, and determining the effect of the selected optimal feature subset; the training complexity is greatly reduced while the prediction accuracy is ensured, and the optimal feature subset is more efficiently and quickly found in the high-dimensional data set.

Description

Multi-model feature selection method and system based on feature contribution
Technical Field
The invention relates to the technical fields of inspection medicine and disease screening, in particular to a multi-model feature selection method and system based on feature contribution.
Background
Currently, with the development of artificial intelligence technology, machine learning gradually begins to apply to the medical field. The machine learning is mainly used in the field of screening and auxiliary diagnosis of disease models, the feature engineering is needed to be carried out before the models are built, the feature engineering is an indispensable part in the machine learning, the field takes an important role, the feature engineering is to screen relatively good data features from original data by a series of methods so as to improve the training effect of the models, the feature selection is an important link in the feature engineering, the aim of the feature selection is to find an optimal feature subset, and the aim of the feature selection is to remove some features with lower contribution degree, so that the number of the features is reduced to improve the accuracy of the models and the running time is reduced.
In 2022, wiharto et al used genetic algorithm and support vector machine for feature selection and deep neural network combined diagnosis of coronary heart disease, and the accuracy rate reached 87.7%; in 2021, dharmalingam et al used a hybrid multi-objective particle swarm optimization algorithm with a local tabu search algorithm for pulmonary disease classification with accuracy up to 90.588%; in 2020, shifa et al used a dataset to extract 902 behavioral cues, voice prosody, eye movement, and head pose from speech behaviors, and then aggregate and select optimal features from 38 different class feature selection algorithms to explain the depression detection model, and the results indicate that the voice behavioral features are the most significant features of the depression detection model; in 2017, li et al review feature selection research from the data perspective, and classified the feature selection algorithms for conventional data into similarity-based, information theory-based, sparse learning-based, and statistical-based methods, while explaining how to evaluate the feature selection algorithms, and in summary, demonstrate that feature selection can be applied to disease screening.
However, in the method proposed by Wiharto et al, the convergence speed of the genetic algorithm is low, the local searching capability is poor, meanwhile, the control variables are more, the support vector machine is difficult to implement on a large-scale training sample, the time is long, and the combination speed of the two methods is not improved as a whole; the method proposed in Dharmalingam et al only works very locally, while the convergence rate at the later stage of the iteration is slow; the feature dimensions in the method proposed by shifa et al are relatively difficult to obtain unless it is a institution or hospital dedicated to the study of depression; li et al simply classified and evaluated the method of feature selection from the data perspective.
Therefore, how to provide a multi-model feature selection method and system based on feature contribution is a problem that needs to be solved by those skilled in the art.
Disclosure of Invention
In view of the above, the invention provides a multi-model feature selection method and system based on feature contribution, which uses a conventional test data construction model to perform feature selection according to feature contribution degree, so as to achieve a better effect by using fewer features, and meanwhile, the conventional test data is relatively easy to obtain, and the whole implementation is relatively convenient and rapid.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
a multi-model feature selection method based on feature contribution, comprising the steps of:
s1, respectively extracting a blood routine test data set and a biochemical test data set of a patient and a healthy individual, and respectively obtaining sample feature sets of all data dimensions;
s2, dividing a sample feature set of each data dimension into K parts based on a K-fold cross validation method to obtain a test set and a training set, wherein K is a constant which is arbitrarily larger than 1;
s3, selecting a plurality of machine learning classifier models to train sample sets of sample feature sets of each data dimension, performing embedded feature selection to obtain average accuracy of each model, outputting feature importance of the sample feature sets of each data dimension and giving weight values;
s4, sorting according to the average accuracy of the machine learning classifier models, assigning corresponding weights to the models, combining the model weights with the feature weight values to construct a formula, calculating the total weight of each feature under different models, and sorting according to the results to select an optimal feature subset;
s5, training the optimal feature subset by using a machine learning model with highest average accuracy, comparing the optimal feature subset with the result of S4, and determining that the effect of the selected optimal feature subset is superior to or equal to the effect of the feature full set.
Preferably, S1 further includes data preprocessing, which specifically includes: and respectively selecting a data dimension characteristic set according to the blood routine test data set and the biochemical test data set of the patient and the healthy individual, wherein the data dimension characteristic set comprises the blood routine data set, the biochemical data set and the data set of the combination of the blood routine test and the biochemical test.
Preferably, the specific content of S3 includes:
s31, classifying sample feature sets of all data dimensions by using five machine learning classifiers respectively, testing classification effects by using a test set to obtain sensitivity TPR, specificity TNR and classification accuracy ACC of the classifier, and drawing PR and ROC graphs to obtain AUC and AP values;
s32, obtaining average accuracy of sample feature sets of all data dimensions in all machine learning classifier models;
s33, explaining the contribution degree of the feature importance added to the features when modeling each machine learning classifier model, and carrying out weight assignment sigma on the feature contribution degree sequence j J is the feature number.
Preferably, in S31, the TPR, TNR, and ACC are specifically:
TPR=TP/(TP+FN)
TNR=TN/(FP+TN)
ACC=(TP+TN)/(TP+FP+FN+TN)
wherein TP represents the number of patients for which the classifier is identified as truly healthy individuals, FP represents the number of patients for which the classifier is misrecognized as healthy individuals, FN represents the number of patients for which the classifier is identified as healthy individuals, and TN represents the number of healthy individuals for which the classifier is identified as truly healthy individuals;
the method comprises the steps that k-fold cross validation training is used for each classifier model to obtain TPR, TNR and ACC, wherein sensitivity TPR is the proportion of all positive samples to all samples, specificity TNR is the proportion of negative samples to all negative samples, and ACC is the classification accuracy of the classifier;
in S32, a TPR is taken as an ordinate, a FPR is taken as an abscissa, an ROC curve is drawn, a TPR is taken as an abscissa, precision is taken as an ordinate, a PR curve is drawn, an AUC value is an area enclosed by the ROC curve and the coordinate axis, and an AP is a graphic area enclosed by the PR curve and the X axis;
wherein the precision calculation formula is:
Precision=TP/(TP+FP);
and obtaining the average accuracy of each model according to the classification accuracy of the K-time classifier of the K-fold cross verification.
Preferably, the plurality of machine learning classifier models includes a random forest RF model, a neural network NN model, a support vector machine SVM model, a K nearest neighbor KNN model, and a logistic regression LR model.
Preferably, the specific content of S4 includes:
s41, sorting and setting model weights beta according to average accuracy obtained by the sample feature set under different models i The average accuracy rate of the different models is RF, NN, SVM, LR, KNN, and the weight changes along with the position of the sequence;
s42, the total weight of each feature under different models is as follows:
wherein sigma ij A weight value representing a jth feature under an ith model;
s43, calculating the ordering condition of the total weight of each feature under the sample feature set of each data dimension, and selecting the feature of the first half of the sample feature set of each data dimension after ordering as the optimal feature subset.
A multi-model feature selection system based on feature contribution, comprising: the system comprises a data acquisition module, a plurality of machine learning classifier models, a feature weight assignment module, a feature total weight calculation module, an optimal feature subset acquisition module and an optimal feature subset determination module;
the data acquisition module is used for respectively extracting a blood routine test data set and a biochemical test data set of a patient and a healthy individual and respectively acquiring a sample feature set of each data dimension;
the multiple machine learning classification models are used for dividing a sample feature set of each data dimension into K parts based on a K-fold cross validation method, taking one part as a test set, taking the rest K-1 parts as a training set, and enabling K to be a constant which is more than 1 arbitrarily; selecting a plurality of machine learning classifier models to train sample sets of sample characteristics of each data dimension, performing embedded characteristic selection to obtain average accuracy of each model, and outputting characteristic importance of the sample characteristic sets of each data dimension;
the feature weight assignment module is used for assigning a feature weight value according to the output feature importance;
the feature total weight calculation module is used for sorting according to the average accuracy of the machine learning classifier models, assigning corresponding weights to the models, combining the model weights with feature weight values to construct a formula, and calculating the total weight of each feature under different models;
the optimal feature subset acquisition module is used for sorting and selecting an optimal feature subset according to the total weight result of each feature under different models;
and the optimal feature subset determining module is used for comparing the training of the optimal feature subset by using the machine learning model with high average accuracy with the result of the feature total weight calculating module, and determining that the effect of the selected optimal feature subset is superior to or equal to the effect of the feature total set.
Preferably, the multi-model feature selection system based on feature contribution further comprises a data preprocessing module, wherein the data dimension feature sets are selected according to blood routine test data sets and biochemical test data sets of patients and healthy individuals respectively, and the data dimension feature sets comprise the blood routine test data sets, the biochemical test data sets and data sets of blood routine test and biochemical test combination.
Preferably, the plurality of machine learning classifier models includes a random forest RF model, a neural network NN model, a support vector machine SVM model, a K nearest neighbor KNN model, and a logistic regression LR model.
Compared with the prior art, the multi-model feature selection method and system based on feature contribution provided by the invention have the advantages that the abnormal data set contains a plurality of related features, the features are subjected to contribution degree analysis and weighted value assignment by using a machine learning classification model, then the features are sequenced according to the accuracy of different models and weighted value assignment, then the feature values under different models are calculated according to a related formula constructed according to the weighted values of the different models, the first half of the features are sequenced according to the values and selected as the optimal feature subset, finally the optimal feature subset is trained by using the machine learning model to obtain the result and the previous result to be compared, the advantage of feature selection is fully exerted, only feature information with high feature contribution degree is selected as the features of the machine learning model, compared with the traditional method, the accuracy of prediction is ensured, meanwhile training complexity is greatly reduced, and the optimal feature subset is more efficiently and rapidly found in the high-dimensional data set.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present invention, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic diagram of a multi-model feature selection method based on feature contribution provided by the invention;
FIG. 2 is a schematic diagram of preliminary statistics of feature importance under each model provided in an embodiment of the present invention;
FIG. 3 is a diagram showing the comparison between the training results of all features and the training results of the optimal feature subset of the random forest model according to the embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The embodiment of the invention discloses a multi-model feature selection method based on feature contribution, as shown in fig. 1, comprising the following steps:
s1, respectively extracting a blood routine test data set and a biochemical test data set of a patient and a healthy individual, and respectively obtaining sample feature sets of all data dimensions;
s2, dividing a sample feature set of each data dimension into K parts based on a K-fold cross validation method to obtain a test set and a training set, wherein K is a constant which is arbitrarily larger than 1;
s3, selecting a plurality of machine learning classifier models to train sample sets of sample feature sets of each data dimension, performing embedded feature selection to obtain average accuracy of each model, outputting feature importance of the sample feature sets of each data dimension and giving weight values;
s4, sorting according to the average accuracy of the machine learning classifier models, assigning corresponding weights to the models, combining the model weights with the feature weight values to construct a formula, calculating the total weight of each feature under different models, and sorting according to the results to select an optimal feature subset;
s5, training the optimal feature subset by using a machine learning model with highest average accuracy, comparing the optimal feature subset with the result of S4, and determining that the effect of the selected optimal feature subset is superior to or equal to the effect of the feature full set.
In practical application, k takes a value of 5 or 10, one of the k is taken as a test set, and the rest k-1 is taken as a training set.
In this embodiment, the sample feature set includes 22 blood routine data and 14 biochemical data.
In order to further implement the above technical solution, S1 further includes data preprocessing, and specific contents are: and respectively selecting a data dimension characteristic set according to the blood routine test data set and the biochemical test data set of the patient and the healthy individual, wherein the data dimension characteristic set comprises blood routine data, biochemical data and blood routine and biochemical combination data.
In this example, the blood routine test dataset and the biochemical test dataset in S1 are datasets for patients and healthy individuals who selected a single disease.
In order to further implement the above technical solution, the specific content of S3 includes:
s31, classifying sample feature sets of all data dimensions by using five machine learning classifiers respectively, testing classification effects by using a test set to obtain sensitivity TPR, specificity TNR and classification accuracy ACC of the classifier, and drawing PR and ROC graphs to obtain AUC and AP values;
s32, obtaining average accuracy of sample feature sets of all data dimensions in all machine learning classifier models;
s33, explaining the contribution degree of the feature importance added to the features when modeling each machine learning classifier model, and carrying out weight assignment sigma on the feature contribution degree sequence j J is the feature number, as in fig. 2.
In this embodiment, j=1, 2,..36.
In order to further implement the above technical solution, in S31, the TPR, TNR, and ACC are specifically:
TPR=TP/(TP+FN)
TNR=TN/(FP+TN)
ACC=(TP+TN)/(TP+FP+FN+TN)
wherein TP represents the number of patients for which the classifier is identified as truly healthy individuals, FP represents the number of patients for which the classifier is misrecognized as healthy individuals, FN represents the number of patients for which the classifier is identified as healthy individuals, and TN represents the number of healthy individuals for which the classifier is identified as truly healthy individuals;
the method comprises the steps that k-fold cross validation training is used for each classifier model to obtain TPR, TNR and ACC, wherein sensitivity TPR is the proportion of all positive samples to all samples, specificity TNR is the proportion of negative samples to all negative samples, and ACC is the classification accuracy of the classifier;
in S32, a TPR is taken as an ordinate, a FPR is taken as an abscissa, an ROC curve is drawn, a TPR is taken as an abscissa, precision is taken as an ordinate, a PR curve is drawn, an AUC value is an area enclosed by the ROC curve and the coordinate axis, and an AP is a graphic area enclosed by the PR curve and the X axis;
wherein the precision calculation formula is:
Precision=TP/(TP+FP);
and obtaining the average accuracy of each model according to the classification accuracy of the K-time classifier of the K-fold cross verification.
In order to further implement the above technical solution, the plurality of machine learning classifier models includes a random forest RF model, a neural network NN model, a support vector machine SVM model, a K nearest neighbor KNN model, and a logistic regression LR model.
In this embodiment, the specific usage parameters of the five classifiers of random forest, neural network, support vector machine, K nearest neighbor, and logistic regression are respectively:
the random forest algorithm uses a random forest function, where n_estimators is set to 100, min_samples_split is set to 2, and min_samples_leaf is set to 1;
the neural network algorithm uses a Sequential model layer number in Keras as 512 and 1, and activation is set as relu and sigmoid, wherein a loss function adopts a binary cross entropy loss function; selecting Adam by an optimization algorithm; epoch is set to 200; the batch_size is set to 32;
the SVC algorithm uses SVC function, the parameter kernel is set as linear, class_weight is set as balance, and the probability is set as True;
the K nearest neighbor algorithm uses a KNEIGHIBORSClassifier function, wherein n_neighbors is set to 5, and other parameters are defaults;
the logistic regression algorithm uses a logistic regression function, where class_weight is set to {0:1,1:2}, with other parameters defaults.
In order to further implement the above technical solution, the specific content of S4 includes:
s41, sorting and setting model weights beta according to average accuracy obtained by the sample feature set under different models i The average accuracy rate of the different models is RF, NN, SVM, LR, KNN, and the weight changes along with the position of the sequence;
s42, the total weight of each feature under different models is as follows:
wherein sigma ij Weight values representing the jth feature under the ith model, i=1, 2, 5;
s43, calculating the ordering condition of the total weight of each feature under the sample feature set of each data dimension, and selecting the feature of the first half of the sample feature set of each data dimension after ordering as the optimal feature subset.
In this embodiment, the accuracy is a random forest RF model, and the result RF of training all features using the random forest model is compared with the result jwrf of training the optimal feature subset using the random forest model to determine that the effect of the selected feature subset is better than or equal to the effect of the feature subset, as shown in fig. 3.
A multi-model feature selection system based on feature contribution, comprising: the system comprises a data acquisition module, a plurality of machine learning classifier models, a feature weight assignment module, a feature total weight calculation module, an optimal feature subset acquisition module and an optimal feature subset determination module;
the data acquisition module is used for respectively extracting a blood routine test data set and a biochemical test data set of a patient and a healthy individual and respectively acquiring a sample feature set of each data dimension;
the multiple machine learning classification models are used for dividing a sample feature set of each data dimension into K parts based on a K-fold cross validation method, taking one part as a test set, taking the rest K-1 parts as a training set, and enabling K to be a constant which is more than 1 arbitrarily; selecting a plurality of machine learning classifier models to train sample sets of sample characteristics of each data dimension, performing embedded characteristic selection to obtain average accuracy of each model, and outputting characteristic importance of the sample characteristic sets of each data dimension;
the feature weight assignment module is used for assigning a feature weight value according to the output feature importance;
the feature total weight calculation module is used for sorting according to the average accuracy of the machine learning classifier models, assigning corresponding weights to the models, combining the model weights with feature weight values to construct a formula, and calculating the total weight of each feature under different models;
the optimal feature subset acquisition module is used for sorting and selecting an optimal feature subset according to the total weight result of each feature under different models;
and the optimal feature subset determining module is used for comparing the training of the optimal feature subset by using the machine learning model with high average accuracy with the result of the feature total weight calculating module, and determining that the effect of the selected optimal feature subset is superior to or equal to the effect of the feature total set.
In order to further implement the technical scheme, the multi-model feature selection system based on feature contribution further comprises a data preprocessing module, wherein the data dimension feature sets are selected according to blood routine test data sets and biochemical test data sets of patients and healthy individuals respectively, and the data dimension feature sets comprise the blood routine test data sets, the biochemical test data sets and data sets of blood routine test and biochemical test combination.
In order to further implement the above technical solution, the plurality of machine learning classifier models includes a random forest RF model, a neural network NN model, a support vector machine SVM model, a K nearest neighbor KNN model, and a logistic regression LR model.
In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (6)

1. A method of multi-model feature selection based on feature contribution, comprising the steps of:
s1, respectively extracting a blood routine test data set and a biochemical test data set of a patient and a healthy individual, and obtaining a sample feature set of each data dimension;
s2, dividing a sample feature set of each data dimension into K parts based on a K-fold cross validation method, and respectively obtaining a test set and a training set, wherein K is a constant which is arbitrarily larger than 1;
s3, selecting a plurality of machine learning classifier models to train sample sets of sample feature sets of each data dimension, performing embedded feature selection to obtain average accuracy of each model, outputting feature importance of the sample feature sets of each data dimension and giving weight values;
s4, sorting according to the average accuracy of the machine learning classifier models, assigning corresponding weights to the models, combining the model weights with the feature weight values to construct a formula, calculating the total weight of each feature under different models, and sorting according to the results to select an optimal feature subset;
s5, training the optimal feature subset by using a machine learning model with highest average accuracy, comparing the optimal feature subset with the result of S4, and determining that the effect of the selected optimal feature subset is superior to or equal to the effect of the feature full set;
the specific content of S3 comprises:
s31, classifying sample feature sets of all data dimensions by using five machine learning classifiers respectively, testing classification effects by using a test set to obtain sensitivity TPR, specificity TNR and classification accuracy ACC of the classifier, and drawing PR and ROC graphs to obtain AUC and AP values;
s32, obtaining average accuracy of sample feature sets of all data dimensions in all machine learning classifier models;
s33, explaining the contribution degree of the feature importance added to the features when modeling each machine learning classifier model, and carrying out weight assignment sigma on the feature contribution degree sequence j J is a feature number;
TPR, TNR and ACC are specifically:
TPR=TP/(TP+FN)
TNR=TN/(FP+TN)
ACC=(TP+TN)/(TP+FP+FN+TN)
wherein TP represents the number of patients for which the classifier is identified as truly healthy individuals, FP represents the number of patients for which the classifier is misrecognized as healthy individuals, FN represents the number of patients for which the classifier is identified as healthy individuals, and TN represents the number of healthy individuals for which the classifier is identified as truly healthy individuals;
the method comprises the steps that k-fold cross validation training is used for each classifier model to obtain TPR, TNR and ACC, wherein sensitivity TPR is the proportion of all positive samples to all samples, specificity TNR is the proportion of negative samples to all negative samples, and ACC is the classification accuracy of the classifier;
drawing a ROC curve by taking TPR as an ordinate and FPR as an abscissa, drawing a PR curve by taking TPR as an abscissa and precision as an ordinate, wherein an AUC value is an area enclosed by the ROC curve and a coordinate axis, and AP is a graph area enclosed by the PR curve and an X axis;
wherein the precision calculation formula is:
Precision=TP/(TP+FP);
obtaining the average accuracy of each model according to the classification accuracy of the K-time classifier of the K-fold cross verification;
the specific content of S4 comprises:
s41, sorting and setting model weights beta according to average accuracy obtained by the sample feature set under different models i The average accuracy rate of the different models is RF, NN, SVM, LR, KNN, and the weight changes along with the position of the sequence;
s42, the total weight of each feature under different models is as follows:
wherein sigma ij A weight value representing a jth feature under an ith model;
s43, calculating the ordering condition of the total weight of each feature under the sample feature set of each data dimension, and selecting the feature of the first half of the sample feature set of each data dimension after ordering as the optimal feature subset.
2. The method for selecting a plurality of types of characteristics based on characteristic contribution according to claim 1, wherein S1 further comprises data preprocessing, and the specific contents are as follows: and respectively selecting a data dimension characteristic set according to the blood routine test data set and the biochemical test data set of the patient and the healthy individual, wherein the data dimension characteristic set comprises the blood routine test data set, the biochemical test data set and a data set combining the blood routine test and the biochemical test.
3. A multi-model feature selection method based on feature contribution as recited in claim 1, wherein the plurality of machine learning classifier models includes a random forest RF model, a neural network NN model, a support vector machine SVM model, a K nearest neighbor KNN model, and a logistic regression LR model.
4. A feature contribution based multi-model feature selection system, characterized in that a feature contribution based multi-model feature selection method according to any of claims 1-3, comprising: the system comprises a data acquisition module, a plurality of machine learning classifier models, a feature weight assignment module, a feature total weight calculation module, an optimal feature subset acquisition module and an optimal feature subset determination module;
the data acquisition module is used for respectively extracting a blood routine test data set and a biochemical test data set of a patient and a healthy individual and respectively acquiring a sample feature set of each data dimension;
the multiple machine learning classification models are used for dividing a sample feature set of each data dimension into K parts based on a K-fold cross validation method to respectively obtain a test set and a training set, wherein K is a constant which is arbitrarily larger than 1; selecting a plurality of machine learning classifier models to train sample sets of sample characteristics of each data dimension, performing embedded characteristic selection to obtain average accuracy of each model, and outputting characteristic importance of the sample characteristic sets of each data dimension;
the feature weight assignment module is used for assigning a feature weight value according to the output feature importance;
the feature total weight calculation module is used for sorting according to the average accuracy of the machine learning classifier models, assigning corresponding weights to the models, combining the model weights with feature weight values to construct a formula, and calculating the total weight of each feature under different models;
the optimal feature subset acquisition module is used for sorting and selecting an optimal feature subset according to the total weight result of each feature under different models;
and the optimal feature subset determining module is used for comparing the training of the optimal feature subset by using the machine learning model with the highest average accuracy with the result of the feature total weight calculating module, and determining that the effect of the selected optimal feature subset is better than or equal to the effect of the feature total set.
5. The multi-model feature selection system based on feature contribution of claim 4, further comprising a data preprocessing module for selecting data dimension feature sets from a blood routine test data set and a biochemical test data set of the patient and the healthy individual, respectively, including the blood routine test data set, the biochemical test data set, and a blood routine test combined with biochemical test data set.
6. A multi-model feature selection system based on feature contribution as recited in claim 4, wherein the plurality of machine learning classifier models includes a random forest RF model, a neural network NN model, a support vector machine SVM model, a K nearest neighbor KNN model, and a logistic regression LR model.
CN202211357878.1A 2022-11-01 2022-11-01 Multi-model feature selection method and system based on feature contribution Active CN116226629B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211357878.1A CN116226629B (en) 2022-11-01 2022-11-01 Multi-model feature selection method and system based on feature contribution

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211357878.1A CN116226629B (en) 2022-11-01 2022-11-01 Multi-model feature selection method and system based on feature contribution

Publications (2)

Publication Number Publication Date
CN116226629A CN116226629A (en) 2023-06-06
CN116226629B true CN116226629B (en) 2024-03-22

Family

ID=86583109

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211357878.1A Active CN116226629B (en) 2022-11-01 2022-11-01 Multi-model feature selection method and system based on feature contribution

Country Status (1)

Country Link
CN (1) CN116226629B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105184316A (en) * 2015-08-28 2015-12-23 国网智能电网研究院 Support vector machine power grid business classification method based on feature weight learning
CN110177112A (en) * 2019-06-05 2019-08-27 华东理工大学 The network inbreak detection method deviated based on dibaryon spatial sampling and confidence
CN110379521A (en) * 2019-06-24 2019-10-25 南京理工大学 Medical data collection feature selection approach based on information theory
CN112381787A (en) * 2020-11-12 2021-02-19 福州大学 Steel plate surface defect classification method based on transfer learning
CN113535694A (en) * 2021-06-18 2021-10-22 北方民族大学 Stacking frame-based feature selection method
CN114724715A (en) * 2022-04-12 2022-07-08 南京邮电大学 Multi-machine learning model feature selection method based on optimal AUC

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220293266A1 (en) * 2021-03-09 2022-09-15 The Hong Kong Polytechnic University System and method for detection of impairment in cognitive function

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105184316A (en) * 2015-08-28 2015-12-23 国网智能电网研究院 Support vector machine power grid business classification method based on feature weight learning
CN110177112A (en) * 2019-06-05 2019-08-27 华东理工大学 The network inbreak detection method deviated based on dibaryon spatial sampling and confidence
CN110379521A (en) * 2019-06-24 2019-10-25 南京理工大学 Medical data collection feature selection approach based on information theory
CN112381787A (en) * 2020-11-12 2021-02-19 福州大学 Steel plate surface defect classification method based on transfer learning
CN113535694A (en) * 2021-06-18 2021-10-22 北方民族大学 Stacking frame-based feature selection method
CN114724715A (en) * 2022-04-12 2022-07-08 南京邮电大学 Multi-machine learning model feature selection method based on optimal AUC

Also Published As

Publication number Publication date
CN116226629A (en) 2023-06-06

Similar Documents

Publication Publication Date Title
CN114398961B (en) Visual question-answering method based on multi-mode depth feature fusion and model thereof
CN111967495B (en) Classification recognition model construction method
Katarya et al. Comparison of different machine learning models for diabetes detection
Afrin et al. Supervised machine learning based liver disease prediction approach with LASSO feature selection
Jatav An algorithm for predictive data mining approach in medical diagnosis
Díaz-Santos et al. Classical vs. Quantum machine learning for breast cancer detection
CN117195027A (en) Cluster weighted clustering integration method based on member selection
CN116226629B (en) Multi-model feature selection method and system based on feature contribution
Reddy et al. AdaBoost for Parkinson's disease detection using robust scaler and SFS from acoustic features
CN112465054B (en) FCN-based multivariate time series data classification method
Reddy et al. Diabetes Prediction using Extreme Learning Machine: Application of Health Systems
Purnomo et al. Synthesis ensemble oversampling and ensemble tree-based machine learning for class imbalance problem in breast cancer diagnosis
Riyaz et al. Ensemble learning for coronary heart disease prediction
CN108304546B (en) Medical image retrieval method based on content similarity and Softmax classifier
Alajlan Model-based approach for anEarly diabetes PredicationUsing machine learning algorithms
Usha et al. Predicting Heart Disease Using Feature Selection Techniques Based On Data Driven Approach
Kecman et al. Adaptive local hyperplane for regression tasks
Mishra et al. Machine learning approaches for type-2 diabetes software predictor
Wu et al. Comparison of different machine learning models in breast cancer
Patidar et al. An efficient SVM and ACO-RF method for the cluster-based feature selection and classification
Cao et al. Alzheimer’s Disease Stage Detection Method Based on Convolutional Neural Network
Krishna et al. Parkinson's Disease Detection from Speech Signals Using Explainable Artificial Intelligence
Romalt et al. Prediction of Cardio Vascular Disease by Deep Learning and Machine Learning-A Combined Data Science Approach
Jindal et al. Design and Development of Cardiovascular Disease Prediction System Using Voting Classifier
Lavanya et al. An ensemble deep learning classifier of entropy convolutional neural network and divergence weight bidirectional LSTM for efficient disease prediction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant