CN109545372B

CN109545372B - Patient physiological data feature selection method based on greedy-of-distance strategy

Info

Publication number: CN109545372B
Application number: CN201811313953.8A
Authority: CN
Inventors: 钮焱; 李军; 童坤; 刘宇强; 李星
Original assignee: Hubei University of Technology
Current assignee: Hubei University of Technology
Priority date: 2018-11-06
Filing date: 2018-11-06
Publication date: 2021-07-06
Anticipated expiration: 2038-11-06
Also published as: CN109545372A

Abstract

The invention discloses a patient physiological data feature selection method based on a greedy distance strategy, which is improved aiming at the disadvantage of lower performance of the existing feature selection algorithm.

Description

Patient physiological data feature selection method based on greedy-of-distance strategy

Technical Field

The invention belongs to the technical field of medical treatment, relates to a patient physiological data feature selection method, and particularly relates to a wolf's own characteristics selection method based on a greedy distance strategy.

Background

Nowadays, the science and technology are developed at a high speed, the medical detection system is continuously updated, and the detection system is mature day by day. Heart disease is a killer of human health and has great significance in detecting it before the onset of the disease. The physiological data of the patient has large characteristic quantity and is redundant, the redundant characteristic makes the workload of detecting the heart disease become huge, and the effect becomes poor. The gray wolf optimization algorithm (GWO) is a group intelligence algorithm which is put into use at present, determines the position of prey to be prey by simulating the process of prey on wolf groups, namely, the optimal solution of the optimization problem, and is largely used in the feature selection part, but the algorithm itself has a slow convergence speed and a low search efficiency. The invention provides an improved wolf algorithm for a feature selection part, the algorithm replaces a general wolf algorithm position updating part with a greedy strategy, and the optimal price searching efficiency is improved, so that a better feature set can be extracted, and the detection of a sample is facilitated.

The purpose of feature selection is to extract important features from the data and remove redundant features. The feature selection can reduce data dimensionality, improve prediction performance, reduce overfitting, enhance understanding between features and feature values, and the like. In the real world, data to be classified often has a large number of redundant features, which means that some features in the data can be replaced by other features, and the replaced features can be removed in the classification process, furthermore, the mutual connection between the features has a great influence on the output effect of the classification, and if we can find out the connection between the features, we can dig out a large amount of information hidden in the data.

All feature selection algorithms can be classified into the following three categories, filtering, embedding and wrapping. The filtering method is realized by firstly selecting the characteristics of the data set and then training a classifier to split the data set and the classifier. The key of the method is to find a method for measuring the importance of features, such as pearson correlation coefficients, mutual information and the like. Then sorting is carried out according to the size of the metric, and the characteristic with the metric value sorted in the front is selected as the characteristic of the classification standard. However, the method has the disadvantage of neglecting the interdependence relationship between the features, and on one hand, the top-ranked features are equivalent to the features with redundancy introduced if the correlation between some features is strong. On the other hand, the feature in the next rank, although the metric value is not large and the value is not obvious, has good prediction effect independently of other features and is combined with other features, so that the valuable features are lost. The embedded method is to integrate the feature selection process into the learner training process, and the two are completed in a unified process, such as lasso ridge regression. The core idea of the wrapping method is that under the condition that a training model and an evaluation method of prediction effect are given, the prediction effect of each subset is evaluated according to different feature subsets in a feature space, and the feature subset with the best prediction effect is selected as a finally selected training subset. The method has the advantages that the characteristic subset selected by the wrapping method has better prediction effect than the filtering method in consideration of the interdependence relation among the characteristics, but the method has the defect of large calculation amount because the characteristic subset is in an exponential order. Different algorithms are generated for how efficiently the entire feature space is searched.

The genetic algorithm is the first intelligent algorithm used for solving the problem, the idea of the genetic algorithm is derived from the reproductive genetic process among natural biological populations, the solution of the optimization problem is considered as a gene, and then genetic communication including crossing and variation is carried out among the whole populations. The natural environment can be regarded as an objective function, and genes with high adaptability to the natural environment are reserved and are passed on to the next generation. Genetic algorithms have the ability to solve complex nonlinear optimization problems. However, the genetic algorithm has many disadvantages such as low operation efficiency and easy falling into the local optimal solution.

The Particle Swarm Optimization (PSO) concept stems from the study of the foraging behavior of a flock of birds. The potential solution of each optimization problem can be thought of as a point on a d-dimensional search space, which is called as a 'particle', all particles have an adaptive value determined by an objective function, each particle also has a speed to determine the flying direction and distance of the particle, and then the particles follow the current optimal particle to search in the solution space. Compared with the traditional multi-target optimization method, the particle swarm optimization method has great advantages in solving the multi-target problem. However, the method has the disadvantages of low precision, easy divergence and the like.

Disclosure of Invention

The invention aims to solve the problems that the existing patient physiological data feature selection algorithm is low in convergence speed and searching efficiency and is easy to fall into a local optimal solution, and provides a gray wolf feature selection method based on a distance greedy strategy, so that the algorithm classification accuracy is improved, and the data feature redundancy is reduced.

The technical scheme adopted by the invention is as follows: a patient physiological data characteristic selection method based on a greedy-of-distance strategy is characterized by comprising the following steps of:

step 1: inputting data captured from physiological data of a patient, and forming sample data containing labels into a training set; wherein, the label marks that the physiological data of the patient represents the disease state of the patient, and the disease state is divided into diseased state and non-diseased state;

step 2: aiming at the captured data, utilizing a gray wolf feature selection method based on a greedy distance strategy to select the physiological data features of the patient;

step 2.1: initializing the current iteration times, the number of the wolf individuals, the population size of the wolf group and the position vector of each wolf individual; the position vector of each wolf individual represents a candidate solution of the feature selection problem;

step 2.2: calculating the coding vector of each wolf according to the position vector, and calculating the adaptive value of each wolf according to the coding vector;

step 2.3: setting the maximum iteration number as maximum, and selecting the first three as alpha, beta and delta according to the size of the adaptive value;

step 2.4: calculating a distance map of each wolf;

step 2.5: updating the coding vectors of alpha, beta and delta according to the distance mapping of each wolf head;

step 2.6: judging whether t is larger than maximum;

if yes, executing the following step 3;

if not, returning to the step 2.4 after t is equal to t + 1;

and step 3: and outputting the feature subset corresponding to the alpha code vector.

The invention improves the disadvantage of lower performance of the existing feature selection algorithm, improves the position updating part in the original Huilusu algorithm by using a greedy strategy, improves the capability of the algorithm in developing the optimal solution, improves the convergence rate, can effectively improve the classification accuracy and reduce the data feature redundancy.

Drawings

FIG. 1 is a schematic flow chart of an embodiment of the present invention;

FIG. 2 is a graph comparing the detection error rate of the present invention with three other feature selection algorithms;

FIG. 3 is a comparison graph of feature selection numbers after feature selection in the present invention versus three other feature selection algorithms.

Detailed Description

In order to facilitate the understanding and implementation of the present invention for those of ordinary skill in the art, the present invention is further described in detail with reference to the accompanying drawings and examples, it is to be understood that the embodiments described herein are merely illustrative and explanatory of the present invention and are not restrictive thereof.

The core of the technology of the invention is to regard the feature selection problem of medical data containing N features as a discrete optimization combination problem in a binary system N-dimensional space, each feature subset can be represented by an N-dimensional binary vector, and the improved Hui wolf optimization algorithm is adopted to search in the N-dimensional binary system space.

Referring to fig. 1, the method for selecting physiological data characteristics of a patient based on a greedy-of-distance strategy provided by the invention comprises the following steps:

step 1: inputting data captured from physiological data of a patient, and forming sample data containing labels into a training set; the label marks the diseased condition in the physiological data of the patient, the diseased condition is divided into a diseased condition and an undiseased condition, 0 represents normal, and 1-4 represents the degree of vasoconstriction;

in this embodiment, Z pieces of captured medical data containing N features are input, each piece of data in the data set is a sample, the sample capacity is Z, each piece of input data is represented by a feature vector, each dimension of the vector represents one feature of the data, and all samples containing category labels constitute a training sample set T.

(1) initializing the current iteration time t as 1, the number i of a wolf individual as 1, and the population size of a wolf group as K;

(2) for the wolf individual from i ═ 1, 2, … K, the position vector of each head wolf in the wolf cluster is initialized randomly within (0, max)

The vector dimension is N, wherein max represents the maximum value of the position of the wolf individual, and is taken as 1;

(1) find a mapping function f that can map values in the (0, max) interval into {0, 1} the discrete set, and guarantee that there is a number δ in (0, max) such that f (temp1) for all temp1 ∈ (0, δ) and temp2 ∈ [ δ, max)<f (temp2) so that the continuous feature vector can be used

Become binary coded vectors containing only 0 and 1

The following function is selected as the mapping function in this embodiment:

wherein position (i, j) represents

The value of the j-th dimension in (i, j) represents

The j-th dimension of the vector, so that the position of the gray wolf is converted from a continuous value to a binary coded value of 0, 1 by using the function, and the binary coded value can be used in a feature selection algorithm.

(2) Encoding vector in wolf

1 represents that the characteristic is selected, 0 represents that the characteristic is not selected, and the training set T is arranged in the coding vector

Retaining the corresponding selected features, deleting the unselected features to obtain a new training setIs T _ solution.

(3) The average precision (or classification error rate) of the classified T _ solution is calculated by a classifier, and the precision is used as a wolf pack coding vector

Corresponding adaptive value P_i. The classifier can select different classifiers such as an SVM (support vector machine), an artificial neural network and the like according to actual conditions, the embodiment uses a KNN classifier, and K in KNN takes a value of 5;

setting maximum iteration number maximum, and then selecting adaptive value P_iThe optimal encoding vector of wolf is taken as the encoding vector of alpha. The excellent of the adaptive value is relative, and is related to the meaning of the selected adaptive value function, the invention selects the classification error rate as the adaptive value of the wolf, and the lower the classification error rate is, the better the classification effect is, the better the wolf is. Therefore, the initialization of α, β and δ in the present invention is divided into the following three substeps:

(1) selecting an adaptation value P_iLowest wolf

Initializing a code vector of alpha

Code vector of wolf j

(2) After j is deleted, the adaptive value P is selected from the rest wolf individuals_iLowest wolf

Initializing a code vector of beta

Code vector of wolf's n

(3) After n is deleted, an adaptive value P is finally selected from the remaining wolf individuals_iLowest wolf

Initializing a code vector of delta

Code vector of wolf m

Step 2.4: calculating a distance map of each wolf;

the step is the core of the invention and is an innovation point, the invention improves the defects of the existing wolf optimization algorithm, improves the capability of the algorithm for developing the optimal solution, improves the convergence speed, and can effectively improve the classification accuracy and reduce the data characteristic redundancy.

In the embodiment, a greedy strategy is utilized to calculate the distance mapping of each wolf head; the specific implementation comprises the following substeps:

step 2.4.1: computing successive encoded distance vectors based on selection of alpha, beta, and delta

Wherein the content of the first and second substances,

representing parameters

Three different random vectors, parameters

Calculated in step 2.4.2;

and

distances representing the individual distances α, β and δ are defined as follows:

wherein the content of the first and second substances,

representing parameters

Three different random vectors of, wherein the parameters

Calculated in step 2.4.2;

and

position vectors representing α, β, and δ in the t-th iteration;

is a middleA parameter representing the final position of each wolf moving along α, β, and δ at the tth iteration; it is defined as follows:

step 2.4.2: calculating parameters

And a, calculated using the following formula:

wherein the content of the first and second substances,

is in a value range of [0, 1]A is a parameter variable for controlling the development and searchability of the algorithm, the parameter variable is linearly reduced from 2 to 0 along with the increase of the iteration times, t is the number of current iteration times, and maximer is the total number of algorithm iteration times;

step 2.4.3: computing

Wherein

Represents the calculation of step 2.4.1

The value of the n-th dimension is,

representing a vector

The value of the nth dimension; b represents the maximum value of the assumed problem search interval,

is represented by

Mapping functions obtained in different problem search intervals;

step 2.4.4: calculating X^dChange and hold;

wherein the content of the first and second substances,

is composed of

The value of the d-th dimension in (1),

binary coded vectors, X, representing individuals^dRepresenting the d-dimension value of each single binary coding vector; d^dFor continuously encoding vectors

The value of d is [0, 1 ]]Random numbers in intervals, where hold and change represent pairs

The value after the operation is taken as X^dThe value of (c).

updating the code vectors of alpha, beta and delta, sorting the updated individual adaptive values of wolfs, and selecting the adaptive value P of the three-headed wolf with the first three of the adaptive values_α'、P_β' and P_δ' Adaptation values P to original alpha, beta and delta_α，P_βAnd P_δPerforming corresponding comparison if the new adaptive value P_iIs superior to the original adaptation value P_iThen the corresponding code vector is used

Updating the code vector corresponding to the new adaptive value

Otherwise, the updating is not carried out.

Step 2.6: judging whether t is larger than maximum;

if yes, executing the following step 3;

if not, returning to the step 2.4 after t is equal to t + 1;

The coded vector of alpha

Binary string representing optimal feature subset, 1 representing feature selected, 0 representing feature not selected, and outputting

And the feature corresponding to the dimension with the value of 1 is extracted.

The effects of the present invention will be further described below by comparative experiments.

(1) Simulating conditions;

the data set used in the experiment was a set of cardiac data in the uci database, which was divided equally into two parts, one as the training set and the other as the test set. In the experiment, the language used by each method is realized by matlab.

(2) Experimental content and results;

the method comprises the steps of utilizing a group of heart disease data in an uci database as a data set, utilizing a KNN algorithm in matlab as a classifier to detect, then optimizing a post-algorithm GWO, a Genetic Algorithm (GA) and a particle swarm algorithm (PSO) as algorithms of a feature selection part, utilizing KNN as a sample classifier, utilizing a sample classification error rate and a final feature selection number as comparison indexes, and comparing average performance indexes of four different feature selection algorithms under different running times.

The data set used in the experiment was a set of cardiac disease data sets provided by the UCI database, for a total of 303 data, each of which recorded all physiological indicators of cardiac patients. Each datum consists of 14 features and a label, the population number of the wolf pack is set to be 12, the iteration number maximum of the algorithm is set to be 6, KNN is selected as a classifier in the experiment, and K is 5.

The specific 14 data characteristics are respectively: age represents the patient's age; sex denotes patient gender, wherein 0 denotes female and 1 denotes male; cp represents the chest pain type of the patient and is divided into four types, namely 1, 2, 3 and 4; trestbps represents the resting blood pressure of the patient; chol denotes the cholesterol value of the patient; fbs denotes fasting plasma glucose level of the patient; restecg means electrocardiogram results of patients, 0 means normal, 1 means mild, 2 means severe; thalach represents the maximum heart beat number of the patient; exang indicates whether the patient has exercise angina, 0 indicates present, and 1 indicates absent; oldpeak represents the number of st wave drops caused by patient motion; slop represents the patient's motion st band slope; ca represents the number of vessels seen by the patient's fluoroscopy; thal represents the defect types of the patients, namely 3, 6 and 7; status indicates the disease status of the patient, 0 indicates normal, and 1 to 4 indicate the degree of vasoconstriction.

The performance indexes of four different feature selection algorithms under different algorithm running times are compared in experiments, and the algorithm running times are increased from 20 times to 200 times. The abscissa of fig. 2 and 3 represents the number of algorithm runs, 1 represents the number of first experimental runs as 20, and 10 represents the number of tenth experimental runs as 200. Errorb and count indicate the error rate and feature selection number after the original Grey wolf algorithm is used as the feature selection part, and Errore and count indicate the error rate and feature selection number after the improved Grey wolf algorithm is used as the feature selection part. As can be seen from fig. 1, except for the 2 nd experiment (40 times of operation), the classification accuracy of the improved algorithm is better than that of all other algorithms, the average error rate is below 1.85%, the effect is obviously improved, the fluctuation amplitude is small, and the operation effect is stable. As can be seen from fig. 2, the average feature selection numbers using the improved algorithm were all less than 3.85, both lower than those of PSO and GA in ten experiments. Compared with the improved gray wolf algorithm, the number of feature choices is reduced greatly, and the volatility is stable.

In conclusion, experiments show that under the same conditions, the algorithm can achieve better effect in the aspect of feature selection. In longitudinal comparison, the detection error rate of the algorithm after feature selection is superior to that of the original BGWO, and the algorithm is superior to the EBGWO of the improved version in the aspect of feature selection number, so that the advantages of the two algorithms are combined in general. The algorithm is superior to PSO and GA no matter the number of feature choices or the detection error rate are compared in the transverse direction, the convergence speed of the algorithm is high, and a good effect can be achieved with few iteration times.

It should be understood that parts of the specification not set forth in detail are well within the prior art.

It should be understood that the above description of the preferred embodiments is given for clarity and not for any purpose of limitation, and that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1.A patient physiological data characteristic selection method based on a greedy-of-distance strategy is characterized by comprising the following steps of:

wherein, the number of initialization iterations t is 1, the population size of the wolf cluster is K, and for wolf individuals with i being 1, 2, … and K, the position vector of each head wolf in the wolf cluster is initialized randomly in (0, max)

The vector dimension is N, where max represents the maximum value of the position of the wolf individual;

step 2.4: calculating a distance map of each wolf;

wherein, the distance mapping of each wolf is calculated by utilizing a greedy strategy; the specific implementation comprises the following substeps:

Wherein the content of the first and second substances,

representing parameters

Three different random vectors, parameters

Calculated in step 2.4.2;

and

wherein the content of the first and second substances,

representing parameters

Three different random vectors of, wherein the parameters

Calculated in step 2.4.2;

and

position vectors representing α, β, and δ in the t-th iteration;

for the intermediate parameters, the final position of each wolf moving along α, β, and δ at the tth iteration is represented; it is defined as follows:

step 2.4.2: calculating parameters

And a, calculated using the following formula:

wherein the content of the first and second substances,

step 2.4.3: computing

Wherein

Represents the calculation of step 2.4.1

The value of the n-th dimension is,

representing a vector

is represented by

Mapping functions obtained in different problem search intervals;

step 2.4.4: calculating X^dChange and hold;

wherein the content of the first and second substances,

is composed of

The value of the d-th dimension in (1),

The value after the operation is taken as X^dA value of (d);

wherein, updating the code vectors of alpha, beta and delta comprises sorting the updated individual adaptive values of wolf, and selecting the adaptive value P of the three-headed wolf with the first three of the adaptive values_α'、P_β' and P_δ' Adaptation values P to original alpha, beta and delta_α，P_βAnd P_δPerforming corresponding comparison if the new adaptive value P_iIs superior to the original adaptation value P_iThen the corresponding code vector is used

Updating the code vector corresponding to the new adaptive value

Otherwise, not updating;

step 2.6: judging whether t is larger than maximum;

if yes, executing the following step 3;

if not, returning to the step 2.4 after t is equal to t + 1;

2. The greedy-of-distance-strategy-based patient physiological data feature selection method as recited in claim 1, wherein: in step 1, for the data captured and labeled in known manner, each piece of data is represented by a feature vector, and each dimension of the vector represents a feature of the data.

3. The greedy-of-distance-strategy-based patient physiological data feature selection method as recited in claim 1, wherein: in step 2.2, find a mapping function f that maps the values in the (0, max) interval into the {0, 1} discrete set and ensures that there is a number δ in (0, max) such that f (temp1) exists for all temp1 ∈ (0, δ) and temp2 ∈ [ δ, max)<f (temp2), so that the continuous feature vector

Become binary coded vectors containing only 0 and 1

4. The greedy-of-distance-strategy-based patient physiological data feature selection method as recited in claim 1, wherein: in step 2.2, the vector is encoded according to binary of wolf

Calculating an adaptation value of each wolf, the code vector of each wolf

1 represents that the characteristic is selected, 0 represents that the characteristic is not selected, and the training set T is enabled to be encoded in the encoding vector

Corresponding to the training set under the selected characteristics as T _ solution, calculating the average precision or the classification error rate P after classifying the T _ solution by utilizing a classifier_iThe accuracy is used as the wolf group code vector

The corresponding adaptation value.