CN107845407A

CN107845407A - Based on filtering type and improve the human body physiological characteristics selection algorithm for clustering and being combined

Info

Publication number: CN107845407A
Application number: CN201710733507.1A
Authority: CN
Inventors: 陈波; 俞洁; 高秀娥; 郑庆国; 白旭飞
Original assignee: Dalian University
Current assignee: Dalian University
Priority date: 2017-08-24
Filing date: 2017-08-24
Publication date: 2018-03-27

Abstract

The invention discloses a kind of human body physiological characteristics selection algorithm being combined based on filtering type and improvement cluster, including：S1：Impedance model is selected, collects fisrt feature parameter and second feature supplemental characteristic structure initial characteristicses collection and final optimal subset；S2：Filter algorithm is introduced, for each feature in the data that are collected into；S3：Feature set is ranked up from big to small according to HSIC value；S4：The feature of K before ranking is added in feature set, parameter uncorrelated to body composition is filtered off using Filter algorithms, builds initial data set；S5：According to clustering algorithm by dataset construction feature sparse graph；S6：Redundancy feature in cluster is screened using improved clustering algorithm；The human body physiological characteristics selection algorithm that the application establishes can improve human body composition precision of prediction, and more efficiently detection means is provided for body composition Study and clinical practice.

Description

Human physiological feature selection algorithm based on combination of filtering and improved clustering

Technical Field

The invention belongs to the field of bioinformatics, and particularly relates to a human physiological characteristic selection algorithm based on combination of filtering and improved clustering.

Background

The equilibrium state of body components plays an important role in maintaining the stability of the environment in the body, and is an important factor influencing the health of the human body. When a disease occurs, the body composition changes often earlier than the clinical symptoms of the disease. Therefore, the change of the body composition of the human body can be used for carrying out correlation prediction on diseases such as hypertension, dyslipidemia, metabolic syndrome and the like. However, there are many relevant parameters affecting body composition, and there are features of high non-linearity, redundancy, irrelevance, etc. among the parameters.

The redundancy characteristic of the existing Wrapper algorithm is removed, the method can obtain better generic performance, but the algorithm is not suitable for large-scale data sets due to high complexity; the Filter algorithm gives each feature a weight value according to a criterion calculation result, the calculation efficiency is high, but the redundancy among the features is not fully considered in the method, and the selected feature subset is likely to have a large amount of redundancy; the clustering method divides the body composition parameter data into a plurality of groups or clusters in an object-oriented manner, so that the objects in the clusters have high similarity, and the judgment is carried out according to the distance between each cluster and a central point, so that redundant features are effectively screened out, but irrelevant features cannot be effectively screened out. In view of this, before analyzing the high-dimensional data of the volume components, it is necessary to provide a new method for processing the data in a dimension reduction way.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a human physiological feature selection algorithm based on the combination of filtering and improved clustering, firstly, a Filter feature selection algorithm is used for removing features irrelevant to the body component classification, and then, an M-Chameleon feature clustering method is used for removing redundant features, so that the advantages of both the Filter feature selection algorithm and the feature clustering are brought into full play. The human body composition prediction model established in the way can improve the human body composition prediction precision and provide a more effective detection means for human body composition research and clinical application.

In order to achieve the above object, the present invention provides a human physiological feature selection algorithm based on a combination of filtering and improved clustering, comprising:

s1: selecting an impedance model, collecting data of a first characteristic parameter and a second characteristic parameter to construct an initial characteristic set and a final optimal subset, and initializing the initial characteristic set and the final optimal subset into an empty set;

further, body composition data measured by a human body composition analyzer (INBODY) is used as a data set and is marked as T ═ O, F, C, wherein O is a data sample set, F is a selection feature set, and C is a body composition classification; using parameter set having important influence on human body composition, such as weight, height, age, sex, impedance value of each section of human body, etc. as first characteristic parameter, and reciprocal 1/R of each section of impedance_iSquare R_i ²、R_iR_jAs a second characteristic parameter. The impedance value selects a 1KHZ impedance parameter in INBODY, a first characteristic parameter (R1, R2, R3, R4, R5 and A, H, W, wherein A is age, H is height and W is weight) and a second characteristic parameter (1/R1, 1/R2, 1/R3, 1/R4, 1/R5, R1R2, R1R3, R1R4, R1R5, R2R3, R2R4, R2R5, R3R4, R3R5, R5R4 and the like) as original characteristic parameter sets, and is recorded as F ═ F { (F) } F₁,f₂,…,f_m}; the body composition classification set C includes Body Fat Mass (BFM) and total water mass (TBW).

S2: introducing a filtering algorithm (Filter), and calculating an HSIC value under a body composition class C for each characteristic in the collected data, wherein the HSIC value is used for representing the correlation size of the physiological characteristic and the body composition class;

further, for each feature { f₁,f₂,…,f_mE.g. F, defines a non-linear feature mapping φ:the mapping may map the feature points f₁,f₂,…,f_mMapping to a regenerative kernel Hilbert spaceIn (1), the kernel function is:in the formula:space(s)Inner product of (d). Similarly, an individual component classification map ψ is defined:the volume composition index C space is mapped to a regeneration core Hilbert space and is recorded asIn (1), the kernel function is:in addition, the cross covariance operator that defines the feature and body composition classes is:in the formulaThe product of the tensors is represented by,andindicating a desire. For each feature f₁,f₂,…,f_mBelongs to F, an HSIC value under a body composition class c is calculated (HSIC is an independence measurement method based on a kernel, a cross covariance operator is defined on a regeneration kernel Hilbert space, an independence judgment criterion is obtained through empirical estimation of operator norm, the similarity between two data distributions can be measured, and the method is widely applied to feature selection and dimension reduction), and the value represents the correlation size of physiological features and the body composition class:

for a certain feature f and body composition class c, a larger value of HSIC indicates a stronger dependency of c on f.

S3: sorting the feature set from large to small according to the value of HSIC;

s4: adding the characteristics of K before ranking into a characteristic set, filtering out parameters irrelevant to body composition by using a Filter algorithm, and constructing an initial data set;

s5: and constructing a characteristic sparse graph for the data set according to a clustering algorithm (M-chameleon). RI is an edge set of the mutual connection among the characteristics, RC is the similarity among the characteristics, and the number k of the expected clusters is initialized;

furthermore, Chameleon uses a coacervation hierarchical clustering method to construct a feature sparse graph according to a method of a K-nearest neighbor graph, each vertex in the graph represents a data object, an edge exists between the two vertices, the similarity of the objects can be reflected by using the weighting of the edges, and the algorithm principle is as shown in fig. 1. The similarity of the feature sub-clusters is evaluated according to two points: 1) interconnection of objects in the cluster; 2) proximity of clusters. If the interconnectivity of two feature clusters is high and the distance is close, the feature clusters with longer distance will be merged and replaced. And determining the similarity between the two features according to the relative interconnection degree RI and the relative approximation degree RC of the two feature clusters. Giving a normalized and Filter-filtered feature data set F ═ F₁,f₂,…,f_m}, the data cluster F is divided into sub-clusters F₁And f₂Dividing F into two₁And f₂And the weight of the cut edge is the smallest, the feature sub-cluster f₁And f₂The greater the relative interconnectivity there between. Two feature clusters f₁And f₂Relative degree of interconnection RI (f)₁,f₂) Is defined as a feature cluster f₁And f₂Relative degree of interconnection between, with respect to two clusters f₁And f₂The internal interconnection degree of (c) is normalized, namely:

wherein,is composed of f₁And f₂The edge of the cluster is cut, and in the same way,orIs to mix f₁(or f)₂) The minimum sum of the edge cuts divided into two parts that are approximately equal.

Two feature clusters f₁And f₂Relative degree of approximation RC (f)₁,f₂) Is defined as f₁And f₂Absolute approximation between, with respect to two feature clusters f₁And f₂The normalization of the internal approximation of (a), namely:

wherein,is connecting f₁Vertex sum f₂The average weight of the edges of the vertices,(or) Is the minimum of two clusters f₁(or f)₂) Average weight of the edge of (1). By feature sub-cluster f₁And f₂Determines the similarity between two sub-clusters.

S6: screening redundant features in the clusters by using an improved clustering algorithm;

s61: calculating the distance between clusters and sequencing the distances, and judging whether the number h of the sample sub-clusters is equal to the number k of the initialized expected clusters; s62: if not, selecting two sub-clusters with the maximum similarity function value for merging, and if equal, ending; s63: recalculating the relative similarity RC of the new sub-clusters, traversing all the sub-clusters, and judging whether all the sub-clusters are merged every two; s64: if all the sub-clusters attempt to merge, returning to S61; otherwise, merging the two sub-clusters with the minimum similarity function and returning to S63; s65: and selecting the characteristics with the maximum HSIC value for combination.

S7: and selecting one feature with the highest HSIC value from each feature cluster to combine into an optimal feature set.

Due to the adoption of the technical scheme, the invention can obtain the following technical effects: according to the characteristics of human body physiological characteristic parameters, a human body characteristic parameter selection algorithm based on combination of Filter and clustering is provided, characteristics irrelevant to categories are removed by using a characteristic filtering method of Hilbert-Schmidt dependency criterion, improved Chameleon clustering is used for characteristic selection and optimized improvement, redundant characteristics are well removed, an optimal characteristic parameter set for a structure component model is effectively selected, the problems of multiple human body physiological characteristic parameters and redundancy are solved, and a more effective detection means is provided for human body component research and clinical application.

Drawings

FIG. 1 is a schematic diagram of a Chameleon clustering algorithm;

FIG. 2 is a schematic diagram of an improved Chameleon algorithm;

FIG. 3 is a human body feature parameter selection process;

FIG. 4 is a correlation between characteristic parameters obtained by using a filtering algorithm and BFM in a 1KHZ frequency band;

FIG. 5 is a characteristic parameter and BFM correlation degree obtained by using a filtering algorithm under a 250KHZ frequency band;

FIG. 6 is a BFM correlation of characteristic parameters obtained by using a filtering algorithm under a 500KHZ frequency band

FIG. 7 is an analysis of the number of parameter clusters after using a filtering algorithm;

FIG. 8 shows the distance between the characteristic parameter and the BFM index when different sample numbers are grouped into four categories;

FIG. 9 shows the comparison between the predicted value and the actual value of the BFM model;

FIG. 10 shows the comparison of the BFM model predicted values with respect to the error.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in detail with reference to the accompanying drawings and specific embodiments.

Taking body composition data measured by INBODY as a data set, and recording as T ═ O, F and C; using parameter set having important influence on human body composition, such as weight, height, age, sex, impedance value of each segment of human body, etc. as first characteristic parameter, and reciprocal 1/R of each segment impedance_iSquare R_i ²、R_iR_jAs a second characteristic parameter. The INBODY measuring frequency ranges comprise three frequency ranges of 1KHZ, 250KHZ and 500KHZ, and the relation between body composition and characteristic parameters is researched under the conditions of the three frequency ranges and different sample sizes. Wherein a first characteristic parameter (R) is selected₁、R₂、R₃、R₄、R₅A, H, W) and a second characteristic parameter 1/R₁、1/R₂、1/R₃、1/R₄、1/R₅、R₁R₂、R₁R₃、R₁R₄、R₁R₅、R₂R₃、R₂R₄、R₂R₅、 R₃R₄、R₃R₅、R₄R₅As the original feature parameter set, it is noted that F ═ F₁,f₂,…,f_m}; the body composition classification set C comprises Body Fat Mass (BFM) and total water mass (TBW); the body composition classification set C includes Body Fat Mass (BFM) and total water mass (TBW). The following table lists some of the sample data sets.

However, there are many relevant parameters affecting body composition, and there are features of high non-linearity, redundancy, irrelevance, etc. among the parameters. In view of the above problems, it is necessary to provide a method for performing dimension reduction processing on data to solve the problem of redundant and irrelevant characteristic parameters. The clustering method divides the body composition parameter data into a plurality of groups or clusters in an object-to-object manner, so that the objects in the clusters have high similarity, and judgment is carried out according to the distance between each cluster and a central point, and redundant features are effectively screened out. Meanwhile, before the high-dimensional data of the volume composition is analyzed, the attributes irrelevant to the required features are eliminated through a step of reducing the number of the features;

therefore, the data should first be subjected to a filtering algorithm. Given a set of raw features, F ═ F₁,f₂,…,f_mData sample set O ═ O }₁,o₂,…,o_nAnd (4) running a filtering algorithm on the first 100 human samples under three frequency bands of 1KHZ, 250KHZ and 500KHZ, and listing the correlation degree of the filtered characteristic parameters obtained after running the algorithm on the body composition BFM in the lower graph.

In the formula:space(s)Inner product of (d). Similarly, a volume composition class mapping ψ is defined:the volume composition index C space is mapped to a regeneration core Hilbert space and is recorded asIn (3), the corresponding kernel function is:

the kernel function can calculate the inner product of two characteristic points between the characteristic space projection without explicitly calculating a specific mappingWithout paying the computational cost implied by the dimensionality. The cross-covariance operator, which can thus define the feature and body composition classes, is:

in the above formulaThe product of the tensors is represented by,andexpress expectation^[16]The norm of the square of this covarianceCalled HSIC: the expression is^[14]：

After the BFM of the lower body components with different impedances is run by using the filter algorithm, the correlation condition can be obtained, as shown in fig. 4, 5 and 6, as can be seen from the above three graphs, when the impedance frequency band is gradually increased, the value of the impedance is also continuously reduced, and the BFM information amount contained in each characteristic parameter is gradually reduced. Selecting characteristic parameters according to the confidence interval of 80 percent as screening, and summarizing the characteristics after the filter algorithm is operated under different frequency bands as shown in the following table 2:

table 2: characteristics of running filter algorithm under different frequency bands

As can be seen from Table 2, the algorithm herein greatly reduces the number of primitive feature sets, and the 250KHZ frequency band features are more aggregated. Therefore, the characteristics of the filtered intermediate impedance frequency band of 250KHZ are selected for cluster analysis, and redundant information is screened out.

Before clustering, firstly, judging that the feature parameters are clustered into several classes, and respectively calculating the number of information contained in different clustering conditions according to the screened feature parameters, as shown in fig. 7, the analysis shows that the selected feature information can be better represented by dividing the feature parameters and body components into 4 classes. When the number of samples was 20, 40, 60, 80, and 100, as can be seen from FIG. 8, the cluster variation was not large, 1/R₄,1/R₅In the form of a polymer A, H, W, R₅,R₄In the form of a polymer, R₄R₅，R₁R₂,R₂ ²,R₁ ²，R₅ ²In the form of a polymer, R₂R₃，R₁R₃Are gathered into one group. The characteristic parameters obtained by the Filter algorithm can be removed from 1/R far away from the cluster center BFM after being clustered into 4 classes₄, R₄,R₁ ²，R₁R₃. Table 3 lists the feature parameter selection after Filter and clustering algorithm.

Table 3: characteristic parameters after Filter and clustering algorithm

Table 4 lists the candidate feature set and time complexity for body composition BFM prediction using three feature selection methods;

table 4: optimal feature set and complexity comparison

As can be seen from Table 4, under the condition that the dimensions of the data sets are the same, the number of the candidate feature sets obtained by using the algorithm and the time complexity thereof are both smaller than those of the feature selection algorithms of Filter, Wrapper and mRMR;

in order to verify the performance of the characteristic selection algorithm, for body composition (BFM), the characteristic selection is carried out by respectively using an mRMR, Filter and Wrapper combined characteristic selection algorithm and the characteristic selection algorithm, and in order to accurately measure the quality degree of the candidate characteristic set under the given body composition BFM, the first 80 of the sample sets are taken as training sample sets and are marked as T₁＝{(x₁,y₁),(x₂,y₂),…,(x₈₀,y₈₀) The last 20 were used as test sample sets

T₂＝{(x₈₁,y₈₁),(x₈₂,y₈₂),…,(x₁₀₀,y₁₀₀) In which x_i∈R^lFor the input characteristic parameter value, as an argument, y_iE.R is the actual body composition value and is used as a dependent variable; using multiple linear regression in SPSS software on T₁And (5) training. Table 5 shows a summary of the models obtained by regression modeling of BFM using the feature set described above:

table 5: model collection (modified)

a. Predictor variables (constants), W, S, A, R₃,1/R₂,1/R₁,1/R₃,R₄ ²,R₄R₅,R₅ ²

b. Predictor variable (constant), 1/R₃,W,S,R₂ ²,R₄ ²,R₄R₅,R₅ ²,1/R₁,R₅

c. Predictor variables (constants), A, H, W, R₅,R₁R₂,R₂R₃,R₄R₅,1/R₅,R₂ ²,R₅ ²，

As can be seen from table 5, the correlations between the physiological feature sets in models 1, 2 and 3 and BFM are 0.927, 0.906 and 0.978, respectively, so the correlation between the feature sets obtained by using the algorithm herein and body composition is strongest;

according to the obtained regression coefficients of the models, a prediction equation is listed:

BFM₁＝0.041*W+0.126*S+0.523*A-0.212*R₃+0.171*1/R₁+0.126*1/R₂+0.179*1/R₃+0.132R² ₄+0.13R₄R₅+0.127R² ₅-8.56(1)

BFM₂＝0.313*W-0.044*S-0.125*1/R₃+0.108*1/R₁+0.016*R₄ ²-0.01R₂ ²+0.071R₅ ²+0.072R₄R₅-0.526R₅+5.674 (2)

BFM₃＝-0.464*A-0.15*H+0.122*W-0.143*R₅+0.129*R₁R₂+0.122*R₂R₃-0.134*R₄R₅+0.145*1/R₅+0.129*R₂ ²-0.141*R₅ ²(3)

using the obtained prediction model to test set T₂And predicting and comparing with the actual value to obtain a comparison graph 9 of the predicted value and the actual value of the BFM model and an error analysis graph 10. As can be seen from fig. 10, the prediction model constructed using the features obtained by the feature selection algorithm herein has high accuracy, and the prediction relative error is less than 0.12. The result shows that the human physiological characteristics based on the combination of the filter and the clusteringThe characteristic set acquired by the selection algorithm shows good correlation with the body composition, so that the fitting precision of the body composition prediction model can be improved, and the prediction error can be reduced.

Compared with the prior art, the invention provides a human body physiological characteristic selection algorithm based on combination of Filter and clustering. The method comprises the steps of removing features irrelevant to categories by using a feature filtering method of Hilbert-Schmidt dependency criterion, using improved Chameleon clustering in feature selection and carrying out optimization improvement, well removing redundant features, effectively selecting an optimal feature parameter set for a structure component model, and solving the problems of more human body physiological feature parameters and redundancy; the human body composition prediction model established in the way can improve the human body composition prediction precision and provide a more effective detection means for human body composition research and clinical application.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art should be considered to be within the technical scope of the present invention, and equivalent alternatives or modifications according to the technical solution and the inventive concept of the present invention should be covered by the scope of the present invention.

Claims

1. Human physiological characteristic selection algorithm based on combination of filtering and improved clustering, which is characterized by comprising the following steps:

s2: introducing a filtering algorithm, and calculating an HSIC value under the body composition classification for each feature in the collected data, wherein the value represents the correlation size of the physiological feature and the body composition classification;

s3: sorting the feature set from large to small according to the value of HSIC;

s4: adding the characteristics of K before ranking into a characteristic set, filtering out parameters irrelevant to body composition by using a filtering algorithm, and constructing an initial data set;

s5: constructing a characteristic sparse graph from a data set according to a clustering algorithm, wherein RI is an edge set of characteristics which are mutually connected, and RC is the similarity between the characteristics, and initializing the number k of expected clusters;

2. The human physiological feature selection algorithm based on the combination of filtering and improved clustering according to claim 1, wherein the body composition data measured by the human body composition analyzer is used as a data set, denoted as T ═ O, F, C, where O is a data sample set, F is a selection feature set, and C is a body composition classification; taking the parameter set having important influence on the body composition of human body as the first characteristic parameter, and the reciprocal 1/R of each section impedance_iSquare R_i ²、R_iR_jAs a second characteristic parameter; wherein, the impedance value selects 1KHZ impedance parameter in the human body composition analyzer, the first characteristic parameters R1, R2, R3, R4, R5, A, H, W and the second characteristic parameters 1/R1, 1/R2, 1/R3, 1/R4, 1/R5, R1R2, R1R3, R1R4, R1R5, R2R3, R2R4, R2R5, R3R4, R3R5 and R5R4 are used as original characteristic parameter sets, and F { F ═₁,f₂,···,f_m}，f₁,f₂,···,f_mThe characteristics are that A is age, H is height, W is weight, and the body composition classification C includes body fat amount and total water amount.

3. The human physiological feature selection algorithm based on a combination of filtering and improved clustering according to claim 1 or 2, wherein for each feature in the collected data, the HSIC values at the body composition class C are calculated, specifically:

for each feature f₁,f₂,···,f_mE.g. F, define a nonlinear feature mappingThe mapping maps the feature points f₁,f₂,···,f_mMapping to a regenerative kernel Hilbert spaceIn (1), the volume composition index C space is mapped to the regeneration core Hilbert space and is recorded asIn (1), the kernel function is:for each feature f₁,f₂,···,f_mE.g. F, calculating HSIC value under the body composition classification C.

4. The human physiological feature selection algorithm based on a combination of filtering and improved clustering according to claim 3, wherein the cross covariance operator for defining the feature and body composition classes is:in the formulaThe product of the tensors is represented by,andindicating a desire; the HSIC value characterizes the magnitude of the correlation of physiological features with body composition classes:

5. The human physiological feature selection algorithm based on the combination of filtering and improved clustering according to claim 1, wherein a feature sparse graph is constructed according to a K-nearest neighbor graph method by using a Chameleon agglomerative hierarchical clustering method, each vertex in the graph represents a data object, an edge exists between the two vertices, the similarity of the objects can be reflected by using the weighting of the edges, and the similarity of feature sub-clusters is evaluated according to two points: 1) interconnection of objects in the cluster; 2) proximity of clusters; and determining the similarity between the two features according to the relative interconnection degree RI and the relative approximation degree RC of the two feature clusters.

6. The human physiological feature selection algorithm based on a combination of filtering and improved clustering according to claim 4, wherein given the normalized feature data set F ═ { F ═ F filtered using a filtering algorithm₁,f₂,···,f_m}, the data cluster F is divided into sub-clusters F₁And f₂Dividing F into two₁And f₂And the weight of the cut edge is the smallest, the feature sub-cluster f₁And f₂The greater the relative interconnectivity between; two feature clusters f₁And f₂Relative degree of interconnection RI (f)₁,f₂) Is defined as a feature cluster f₁And f₂Relative degree of interconnection between, with respect to two clusters f₁And f₂The internal interconnection degree of (c) is normalized, namely:

wherein,is composed of f₁And f₂The edge of the cluster is cut, and in the same way,orIs to mix f₁Or f₂The minimum sum of the edge cuts divided into two parts that are approximately equal.

7. The human physiological feature selection algorithm based on a combination of filtering and improved clustering according to claim 5,

two feature clusters f₁And f₂Relative degree of approximation RC (f)₁,f₂) Is defined as f₁And f₂Relative degree of approximation between, with respect to two feature clusters f₁And f₂The normalization of the internal approximation of (a), namely:

<mrow> <mi>R</mi> <mi>C</mi> <mrow> <mo>(</mo> <msub> <mi>f</mi> <mn>1</mn> </msub> <mo>,</mo> <msub> <mi>f</mi> <mn>2</mn> </msub> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <msub> <mover> <mi>S</mi> <mo>&OverBar;</mo> </mover> <mrow> <msub> <mi>EC</mi> <mrow> <mo>{</mo> <msub> <mi>f</mi> <mn>1</mn> </msub> <mo>,</mo> <msub> <mi>f</mi> <mn>2</mn> </msub> <mo>}</mo> </mrow> </msub> </mrow> </msub> <mrow> <mfrac> <mrow> <mo>|</mo> <msub> <mi>f</mi> <mn>1</mn> </msub> <mo>|</mo> </mrow> <mrow> <mo>|</mo> <msub> <mi>f</mi> <mn>1</mn> </msub> <mo>|</mo> <mo>+</mo> <mo>|</mo> <msub> <mi>f</mi> <mn>2</mn> </msub> <mo>|</mo> </mrow> </mfrac> <msub> <mover> <mi>S</mi> <mo>&OverBar;</mo> </mover> <mrow> <msub> <mi>EC</mi> <msub> <mi>f</mi> <mn>1</mn> </msub> </msub> </mrow> </msub> <mo>+</mo> <mfrac> <mrow> <mo>|</mo> <msub> <mi>f</mi> <mn>1</mn> </msub> <mo>|</mo> </mrow> <mrow> <mo>|</mo> <msub> <mi>f</mi> <mn>1</mn> </msub> <mo>|</mo> <mo>+</mo> <mo>|</mo> <msub> <mi>f</mi> <mn>2</mn> </msub> <mo>|</mo> </mrow> </mfrac> <msub> <mover> <mi>S</mi> <mo>&OverBar;</mo> </mover> <mrow> <msub> <mi>EC</mi> <msub> <mi>f</mi> <mn>2</mn> </msub> </msub> </mrow> </msub> </mrow> </mfrac> </mrow>

wherein,is connecting f₁Vertex sum f₂The average weight of the edges of the vertices,orIs the minimum of two clusters f₁Or f₂The average weight of the edge of (a);

by feature sub-cluster f₁And f₂Relative interconnectivity and relative proximity of the two sub-clusters determines the similarity between the two sub-clusters.

8. The human physiological feature selection algorithm based on the combination of filtering and improved clustering according to claim 1, wherein the improved clustering algorithm is characterized in that all feature sub-clusters are traversed and tried to be merged and replaced, the feature selection quality is evaluated after the sub-clusters are merged, and merging is tried to be performed between every two existing sub-clusters; the method comprises the following specific steps:

s61: calculating the distance between clusters and sequencing the distances, and judging whether the number h of the sample sub-clusters is equal to the number k of the initialized expected clusters;

s62: if not, selecting two sub-clusters with the maximum similarity function value for merging, and if equal, ending;

s63: recalculating the relative similarity RC of the new sub-clusters, traversing all the sub-clusters, and judging whether all the sub-clusters are tried to be combined pairwise;

s64: if all the sub-clusters attempt to merge, returning to S61; otherwise, merging the two sub-clusters with the minimum similarity function and returning to S63;

s65: and selecting the characteristics with the maximum HSIC value for combination.