CN107679138B

CN107679138B - Spectral feature selection method based on local scale parameters, entropy and cosine similarity

Info

Publication number: CN107679138B
Application number: CN201710868300.5A
Authority: CN
Inventors: 谢娟英; 周颖; 丁丽娟
Original assignee: Shaanxi Normal University
Current assignee: Shaanxi Normal University
Priority date: 2017-09-22
Filing date: 2017-09-22
Publication date: 2021-08-27
Anticipated expiration: 2037-09-22
Also published as: CN107679138A

Abstract

The invention discloses a spectral feature selection method based on local scale parameters, entropy and cosine similarity, which adopts a Gaussian kernel function as a similarity measurement method, defines characteristic local scale parameters based on characteristic local standard deviation as kernel function parameters, and solves the problems that unified scale parameters cannot reflect data distribution information when a characteristic affinity matrix is calculated and the local scale parameters are influenced by outliers; the feature importance is measured by respectively adopting entropy and chord similarity sequencing, and a proper feature subset can be quickly selected; provides technical support for data analysis of diseases such as tumor and the like, and has important biomedical significance.

Description

Spectral feature selection method based on local scale parameters, entropy and cosine similarity

Technical Field

The invention belongs to a gene microarray data and text data analysis technology, and relates to a spectral feature selection method based on local scale parameters, entropy and cosine similarity.

Background

Feature selection is the primary task of high-dimensional big data analysis such as gene microarray data and text data^[1,2]The method aims to eliminate irrelevant or redundant features from all features and select a feature subset with good distinguishing capability so as to retain all classification information of an original feature set as much as possible. The feature selection algorithm is divided into a supervised method and an unsupervised method according to whether the feature selection process uses sample class mark information or not^[3]. The supervised feature selection method carries out feature selection by calculating the correlation between features and class mark columns, and the unsupervised feature selection method considers the internal structure of data and does not need to use class mark information. In practical application, a large amount of data which is difficult to acquire is existed, so that the unsupervised feature selection research is particularly important.

The cluster analysis is used as an unsupervised learning technology, can discover knowledge from data and reveal hidden patterns and rules^[4]The clustering idea is introduced into unsupervised feature selection, so that high-quality feature subsets can be ensured^[5]. Liu Tao and the like^[6]Providing an unsupervised feature selection algorithm for text clustering, and adopting χ for a K-means clustering result²And calculating the feature importance by using the statistics or the information entropy to select the features, thereby obviously improving the text clustering performance. The traditional partitioning type clustering algorithm such as K-means is suitable for finding spherical clusters and is often converged in a local optimal solution. Spectral clustering is established on the spectrogram theory, the clustering problem is converted into the optimization problem of a graph, and data is utilizedClustering feature vectors of a similarity matrix^[7,8～11]Convergence to a global optimum solution^[7]. Spectral clustering is classified into two-way spectral clustering and multi-way spectral clustering according to different partition criteria. Two-path spectral clustering can only perform two types of division, has high calculation complexity, and only comprises one characteristic vector, so that useful information is lost^[10]. The multi-path spectral clustering algorithm adopts a plurality of characteristic vectors simultaneously, can comprehensively acquire information and reduce instability^[12,13]。

In recent years, spectral feature selection has become one of the hot spots in the field of machine learning and pattern recognition. Zhao et al^[14]Considering the correlation between the features and the class marks and the correlation between the features, a unified feature selection framework based on spectrogram theory is provided and is generally used for supervised feature selection (such as a Relieff algorithm)^[15]) And unsupervised feature selection (e.g., Laplace score feature selection method)^[16]). But this framework lacks redundant control and does not result in a compact subset of features. Garcia-Garcia etc^[17]On the basis of the framework, a general framework for selecting the spectral features based on mutual information is provided, and the low redundancy of the obtained feature subsets is ensured. Zhou et al^[18]The spectral feature selection method comprehensively considering the local structure and the global structure not only ensures the stability of the local geometric structure of data, but also ensures that different clusters have obvious dissimilarity, thereby selecting representative features as far as possible.

Traditional spectral clustering algorithm NJW^[11]The medium similarity measurement mostly adopts a radial basis kernel function, and a scale parameter sigma in the function is usually given subjectively and is globally uniform, so that the data distribution cannot be reflected correctly, and the real distribution of the data cannot be found. The NJW-based spectral feature selection is very sensitive to scale parameters, making scale parameter selection the key to finding the optimal feature subset. In addition, the result of spectrum feature selection also depends on a feature importance measurement method, the existing spectrum feature selection algorithm mostly adopts different normalized cut-set criteria to order features or continuously iterates, and a feature subset is selected from the result of feature spectrum clustering, so that the optimal feature subset cannot be quickly selected. Reference to the literature

[1]Guyon I,Elisseeff A.An introduction to variable and feature selection[J].The Journal of Machine Learning Research,2003,3:1157-1182

[2]Hua Jianping,Tembe W D,Dougherty E R.Performance of feature selection methods in the classification of high-dimension data[J].Pattern Recognition,2009,42(3):409-424

[3]Xie Juanying,Xie Weixin.Supervised Feature Selection and its Application[M].Xi’an,Shaanxi Normal University publishing CNS LTD,2012,11:1-3

[4]Gao Hongchao.The Research of DPC Algorithm and Gene Selection Algorithm based on Clustering[D].Xi’an:Shaanxi Normal University,2015

[5]Xie Juanying,Gao Hongchao.Statistical Correlation and K-Means Based Distinguishable Gene Subset Selection Algorithms[J].Journal of Software,2014,25(9):2050-2075

[6]Liu Tao,Wu Gongyi,Chen Zheng.An Effective Unsupervised Feature Selection Method for Text Clustering[J].Journal of Computer Research and Development,2005,42(3):381-386

[7]Gao Yan,Gu Shiwen,Tang Jin et.al.Research on Spectral Clustering in Machine Learning[J].Journal of Computer Science,2007，34(2):201-203

[8]Ulrike von Luxburg.A Tutorial on spectral clustering[J].Statistics and Computing,2007,17(4):

[9]S.A.Toussi,H.S Yazdi.Feature Selection in Spectral Clustering[J].International Journal of Signal Processing,Image Processing and Pattern Recongnition,2011,4(3):179-194

[10]J.Shi,J.Malik.Normalized cuts and image segmentation,IEEE Transactions on Pattern Analysis and Machine Intelligence[J].2000,22(8):888-905

[11].Y.Ng,M.Jardan,Y.Weiss.On spectral clustering:analysis and an algorithm[J].Advances in neural information processing systems,2002,2:849-856

[12]C.Alpert,A.Kahng,S.Yao.Spectral partitioning:the more eigenvectors,the better[J].Discrete Applied Math,1999,90:3-26

[13]Y.Weiss.Segmetation using Eigenvectors:a unigying view[J].In Proceedings of International Conference on Computer Vision.1999,9:975-982

[14]Z.Zhao,H.Liu.Spectral feature selection for supervised and unsupervised learning[C].In ICML’07:Proceedings of the 24th international conference on Machine learning,pages 1151–1157,New York,NY,USA,2007.ACM.

[15]Igor Kononenko.Estimating Attribute:Analysis and Extensions of Relief[J].Proceedings of the European conference on machine learning,1994

[16]He X,Cai D,Niyogi P.Laplacian score for feature selection[C].Advances in neural information processing systems.2005:507-514

[17]D GarciaGarcia,R SantosRodriguez.Spectral Clustering and Feature Selection for Microarray Data[C].International Conference on Machine Learning and Applications,2009:425-428

[18]Sihang Zhou,Xinwang Liu,Chengzhang Zhu et al.Spectral Clustering-based Local and Global Structure Preservation for Feature Selection[C].International Joint Conference on Neural Networks,2014

[19]Dash M,Liu H.Feature selection for clustering[M].Knowledge Discovery and Data Mining.Current Issues and New Applications.Springer Berlin Heidelberg,2000:110-121

[20]Dash M,Choi K,Scheuermann P,et al.Feature selection for clustering-a filter solution[C].Data Mining,2002.ICDM 2003.Proceedings.2002IEEE International Conference on.IEEE,2002:115-122

[21]Vapnik V.The nature of statistical learning theory[M].Springer Science&Business Media,2013

[22]Miranda J,Montoya R,Weber R.Linear penalization support vector machines for feature selection[C].Proceedings of International Conference on Pattern Recognition and Machine Intelligence.Berlin:Springer-Verlag,2005:188-192

[23]Cai D,Zhang C,He X.Unsupervised feature selection for multi-cluster data[C].Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining.ACM,2010:333-342

[24]Zelnik-Manor L,Perona P.Self-tuning spectral clustering[C].Advances in neural information processing systems.2004:1601-1608

[25]Jiawei Han，KAMBER M.Data Mining Concepts and Techniques[M].Second Edition.Beijing:China Machine Press，2006

[26]Chang C C,Lin C J.LIBSVM:A library for support vector machines[J].ACM Transactions on Intelligent Systems and Technology,2011,2(3):1-27

[27]Hsu C W,Chang C C,Lin C J.A practical guide to support vector classification[R].Taibei:National Taiwan University,Department of Computer Science,2003

[28]http:.datam.i2r.a-star.edu.sg/datasets/krbd/

[29]Davis,J,Goadrich,M.The relationship between Precision-Recall and ROC curves[C].In Proceedings of the 23rd international conference on Machine learning ACM,2006,6:233-240

[30]Fawcett,T.An introduction to ROC analysis[J].Pattern recognition letters,2006,27(8):861-874.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a spectral feature selection method based on local scale parameters, entropy and cosine similarity, overcomes the defect that the uniform scale parameter sigma cannot completely and accurately reflect data distribution information when a feature affinity matrix is calculated, and simultaneously overcomes the defect that the feature local scale parameter sigma cannot completely and accurately reflect data distribution information when a self-tuning method calculates the feature affinity matrix_iProblems affected by outliers.

In order to achieve the purpose, the invention adopts the following scheme:

the spectral feature selection method based on the local scale parameters, entropy and cosine similarity comprises a feature spectral clustering method based on the feature local standard deviation and a feature spectral clustering method based on the self-tuning algorithm;

let X be X ═ X₁,x₂,L,x_n}∈R^m×n，x_i(i is 1, L, n) is a column vector, which is an original feature column for spectral feature selection, and m is the number of samples;

the characteristic spectrum clustering method based on the characteristic local standard deviation comprises the following steps:

1) equation (1) defines the reflection characteristic x_i(i ═ 1,2, L, n) local standard deviation scale parameter definition σ_{std_i}Calculating a new scale parameter σ according to equation (1)_{std_i}；

Wherein, the characteristic x_rIs a characteristic x_iThe r nearest neighbor, the neighbor measurement basis is Euclidean distance;

2) formula (2) defines an affinity matrix a ═ a (a) expressing the similarity between features_ij)_{i,j＝1,2,L,n}∈R^n×n；

Wherein d (x)_i,x_j) Is a characteristic x_i,x_jThe Euclidean distance between;

3) constructing a feature degree matrix D ═ D (D) according to equation (3)_ij)_{i,j＝1,2,L,n}∈R^n×nI.e. a matrix with only major diagonal elements, the ith diagonal element being the sum of the corresponding elements in the ith row of the characteristic affinity matrix a;

4) calculating a normalized Laplace matrix L according to the formula (4);

5) solving eigenvalues of the normalized Laplace matrix L, sorting the eigenvalues in descending order, and selecting K eigenvectors corresponding to the top K (equal to m, sample number) large eigenvalues to form a matrix V, namely V equal to [ V ═ V [ V ]₁,v₂,L,v_K]∈R^n×K，v_i(i ═ 1,2, L, K) is the eigenvector corresponding to the ith large eigenvalue, which is the column vector;

6) standardizing the matrix V according to the formula (5), and recording the standardized matrix as U ═ U (U)_ij)_{i＝1,L,n；j＝1,L,K}∈R^n×K；

7) Performing K-means (K is m) clustering on the matrix U, and clustering n features into K feature cluster;

8) measuring the importance of the features by respectively using entropy sorting and cosine similarity sorting, sorting the features, selecting the most important feature of the cluster from each feature cluster to represent the cluster, and obtaining a feature subset consisting of the representative features of K feature clusters;

the self-tuning algorithm based feature spectrum clustering method comprises the following steps:

1) equation (6) defines the reflection characteristic x_i(i ═ 1,2, L, n) local scale parameter σ of local information_iCalculating the feature x according to equation (6)_iLocal scale parameter σ of_i；

σ_i＝d(x_i,x_p) (6)

Wherein, the characteristic x_pIs a characteristic x_iP-th neighbor of (c), d (x)_i,x_p) Is a characteristic x_iTo feature x_pThe Euclidean distance of;

2) formula (7) defines a feature affinity matrix a ═ a (a) expressing the similarity between features_ij)_{i,j＝1,2,L,n}∈R^n×n；

Wherein, d (x)_i,x_j) Is a characteristic x_i,x_jThe Euclidean distance between;

4) calculating a normalized Laplace matrix L according to the formula (4);

5) solving eigenvalues of the normalized Laplace matrix L, sorting the eigenvalues in descending order, and selecting K eigenvectors corresponding to the top K (═ m) large eigenvalues to form a matrix V, namely V ═ V [ V ] m₁,v₂,L,v_K]∈R^n×K，v_i(i ═ 1,2, L, K) is the eigenvector corresponding to the ith large eigenvalue, which is the column vector;

7) Performing K-means (K is m) clustering on the matrix U, and clustering n features into K feature clusters;

8) the feature importance is measured by respectively using entropy sorting and cosine similarity sorting, the features are sorted, one feature with the most importance in the cluster is selected from each feature class cluster to represent the cluster, and a feature subset consisting of the representative features of K feature class clusters is obtained.

Further, step 8) measures the feature importance based on entropy sorting, selects the most important feature of each feature cluster from the K feature clusters, forms a feature subset containing the K features, and realizes feature selection, specifically comprising the following steps:

1) the characteristic matrix U after characteristic spectrum clustering is equal to (U)_ij)_{i＝1,L,n；j＝1,L,K}∈R^n×KWherein U is_iRepresenting the ith characteristic, and defining the entropy of a characteristic matrix U as an expression (8) according to the entropy theory;

wherein, p (U)_i) Represents U_iBecause the prior probability of the features is often difficult to obtain, the prior probability is replaced by the similarity in the calculation, and therefore the formula (8) is replaced by the formula (9);

wherein S is_ijRepresentation characteristic U_iAnd U_jThe similarity of (a) is defined as formula (10);

wherein, Distance_ijRepresentation characteristic U_iAnd U_jThe distance between the two electrodes is calculated according to the formula (11);

therein, max_kAnd min_kRespectively representing the maximum value and the minimum value of all the characteristics of the kth sample;

2) let formula (12) E_-UsIndicating the removal of a feature U from a feature set U_sLater feature set U- { U_sEntropy of, if E_-Usf

Explanation deletion feature U_sWill cause greater disorder of the feature set U, and therefore, feature U_sThan U_tMore importantly, thus, all features are ordered;

wherein, U/U_s＝U-{U_s}；

3) And 7) respectively selecting the most important features of the feature clusters from the K feature clusters obtained in the step 7) to form a feature subset, so as to obtain the optimal feature subset to be selected.

Further, step 8) measures feature importance based on cosine similarity ranking, and selects the most important feature in each feature cluster from the K feature clusters, that is, the feature with the largest sum of cosine similarities with other features of the cluster, as the representative feature of the feature cluster, so as to obtain a feature subset including K features, thereby implementing feature selection, specifically including the following steps:

1) the characteristic matrix U after characteristic spectrum clustering is equal to (U)_ij)_{i＝1,L,n；j＝1,L,K}∈R^n×KWherein U is_iRepresenting the ith feature, then we define equation (14) to measure the importance, i.e. representativeness, of each feature, the more important features are more representative;

wherein, | U_igU_jI represents a feature U_i，U_jAbsolute value of inner product, | U_i||×||U_j| represents a feature U_i，U_jProduct of modes, N_iRepresentation characteristic U_iThe number of the characteristics of the characteristic cluster;

2) according to the definition of formula (14) in step 1) of the part, obtaining the importance value of each feature, and sorting the features of each feature cluster according to the importance;

3) from the K (K ═ m) feature class clusters obtained in step 7) of claim 1 (or 2), the most important features of each feature class cluster are selected, respectively, and feature subsets are formed, so that the optimal feature subset to be selected is obtained.

The invention has the following beneficial effects:

aiming at the problems existing in the existing spectrum feature selection, the self-tuning algorithm is firstly introduced, feature spectrum clustering based on feature local scale parameters is provided, then, aiming at the problem that the local scale parameters of the algorithm are influenced by outliers, a feature spectrum clustering algorithm based on feature local standard deviation is provided, and the two feature spectrum clustering methods based on the feature local scale parameters are respectively used for carrying out spectrum clustering on the features to obtain K (m, sample number) feature clusters; on the basis, a feature sorting method based on entropy and cosine similarity is provided, various cluster features are sorted, the most important features of various clusters are respectively selected from the various feature clusters to serve as representative features of the feature cluster, and the representative features of the various clusters form an optimal feature subset, so that feature selection is realized, irrelevant and redundant features are provided, and the recognition rate and the stability of a system are improved.

The invention has good effects in the aspects of diagnosis of tumor patients and identification and application of tumor gene markers, and specifically comprises the following steps:

(1) the method adopts a Gaussian kernel function as a similarity measurement method, applies self-tuning thought to feature spectral clustering, and overcomes the defect that when the traditional spectral clustering is applied to feature clustering, the unified scale parameter sigma used for calculating a feature affinity matrix can not accurately reflect data distribution information, thereby influencing experimental results;

(2) defining the local standard deviation of the characteristics as a kernel function scale parameter, realizing characteristic spectrum clustering, overcoming the defect that the uniform scale parameter sigma cannot accurately reflect data distribution information when calculating the characteristic affinity matrix, and overcoming the local scale parameter sigma when calculating the characteristic affinity matrix by the self-tuning method_iProblems affected by outliers;

(2) the entropy is adopted to define the feature importance, the features are sequenced, the most important features of each feature cluster obtained from the feature spectrum clustering result can be quickly selected, and an ideal feature subset is formed;

(3) the cosine similarity ranking is adopted to measure the feature importance, the most important feature of each feature cluster obtained by the feature spectrum clustering result can be quickly selected, and a proper feature subset is formed;

(3) the characteristic selection method provided by the invention can select an effective characteristic distinguishing subset to discover the gene marker of the tumor, has a higher classification effect in the analysis of tumor gene expression profile data, provides technical support for the data analysis of diseases such as the tumor and the like, and has important biomedical significance.

Drawings

FIG. 1 is a flow chart of the application of the novel method for selecting the profile characteristics in the tumor gene expression profile data according to the present invention

FIG. 2 is a graph of the average classification accuracy of the feature selection method of the present invention on a Colon dataset

FIG. 3 is a graph of the mean AUC values of the feature selection method of the present invention on a Colon data set

FIG. 4 is a graph of the average MCC values of the feature selection method of the present invention on a Colon dataset

Detailed Description

The invention is further described below with reference to the accompanying drawings.

The invention provides a spectral feature selection method based on a characteristic local scale parameter of self-tuning algorithm thought, a characteristic local scale parameter based on a characteristic local standard deviation, entropy and cosine similarity sorting. Defining local scale parameter sigma of characteristic i based on self-tuning algorithm local scale parameter by using Gaussian kernel function as characteristic similarity measurement method_iLocal scale parameter sigma of characteristic i based on characteristic standard deviation_{std_i}Respectively using a characteristic local scale parameter sigma_iAnd σ_{std_i}As kernel function scale parameters, the feature is subjected to spectral clustering, so that the problem that the unified scale parameter sigma of the affinity matrix of the calculated feature can not accurately reflect data information and the local scale parameter sigma_iProblems affected by outliers; then, the feature importance is measured by respectively adopting entropy and cosine similarity, and the features are sortedAnd selecting one most important feature from each feature class cluster as a representative feature of the cluster to form a feature subset.

Wherein d (x)_i,x_j) Is a characteristic x_i,x_jThe Euclidean distance between;

4) calculating a normalized Laplace matrix L according to the formula (4);

σ_i＝d(x_i,x_p) (6)

2) formula (II)(7) Defining a feature affinity matrix a ═ that expresses similarity between features (a)_ij)_{i,j＝1,2,L,n}∈R^n×n；

4) calculating a normalized Laplace matrix L according to the formula (4);

The invention provides a local scale parameter sigma based on characteristics respectively_iAnd σ_{std_i}The feature spectrum clustering of the invention is combined with feature sorting based on entropy and cosine similarity to obtain 4 unsupervised feature selection algorithms based on spectrum clustering. The algorithm firstly carries out spectral clustering on the features to obtain K (m, sample number) feature clusters, then measures feature importance by respectively using entropy sorting and cosine similarity sorting, selects one most important feature of the cluster from the various clusters to represent the feature clusters, and obtains an optimal feature subset containing K (m, sample number) features.

The proposed 4 unsupervised Feature Selection algorithms (The unsupervised Feature Selection algorithm based on Spectral Clustering, FSSC) are:

a) a spectrum feature selection algorithm FSSC _ OE (FSSC based on Original sigma and entropy) adopting local scale and entropy sorting of self-tuning algorithm;

b) a spectrum feature selection algorithm FSSC _ OC (FSSC based on Original sigma and Cosine similarity) using local scale and Cosine similarity ordering of self-tuning algorithm;

c) a spectral feature selection algorithm FSSC _ SE (FSSC based on Standard development and Encopy) which adopts new scale parameters of feature local Standard Deviation and entropy sorting;

d) the spectral feature selection algorithm FSSC _ SC (FSSC based on Standard development and Cosine similarity) using the new scale parameter of the feature local Standard Deviation and Cosine similarity ranking.

After the characteristics are selected, a Support Vector Machine (SVM) is adopted as a classification tool for training samples only containing K (m, sample number) characteristics, an SVM classification model based on the selected characteristic subset is constructed, and the performance of the selected characteristic subset is evaluated according to classification indexes of the SVM classification model on a test set, such as classification accuracy, sensitivity, specificity and the like. Feature selection based on spectral clustering of 4 kinds of experimental designSelection method in 7 gene data set experimental results, and with Multi-Cluster Feature Selection method MCFS (Multi-Cluster Feature Selection)^[23]And Laplacian score feature selection method Laplacian (Laplacian score for feature selection)^[16]The experimental comparison shows that the 4 methods involved in the invention have good performance, wherein the spectrum feature selection method FSSC _ SC based on feature local standard deviation and cosine similarity feature sorting has the best effect.

The present invention relates to the definition:

new local scale parameter sigma based on characteristic local standard deviation_{std_i}: the self-adaptive characteristic spectrum clustering algorithm based on the characteristic local standard deviation is provided, the algorithm firstly transposes an original data matrix to obtain a matrix taking the characteristic as a row, and a new scale parameter sigma based on the characteristic local standard deviation is defined by adopting the formula (1) in the claims_{std_i}Based on the new scale parameter, an affinity matrix of the features is constructed using equation (2) in the claims.

Wherein, the characteristic x_rIs a characteristic x_iThe r-th nearest neighbor, the neighbor metric being in terms of euclidean distance.

Wherein d (x)_i,x_j) Is a characteristic x_i,x_jThe euclidean distance between them.

Feature selection based on entropy ordering: entropy is a measure of the degree of misordering of a system. The more disordered the system, the greater the entropy; conversely, the more ordered the system, the smaller the entropy.

The characteristic matrix U after characteristic spectrum clustering is equal to (U)_ij)_{i＝1,L,n；j＝1,L,K}∈R^n×KWherein U is_iThe ith feature is represented. According to the entropy theory, the entropy of the feature matrix U is defined as formula (8) in the claims.

Wherein, p (U)_i) Represents U_iSince the prior probability of the features is often difficult to obtain, in the calculation, we replace the prior probability with the similarity, and therefore, the formula (8) of the claim is replaced with the formula (9) of the claim.

Wherein S is_ijRepresentation characteristic U_iAnd U_jIs defined as formula (10) of the claims.

Wherein, Distance_ijRepresentation characteristic U_iAnd U_jThe calculation method of the distance is shown in formula (11) of the claims.

according to the formulas (10) to (11) in the claims, the closer the two features are, the similarity S_ijThe larger the two features are, the more similar the two features are, i.e. there is redundancy in the two features. Similar features are grouped together and one of the most representative features is selected. All representative features constitute a suitable subset of features to be selected.

Claim as follows, formula (12) E_-UsIndicating the removal of a feature U from a feature set U_sAfter thatFeature set U- { U_sEntropy of, if E_-Us f

Explanation deletion feature U_sWill cause greater disorder of the feature set U, and therefore, feature U_sSpecific characteristic U_tAnd more importantly. All features are thereby ordered according to equation (12) of the claims.

Feature selection based on cosine similarity ranking:

cosine similarity measures the difference between two samples by using the cosine value of the included angle between two sample vectors in a sample space, and is usually used for comparing documents or sequencing the documents according to a given query word vector, and the calculation method is shown as a formula (15).

Wherein cos theta is belonged to < -1,1 >, when the included angle of the vectors a and b is 0 degree, the cosine value is 1, when the included angle of the vectors a and b is 180 degrees, the cosine value of the included angle is-1, the included angle of the vectors a and b is closer to 0 degree, and the cosine value of the included angle is closer to 1. When the cosine of the included angle of the features is used for judging the feature similarity, the feature has no directivity, so that the absolute value of cos theta is taken, and the modified cosine similarity is defined as shown in a formula (16).

Therefore, in the claims we define the metric characteristic U of equation (14)_iThe importance of (c).

Wherein, | U_igU_jI represents a feature U_i，U_jAbsolute value of inner product, | U_i||×||U_j| represents a feature U_i，U_jProduct of modes, N_iRepresentation characteristic U_iThe number of features in the cluster.

After the importance of the features is obtained according to the formula (14), the features of each feature cluster obtained by clustering the feature spectrum are sorted according to the importance. Then, from the K (K ═ m, sample number) feature clusters obtained in step 7) of claim 1 (or 2), the most important features of each class cluster are selected, respectively, and feature subsets are formed, so that the optimal feature subset to be selected is obtained.

The method implemented by the invention comprises the following steps:

Inputting: data X, X₁,x₂,L,x_n}∈R^n×m，x_i(i ═ 1, L, n) is the original feature column for spectral feature selection, X is the training sample;

and (3) outputting: a selected subset of features S.

1) Calculating the characteristic x according to formula (1) of claim 1_iNew scale parameter sigma based on characteristic standard deviation_{std_i}；

2) Calculating the characteristic x according to equation (6) of claim 2_iLocal scale parameter σ of_i；

3) An affinity matrix characterized by the calculation of formula (2) according to claim 1, a ═ a (a)_ij)_{i,j＝1,2,L,n}∈R^n×n；

4) An affinity matrix a ═ of the calculated features of formula (7) according to claim 2 (a)_ij)_{i,j＝1,2,L,n}∈R^n×n；

5) Constructing a matrix degree matrix D ═ D (D) according to equation (3) of claim 1_ij)_{i,j＝1,2,L,n}∈R^n×nI.e. a matrix with only major diagonal elements, the ith diagonal element being the sum of the corresponding elements in the ith row of the characteristic affinity matrix a;

6) calculating a normalized laplacian matrix L according to equation (4) of claim 1;

7) solving eigenvalues of the normalized Laplace matrix L, selecting K eigenvectors corresponding to the top K (═ m, sample number) large eigenvalues to form a matrix V, namely V ═[v₁,v₂,L,v_K]∈R^n×K，v_i(i ═ 1,2, L, K) is the eigenvector corresponding to the ith large eigenvalue, which is the column vector;

8) equation (5) according to claim 1, standardizing the matrix V by rows, and recording the standardized matrix as U ═ U (U ═ U)_ij)_{i＝1,L,n；j＝1,L,K}∈R^n×K；

9) Performing K-means (K is m, the number of samples) clustering on the matrix U, and clustering n features into K feature clusters;

10) respectively using the entropy sorting of claim 3 and the cosine similarity sorting of claim 4 to measure the feature importance, sorting the features, selecting the most important feature from each feature cluster to represent the cluster, and obtaining a feature subset consisting of the representative features of K feature clusters;

11) the resulting 4 feature subsets are output.

The invention has the following characteristics:

1. feature spectrum clustering is proposed, features are clustered, and feature selection is achieved by combining entropy sorting and cosine similarity sorting;

2. a characteristic spectrum clustering method based on self-tuning algorithm idea is provided;

3. the Gaussian kernel function is used as a characteristic similarity measurement method, the characteristic local standard deviation is defined as a kernel function scale parameter, and the problem that the characteristic local scale parameter of the self-tuning method is influenced by outliers when a characteristic affinity matrix is calculated is solved;

4. cosine similarity sorting is adopted to measure feature importance, and a proper feature subset is quickly selected;

5. defining a characteristic entropy concept, measuring the importance of characteristics by adopting characteristic entropy, sequencing the characteristics, popularizing the entropy sequencing of samples by Dash and the like to the entropy sequencing of the characteristics, and realizing the selection of the most representative characteristics from each characteristic cluster to form an ideal characteristic subset.

Example (b):

the invention verifies the effectiveness of the newly proposed 4 feature selection methods based on spectral clustering, and compares the performance of the 4 feature selection methods with the performance of a multi-cluster feature selection method MCFS and a Laplacian score feature selection method. Therefore, feature selection is respectively realized by 4 proposed method spectrum feature selection methods of FSSC _ OE, FSSC _ OC, FSSC _ SE and FSSC _ SC, and comparison algorithms of MCFS and Laplacian score feature selection methods to obtain corresponding feature subsets, an SVM classifier is constructed for training samples only containing feature subset features, and indexes of the classifier such as accuracy, sensitivity and specificity are compared.

By adopting a 10-fold cross validation experiment, firstly, samples of each type are sequentially added into 10 sample sets one by one (the sample sets are empty initially) until each sample of the type is added, so that the samples are uniformly divided into 10 parts. Each 1 part is taken as a test sample, and the other 9 parts are taken as training samples to realize a 10-fold cross validation experiment. And sequentially selecting features on each training set, evaluating the selected feature subsets by using the corresponding test set, and evaluating the performance of the feature subsets by taking the average classification accuracy of the 10-fold experimental result.

Experimental use of Linzhiren, etc^[26]The developed SVM tool box Libsvm, the kernel Function adopts Radial Basis Function (RBF)^[27]The RBF parameter adopts a default value, and the penalty factor C is 10. The parameters of the adjacent comparison algorithms MCFS and Laplacian are set to be 5, and the similarity between vectors in the Laplacian algorithm adopts cosine similarity.

To avoid different dimensions affecting the experimental results, the data were normalized using the maximum minimization method of equation (22).

Wherein x is_ijDenotes the jth characteristic value, min (x), of the ith sample_gj) Denotes the minimum value of the jth feature, max (x)_gj) The maximum value of the jth feature is indicated.

7 Gene datasets used in the practice of the invention^[28]Described in table 1: colon cancer, CNS (central Nervous System Embryonal tumor), CNS Outcome, leukemia Leukemia, Lung adenocarcinoma Lung cancer-Michigan, lymphoma DLBCL Tumor Harvard, DLBCLOutcom.

TABLE 1 Gene data set description

Gene data set	Base factor	Number of samples (positive + negative)
			Colon	2000	62(40+22)
Leukemia	7129	72(47+25)
			DLBCL Tumor Harvard	7129	77(58+19)
DLBCL Outcome	6817	58(32+26)
			LungCancer-Michigan	7129	96(86+10)
CNS Outcome	7129	60(39+21)
			CNS	7129	90(60+30)

And (3) evaluating the feature subset by an experiment, evaluating the capability of the selected feature subset based on the performance (classification accuracy, sensitivity, specificity and MCC coefficient) of the SVM classifier of the feature subset, and further evaluating a feature selection algorithm for obtaining the feature subset. The basis of the classifier performance evaluation index is a confusion matrix (see table 2). The row corresponds to the class for which the sample actually belongs to positive (P) or negative (N), and the column indicates whether the classification result is of the positive (P ') or negative (N') class. The classifier performance evaluation index can be defined as follows based on table 2.

TABLE 2 confusion matrix

	Predicted to be of type P'	Predicted to be negative type N'
			Actually P of positive type	TP	FN
Actually negative class N	FP	TN

1) Accuracy (Accuracy):

2) sensitivity (Sensitivity):

3) specificity (Specificity):

4) matthews Correlation coefficient MCC (Matthews Correlation coefficient):

wherein, P ═ TP + FN, P ═ TP + FP, N ═ TN + FN

5)AUC^[29,30](Area Under Curve) is the Area Under the ROC (Receiver Operating Characteristic Curve, ROC Curve for short), and the AUC value is not more than 1. AUC and ROC are often used to evaluate the goodness of a binary classification model. The ROC curve is plotted on the abscissa for FPR and on the ordinate for TPR.

When the distribution of positive and negative samples in the test set is changed, the ROC curve can be kept unchanged. The phenomenon of class imbalance often occurs in actual data, and the classification result can be more truly reflected by evaluating the performance of the classification model by using the AUC, and the calculation method of the AUC is as follows:

wherein n is₀，n₁Respectively positive and negative sample numbers, n is total number of samples, r_iThe minimum sequence number is 1 for the descending sequence number of the ith sample.

As can be seen from fig. 2, the classification accuracy of FSSC _ SC and FSSC _ OC algorithms is significantly higher than that of the other four methods, and the classification accuracy curves of FSSC _ SE method and MCFS method are approximately the same, but the FSSC _ SE accuracy is slightly lower than that of MCFS overall. Since Colon is an unbalanced data set, it is more accurate to use the two indicators of MCC and AUC, and it is apparent from FIG. 3 and FIG. 4 that the curve of MCC and AUC of FSSC _ SE is higher than that of other methods. In combination, FSSC _ SC is the best performance of these several methods.

While the invention has been described in further detail with reference to specific preferred embodiments thereof, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. The spectral feature selection method based on the local scale parameters, entropy and cosine similarity is characterized by comprising the following steps: the method comprises a characteristic spectrum clustering method based on characteristic local standard deviation and a characteristic spectrum clustering method based on self-tuning algorithm;

let X be X ═ X₁,x₂,…,x_n}∈R^m×n，x_i(i is 1, …, n) is a column vector, which is the original feature column for selecting spectral features, and m is the number of samples;

1) equation (1) defines the reflection characteristic x_i(i ═ 1,2, …, n) local standard deviation metric parameter σ_{std_i}Calculating the scale parameter σ according to equation (1)_{std_i}；

2) formula (2) defines an affinity matrix a ═ a (a) expressing the similarity between features_ij)_{i,j＝1,2,…,n}∈R^n×n；

Wherein d (x)_i,x_j) Is a characteristic x_i,x_jThe Euclidean distance between;

3) constructing a feature degree matrix D ═ D (D) according to equation (3)_ij)_{i,j＝1,2,…,n}∈R^n×nI.e. a matrix with only major diagonal elements, the ith diagonal element being the sum of the corresponding elements in the ith row of the characteristic affinity matrix a;

4) calculating a normalized Laplace matrix L according to the formula (4);

5) solving eigenvalues of a normalized Laplace matrix L, sequencing in a descending order, and selecting K eigenvectors corresponding to the first K large eigenvalues to form a matrix V, wherein K is m and the number of samples; i.e. V ═ V₁,v₂,…,v_K]∈R^n×K，v_i(i ═ 1,2, …, K) is the eigenvector for the ith large eigenvalue, a column vector;

6) standardizing the matrix V according to the formula (5), and recording the standardized matrix as U ═ U (U)_ij)_{i＝1,…,n；j＝1,…,K}∈R^n×K；

7) Performing K-means clustering on the matrix U, wherein K is m, and clustering n features into K feature clusters;

1) equation (6) defines the reflection characteristic x_i(i ═ 1,2, …, n) local scale parameter σ for local information_iCalculating the feature x according to equation (6)_iLocal scale parameter σ of_i；

σ_i＝d(x_i,x_p) (6)

2) formula (7) defines a feature affinity matrix a ═ a (a) expressing the similarity between features_ij)_{i,j＝1,2,…,n}∈R^n×n；

Wherein d (x)_i,x_j) Is a characteristic x_i,x_jThe Euclidean distance between;

4) calculating a normalized Laplace matrix L according to the formula (4);

5) solving eigenvalues of the normalized Laplace matrix L, sequencing in a descending order, and selecting K eigenvectors corresponding to the top K big eigenvalues to form a matrix V, wherein K is m, namely V is [ V ═ m₁,v₂,…,v_K]∈R^n×K，v_i(i ═ 1,2, …, K) is the eigenvector for the ith large eigenvalue, a column vector;

7) Performing K-means clustering on the matrix U, wherein the n characteristics are clustered into K characteristic clusters by K-m;

step 8) of the feature local standard deviation-based feature spectral clustering method and the self-tuning algorithm-based feature spectral clustering method measures feature importance based on entropy sorting, selects the most important feature of each feature cluster from the K feature clusters, and forms a feature subset containing the K features to realize feature selection, and specifically comprises the following steps:

1) normalized matrix U ═ U_ij)_{i＝1,…,n；j＝1,…,K}∈R^n×KWherein U is_iRepresenting the ith characteristic, and defining the entropy of a characteristic matrix U as an expression (8) according to the entropy theory;

2) let formula (12) E_-UsIndicating the removal of a feature U from a feature set U_sLater feature set U- { U_sEntropy of, if

Explanation deletion feature U_sWill cause greater disorder of the feature set U, and therefore, feature U_sThan U_tMore importantly, therefore,sorting all the features;

wherein, U/U_s＝U-{U_s}；

3) Respectively selecting the most important features of each feature cluster from the K feature clusters obtained in the step 7) of the feature local standard deviation-based feature spectral clustering method and the self-tuning algorithm-based feature spectral clustering method to form a feature subset, and obtaining the optimal feature subset to be selected;

step 8) in the feature spectral clustering method based on the feature local standard deviation and the feature spectral clustering method based on the self-tuning algorithm measures the feature importance based on cosine similarity sequencing, and for the feature with the most important feature in each feature cluster, namely the feature with the maximum sum of cosine similarities with other features of the feature cluster, is selected from the K feature clusters to serve as the representative feature of the feature cluster, so that a feature subset containing the K features is obtained, and feature selection is realized, and the specific steps are as follows:

1) normalized matrix U ═ U_ij)_{i＝1,…,n；j＝1,…,K}∈R^n×KWherein U is_iRepresenting the ith feature, defining formula (14) to measure the importance, i.e., representativeness, of each feature, the more important features are more representative;

wherein, | U_i·U_jI represents a feature U_i，U_jAbsolute value of inner product, | U_i||×||U_j| represents a feature U_i，U_jProduct of modes, N_iRepresentation characteristic U_iThe number of the characteristics of the characteristic cluster;

2) obtaining the importance value of each feature according to the formula (14), and sorting the features of each feature cluster according to the importance;

3) respectively selecting the most important features of each feature cluster from the K feature clusters obtained in the step 7) of the feature local standard deviation-based feature spectrum clustering method and the self-tuning algorithm-based feature spectrum clustering method to form a feature subset, and obtaining the optimal feature subset to be selected;

the optimal feature subset is the result of gene microarray data and text data analysis, and the optimal feature subset is used for finding gene markers of tumors.