CN107679138B - Spectral feature selection method based on local scale parameters, entropy and cosine similarity - Google Patents

Spectral feature selection method based on local scale parameters, entropy and cosine similarity Download PDF

Info

Publication number
CN107679138B
CN107679138B CN201710868300.5A CN201710868300A CN107679138B CN 107679138 B CN107679138 B CN 107679138B CN 201710868300 A CN201710868300 A CN 201710868300A CN 107679138 B CN107679138 B CN 107679138B
Authority
CN
China
Prior art keywords
feature
characteristic
matrix
features
cluster
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710868300.5A
Other languages
Chinese (zh)
Other versions
CN107679138A (en
Inventor
谢娟英
周颖
丁丽娟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shaanxi Normal University
Original Assignee
Shaanxi Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shaanxi Normal University filed Critical Shaanxi Normal University
Priority to CN201710868300.5A priority Critical patent/CN107679138B/en
Publication of CN107679138A publication Critical patent/CN107679138A/en
Application granted granted Critical
Publication of CN107679138B publication Critical patent/CN107679138B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Other Investigation Or Analysis Of Materials By Electrical Means (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a spectral feature selection method based on local scale parameters, entropy and cosine similarity, which adopts a Gaussian kernel function as a similarity measurement method, defines characteristic local scale parameters based on characteristic local standard deviation as kernel function parameters, and solves the problems that unified scale parameters cannot reflect data distribution information when a characteristic affinity matrix is calculated and the local scale parameters are influenced by outliers; the feature importance is measured by respectively adopting entropy and chord similarity sequencing, and a proper feature subset can be quickly selected; provides technical support for data analysis of diseases such as tumor and the like, and has important biomedical significance.

Description

Spectral feature selection method based on local scale parameters, entropy and cosine similarity
Technical Field
The invention belongs to a gene microarray data and text data analysis technology, and relates to a spectral feature selection method based on local scale parameters, entropy and cosine similarity.
Background
Feature selection is the primary task of high-dimensional big data analysis such as gene microarray data and text data[1,2]The method aims to eliminate irrelevant or redundant features from all features and select a feature subset with good distinguishing capability so as to retain all classification information of an original feature set as much as possible. The feature selection algorithm is divided into a supervised method and an unsupervised method according to whether the feature selection process uses sample class mark information or not[3]. The supervised feature selection method carries out feature selection by calculating the correlation between features and class mark columns, and the unsupervised feature selection method considers the internal structure of data and does not need to use class mark information. In practical application, a large amount of data which is difficult to acquire is existed, so that the unsupervised feature selection research is particularly important.
The cluster analysis is used as an unsupervised learning technology, can discover knowledge from data and reveal hidden patterns and rules[4]The clustering idea is introduced into unsupervised feature selection, so that high-quality feature subsets can be ensured[5]. Liu Tao and the like[6]Providing an unsupervised feature selection algorithm for text clustering, and adopting χ for a K-means clustering result2And calculating the feature importance by using the statistics or the information entropy to select the features, thereby obviously improving the text clustering performance. The traditional partitioning type clustering algorithm such as K-means is suitable for finding spherical clusters and is often converged in a local optimal solution. Spectral clustering is established on the spectrogram theory, the clustering problem is converted into the optimization problem of a graph, and data is utilizedClustering feature vectors of a similarity matrix[7,8~11]Convergence to a global optimum solution[7]. Spectral clustering is classified into two-way spectral clustering and multi-way spectral clustering according to different partition criteria. Two-path spectral clustering can only perform two types of division, has high calculation complexity, and only comprises one characteristic vector, so that useful information is lost[10]. The multi-path spectral clustering algorithm adopts a plurality of characteristic vectors simultaneously, can comprehensively acquire information and reduce instability[12,13]
In recent years, spectral feature selection has become one of the hot spots in the field of machine learning and pattern recognition. Zhao et al[14]Considering the correlation between the features and the class marks and the correlation between the features, a unified feature selection framework based on spectrogram theory is provided and is generally used for supervised feature selection (such as a Relieff algorithm)[15]) And unsupervised feature selection (e.g., Laplace score feature selection method)[16]). But this framework lacks redundant control and does not result in a compact subset of features. Garcia-Garcia etc[17]On the basis of the framework, a general framework for selecting the spectral features based on mutual information is provided, and the low redundancy of the obtained feature subsets is ensured. Zhou et al[18]The spectral feature selection method comprehensively considering the local structure and the global structure not only ensures the stability of the local geometric structure of data, but also ensures that different clusters have obvious dissimilarity, thereby selecting representative features as far as possible.
Traditional spectral clustering algorithm NJW[11]The medium similarity measurement mostly adopts a radial basis kernel function, and a scale parameter sigma in the function is usually given subjectively and is globally uniform, so that the data distribution cannot be reflected correctly, and the real distribution of the data cannot be found. The NJW-based spectral feature selection is very sensitive to scale parameters, making scale parameter selection the key to finding the optimal feature subset. In addition, the result of spectrum feature selection also depends on a feature importance measurement method, the existing spectrum feature selection algorithm mostly adopts different normalized cut-set criteria to order features or continuously iterates, and a feature subset is selected from the result of feature spectrum clustering, so that the optimal feature subset cannot be quickly selected. Reference to the literature
[1]Guyon I,Elisseeff A.An introduction to variable and feature selection[J].The Journal of Machine Learning Research,2003,3:1157-1182
[2]Hua Jianping,Tembe W D,Dougherty E R.Performance of feature selection methods in the classification of high-dimension data[J].Pattern Recognition,2009,42(3):409-424
[3]Xie Juanying,Xie Weixin.Supervised Feature Selection and its Application[M].Xi’an,Shaanxi Normal University publishing CNS LTD,2012,11:1-3
[4]Gao Hongchao.The Research of DPC Algorithm and Gene Selection Algorithm based on Clustering[D].Xi’an:Shaanxi Normal University,2015
[5]Xie Juanying,Gao Hongchao.Statistical Correlation and K-Means Based Distinguishable Gene Subset Selection Algorithms[J].Journal of Software,2014,25(9):2050-2075
[6]Liu Tao,Wu Gongyi,Chen Zheng.An Effective Unsupervised Feature Selection Method for Text Clustering[J].Journal of Computer Research and Development,2005,42(3):381-386
[7]Gao Yan,Gu Shiwen,Tang Jin et.al.Research on Spectral Clustering in Machine Learning[J].Journal of Computer Science,2007,34(2):201-203
[8]Ulrike von Luxburg.A Tutorial on spectral clustering[J].Statistics and Computing,2007,17(4):
[9]S.A.Toussi,H.S Yazdi.Feature Selection in Spectral Clustering[J].International Journal of Signal Processing,Image Processing and Pattern Recongnition,2011,4(3):179-194
[10]J.Shi,J.Malik.Normalized cuts and image segmentation,IEEE Transactions on Pattern Analysis and Machine Intelligence[J].2000,22(8):888-905
[11].Y.Ng,M.Jardan,Y.Weiss.On spectral clustering:analysis and an algorithm[J].Advances in neural information processing systems,2002,2:849-856
[12]C.Alpert,A.Kahng,S.Yao.Spectral partitioning:the more eigenvectors,the better[J].Discrete Applied Math,1999,90:3-26
[13]Y.Weiss.Segmetation using Eigenvectors:a unigying view[J].In Proceedings of International Conference on Computer Vision.1999,9:975-982
[14]Z.Zhao,H.Liu.Spectral feature selection for supervised and unsupervised learning[C].In ICML’07:Proceedings of the 24th international conference on Machine learning,pages 1151–1157,New York,NY,USA,2007.ACM.
[15]Igor Kononenko.Estimating Attribute:Analysis and Extensions of Relief[J].Proceedings of the European conference on machine learning,1994
[16]He X,Cai D,Niyogi P.Laplacian score for feature selection[C].Advances in neural information processing systems.2005:507-514
[17]D GarciaGarcia,R SantosRodriguez.Spectral Clustering and Feature Selection for Microarray Data[C].International Conference on Machine Learning and Applications,2009:425-428
[18]Sihang Zhou,Xinwang Liu,Chengzhang Zhu et al.Spectral Clustering-based Local and Global Structure Preservation for Feature Selection[C].International Joint Conference on Neural Networks,2014
[19]Dash M,Liu H.Feature selection for clustering[M].Knowledge Discovery and Data Mining.Current Issues and New Applications.Springer Berlin Heidelberg,2000:110-121
[20]Dash M,Choi K,Scheuermann P,et al.Feature selection for clustering-a filter solution[C].Data Mining,2002.ICDM 2003.Proceedings.2002IEEE International Conference on.IEEE,2002:115-122
[21]Vapnik V.The nature of statistical learning theory[M].Springer Science&Business Media,2013
[22]Miranda J,Montoya R,Weber R.Linear penalization support vector machines for feature selection[C].Proceedings of International Conference on Pattern Recognition and Machine Intelligence.Berlin:Springer-Verlag,2005:188-192
[23]Cai D,Zhang C,He X.Unsupervised feature selection for multi-cluster data[C].Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining.ACM,2010:333-342
[24]Zelnik-Manor L,Perona P.Self-tuning spectral clustering[C].Advances in neural information processing systems.2004:1601-1608
[25]Jiawei Han,KAMBER M.Data Mining Concepts and Techniques[M].Second Edition.Beijing:China Machine Press,2006
[26]Chang C C,Lin C J.LIBSVM:A library for support vector machines[J].ACM Transactions on Intelligent Systems and Technology,2011,2(3):1-27
[27]Hsu C W,Chang C C,Lin C J.A practical guide to support vector classification[R].Taibei:National Taiwan University,Department of Computer Science,2003
[28]http:.datam.i2r.a-star.edu.sg/datasets/krbd/
[29]Davis,J,Goadrich,M.The relationship between Precision-Recall and ROC curves[C].In Proceedings of the 23rd international conference on Machine learning ACM,2006,6:233-240
[30]Fawcett,T.An introduction to ROC analysis[J].Pattern recognition letters,2006,27(8):861-874.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a spectral feature selection method based on local scale parameters, entropy and cosine similarity, overcomes the defect that the uniform scale parameter sigma cannot completely and accurately reflect data distribution information when a feature affinity matrix is calculated, and simultaneously overcomes the defect that the feature local scale parameter sigma cannot completely and accurately reflect data distribution information when a self-tuning method calculates the feature affinity matrixiProblems affected by outliers.
In order to achieve the purpose, the invention adopts the following scheme:
the spectral feature selection method based on the local scale parameters, entropy and cosine similarity comprises a feature spectral clustering method based on the feature local standard deviation and a feature spectral clustering method based on the self-tuning algorithm;
let X be X ═ X1,x2,L,xn}∈Rm×n,xi(i is 1, L, n) is a column vector, which is an original feature column for spectral feature selection, and m is the number of samples;
the characteristic spectrum clustering method based on the characteristic local standard deviation comprises the following steps:
1) equation (1) defines the reflection characteristic xi(i ═ 1,2, L, n) local standard deviation scale parameter definition σstd_iCalculating a new scale parameter σ according to equation (1)std_i
Figure BDA0001416610000000041
Wherein, the characteristic xrIs a characteristic xiThe r nearest neighbor, the neighbor measurement basis is Euclidean distance;
2) formula (2) defines an affinity matrix a ═ a (a) expressing the similarity between featuresij)i,j=1,2,L,n∈Rn×n
Figure BDA0001416610000000042
Wherein d (x)i,xj) Is a characteristic xi,xjThe Euclidean distance between;
3) constructing a feature degree matrix D ═ D (D) according to equation (3)ij)i,j=1,2,L,n∈Rn×nI.e. a matrix with only major diagonal elements, the ith diagonal element being the sum of the corresponding elements in the ith row of the characteristic affinity matrix a;
Figure BDA0001416610000000043
4) calculating a normalized Laplace matrix L according to the formula (4);
Figure BDA0001416610000000051
5) solving eigenvalues of the normalized Laplace matrix L, sorting the eigenvalues in descending order, and selecting K eigenvectors corresponding to the top K (equal to m, sample number) large eigenvalues to form a matrix V, namely V equal to [ V ═ V [ V ]1,v2,L,vK]∈Rn×K,vi(i ═ 1,2, L, K) is the eigenvector corresponding to the ith large eigenvalue, which is the column vector;
6) standardizing the matrix V according to the formula (5), and recording the standardized matrix as U ═ U (U)ij)i=1,L,n;j=1,L,K∈Rn×K
Figure BDA0001416610000000052
7) Performing K-means (K is m) clustering on the matrix U, and clustering n features into K feature cluster;
8) measuring the importance of the features by respectively using entropy sorting and cosine similarity sorting, sorting the features, selecting the most important feature of the cluster from each feature cluster to represent the cluster, and obtaining a feature subset consisting of the representative features of K feature clusters;
the self-tuning algorithm based feature spectrum clustering method comprises the following steps:
1) equation (6) defines the reflection characteristic xi(i ═ 1,2, L, n) local scale parameter σ of local informationiCalculating the feature x according to equation (6)iLocal scale parameter σ ofi
σi=d(xi,xp) (6)
Wherein, the characteristic xpIs a characteristic xiP-th neighbor of (c), d (x)i,xp) Is a characteristic xiTo feature xpThe Euclidean distance of;
2) formula (7) defines a feature affinity matrix a ═ a (a) expressing the similarity between featuresij)i,j=1,2,L,n∈Rn×n
Figure BDA0001416610000000053
Wherein, d (x)i,xj) Is a characteristic xi,xjThe Euclidean distance between;
3) constructing a feature degree matrix D ═ D (D) according to equation (3)ij)i,j=1,2,L,n∈Rn×nI.e. a matrix with only major diagonal elements, the ith diagonal element being the sum of the corresponding elements in the ith row of the characteristic affinity matrix a;
Figure BDA0001416610000000061
4) calculating a normalized Laplace matrix L according to the formula (4);
Figure BDA0001416610000000062
5) solving eigenvalues of the normalized Laplace matrix L, sorting the eigenvalues in descending order, and selecting K eigenvectors corresponding to the top K (═ m) large eigenvalues to form a matrix V, namely V ═ V [ V ] m1,v2,L,vK]∈Rn×K,vi(i ═ 1,2, L, K) is the eigenvector corresponding to the ith large eigenvalue, which is the column vector;
6) standardizing the matrix V according to the formula (5), and recording the standardized matrix as U ═ U (U)ij)i=1,L,n;j=1,L,K∈Rn×K
Figure BDA0001416610000000063
7) Performing K-means (K is m) clustering on the matrix U, and clustering n features into K feature clusters;
8) the feature importance is measured by respectively using entropy sorting and cosine similarity sorting, the features are sorted, one feature with the most importance in the cluster is selected from each feature class cluster to represent the cluster, and a feature subset consisting of the representative features of K feature class clusters is obtained.
Further, step 8) measures the feature importance based on entropy sorting, selects the most important feature of each feature cluster from the K feature clusters, forms a feature subset containing the K features, and realizes feature selection, specifically comprising the following steps:
1) the characteristic matrix U after characteristic spectrum clustering is equal to (U)ij)i=1,L,n;j=1,L,K∈Rn×KWherein U isiRepresenting the ith characteristic, and defining the entropy of a characteristic matrix U as an expression (8) according to the entropy theory;
Figure BDA0001416610000000064
wherein, p (U)i) Represents UiBecause the prior probability of the features is often difficult to obtain, the prior probability is replaced by the similarity in the calculation, and therefore the formula (8) is replaced by the formula (9);
Figure BDA0001416610000000065
wherein S isijRepresentation characteristic UiAnd UjThe similarity of (a) is defined as formula (10);
Figure BDA0001416610000000071
wherein, DistanceijRepresentation characteristic UiAnd UjThe distance between the two electrodes is calculated according to the formula (11);
Figure BDA0001416610000000072
therein, maxkAnd minkRespectively representing the maximum value and the minimum value of all the characteristics of the kth sample;
2) let formula (12) E-UsIndicating the removal of a feature U from a feature set UsLater feature set U- { UsEntropy of, if E-Usf
Figure BDA0001416610000000075
Explanation deletion feature UsWill cause greater disorder of the feature set U, and therefore, feature UsThan UtMore importantly, thus, all features are ordered;
Figure BDA0001416610000000073
wherein, U/Us=U-{Us};
3) And 7) respectively selecting the most important features of the feature clusters from the K feature clusters obtained in the step 7) to form a feature subset, so as to obtain the optimal feature subset to be selected.
Further, step 8) measures feature importance based on cosine similarity ranking, and selects the most important feature in each feature cluster from the K feature clusters, that is, the feature with the largest sum of cosine similarities with other features of the cluster, as the representative feature of the feature cluster, so as to obtain a feature subset including K features, thereby implementing feature selection, specifically including the following steps:
1) the characteristic matrix U after characteristic spectrum clustering is equal to (U)ij)i=1,L,n;j=1,L,K∈Rn×KWherein U isiRepresenting the ith feature, then we define equation (14) to measure the importance, i.e. representativeness, of each feature, the more important features are more representative;
Figure BDA0001416610000000074
wherein, | UigUjI represents a feature Ui,UjAbsolute value of inner product, | Ui||×||Uj| represents a feature Ui,UjProduct of modes, NiRepresentation characteristic UiThe number of the characteristics of the characteristic cluster;
2) according to the definition of formula (14) in step 1) of the part, obtaining the importance value of each feature, and sorting the features of each feature cluster according to the importance;
3) from the K (K ═ m) feature class clusters obtained in step 7) of claim 1 (or 2), the most important features of each feature class cluster are selected, respectively, and feature subsets are formed, so that the optimal feature subset to be selected is obtained.
The invention has the following beneficial effects:
aiming at the problems existing in the existing spectrum feature selection, the self-tuning algorithm is firstly introduced, feature spectrum clustering based on feature local scale parameters is provided, then, aiming at the problem that the local scale parameters of the algorithm are influenced by outliers, a feature spectrum clustering algorithm based on feature local standard deviation is provided, and the two feature spectrum clustering methods based on the feature local scale parameters are respectively used for carrying out spectrum clustering on the features to obtain K (m, sample number) feature clusters; on the basis, a feature sorting method based on entropy and cosine similarity is provided, various cluster features are sorted, the most important features of various clusters are respectively selected from the various feature clusters to serve as representative features of the feature cluster, and the representative features of the various clusters form an optimal feature subset, so that feature selection is realized, irrelevant and redundant features are provided, and the recognition rate and the stability of a system are improved.
The invention has good effects in the aspects of diagnosis of tumor patients and identification and application of tumor gene markers, and specifically comprises the following steps:
(1) the method adopts a Gaussian kernel function as a similarity measurement method, applies self-tuning thought to feature spectral clustering, and overcomes the defect that when the traditional spectral clustering is applied to feature clustering, the unified scale parameter sigma used for calculating a feature affinity matrix can not accurately reflect data distribution information, thereby influencing experimental results;
(2) defining the local standard deviation of the characteristics as a kernel function scale parameter, realizing characteristic spectrum clustering, overcoming the defect that the uniform scale parameter sigma cannot accurately reflect data distribution information when calculating the characteristic affinity matrix, and overcoming the local scale parameter sigma when calculating the characteristic affinity matrix by the self-tuning methodiProblems affected by outliers;
(2) the entropy is adopted to define the feature importance, the features are sequenced, the most important features of each feature cluster obtained from the feature spectrum clustering result can be quickly selected, and an ideal feature subset is formed;
(3) the cosine similarity ranking is adopted to measure the feature importance, the most important feature of each feature cluster obtained by the feature spectrum clustering result can be quickly selected, and a proper feature subset is formed;
(3) the characteristic selection method provided by the invention can select an effective characteristic distinguishing subset to discover the gene marker of the tumor, has a higher classification effect in the analysis of tumor gene expression profile data, provides technical support for the data analysis of diseases such as the tumor and the like, and has important biomedical significance.
Drawings
FIG. 1 is a flow chart of the application of the novel method for selecting the profile characteristics in the tumor gene expression profile data according to the present invention
FIG. 2 is a graph of the average classification accuracy of the feature selection method of the present invention on a Colon dataset
FIG. 3 is a graph of the mean AUC values of the feature selection method of the present invention on a Colon data set
FIG. 4 is a graph of the average MCC values of the feature selection method of the present invention on a Colon dataset
Detailed Description
The invention is further described below with reference to the accompanying drawings.
The invention provides a spectral feature selection method based on a characteristic local scale parameter of self-tuning algorithm thought, a characteristic local scale parameter based on a characteristic local standard deviation, entropy and cosine similarity sorting. Defining local scale parameter sigma of characteristic i based on self-tuning algorithm local scale parameter by using Gaussian kernel function as characteristic similarity measurement methodiLocal scale parameter sigma of characteristic i based on characteristic standard deviationstd_iRespectively using a characteristic local scale parameter sigmaiAnd σstd_iAs kernel function scale parameters, the feature is subjected to spectral clustering, so that the problem that the unified scale parameter sigma of the affinity matrix of the calculated feature can not accurately reflect data information and the local scale parameter sigmaiProblems affected by outliers; then, the feature importance is measured by respectively adopting entropy and cosine similarity, and the features are sortedAnd selecting one most important feature from each feature class cluster as a representative feature of the cluster to form a feature subset.
The spectral feature selection method based on the local scale parameters, entropy and cosine similarity comprises a feature spectral clustering method based on the feature local standard deviation and a feature spectral clustering method based on the self-tuning algorithm;
let X be X ═ X1,x2,L,xn}∈Rm×n,xi(i is 1, L, n) is a column vector, which is an original feature column for spectral feature selection, and m is the number of samples;
the characteristic spectrum clustering method based on the characteristic local standard deviation comprises the following steps:
1) equation (1) defines the reflection characteristic xi(i ═ 1,2, L, n) local standard deviation scale parameter definition σstd_iCalculating a new scale parameter σ according to equation (1)std_i
Figure BDA0001416610000000101
Wherein, the characteristic xrIs a characteristic xiThe r nearest neighbor, the neighbor measurement basis is Euclidean distance;
2) formula (2) defines an affinity matrix a ═ a (a) expressing the similarity between featuresij)i,j=1,2,L,n∈Rn×n
Figure BDA0001416610000000102
Wherein d (x)i,xj) Is a characteristic xi,xjThe Euclidean distance between;
3) constructing a feature degree matrix D ═ D (D) according to equation (3)ij)i,j=1,2,L,n∈Rn×nI.e. a matrix with only major diagonal elements, the ith diagonal element being the sum of the corresponding elements in the ith row of the characteristic affinity matrix a;
Figure BDA0001416610000000103
4) calculating a normalized Laplace matrix L according to the formula (4);
Figure BDA0001416610000000104
5) solving eigenvalues of the normalized Laplace matrix L, sorting the eigenvalues in descending order, and selecting K eigenvectors corresponding to the top K (equal to m, sample number) large eigenvalues to form a matrix V, namely V equal to [ V ═ V [ V ]1,v2,L,vK]∈Rn×K,vi(i ═ 1,2, L, K) is the eigenvector corresponding to the ith large eigenvalue, which is the column vector;
6) standardizing the matrix V according to the formula (5), and recording the standardized matrix as U ═ U (U)ij)i=1,L,n;j=1,L,K∈Rn×K
Figure BDA0001416610000000111
7) Performing K-means (K is m) clustering on the matrix U, and clustering n features into K feature cluster;
8) measuring the importance of the features by respectively using entropy sorting and cosine similarity sorting, sorting the features, selecting the most important feature of the cluster from each feature cluster to represent the cluster, and obtaining a feature subset consisting of the representative features of K feature clusters;
the self-tuning algorithm based feature spectrum clustering method comprises the following steps:
1) equation (6) defines the reflection characteristic xi(i ═ 1,2, L, n) local scale parameter σ of local informationiCalculating the feature x according to equation (6)iLocal scale parameter σ ofi
σi=d(xi,xp) (6)
Wherein, the characteristic xpIs a characteristic xiP-th neighbor of (c), d (x)i,xp) Is a characteristic xiTo feature xpThe Euclidean distance of;
2) formula (II)(7) Defining a feature affinity matrix a ═ that expresses similarity between features (a)ij)i,j=1,2,L,n∈Rn×n
Figure BDA0001416610000000112
Wherein, d (x)i,xj) Is a characteristic xi,xjThe Euclidean distance between;
3) constructing a feature degree matrix D ═ D (D) according to equation (3)ij)i,j=1,2,L,n∈Rn×nI.e. a matrix with only major diagonal elements, the ith diagonal element being the sum of the corresponding elements in the ith row of the characteristic affinity matrix a;
Figure BDA0001416610000000113
4) calculating a normalized Laplace matrix L according to the formula (4);
Figure BDA0001416610000000114
5) solving eigenvalues of the normalized Laplace matrix L, sorting the eigenvalues in descending order, and selecting K eigenvectors corresponding to the top K (═ m) large eigenvalues to form a matrix V, namely V ═ V [ V ] m1,v2,L,vK]∈Rn×K,vi(i ═ 1,2, L, K) is the eigenvector corresponding to the ith large eigenvalue, which is the column vector;
6) standardizing the matrix V according to the formula (5), and recording the standardized matrix as U ═ U (U)ij)i=1,L,n;j=1,L,K∈Rn×K
Figure BDA0001416610000000121
7) Performing K-means (K is m) clustering on the matrix U, and clustering n features into K feature clusters;
8) the feature importance is measured by respectively using entropy sorting and cosine similarity sorting, the features are sorted, one feature with the most importance in the cluster is selected from each feature class cluster to represent the cluster, and a feature subset consisting of the representative features of K feature class clusters is obtained.
The invention provides a local scale parameter sigma based on characteristics respectivelyiAnd σstd_iThe feature spectrum clustering of the invention is combined with feature sorting based on entropy and cosine similarity to obtain 4 unsupervised feature selection algorithms based on spectrum clustering. The algorithm firstly carries out spectral clustering on the features to obtain K (m, sample number) feature clusters, then measures feature importance by respectively using entropy sorting and cosine similarity sorting, selects one most important feature of the cluster from the various clusters to represent the feature clusters, and obtains an optimal feature subset containing K (m, sample number) features.
The proposed 4 unsupervised Feature Selection algorithms (The unsupervised Feature Selection algorithm based on Spectral Clustering, FSSC) are:
a) a spectrum feature selection algorithm FSSC _ OE (FSSC based on Original sigma and entropy) adopting local scale and entropy sorting of self-tuning algorithm;
b) a spectrum feature selection algorithm FSSC _ OC (FSSC based on Original sigma and Cosine similarity) using local scale and Cosine similarity ordering of self-tuning algorithm;
c) a spectral feature selection algorithm FSSC _ SE (FSSC based on Standard development and Encopy) which adopts new scale parameters of feature local Standard Deviation and entropy sorting;
d) the spectral feature selection algorithm FSSC _ SC (FSSC based on Standard development and Cosine similarity) using the new scale parameter of the feature local Standard Deviation and Cosine similarity ranking.
After the characteristics are selected, a Support Vector Machine (SVM) is adopted as a classification tool for training samples only containing K (m, sample number) characteristics, an SVM classification model based on the selected characteristic subset is constructed, and the performance of the selected characteristic subset is evaluated according to classification indexes of the SVM classification model on a test set, such as classification accuracy, sensitivity, specificity and the like. Feature selection based on spectral clustering of 4 kinds of experimental designSelection method in 7 gene data set experimental results, and with Multi-Cluster Feature Selection method MCFS (Multi-Cluster Feature Selection)[23]And Laplacian score feature selection method Laplacian (Laplacian score for feature selection)[16]The experimental comparison shows that the 4 methods involved in the invention have good performance, wherein the spectrum feature selection method FSSC _ SC based on feature local standard deviation and cosine similarity feature sorting has the best effect.
The present invention relates to the definition:
new local scale parameter sigma based on characteristic local standard deviationstd_i: the self-adaptive characteristic spectrum clustering algorithm based on the characteristic local standard deviation is provided, the algorithm firstly transposes an original data matrix to obtain a matrix taking the characteristic as a row, and a new scale parameter sigma based on the characteristic local standard deviation is defined by adopting the formula (1) in the claimsstd_iBased on the new scale parameter, an affinity matrix of the features is constructed using equation (2) in the claims.
Figure BDA0001416610000000131
Wherein, the characteristic xrIs a characteristic xiThe r-th nearest neighbor, the neighbor metric being in terms of euclidean distance.
2) Formula (2) defines an affinity matrix a ═ a (a) expressing the similarity between featuresij)i,j=1,2,L,n∈Rn×n
Figure BDA0001416610000000132
Wherein d (x)i,xj) Is a characteristic xi,xjThe euclidean distance between them.
Feature selection based on entropy ordering: entropy is a measure of the degree of misordering of a system. The more disordered the system, the greater the entropy; conversely, the more ordered the system, the smaller the entropy.
The characteristic matrix U after characteristic spectrum clustering is equal to (U)ij)i=1,L,n;j=1,L,K∈Rn×KWherein U isiThe ith feature is represented. According to the entropy theory, the entropy of the feature matrix U is defined as formula (8) in the claims.
Figure BDA0001416610000000141
Wherein, p (U)i) Represents UiSince the prior probability of the features is often difficult to obtain, in the calculation, we replace the prior probability with the similarity, and therefore, the formula (8) of the claim is replaced with the formula (9) of the claim.
Figure BDA0001416610000000142
Wherein S isijRepresentation characteristic UiAnd UjIs defined as formula (10) of the claims.
Figure BDA0001416610000000143
Wherein, DistanceijRepresentation characteristic UiAnd UjThe calculation method of the distance is shown in formula (11) of the claims.
Figure BDA0001416610000000144
Therein, maxkAnd minkRespectively representing the maximum value and the minimum value of all the characteristics of the kth sample;
according to the formulas (10) to (11) in the claims, the closer the two features are, the similarity SijThe larger the two features are, the more similar the two features are, i.e. there is redundancy in the two features. Similar features are grouped together and one of the most representative features is selected. All representative features constitute a suitable subset of features to be selected.
Claim as follows, formula (12) E-UsIndicating the removal of a feature U from a feature set UsAfter thatFeature set U- { UsEntropy of, if E-Us f
Figure BDA0001416610000000145
Explanation deletion feature UsWill cause greater disorder of the feature set U, and therefore, feature UsSpecific characteristic UtAnd more importantly. All features are thereby ordered according to equation (12) of the claims.
Feature selection based on cosine similarity ranking:
cosine similarity measures the difference between two samples by using the cosine value of the included angle between two sample vectors in a sample space, and is usually used for comparing documents or sequencing the documents according to a given query word vector, and the calculation method is shown as a formula (15).
Figure BDA0001416610000000151
Wherein cos theta is belonged to < -1,1 >, when the included angle of the vectors a and b is 0 degree, the cosine value is 1, when the included angle of the vectors a and b is 180 degrees, the cosine value of the included angle is-1, the included angle of the vectors a and b is closer to 0 degree, and the cosine value of the included angle is closer to 1. When the cosine of the included angle of the features is used for judging the feature similarity, the feature has no directivity, so that the absolute value of cos theta is taken, and the modified cosine similarity is defined as shown in a formula (16).
Figure BDA0001416610000000152
Therefore, in the claims we define the metric characteristic U of equation (14)iThe importance of (c).
Figure BDA0001416610000000153
Wherein, | UigUjI represents a feature Ui,UjAbsolute value of inner product, | Ui||×||Uj| represents a feature Ui,UjProduct of modes, NiRepresentation characteristic UiThe number of features in the cluster.
After the importance of the features is obtained according to the formula (14), the features of each feature cluster obtained by clustering the feature spectrum are sorted according to the importance. Then, from the K (K ═ m, sample number) feature clusters obtained in step 7) of claim 1 (or 2), the most important features of each class cluster are selected, respectively, and feature subsets are formed, so that the optimal feature subset to be selected is obtained.
The method implemented by the invention comprises the following steps:
FIG. 1 is a flow chart of the application of the novel method for selecting the profile characteristics in the tumor gene expression profile data according to the present invention
Inputting: data X, X1,x2,L,xn}∈Rn×m,xi(i ═ 1, L, n) is the original feature column for spectral feature selection, X is the training sample;
and (3) outputting: a selected subset of features S.
1) Calculating the characteristic x according to formula (1) of claim 1iNew scale parameter sigma based on characteristic standard deviationstd_i
2) Calculating the characteristic x according to equation (6) of claim 2iLocal scale parameter σ ofi
3) An affinity matrix characterized by the calculation of formula (2) according to claim 1, a ═ a (a)ij)i,j=1,2,L,n∈Rn×n
4) An affinity matrix a ═ of the calculated features of formula (7) according to claim 2 (a)ij)i,j=1,2,L,n∈Rn×n
5) Constructing a matrix degree matrix D ═ D (D) according to equation (3) of claim 1ij)i,j=1,2,L,n∈Rn×nI.e. a matrix with only major diagonal elements, the ith diagonal element being the sum of the corresponding elements in the ith row of the characteristic affinity matrix a;
6) calculating a normalized laplacian matrix L according to equation (4) of claim 1;
7) solving eigenvalues of the normalized Laplace matrix L, selecting K eigenvectors corresponding to the top K (═ m, sample number) large eigenvalues to form a matrix V, namely V ═[v1,v2,L,vK]∈Rn×K,vi(i ═ 1,2, L, K) is the eigenvector corresponding to the ith large eigenvalue, which is the column vector;
8) equation (5) according to claim 1, standardizing the matrix V by rows, and recording the standardized matrix as U ═ U (U ═ U)ij)i=1,L,n;j=1,L,K∈Rn×K
9) Performing K-means (K is m, the number of samples) clustering on the matrix U, and clustering n features into K feature clusters;
10) respectively using the entropy sorting of claim 3 and the cosine similarity sorting of claim 4 to measure the feature importance, sorting the features, selecting the most important feature from each feature cluster to represent the cluster, and obtaining a feature subset consisting of the representative features of K feature clusters;
11) the resulting 4 feature subsets are output.
The invention has the following characteristics:
1. feature spectrum clustering is proposed, features are clustered, and feature selection is achieved by combining entropy sorting and cosine similarity sorting;
2. a characteristic spectrum clustering method based on self-tuning algorithm idea is provided;
3. the Gaussian kernel function is used as a characteristic similarity measurement method, the characteristic local standard deviation is defined as a kernel function scale parameter, and the problem that the characteristic local scale parameter of the self-tuning method is influenced by outliers when a characteristic affinity matrix is calculated is solved;
4. cosine similarity sorting is adopted to measure feature importance, and a proper feature subset is quickly selected;
5. defining a characteristic entropy concept, measuring the importance of characteristics by adopting characteristic entropy, sequencing the characteristics, popularizing the entropy sequencing of samples by Dash and the like to the entropy sequencing of the characteristics, and realizing the selection of the most representative characteristics from each characteristic cluster to form an ideal characteristic subset.
Example (b):
the invention verifies the effectiveness of the newly proposed 4 feature selection methods based on spectral clustering, and compares the performance of the 4 feature selection methods with the performance of a multi-cluster feature selection method MCFS and a Laplacian score feature selection method. Therefore, feature selection is respectively realized by 4 proposed method spectrum feature selection methods of FSSC _ OE, FSSC _ OC, FSSC _ SE and FSSC _ SC, and comparison algorithms of MCFS and Laplacian score feature selection methods to obtain corresponding feature subsets, an SVM classifier is constructed for training samples only containing feature subset features, and indexes of the classifier such as accuracy, sensitivity and specificity are compared.
By adopting a 10-fold cross validation experiment, firstly, samples of each type are sequentially added into 10 sample sets one by one (the sample sets are empty initially) until each sample of the type is added, so that the samples are uniformly divided into 10 parts. Each 1 part is taken as a test sample, and the other 9 parts are taken as training samples to realize a 10-fold cross validation experiment. And sequentially selecting features on each training set, evaluating the selected feature subsets by using the corresponding test set, and evaluating the performance of the feature subsets by taking the average classification accuracy of the 10-fold experimental result.
Experimental use of Linzhiren, etc[26]The developed SVM tool box Libsvm, the kernel Function adopts Radial Basis Function (RBF)[27]The RBF parameter adopts a default value, and the penalty factor C is 10. The parameters of the adjacent comparison algorithms MCFS and Laplacian are set to be 5, and the similarity between vectors in the Laplacian algorithm adopts cosine similarity.
To avoid different dimensions affecting the experimental results, the data were normalized using the maximum minimization method of equation (22).
Figure BDA0001416610000000181
Wherein x isijDenotes the jth characteristic value, min (x), of the ith samplegj) Denotes the minimum value of the jth feature, max (x)gj) The maximum value of the jth feature is indicated.
7 Gene datasets used in the practice of the invention[28]Described in table 1: colon cancer, CNS (central Nervous System Embryonal tumor), CNS Outcome, leukemia Leukemia, Lung adenocarcinoma Lung cancer-Michigan, lymphoma DLBCL Tumor Harvard, DLBCLOutcom.
TABLE 1 Gene data set description
Gene data set Base factor Number of samples (positive + negative)
Colon 2000 62(40+22)
Leukemia 7129 72(47+25)
DLBCL Tumor Harvard 7129 77(58+19)
DLBCL Outcome 6817 58(32+26)
LungCancer-Michigan 7129 96(86+10)
CNS Outcome 7129 60(39+21)
CNS 7129 90(60+30)
And (3) evaluating the feature subset by an experiment, evaluating the capability of the selected feature subset based on the performance (classification accuracy, sensitivity, specificity and MCC coefficient) of the SVM classifier of the feature subset, and further evaluating a feature selection algorithm for obtaining the feature subset. The basis of the classifier performance evaluation index is a confusion matrix (see table 2). The row corresponds to the class for which the sample actually belongs to positive (P) or negative (N), and the column indicates whether the classification result is of the positive (P ') or negative (N') class. The classifier performance evaluation index can be defined as follows based on table 2.
TABLE 2 confusion matrix
Predicted to be of type P' Predicted to be negative type N'
Actually P of positive type TP FN
Actually negative class N FP TN
1) Accuracy (Accuracy):
Figure BDA0001416610000000191
2) sensitivity (Sensitivity):
Figure BDA0001416610000000192
3) specificity (Specificity):
Figure BDA0001416610000000193
4) matthews Correlation coefficient MCC (Matthews Correlation coefficient):
Figure BDA0001416610000000194
wherein, P ═ TP + FN, P ═ TP + FP, N ═ TN + FN
5)AUC[29,30](Area Under Curve) is the Area Under the ROC (Receiver Operating Characteristic Curve, ROC Curve for short), and the AUC value is not more than 1. AUC and ROC are often used to evaluate the goodness of a binary classification model. The ROC curve is plotted on the abscissa for FPR and on the ordinate for TPR.
Figure BDA0001416610000000195
Figure BDA0001416610000000196
When the distribution of positive and negative samples in the test set is changed, the ROC curve can be kept unchanged. The phenomenon of class imbalance often occurs in actual data, and the classification result can be more truly reflected by evaluating the performance of the classification model by using the AUC, and the calculation method of the AUC is as follows:
Figure BDA0001416610000000197
wherein n is0,n1Respectively positive and negative sample numbers, n is total number of samples, riThe minimum sequence number is 1 for the descending sequence number of the ith sample.
As can be seen from fig. 2, the classification accuracy of FSSC _ SC and FSSC _ OC algorithms is significantly higher than that of the other four methods, and the classification accuracy curves of FSSC _ SE method and MCFS method are approximately the same, but the FSSC _ SE accuracy is slightly lower than that of MCFS overall. Since Colon is an unbalanced data set, it is more accurate to use the two indicators of MCC and AUC, and it is apparent from FIG. 3 and FIG. 4 that the curve of MCC and AUC of FSSC _ SE is higher than that of other methods. In combination, FSSC _ SC is the best performance of these several methods.
While the invention has been described in further detail with reference to specific preferred embodiments thereof, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (1)

1. The spectral feature selection method based on the local scale parameters, entropy and cosine similarity is characterized by comprising the following steps: the method comprises a characteristic spectrum clustering method based on characteristic local standard deviation and a characteristic spectrum clustering method based on self-tuning algorithm;
let X be X ═ X1,x2,…,xn}∈Rm×n,xi(i is 1, …, n) is a column vector, which is the original feature column for selecting spectral features, and m is the number of samples;
the characteristic spectrum clustering method based on the characteristic local standard deviation comprises the following steps:
1) equation (1) defines the reflection characteristic xi(i ═ 1,2, …, n) local standard deviation metric parameter σstd_iCalculating the scale parameter σ according to equation (1)std_i
Figure FDA0003142411450000011
Wherein, the characteristic xrIs a characteristic xiThe r nearest neighbor, the neighbor measurement basis is Euclidean distance;
2) formula (2) defines an affinity matrix a ═ a (a) expressing the similarity between featuresij)i,j=1,2,…,n∈Rn×n
Figure FDA0003142411450000012
Wherein d (x)i,xj) Is a characteristic xi,xjThe Euclidean distance between;
3) constructing a feature degree matrix D ═ D (D) according to equation (3)ij)i,j=1,2,…,n∈Rn×nI.e. a matrix with only major diagonal elements, the ith diagonal element being the sum of the corresponding elements in the ith row of the characteristic affinity matrix a;
Figure FDA0003142411450000013
4) calculating a normalized Laplace matrix L according to the formula (4);
Figure FDA0003142411450000014
5) solving eigenvalues of a normalized Laplace matrix L, sequencing in a descending order, and selecting K eigenvectors corresponding to the first K large eigenvalues to form a matrix V, wherein K is m and the number of samples; i.e. V ═ V1,v2,…,vK]∈Rn×K,vi(i ═ 1,2, …, K) is the eigenvector for the ith large eigenvalue, a column vector;
6) standardizing the matrix V according to the formula (5), and recording the standardized matrix as U ═ U (U)ij)i=1,…,n;j=1,…,K∈Rn×K
Figure FDA0003142411450000021
7) Performing K-means clustering on the matrix U, wherein K is m, and clustering n features into K feature clusters;
8) measuring the importance of the features by respectively using entropy sorting and cosine similarity sorting, sorting the features, selecting the most important feature of the cluster from each feature cluster to represent the cluster, and obtaining a feature subset consisting of the representative features of K feature clusters;
the self-tuning algorithm based feature spectrum clustering method comprises the following steps:
1) equation (6) defines the reflection characteristic xi(i ═ 1,2, …, n) local scale parameter σ for local informationiCalculating the feature x according to equation (6)iLocal scale parameter σ ofi
σi=d(xi,xp) (6)
Wherein, the characteristic xpIs a characteristic xiP-th neighbor of (c), d (x)i,xp) Is a characteristic xiTo feature xpThe Euclidean distance of;
2) formula (7) defines a feature affinity matrix a ═ a (a) expressing the similarity between featuresij)i,j=1,2,…,n∈Rn×n
Figure FDA0003142411450000022
Wherein d (x)i,xj) Is a characteristic xi,xjThe Euclidean distance between;
3) constructing a feature degree matrix D ═ D (D) according to equation (3)ij)i,j=1,2,…,n∈Rn×nI.e. a matrix with only major diagonal elements, the ith diagonal element being the sum of the corresponding elements in the ith row of the characteristic affinity matrix a;
Figure FDA0003142411450000023
4) calculating a normalized Laplace matrix L according to the formula (4);
Figure FDA0003142411450000031
5) solving eigenvalues of the normalized Laplace matrix L, sequencing in a descending order, and selecting K eigenvectors corresponding to the top K big eigenvalues to form a matrix V, wherein K is m, namely V is [ V ═ m1,v2,…,vK]∈Rn×K,vi(i ═ 1,2, …, K) is the eigenvector for the ith large eigenvalue, a column vector;
6) standardizing the matrix V according to the formula (5), and recording the standardized matrix as U ═ U (U)ij)i=1,…,n;j=1,…,K∈Rn×K
Figure FDA0003142411450000032
7) Performing K-means clustering on the matrix U, wherein the n characteristics are clustered into K characteristic clusters by K-m;
8) measuring the importance of the features by respectively using entropy sorting and cosine similarity sorting, sorting the features, selecting the most important feature of the cluster from each feature cluster to represent the cluster, and obtaining a feature subset consisting of the representative features of K feature clusters;
step 8) of the feature local standard deviation-based feature spectral clustering method and the self-tuning algorithm-based feature spectral clustering method measures feature importance based on entropy sorting, selects the most important feature of each feature cluster from the K feature clusters, and forms a feature subset containing the K features to realize feature selection, and specifically comprises the following steps:
1) normalized matrix U ═ Uij)i=1,…,n;j=1,…,K∈Rn×KWherein U isiRepresenting the ith characteristic, and defining the entropy of a characteristic matrix U as an expression (8) according to the entropy theory;
Figure FDA0003142411450000033
wherein, p (U)i) Represents UiBecause the prior probability of the features is often difficult to obtain, the prior probability is replaced by the similarity in the calculation, and therefore the formula (8) is replaced by the formula (9);
Figure FDA0003142411450000034
wherein S isijRepresentation characteristic UiAnd UjThe similarity of (a) is defined as formula (10);
Figure FDA0003142411450000041
wherein, DistanceijRepresentation characteristic UiAnd UjThe distance between the two electrodes is calculated according to the formula (11);
Figure FDA0003142411450000042
therein, maxkAnd minkRespectively representing the maximum value and the minimum value of all the characteristics of the kth sample;
2) let formula (12) E-UsIndicating the removal of a feature U from a feature set UsLater feature set U- { UsEntropy of, if
Figure FDA0003142411450000043
Explanation deletion feature UsWill cause greater disorder of the feature set U, and therefore, feature UsThan UtMore importantly, therefore,sorting all the features;
Figure FDA0003142411450000044
wherein, U/Us=U-{Us};
3) Respectively selecting the most important features of each feature cluster from the K feature clusters obtained in the step 7) of the feature local standard deviation-based feature spectral clustering method and the self-tuning algorithm-based feature spectral clustering method to form a feature subset, and obtaining the optimal feature subset to be selected;
step 8) in the feature spectral clustering method based on the feature local standard deviation and the feature spectral clustering method based on the self-tuning algorithm measures the feature importance based on cosine similarity sequencing, and for the feature with the most important feature in each feature cluster, namely the feature with the maximum sum of cosine similarities with other features of the feature cluster, is selected from the K feature clusters to serve as the representative feature of the feature cluster, so that a feature subset containing the K features is obtained, and feature selection is realized, and the specific steps are as follows:
1) normalized matrix U ═ Uij)i=1,…,n;j=1,…,K∈Rn×KWherein U isiRepresenting the ith feature, defining formula (14) to measure the importance, i.e., representativeness, of each feature, the more important features are more representative;
Figure FDA0003142411450000045
wherein, | Ui·UjI represents a feature Ui,UjAbsolute value of inner product, | Ui||×||Uj| represents a feature Ui,UjProduct of modes, NiRepresentation characteristic UiThe number of the characteristics of the characteristic cluster;
2) obtaining the importance value of each feature according to the formula (14), and sorting the features of each feature cluster according to the importance;
3) respectively selecting the most important features of each feature cluster from the K feature clusters obtained in the step 7) of the feature local standard deviation-based feature spectrum clustering method and the self-tuning algorithm-based feature spectrum clustering method to form a feature subset, and obtaining the optimal feature subset to be selected;
the optimal feature subset is the result of gene microarray data and text data analysis, and the optimal feature subset is used for finding gene markers of tumors.
CN201710868300.5A 2017-09-22 2017-09-22 Spectral feature selection method based on local scale parameters, entropy and cosine similarity Active CN107679138B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710868300.5A CN107679138B (en) 2017-09-22 2017-09-22 Spectral feature selection method based on local scale parameters, entropy and cosine similarity

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710868300.5A CN107679138B (en) 2017-09-22 2017-09-22 Spectral feature selection method based on local scale parameters, entropy and cosine similarity

Publications (2)

Publication Number Publication Date
CN107679138A CN107679138A (en) 2018-02-09
CN107679138B true CN107679138B (en) 2021-08-27

Family

ID=61136640

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710868300.5A Active CN107679138B (en) 2017-09-22 2017-09-22 Spectral feature selection method based on local scale parameters, entropy and cosine similarity

Country Status (1)

Country Link
CN (1) CN107679138B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109409127B (en) * 2018-10-30 2022-04-26 北京天融信网络安全技术有限公司 Method and device for generating network data security policy and storage medium
CN109978007A (en) * 2019-02-25 2019-07-05 南京理工大学 A kind of disease risk factor extracting method based on attribute weight cluster
CN109978008A (en) * 2019-02-26 2019-07-05 杭州电子科技大学 The potential similitude optimization method of arest neighbors figure based on range conversion
CN110377798B (en) * 2019-06-12 2022-10-21 成都理工大学 Outlier detection method based on angle entropy
CN110728327B (en) * 2019-10-18 2021-11-23 中国科学技术大学 Interpretable direct-push learning method and system
CN114741048A (en) * 2022-05-20 2022-07-12 中译语通科技股份有限公司 Sample sorting method and device, computer equipment and readable storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101968852A (en) * 2010-09-09 2011-02-09 西安电子科技大学 Entropy sequencing-based semi-supervision spectral clustering method for determining clustering number
KR20140038838A (en) * 2012-09-21 2014-03-31 주식회사 메디칼써프라이 Spectral feature extraction method and system of biological tissue using back scattered light
CN104881671A (en) * 2015-05-21 2015-09-02 电子科技大学 High resolution remote sensing image local feature extraction method based on 2D-Gabor

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101968852A (en) * 2010-09-09 2011-02-09 西安电子科技大学 Entropy sequencing-based semi-supervision spectral clustering method for determining clustering number
KR20140038838A (en) * 2012-09-21 2014-03-31 주식회사 메디칼써프라이 Spectral feature extraction method and system of biological tissue using back scattered light
CN104881671A (en) * 2015-05-21 2015-09-02 电子科技大学 High resolution remote sensing image local feature extraction method based on 2D-Gabor

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
《基于特征子集区分度与支持向量机的特征选择算法》;谢娟英 等;《计算机学报》;20140831(第8期);第1704-1718页 *

Also Published As

Publication number Publication date
CN107679138A (en) 2018-02-09

Similar Documents

Publication Publication Date Title
CN107679138B (en) Spectral feature selection method based on local scale parameters, entropy and cosine similarity
Steinley K‐means clustering: a half‐century synthesis
Kamalov et al. Outlier detection in high dimensional data
Arora et al. Fuzzy c-means clustering strategies: A review of distance measures
Roffo et al. Feature selection via eigenvector centrality
Greene et al. Unsupervised learning and clustering
Ye et al. Robust similarity measure for spectral clustering based on shared neighbors
Chakraborty et al. Simultaneous variable weighting and determining the number of clusters—A weighted Gaussian means algorithm
Mohammed et al. Evaluation of partitioning around medoids algorithm with various distances on microarray data
Ghorai et al. Multicategory cancer classification from gene expression data by multiclass NPPC ensemble
Torkey et al. Machine learning model for cancer diagnosis based on RNAseq microarray
Zhang et al. Three-way clustering method for incomplete information system based on set-pair analysis
Carbonera et al. An entropy-based subspace clustering algorithm for categorical data
He et al. Estimation of optimal cluster number for fuzzy clustering with combined fuzzy entropy index
Hess et al. k is the magic number—inferring the number of clusters through nonparametric concentration inequalities
Lei et al. Automatic PAM Clustering Algorithm for Outlier Detection.
Jesus et al. Dynamic feature selection based on pareto front optimization
Baidari et al. A criterion for deciding the number of clusters in a dataset based on data depth
Zhou et al. A distance and density-based clustering algorithm using automatic peak detection
CN112800138B (en) Big data classification method and system
Vijay et al. Hamming distance based clustering algorithm
Barchiesi et al. Learning incoherent subspaces: classification via incoherent dictionary learning
Soleimani et al. A density-penalized distance measure for clustering
Kangane et al. A comprehensive survey of various clustering paradigms
De Amorim et al. Selecting the Minkowski exponent for intelligent K-Means with feature weighting

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant