CN113177604B - High-dimensional data feature selection method based on improved L1 regularization and clustering - Google Patents

High-dimensional data feature selection method based on improved L1 regularization and clustering Download PDF

Info

Publication number
CN113177604B
CN113177604B CN202110525604.8A CN202110525604A CN113177604B CN 113177604 B CN113177604 B CN 113177604B CN 202110525604 A CN202110525604 A CN 202110525604A CN 113177604 B CN113177604 B CN 113177604B
Authority
CN
China
Prior art keywords
feature
cluster
regularization
sample
clustering
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110525604.8A
Other languages
Chinese (zh)
Other versions
CN113177604A (en
Inventor
栗伟
谢维冬
王林洁
闵新�
王珊珊
于鲲
Original Assignee
东北大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 东北大学 filed Critical 东北大学
Priority to CN202110525604.8A priority Critical patent/CN113177604B/en
Publication of CN113177604A publication Critical patent/CN113177604A/en
Application granted granted Critical
Publication of CN113177604B publication Critical patent/CN113177604B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/211Selection of the most significant subset of features
    • G06F18/2113Selection of the most significant subset of features by ranking or filtering the set of features, e.g. using a measure of variance or of feature cross-correlation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Medical Informatics (AREA)
  • Health & Medical Sciences (AREA)
  • Epidemiology (AREA)
  • Biophysics (AREA)
  • Databases & Information Systems (AREA)
  • Bioethics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Public Health (AREA)
  • Software Systems (AREA)
  • Biotechnology (AREA)
  • General Health & Medical Sciences (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a high-dimensional data feature selection method based on improved L1 regularization and clustering, and relates to the technical field of machine learning. The invention provides a hybrid feature selection algorithm for microarray data analysis, which is based on a K-Means clustering algorithm and an idea of improving L1 regularization, wherein the K-Means clustering algorithm is used for data preprocessing to delete redundant features, and the improved L1 regularization method is used for feature selection to improve stability and classification accuracy.

Description

High-dimensional data feature selection method based on improved L1 regularization and clustering
Technical Field
The invention relates to the technical field of machine learning, in particular to a high-dimensional data feature selection method based on improved L1 regularization and clustering.
Background
Clinically, many diseases have been shown to have a close relationship with genes. In general, genes whose expression levels are highly correlated with the occurrence of diseases are called biomarkers, and the discovery of biomarkers is of great importance for early diagnosis and prevention of diseases. Microarray data analysis techniques have been developed to find the most informative biomarkers and to remove redundant and non-target disease related biomarkers.
Microarray data analysis techniques are used to determine biomarkers. It is well known that the actual number of disease-related features (genes) in raw microarray data is relatively small, due to the high feature dimensions and small sample size. Such data typically contains a small sample and a large number of features unrelated to the disease of interest. In addition, microarray data has a high complexity, i.e., features are direct or interrelated results with high redundancy, which makes many applied machine learning algorithms exhibit low robustness and poor classification accuracy. Therefore, a proper method is searched for to reduce the number of features before the model is built, and the method has very important significance for improving the classification accuracy and the robustness of the model.
Feature selection is significant in mining large-scale high-dimensional datasets, such as those generated by microarray and mass spectrometry, and in creating statistical models. In feature selection, significant features in the entire training dataset can be identified. Feature selection is an important step in selecting biomarkers in biological data of high dimension, small samples. Common feature selection methods can be divided into a filtering method, a packaging method and an embedding method, and the currently more advanced feature selection method is a hybrid feature selection method formed by improving and combining three methods in different ways. Most of the methods adopt a method of selecting more than two characteristics in a superposition way, so as to improve the classification accuracy. However, in microarray data analysis, researchers tend to pay more attention to the stability of feature selection results and the non-redundancy between feature subsets, i.e., fewer redundant relationships exist between feature subsets.
The L1 regularization is an important means in machine learning, the L1 norm is added to the cost function as a penalty term to realize a sparse coefficient matrix, the purpose of feature selection is realized, and the improved L1 regularization method is based on the combination of sampling and selection, so that the sensitivity of a feature selection result to regularization coefficients is weakened, the stability of the result can be obviously improved, and false positives are controlled. The clustering is a process of classifying and organizing data in some similar data members, and the K-means clustering algorithm can divide the sample into a plurality of subsets with weaker relevance by a calculation method based on Euclidean distance, so that the clustering and screening of the features can be realized.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a high-dimensional data feature selection method based on improved L1 regularization and clustering.
The technical scheme of the invention is that the high-dimensional data feature selection method based on improved L1 regularization and clustering comprises the following steps:
step 1: according to a given gene microarray data set, clustering the gene microarray data characteristics by using a K-Means clustering algorithm;
step 1.1: data sample set D= { x by gene microarray 1 ,x 2 ,…,x m Using the number of clusters K, x as input to implement K-Means clustering algorithm j Representing the j-th feature in the sample set, and m is the number of samples;
step 1.2: randomly selecting k samples from the sample set D as an initial mean vector { mu } 1 ,μ 2 ,...,μ o ,...,μ k [ mu ] therein o Representing the mean vector corresponding to the o sample;
step 1.3: for each feature x in the sample set D j Initializing j=1, and performing the following operations:
step 1.3.1: defining clusters corresponding to clustered storage samples
Step 1.3.2: calculating the characteristic x j And each mean vector mu o And is denoted as d jo The formula is shown below;
d jo =||x jo || 2 (1)
step 1.3.3: calculating the characteristic x j Cluster marking lambda j The formula is shown below;
step 1.3.4: feature x j Put into corresponding clusters, i.e
Step 1.3.5: let j=j+1, judge j is greater than n, if yes, go to step 1.4, otherwise go to step 1.3.2;
step 1.4: for each mean vector μ o Let o=1, perform the following operations:
step 1.4.1: for mu o The value of the update is recorded as mu' o As shown in the following formula;
where x represents all data sets C b Is characterized by (2);
step 1.4.2: judging the current mu o Whether or not to equal mu' o If yes, go to step 1.4.3, otherwise keep the current μ o Unchanged, go to step 1.4.4;
step 1.4.3: vector μ of the current mean o The value of μ 'is updated' o
Step 1.4.4: let o=o+1 and judge if i is greater than k, if so go to step 1.5, otherwise go to step 1.4.1;
step 1.5: if when it isFront mean vector mu o Being updated, turning to step 1.3, otherwise turning to step 1.6;
step 1.6: for all C obtained b Where b=1, 2,..k, let c= { C 1 ,C 2 ,...,C k };
Step 1.7: output of cluster c= { C after division 1 ,C 2 ,…,C k };
Step 2: for each cluster C generated in step 1 1 -C k Iteratively deleting redundant characteristics by using the pearson correlation coefficient, and updating each cluster;
step 2.1: for cluster c= { C after division 1 ,C 2 ,…,C k Let parameter q=1, perform the following steps:
step 2.1.1: for C q Calculate each feature x j The independent sample t-test statistic P value of (c) is shown in the following formula;
wherein the method comprises the steps ofAnd->Is the characteristic x j Corresponding positive and negative sample variances; n is n 1 And n 2 For positive and negative sample capacities corresponding to this feature,n is the total number of features;
step 2.1.2: for all ofOrdering, let->X corresponding to maximum value j Is cluster C q Seed node x of (a) s
Step 2.1.3: computing cluster C q Seed node x is removed in the middle s All nodes outside and x s Is of the correlation coefficient of (2)The formula is as follows:
wherein E is a mathematical expectation;
step 2.1.4: sorting the correlation coefficients from large to small, and deleting the nodes corresponding to the first 15% of correlation coefficients in each cluster;
step 2.1.5: reserving the remaining nodes as new clusters
Step 2.1.6: let q=q+1, judge q is greater than k, if yes, go to step 2.2, otherwise go to step 2.1.1;
step 2.2: clustering updated clustersThe parameter r=1, the following steps are performed:
step 2.2.1: for each cluster of inputs using a feature selection algorithm that improves L1 regularizationSelecting features, and making the weight of the j-th feature be +.>
Step 2.2.1.1: input sample spaceWherein m represents the number of samples and n represents the total number of features; target changeQuantity->Defining a parameter regularization coefficient alpha, repeating sampling times K and a counter h=1;
step 2.2.1.2: random sampling number in sample spaceAs subspace X * The method comprises the steps of carrying out a first treatment on the surface of the And obtain the corresponding target variable y *
Step 2.2.1.3: calculation of loss function E (X) using Lasso regression model * ,y * ) +alpha II w is not limited to the above-mentioned formula, where w is the penalty term coefficient.
Step 2.2.1.4: regression model x j If the corresponding coefficient is marked as g, the feature is proved to be selected, and the feature is weighted
Step 2.2.1.5: let h=h+1, judge whether h is greater than K, if yes, go to step 2.2.1.6, otherwise go to step 2.2.1.2.
Step 2.2.1.6: output all x j Corresponding feature weights
Step 2.2.2: let r=r+1, judge whether r is greater than k, if yes, carry out step 2.3, otherwise carry out step 2.2.1; step 2.3: calculating the accumulated weight of each featureAll->Sorting from big to small;
step 2.4: according toOutputs the first l features as the final feature set f= { f 1 ,f 2 ,…,f l Of f, where f 1 Correspond to->One of the greatest accumulated weights;
step 3: for the resulting feature set f= { f 1 ,f 2 ,…,f l And finding out the corresponding gene name from the original microarray data to complete the characteristic analysis of the gene.
The beneficial effects generated by adopting the technical method are as follows:
the invention provides a high-dimensional data feature selection method based on improved L1 regularization and clustering, and provides a mixed feature selection algorithm for microarray data analysis, based on a K-Means clustering algorithm and an idea of improving L1 regularization, wherein the K-Means clustering algorithm is used for data preprocessing to delete redundant features, and the improved L1 regularization method is used for feature selection, so that stability and classification accuracy are improved.
Drawings
FIG. 1 is a flow chart of the overall process of the present invention.
Detailed Description
The following describes in further detail the embodiments of the present invention with reference to the drawings and examples. The following examples are illustrative of the invention and are not intended to limit the scope of the invention.
A high-dimensional data feature selection method based on improved L1 regularization and clustering, as shown in FIG. 1, comprises the following steps:
step 1: according to a given gene microarray data set, clustering the gene microarray data characteristics by using a K-Means clustering algorithm;
step 1.1: data sample set D= { x by gene microarray 1 ,x 2 ,…,x m Using the number of clusters K, x as input to implement K-Means clustering algorithm j Representing the j-th feature in the sample set, and m is the number of samples;
step (a)1.2: randomly selecting k samples from the sample set D as an initial mean vector { mu } 12 ,…,μ o ,…,μ k [ mu ] therein o Representing the mean vector corresponding to the o sample;
step 1.3: for each feature x in the sample set D j Initializing j=1, and performing the following operations:
step 1.3.1: defining clusters corresponding to clustered storage samples
Step 1.3.2: calculating the characteristic x j And each mean vector mu o And is denoted as d jo The formula is shown below;
d jo =||x jo || 2 (1)
step 1.3.3: calculating the characteristic x j Cluster marking lambda j The formula is shown below;
step 1.3.4: feature x j Put into corresponding clusters, i.e
Step 1.3.5: let j=j+1, judge j is greater than n, if yes, go to step 1.4, otherwise go to step 1.3.2;
step 1.4: for each mean vector μ o Let o=1, perform the following operations:
step 1.4.1: for mu o The value of the update is recorded as mu' o As shown in the following formula;
where x represents all data sets C b Is characterized by (2);
step 1.4.2: judging the current mu o Whether or not to equal mu' o If yes, go to step 1.4.3, otherwise keep the current μ o Unchanged, go to step 1.4.4;
step 1.4.3: vector μ of the current mean o The value of μ 'is updated' o
Step 1.4.4: let o=o+1 and judge if o is greater than k, if so go to step 1.5, otherwise go to step 1.4.1;
step 1.5: if the current mean value vector mu o Being updated, turning to step 1.3, otherwise turning to step 1.6;
step 1.6: for all C obtained b Where b=1, 2, …, k, let c= { C 1 ,C 2 ,…,C k };
Step 1.7: output of cluster c= { C after division 1 ,C 2 ,…,C k };
Step 2: for each cluster C generated in step 1 1 -C k Iteratively deleting redundant characteristics by using the pearson correlation coefficient, and updating each cluster;
step 2.1: for cluster c= { C after division 1 ,C 2 ,…,C k Let parameter q=1, perform the following steps:
step 2.1.1: for C q Calculate each feature x j The independent sample t-test statistic P value of (c) is shown in the following formula;
wherein the method comprises the steps ofAnd->Is the characteristic x j Corresponding positive and negative sample variances; n is n 1 And n 2 For positive and negative sample capacities corresponding to this feature,n is the total number of features;
step 2.1.2: for all ofOrdering, let->X corresponding to maximum value j Is cluster C q Seed node x of (a) s
Step 2.1.3: computing cluster C q Seed node x is removed in the middle s All nodes outside and x s Is of the correlation coefficient of (2)The formula is as follows:
wherein E is a mathematical expectation;
step 2.1.4: sorting the correlation coefficients from large to small, and deleting the nodes corresponding to the first 15% of correlation coefficients in each cluster;
step 2.1.5: reserving the remaining nodes as new clusters
Step 2.1.6: let q=q+1, judge q is greater than k, if yes, go to step 2.2, otherwise go to step 2.1.1;
step 2.2: clustering updated clustersThe parameter r=1, the following steps are performed:
step 2.2.1: for each cluster of inputs using a feature selection algorithm that improves L1 regularizationSelecting features, and making the weight of the j-th feature be +.>
Step 2.2.1.1: input sample spaceWherein m represents the number of samples and n represents the total number of features; target variable->Defining a parameter regularization coefficient alpha, repeating sampling times K and a counter h=1;
step 2.2.1.2: random sampling number in sample spaceAs subspace X * The method comprises the steps of carrying out a first treatment on the surface of the And obtain the corresponding target variable y *
Step 2.2.1.3: calculation of loss function E (X) using Lasso regression model * ,y * ) +alpha II w is not limited to the above-mentioned formula, where w is the penalty term coefficient.
Step 2.2.1.4: regression model x j If the corresponding coefficient is marked as g, the feature is proved to be selected, and the feature is weighted
Step 2.2.1.5: let h=h+1, judge if i is greater than K, if so go to step 2.2.1.6, otherwise go to step 2.2.1.2.
Step 2.2.1.6: output all x j Corresponding feature weights
Step 2.2.2: let r=r+1, judge whether r is greater than k, if yes, carry out step 2.3, otherwise carry out step 2.2.1;
step 2.3:calculating the accumulated weight of each featureAll->Sorting from big to small;
step 2.4: according toOutputs the first l features as the final feature set f= { f 1 ,f 2 ,…,f l Of f, where f 1 Correspond to->One of the greatest accumulated weights;
step 3: for the resulting feature set f= { f 1 ,f 2 ,…,f l And finding out the corresponding gene name from the original microarray data to complete the characteristic analysis of the gene.
In this embodiment, tests are performed on 8 kinds of public microarray data sets by using different classifiers, as shown in the following table, the cluster k=5 in the test, the repeated sampling number k=100, the penalty term α is 0.3, and the number of selected features is 10.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the corresponding technical solutions, which are defined by the scope of the appended claims.

Claims (2)

1. The high-dimensional data feature selection method based on the improved L1 regularization and clustering is characterized by comprising the following steps of:
step 1: according to a given gene microarray data set, clustering the gene microarray data characteristics by using a K-Means clustering algorithm;
step 2: for each cluster C generated in step 1 1 -C k Iteratively deleting redundant characteristics by using the pearson correlation coefficient, and updating each cluster;
step 2.1: for cluster c= { C after division 1 ,C 2 ,…,C k Let parameter q=1, perform the following steps:
step 2.1.1: for C q Calculate each feature x j The independent sample t-test statistic P value of (c) is shown in the following formula;
wherein the method comprises the steps ofAnd->Is the characteristic x j Corresponding positive and negative sample variances; n is n 1 And n 2 For positive and negative sample capacities corresponding to this feature,n is the total number of features;
step 2.1.2: for all ofOrdering, let->X corresponding to maximum value j Is cluster C q Seed node x of (a) s
Step 2.1.3: computing cluster C q Seed node x is removed in the middle s All nodes outside and x s Is of the correlation coefficient of (2)The formula is as follows:
wherein E is a mathematical expectation;
step 2.1.4: sorting the correlation coefficients from large to small, and deleting the nodes corresponding to the first 15% of correlation coefficients in each cluster;
step 2.1.5: reserving the remaining nodes as new clusters
Step 2.1.6: let q=q+1, judge q is greater than k, if yes, go to step 2.2, otherwise go to step 2.1.1;
step 2.2: clustering updated clustersThe parameter r=1, the following steps are performed:
step 2.2.1: for each cluster of inputs using a feature selection algorithm that improves L1 regularizationSelecting features, and making the weight of the j-th feature be +.>
Step 2.2.1.1: input sample spaceWherein m represents the number of samples and n represents the total number of features; target variableDefining a parameter regularization coefficient alpha, repeating sampling times K and a counter h=1;
step 2.2.1.2: random sampling number in sample spaceAs subspace X * The method comprises the steps of carrying out a first treatment on the surface of the And obtain the corresponding target variable y *
Step 2.2.1.3: calculation of loss function E (X) using Lasso regression model * ,y * ) +α|w| where w is a penalty term coefficient;
step 2.2.1.4: regression model x j If the corresponding coefficient is marked as g, the feature is proved to be selected, and the feature is weighted
Step 2.2.1.5: let h=h+1, judge whether h is greater than K, if yes, go to step 2.2.1.6, otherwise go to step 2.2.1.2.
Step 2.2.1.6: output all x j Corresponding feature weights
Step 2.2.2: let r=r+1, judge whether r is greater than k, if yes, carry out step 2.3, otherwise carry out step 2.2.1;
step 2.3: calculating the accumulated weight of each featureAll->Sorting from big to small;
step 2.4: according toOutputs the first l features as the final feature set f= { f 1 ,f 2 ,…,f l Of f, where f 1 Correspond to->One of the greatest accumulated weights;
step 3: for the resulting feature set f= { f 1 ,f 2 ,…,f l And finding out the corresponding gene name from the original microarray data to complete the characteristic analysis of the gene.
2. The method for selecting high-dimensional data features based on improved L1 regularization and clustering of claim 1, wherein said step 1 specifically comprises the steps of:
step 1.1: data sample set D= { x by gene microarray 1 ,x 2 ,…,x m Using the number of clusters K, x as input to implement K-Means clustering algorithm j Representing the j-th feature in the sample set, and m is the number of samples;
step 1.2: randomly selecting k samples from the sample set D as an initial mean vector { mu } 12 ,…,μ o ,…,μ k [ mu ] therein o Representing the mean vector corresponding to the o sample;
step 1.3: for each feature x in the sample set D j Initializing j=1, and performing the following operations:
step 1.3.1: defining clusters corresponding to clustered storage samples
Step 1.3.2: calculating the characteristic x j And each mean vector mu o And is denoted as d jo The formula is shown below;
d jo =||x jo || 2 (1)
step 1.3.3: calculating the characteristic x j Cluster marking lambda j The formula is shown below;
step 1.3.4: feature x j Put into corresponding clusters, i.e
Step 1.3.5: let j=j+1, judge j is greater than n, if yes, go to step 1.4, otherwise go to step 1.3.2;
step 1.4: for each mean vector μ o Let o=1, perform the following operations:
step 1.4.1: for mu o The value of the update is recorded as mu' o As shown in the following formula;
where x represents all data sets C b Is characterized by (2);
step 1.4.2: judging the current mu o Whether or not to equal mu' o If yes, go to step 1.4.3, otherwise keep the current μ o Unchanged, go to step 1.4.4;
step 1.4.3: vector μ of the current mean o The value of μ 'is updated' o
Step 1.4.4: let o=o+1 and judge if o is greater than k, if so go to step 1.5, otherwise go to step 1.4.1;
step 1.5: if the current average value is toQuantity mu o Being updated, turning to step 1.3, otherwise turning to step 1.6;
step 1.6: for all C obtained b Where b=1, 2, …, k, let c= { C 1 ,C 2 ,…,C k };
Step 1.7: output of cluster c= { C after division 1 ,C 2 ,…,C k }。
CN202110525604.8A 2021-05-14 2021-05-14 High-dimensional data feature selection method based on improved L1 regularization and clustering Active CN113177604B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110525604.8A CN113177604B (en) 2021-05-14 2021-05-14 High-dimensional data feature selection method based on improved L1 regularization and clustering

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110525604.8A CN113177604B (en) 2021-05-14 2021-05-14 High-dimensional data feature selection method based on improved L1 regularization and clustering

Publications (2)

Publication Number Publication Date
CN113177604A CN113177604A (en) 2021-07-27
CN113177604B true CN113177604B (en) 2024-04-16

Family

ID=76929261

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110525604.8A Active CN113177604B (en) 2021-05-14 2021-05-14 High-dimensional data feature selection method based on improved L1 regularization and clustering

Country Status (1)

Country Link
CN (1) CN113177604B (en)

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2009067655A2 (en) * 2007-11-21 2009-05-28 University Of Florida Research Foundation, Inc. Methods of feature selection through local learning; breast and prostate cancer prognostic markers
CN105372198A (en) * 2015-10-28 2016-03-02 中北大学 Infrared spectrum wavelength selection method based on integrated L1 regularization
CN105740653A (en) * 2016-01-27 2016-07-06 北京工业大学 Redundancy removal feature selection method LLRFC score+ based on LLRFC and correlation analysis
CN107203787A (en) * 2017-06-14 2017-09-26 江西师范大学 A kind of unsupervised regularization matrix characteristics of decomposition system of selection
CN108960341A (en) * 2018-07-23 2018-12-07 安徽师范大学 A kind of structured features selection method towards brain network
CN109993214A (en) * 2019-03-08 2019-07-09 华南理工大学 Multiple view clustering method based on Laplace regularization and order constraint
CN112232413A (en) * 2020-10-16 2021-01-15 东北大学 High-dimensional data feature selection method based on graph neural network and spectral clustering
CN112327701A (en) * 2020-11-09 2021-02-05 浙江大学 Slow characteristic network monitoring method for nonlinear dynamic industrial process
CN112364902A (en) * 2020-10-30 2021-02-12 太原理工大学 Feature selection learning method based on self-adaptive similarity
CN112417028A (en) * 2020-11-26 2021-02-26 国电南瑞科技股份有限公司 Wind speed time sequence characteristic mining method and short-term wind power prediction method

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2009067655A2 (en) * 2007-11-21 2009-05-28 University Of Florida Research Foundation, Inc. Methods of feature selection through local learning; breast and prostate cancer prognostic markers
CN105372198A (en) * 2015-10-28 2016-03-02 中北大学 Infrared spectrum wavelength selection method based on integrated L1 regularization
CN105740653A (en) * 2016-01-27 2016-07-06 北京工业大学 Redundancy removal feature selection method LLRFC score+ based on LLRFC and correlation analysis
CN107203787A (en) * 2017-06-14 2017-09-26 江西师范大学 A kind of unsupervised regularization matrix characteristics of decomposition system of selection
CN108960341A (en) * 2018-07-23 2018-12-07 安徽师范大学 A kind of structured features selection method towards brain network
CN109993214A (en) * 2019-03-08 2019-07-09 华南理工大学 Multiple view clustering method based on Laplace regularization and order constraint
CN112232413A (en) * 2020-10-16 2021-01-15 东北大学 High-dimensional data feature selection method based on graph neural network and spectral clustering
CN112364902A (en) * 2020-10-30 2021-02-12 太原理工大学 Feature selection learning method based on self-adaptive similarity
CN112327701A (en) * 2020-11-09 2021-02-05 浙江大学 Slow characteristic network monitoring method for nonlinear dynamic industrial process
CN112417028A (en) * 2020-11-26 2021-02-26 国电南瑞科技股份有限公司 Wind speed time sequence characteristic mining method and short-term wind power prediction method

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
Deng Cai等.Unsupervised Feature Selection for Multi-Cluster Data.《KDD '10: Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining》.2010,333–342. *
Feiping Nie等.Efficient and Robust Feature Selection via Joint l2,1-Norms Minimization.《Advances in Neural Information Processing Systems 23 (NIPS 2010)》.2010,1-9. *
ILRC: a hybrid biomarker discovery algorithm based on improved L1 regularization and clustering in microarray data;Kun Yu等;《BMC Bioinformatics》;20211022;第22卷;1-19 *
基于稀疏聚类的无监督特征选择;董利梅等;《南京大学学报(自然科学)》;20180131;第54卷(第1期);107-115 *
改进的无监督同时正交基聚类特征选择;钱有程;《吉林化工学院学报》;20190731;第36卷(第7期);80-85 *
面向基因表达微阵列数据的高效特征选择和分类方法研究;李自法;《中国优秀硕士学位论文全文数据库 信息科技辑》;20190115(第01期);I140-2420 *

Also Published As

Publication number Publication date
CN113177604A (en) 2021-07-27

Similar Documents

Publication Publication Date Title
Sun et al. Local-learning-based feature selection for high-dimensional data analysis
CN109977994B (en) Representative image selection method based on multi-example active learning
CN110659207B (en) Heterogeneous cross-project software defect prediction method based on nuclear spectrum mapping migration integration
CN108108762B (en) Nuclear extreme learning machine for coronary heart disease data and random forest classification method
CN112085059B (en) Breast cancer image feature selection method based on improved sine and cosine optimization algorithm
CN113298230B (en) Prediction method based on unbalanced data set generated against network
CN113408605A (en) Hyperspectral image semi-supervised classification method based on small sample learning
CN112784918A (en) Node identification method, system and device based on unsupervised graph representation learning
CN114091603A (en) Spatial transcriptome cell clustering and analyzing method
CN112926640A (en) Cancer gene classification method and equipment based on two-stage depth feature selection and storage medium
Morovvat et al. An ensemble of filters and wrappers for microarray data classification
Park et al. Evolutionary fuzzy clustering algorithm with knowledge-based evaluation and applications for gene expression profiling
CN111444989A (en) Network intrusion detection method
CN113177604B (en) High-dimensional data feature selection method based on improved L1 regularization and clustering
CN111832645A (en) Classification data feature selection method based on discrete crow difference collaborative search algorithm
CN110837853A (en) Rapid classification model construction method
CN112801163B (en) Multi-target feature selection method of mouse model hippocampal biomarker based on dynamic graph structure
Wang et al. Semisupervised Bacterial Heuristic Feature Selection Algorithm for High-Dimensional Classification with Missing Labels
CN108182347B (en) Large-scale cross-platform gene expression data classification method
CN108304546B (en) Medical image retrieval method based on content similarity and Softmax classifier
CN118053501A (en) Biomarker identification method based on genetic algorithm
Uddin et al. Practical analysis of macromolecule identity from cryo-electron tomography images using deep learning
CN110210988B (en) Symbolic social network embedding method based on deep hash
CN114334168A (en) Feature selection algorithm of particle swarm hybrid optimization combined with collaborative learning strategy
Walker Iterative Random Forest Based High Performance Computing Methods Applied to Biological Systems and Human Health

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant