CN113177604B - High-dimensional data feature selection method based on improved L1 regularization and clustering - Google Patents
High-dimensional data feature selection method based on improved L1 regularization and clustering Download PDFInfo
- Publication number
- CN113177604B CN113177604B CN202110525604.8A CN202110525604A CN113177604B CN 113177604 B CN113177604 B CN 113177604B CN 202110525604 A CN202110525604 A CN 202110525604A CN 113177604 B CN113177604 B CN 113177604B
- Authority
- CN
- China
- Prior art keywords
- feature
- cluster
- regularization
- sample
- clustering
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000010187 selection method Methods 0.000 title claims abstract description 12
- 238000000034 method Methods 0.000 claims abstract description 23
- 238000003064 k means clustering Methods 0.000 claims abstract description 11
- 108090000623 proteins and genes Proteins 0.000 claims description 18
- 238000002493 microarray Methods 0.000 claims description 16
- 238000005070 sampling Methods 0.000 claims description 8
- 238000004364 calculation method Methods 0.000 claims description 4
- 238000004458 analytical method Methods 0.000 claims description 3
- 238000010832 independent-sample T-test Methods 0.000 claims description 3
- 238000010208 microarray analysis Methods 0.000 abstract description 5
- 238000010801 machine learning Methods 0.000 abstract description 4
- 238000007781 pre-processing Methods 0.000 abstract description 2
- 230000000875 corresponding effect Effects 0.000 description 23
- 239000000090 biomarker Substances 0.000 description 6
- 201000010099 disease Diseases 0.000 description 5
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 5
- 230000006870 function Effects 0.000 description 3
- 101000870363 Oryctolagus cuniculus Glutathione S-transferase Yc Proteins 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000002596 correlated effect Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000006806 disease prevention Effects 0.000 description 1
- 238000013399 early diagnosis Methods 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 238000004949 mass spectrometry Methods 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000004806 packaging method and process Methods 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
- 230000035945 sensitivity Effects 0.000 description 1
- 238000013179 statistical model Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000012549 training Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/211—Selection of the most significant subset of features
- G06F18/2113—Selection of the most significant subset of features by ranking or filtering the set of features, e.g. using a measure of variance or of feature cross-correlation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
- G06F18/23213—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Databases & Information Systems (AREA)
- Biophysics (AREA)
- Epidemiology (AREA)
- Bioethics (AREA)
- Public Health (AREA)
- Software Systems (AREA)
- Biotechnology (AREA)
- General Health & Medical Sciences (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Probability & Statistics with Applications (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
The invention provides a high-dimensional data feature selection method based on improved L1 regularization and clustering, and relates to the technical field of machine learning. The invention provides a hybrid feature selection algorithm for microarray data analysis, which is based on a K-Means clustering algorithm and an idea of improving L1 regularization, wherein the K-Means clustering algorithm is used for data preprocessing to delete redundant features, and the improved L1 regularization method is used for feature selection to improve stability and classification accuracy.
Description
Technical Field
The invention relates to the technical field of machine learning, in particular to a high-dimensional data feature selection method based on improved L1 regularization and clustering.
Background
Clinically, many diseases have been shown to have a close relationship with genes. In general, genes whose expression levels are highly correlated with the occurrence of diseases are called biomarkers, and the discovery of biomarkers is of great importance for early diagnosis and prevention of diseases. Microarray data analysis techniques have been developed to find the most informative biomarkers and to remove redundant and non-target disease related biomarkers.
Microarray data analysis techniques are used to determine biomarkers. It is well known that the actual number of disease-related features (genes) in raw microarray data is relatively small, due to the high feature dimensions and small sample size. Such data typically contains a small sample and a large number of features unrelated to the disease of interest. In addition, microarray data has a high complexity, i.e., features are direct or interrelated results with high redundancy, which makes many applied machine learning algorithms exhibit low robustness and poor classification accuracy. Therefore, a proper method is searched for to reduce the number of features before the model is built, and the method has very important significance for improving the classification accuracy and the robustness of the model.
Feature selection is significant in mining large-scale high-dimensional datasets, such as those generated by microarray and mass spectrometry, and in creating statistical models. In feature selection, significant features in the entire training dataset can be identified. Feature selection is an important step in selecting biomarkers in biological data of high dimension, small samples. Common feature selection methods can be divided into a filtering method, a packaging method and an embedding method, and the currently more advanced feature selection method is a hybrid feature selection method formed by improving and combining three methods in different ways. Most of the methods adopt a method of selecting more than two characteristics in a superposition way, so as to improve the classification accuracy. However, in microarray data analysis, researchers tend to pay more attention to the stability of feature selection results and the non-redundancy between feature subsets, i.e., fewer redundant relationships exist between feature subsets.
The L1 regularization is an important means in machine learning, the L1 norm is added to the cost function as a penalty term to realize a sparse coefficient matrix, the purpose of feature selection is realized, and the improved L1 regularization method is based on the combination of sampling and selection, so that the sensitivity of a feature selection result to regularization coefficients is weakened, the stability of the result can be obviously improved, and false positives are controlled. The clustering is a process of classifying and organizing data in some similar data members, and the K-means clustering algorithm can divide the sample into a plurality of subsets with weaker relevance by a calculation method based on Euclidean distance, so that the clustering and screening of the features can be realized.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a high-dimensional data feature selection method based on improved L1 regularization and clustering.
The technical scheme of the invention is that the high-dimensional data feature selection method based on improved L1 regularization and clustering comprises the following steps:
step 1: according to a given gene microarray data set, clustering the gene microarray data characteristics by using a K-Means clustering algorithm;
step 1.1: data sample set D= { x by gene microarray 1 ,x 2 ,…,x m Using the number of clusters K, x as input to implement K-Means clustering algorithm j Representing the j-th feature in the sample set, and m is the number of samples;
step 1.2: randomly selecting k samples from the sample set D as an initial mean vector { mu } 1 ,μ 2 ,...,μ o ,...,μ k [ mu ] therein o Representing the mean vector corresponding to the o sample;
step 1.3: for each feature x in the sample set D j Initializing j=1, and performing the following operations:
step 1.3.1: defining clusters corresponding to clustered storage samples
Step 1.3.2: calculating the characteristic x j And each mean vector mu o And is denoted as d jo The formula is shown below;
d jo =||x j -μ o || 2 (1)
step 1.3.3: calculating the characteristic x j Cluster marking lambda j The formula is shown below;
step 1.3.4: feature x j Put into corresponding clusters, i.e
Step 1.3.5: let j=j+1, judge j is greater than n, if yes, go to step 1.4, otherwise go to step 1.3.2;
step 1.4: for each mean vector μ o Let o=1, perform the following operations:
step 1.4.1: for mu o The value of the update is recorded as mu' o As shown in the following formula;
where x represents all data sets C b Is characterized by (2);
step 1.4.2: judging the current mu o Whether or not to equal mu' o If yes, go to step 1.4.3, otherwise keep the current μ o Unchanged, go to step 1.4.4;
step 1.4.3: vector μ of the current mean o The value of μ 'is updated' o ;
Step 1.4.4: let o=o+1 and judge if i is greater than k, if so go to step 1.5, otherwise go to step 1.4.1;
step 1.5: if when it isFront mean vector mu o Being updated, turning to step 1.3, otherwise turning to step 1.6;
step 1.6: for all C obtained b Where b=1, 2,..k, let c= { C 1 ,C 2 ,...,C k };
Step 1.7: output of cluster c= { C after division 1 ,C 2 ,…,C k };
Step 2: for each cluster C generated in step 1 1 -C k Iteratively deleting redundant characteristics by using the pearson correlation coefficient, and updating each cluster;
step 2.1: for cluster c= { C after division 1 ,C 2 ,…,C k Let parameter q=1, perform the following steps:
step 2.1.1: for C q Calculate each feature x j The independent sample t-test statistic P value of (c) is shown in the following formula;
wherein the method comprises the steps ofAnd->Is the characteristic x j Corresponding positive and negative sample variances; n is n 1 And n 2 For positive and negative sample capacities corresponding to this feature,n is the total number of features;
step 2.1.2: for all ofOrdering, let->X corresponding to maximum value j Is cluster C q Seed node x of (a) s ;
Step 2.1.3: computing cluster C q Seed node x is removed in the middle s All nodes outside and x s Is of the correlation coefficient of (2)The formula is as follows:
wherein E is a mathematical expectation;
step 2.1.4: sorting the correlation coefficients from large to small, and deleting the nodes corresponding to the first 15% of correlation coefficients in each cluster;
step 2.1.5: reserving the remaining nodes as new clusters
Step 2.1.6: let q=q+1, judge q is greater than k, if yes, go to step 2.2, otherwise go to step 2.1.1;
step 2.2: clustering updated clustersThe parameter r=1, the following steps are performed:
step 2.2.1: for each cluster of inputs using a feature selection algorithm that improves L1 regularizationSelecting features, and making the weight of the j-th feature be +.>
Step 2.2.1.1: input sample spaceWherein m represents the number of samples and n represents the total number of features; target changeQuantity->Defining a parameter regularization coefficient alpha, repeating sampling times K and a counter h=1;
step 2.2.1.2: random sampling number in sample spaceAs subspace X * The method comprises the steps of carrying out a first treatment on the surface of the And obtain the corresponding target variable y * 。
Step 2.2.1.3: calculation of loss function E (X) using Lasso regression model * ,y * ) +alpha II w is not limited to the above-mentioned formula, where w is the penalty term coefficient.
Step 2.2.1.4: regression model x j If the corresponding coefficient is marked as g, the feature is proved to be selected, and the feature is weighted
Step 2.2.1.5: let h=h+1, judge whether h is greater than K, if yes, go to step 2.2.1.6, otherwise go to step 2.2.1.2.
Step 2.2.1.6: output all x j Corresponding feature weights
Step 2.2.2: let r=r+1, judge whether r is greater than k, if yes, carry out step 2.3, otherwise carry out step 2.2.1; step 2.3: calculating the accumulated weight of each featureAll->Sorting from big to small;
step 2.4: according toOutputs the first l features as the final feature set f= { f 1 ,f 2 ,…,f l Of f, where f 1 Correspond to->One of the greatest accumulated weights;
step 3: for the resulting feature set f= { f 1 ,f 2 ,…,f l And finding out the corresponding gene name from the original microarray data to complete the characteristic analysis of the gene.
The beneficial effects generated by adopting the technical method are as follows:
the invention provides a high-dimensional data feature selection method based on improved L1 regularization and clustering, and provides a mixed feature selection algorithm for microarray data analysis, based on a K-Means clustering algorithm and an idea of improving L1 regularization, wherein the K-Means clustering algorithm is used for data preprocessing to delete redundant features, and the improved L1 regularization method is used for feature selection, so that stability and classification accuracy are improved.
Drawings
FIG. 1 is a flow chart of the overall process of the present invention.
Detailed Description
The following describes in further detail the embodiments of the present invention with reference to the drawings and examples. The following examples are illustrative of the invention and are not intended to limit the scope of the invention.
A high-dimensional data feature selection method based on improved L1 regularization and clustering, as shown in FIG. 1, comprises the following steps:
step 1: according to a given gene microarray data set, clustering the gene microarray data characteristics by using a K-Means clustering algorithm;
step 1.1: data sample set D= { x by gene microarray 1 ,x 2 ,…,x m Using the number of clusters K, x as input to implement K-Means clustering algorithm j Representing the j-th feature in the sample set, and m is the number of samples;
step (a)1.2: randomly selecting k samples from the sample set D as an initial mean vector { mu } 1 ,μ 2 ,…,μ o ,…,μ k [ mu ] therein o Representing the mean vector corresponding to the o sample;
step 1.3: for each feature x in the sample set D j Initializing j=1, and performing the following operations:
step 1.3.1: defining clusters corresponding to clustered storage samples
Step 1.3.2: calculating the characteristic x j And each mean vector mu o And is denoted as d jo The formula is shown below;
d jo =||x j -μ o || 2 (1)
step 1.3.3: calculating the characteristic x j Cluster marking lambda j The formula is shown below;
step 1.3.4: feature x j Put into corresponding clusters, i.e
Step 1.3.5: let j=j+1, judge j is greater than n, if yes, go to step 1.4, otherwise go to step 1.3.2;
step 1.4: for each mean vector μ o Let o=1, perform the following operations:
step 1.4.1: for mu o The value of the update is recorded as mu' o As shown in the following formula;
where x represents all data sets C b Is characterized by (2);
step 1.4.2: judging the current mu o Whether or not to equal mu' o If yes, go to step 1.4.3, otherwise keep the current μ o Unchanged, go to step 1.4.4;
step 1.4.3: vector μ of the current mean o The value of μ 'is updated' o ;
Step 1.4.4: let o=o+1 and judge if o is greater than k, if so go to step 1.5, otherwise go to step 1.4.1;
step 1.5: if the current mean value vector mu o Being updated, turning to step 1.3, otherwise turning to step 1.6;
step 1.6: for all C obtained b Where b=1, 2, …, k, let c= { C 1 ,C 2 ,…,C k };
Step 1.7: output of cluster c= { C after division 1 ,C 2 ,…,C k };
Step 2: for each cluster C generated in step 1 1 -C k Iteratively deleting redundant characteristics by using the pearson correlation coefficient, and updating each cluster;
step 2.1: for cluster c= { C after division 1 ,C 2 ,…,C k Let parameter q=1, perform the following steps:
step 2.1.1: for C q Calculate each feature x j The independent sample t-test statistic P value of (c) is shown in the following formula;
wherein the method comprises the steps ofAnd->Is the characteristic x j Corresponding positive and negative sample variances; n is n 1 And n 2 For positive and negative sample capacities corresponding to this feature,n is the total number of features;
step 2.1.2: for all ofOrdering, let->X corresponding to maximum value j Is cluster C q Seed node x of (a) s ;
Step 2.1.3: computing cluster C q Seed node x is removed in the middle s All nodes outside and x s Is of the correlation coefficient of (2)The formula is as follows:
wherein E is a mathematical expectation;
step 2.1.4: sorting the correlation coefficients from large to small, and deleting the nodes corresponding to the first 15% of correlation coefficients in each cluster;
step 2.1.5: reserving the remaining nodes as new clusters
Step 2.1.6: let q=q+1, judge q is greater than k, if yes, go to step 2.2, otherwise go to step 2.1.1;
step 2.2: clustering updated clustersThe parameter r=1, the following steps are performed:
step 2.2.1: for each cluster of inputs using a feature selection algorithm that improves L1 regularizationSelecting features, and making the weight of the j-th feature be +.>
Step 2.2.1.1: input sample spaceWherein m represents the number of samples and n represents the total number of features; target variable->Defining a parameter regularization coefficient alpha, repeating sampling times K and a counter h=1;
step 2.2.1.2: random sampling number in sample spaceAs subspace X * The method comprises the steps of carrying out a first treatment on the surface of the And obtain the corresponding target variable y * 。
Step 2.2.1.3: calculation of loss function E (X) using Lasso regression model * ,y * ) +alpha II w is not limited to the above-mentioned formula, where w is the penalty term coefficient.
Step 2.2.1.4: regression model x j If the corresponding coefficient is marked as g, the feature is proved to be selected, and the feature is weighted
Step 2.2.1.5: let h=h+1, judge if i is greater than K, if so go to step 2.2.1.6, otherwise go to step 2.2.1.2.
Step 2.2.1.6: output all x j Corresponding feature weights
Step 2.2.2: let r=r+1, judge whether r is greater than k, if yes, carry out step 2.3, otherwise carry out step 2.2.1;
step 2.3:calculating the accumulated weight of each featureAll->Sorting from big to small;
step 2.4: according toOutputs the first l features as the final feature set f= { f 1 ,f 2 ,…,f l Of f, where f 1 Correspond to->One of the greatest accumulated weights;
step 3: for the resulting feature set f= { f 1 ,f 2 ,…,f l And finding out the corresponding gene name from the original microarray data to complete the characteristic analysis of the gene.
In this embodiment, tests are performed on 8 kinds of public microarray data sets by using different classifiers, as shown in the following table, the cluster k=5 in the test, the repeated sampling number k=100, the penalty term α is 0.3, and the number of selected features is 10.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the corresponding technical solutions, which are defined by the scope of the appended claims.
Claims (2)
1. The high-dimensional data feature selection method based on the improved L1 regularization and clustering is characterized by comprising the following steps of:
step 1: according to a given gene microarray data set, clustering the gene microarray data characteristics by using a K-Means clustering algorithm;
step 2: for each cluster C generated in step 1 1 -C k Iteratively deleting redundant characteristics by using the pearson correlation coefficient, and updating each cluster;
step 2.1: for cluster c= { C after division 1 ,C 2 ,…,C k Let parameter q=1, perform the following steps:
step 2.1.1: for C q Calculate each feature x j The independent sample t-test statistic P value of (c) is shown in the following formula;
wherein the method comprises the steps ofAnd->Is the characteristic x j Corresponding positive and negative sample variances; n is n 1 And n 2 For positive and negative sample capacities corresponding to this feature,n is the total number of features;
step 2.1.2: for all ofOrdering, let->X corresponding to maximum value j Is cluster C q Seed node x of (a) s ;
Step 2.1.3: computing cluster C q Seed node x is removed in the middle s All nodes outside and x s Is of the correlation coefficient of (2)The formula is as follows:
wherein E is a mathematical expectation;
step 2.1.4: sorting the correlation coefficients from large to small, and deleting the nodes corresponding to the first 15% of correlation coefficients in each cluster;
step 2.1.5: reserving the remaining nodes as new clusters
Step 2.1.6: let q=q+1, judge q is greater than k, if yes, go to step 2.2, otherwise go to step 2.1.1;
step 2.2: clustering updated clustersThe parameter r=1, the following steps are performed:
step 2.2.1: for each cluster of inputs using a feature selection algorithm that improves L1 regularizationSelecting features, and making the weight of the j-th feature be +.>
Step 2.2.1.1: input sample spaceWherein m represents the number of samples and n represents the total number of features; target variableDefining a parameter regularization coefficient alpha, repeating sampling times K and a counter h=1;
step 2.2.1.2: random sampling number in sample spaceAs subspace X * The method comprises the steps of carrying out a first treatment on the surface of the And obtain the corresponding target variable y * ;
Step 2.2.1.3: calculation of loss function E (X) using Lasso regression model * ,y * ) +α|w| where w is a penalty term coefficient;
step 2.2.1.4: regression model x j If the corresponding coefficient is marked as g, the feature is proved to be selected, and the feature is weighted
Step 2.2.1.5: let h=h+1, judge whether h is greater than K, if yes, go to step 2.2.1.6, otherwise go to step 2.2.1.2.
Step 2.2.1.6: output all x j Corresponding feature weights
Step 2.2.2: let r=r+1, judge whether r is greater than k, if yes, carry out step 2.3, otherwise carry out step 2.2.1;
step 2.3: calculating the accumulated weight of each featureAll->Sorting from big to small;
step 2.4: according toOutputs the first l features as the final feature set f= { f 1 ,f 2 ,…,f l Of f, where f 1 Correspond to->One of the greatest accumulated weights;
step 3: for the resulting feature set f= { f 1 ,f 2 ,…,f l And finding out the corresponding gene name from the original microarray data to complete the characteristic analysis of the gene.
2. The method for selecting high-dimensional data features based on improved L1 regularization and clustering of claim 1, wherein said step 1 specifically comprises the steps of:
step 1.1: data sample set D= { x by gene microarray 1 ,x 2 ,…,x m Using the number of clusters K, x as input to implement K-Means clustering algorithm j Representing the j-th feature in the sample set, and m is the number of samples;
step 1.2: randomly selecting k samples from the sample set D as an initial mean vector { mu } 1 ,μ 2 ,…,μ o ,…,μ k [ mu ] therein o Representing the mean vector corresponding to the o sample;
step 1.3: for each feature x in the sample set D j Initializing j=1, and performing the following operations:
step 1.3.1: defining clusters corresponding to clustered storage samples
Step 1.3.2: calculating the characteristic x j And each mean vector mu o And is denoted as d jo The formula is shown below;
d jo =||x j -μ o || 2 (1)
step 1.3.3: calculating the characteristic x j Cluster marking lambda j The formula is shown below;
step 1.3.4: feature x j Put into corresponding clusters, i.e
Step 1.3.5: let j=j+1, judge j is greater than n, if yes, go to step 1.4, otherwise go to step 1.3.2;
step 1.4: for each mean vector μ o Let o=1, perform the following operations:
step 1.4.1: for mu o The value of the update is recorded as mu' o As shown in the following formula;
where x represents all data sets C b Is characterized by (2);
step 1.4.2: judging the current mu o Whether or not to equal mu' o If yes, go to step 1.4.3, otherwise keep the current μ o Unchanged, go to step 1.4.4;
step 1.4.3: vector μ of the current mean o The value of μ 'is updated' o ;
Step 1.4.4: let o=o+1 and judge if o is greater than k, if so go to step 1.5, otherwise go to step 1.4.1;
step 1.5: if the current average value is toQuantity mu o Being updated, turning to step 1.3, otherwise turning to step 1.6;
step 1.6: for all C obtained b Where b=1, 2, …, k, let c= { C 1 ,C 2 ,…,C k };
Step 1.7: output of cluster c= { C after division 1 ,C 2 ,…,C k }。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110525604.8A CN113177604B (en) | 2021-05-14 | 2021-05-14 | High-dimensional data feature selection method based on improved L1 regularization and clustering |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110525604.8A CN113177604B (en) | 2021-05-14 | 2021-05-14 | High-dimensional data feature selection method based on improved L1 regularization and clustering |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113177604A CN113177604A (en) | 2021-07-27 |
CN113177604B true CN113177604B (en) | 2024-04-16 |
Family
ID=76929261
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110525604.8A Active CN113177604B (en) | 2021-05-14 | 2021-05-14 | High-dimensional data feature selection method based on improved L1 regularization and clustering |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113177604B (en) |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2009067655A2 (en) * | 2007-11-21 | 2009-05-28 | University Of Florida Research Foundation, Inc. | Methods of feature selection through local learning; breast and prostate cancer prognostic markers |
CN105372198A (en) * | 2015-10-28 | 2016-03-02 | 中北大学 | Infrared spectrum wavelength selection method based on integrated L1 regularization |
CN105740653A (en) * | 2016-01-27 | 2016-07-06 | 北京工业大学 | Redundancy removal feature selection method LLRFC score+ based on LLRFC and correlation analysis |
CN107203787A (en) * | 2017-06-14 | 2017-09-26 | 江西师范大学 | Unsupervised regularization matrix decomposition feature selection method |
CN108960341A (en) * | 2018-07-23 | 2018-12-07 | 安徽师范大学 | A kind of structured features selection method towards brain network |
CN109993214A (en) * | 2019-03-08 | 2019-07-09 | 华南理工大学 | Multiple view clustering method based on Laplace regularization and order constraint |
CN112232413A (en) * | 2020-10-16 | 2021-01-15 | 东北大学 | High-dimensional data feature selection method based on graph neural network and spectral clustering |
CN112327701A (en) * | 2020-11-09 | 2021-02-05 | 浙江大学 | Slow characteristic network monitoring method for nonlinear dynamic industrial process |
CN112364902A (en) * | 2020-10-30 | 2021-02-12 | 太原理工大学 | Feature selection learning method based on self-adaptive similarity |
CN112417028A (en) * | 2020-11-26 | 2021-02-26 | 国电南瑞科技股份有限公司 | Wind speed time sequence characteristic mining method and short-term wind power prediction method |
-
2021
- 2021-05-14 CN CN202110525604.8A patent/CN113177604B/en active Active
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2009067655A2 (en) * | 2007-11-21 | 2009-05-28 | University Of Florida Research Foundation, Inc. | Methods of feature selection through local learning; breast and prostate cancer prognostic markers |
CN105372198A (en) * | 2015-10-28 | 2016-03-02 | 中北大学 | Infrared spectrum wavelength selection method based on integrated L1 regularization |
CN105740653A (en) * | 2016-01-27 | 2016-07-06 | 北京工业大学 | Redundancy removal feature selection method LLRFC score+ based on LLRFC and correlation analysis |
CN107203787A (en) * | 2017-06-14 | 2017-09-26 | 江西师范大学 | Unsupervised regularization matrix decomposition feature selection method |
CN108960341A (en) * | 2018-07-23 | 2018-12-07 | 安徽师范大学 | A kind of structured features selection method towards brain network |
CN109993214A (en) * | 2019-03-08 | 2019-07-09 | 华南理工大学 | Multiple view clustering method based on Laplace regularization and order constraint |
CN112232413A (en) * | 2020-10-16 | 2021-01-15 | 东北大学 | High-dimensional data feature selection method based on graph neural network and spectral clustering |
CN112364902A (en) * | 2020-10-30 | 2021-02-12 | 太原理工大学 | Feature selection learning method based on self-adaptive similarity |
CN112327701A (en) * | 2020-11-09 | 2021-02-05 | 浙江大学 | Slow characteristic network monitoring method for nonlinear dynamic industrial process |
CN112417028A (en) * | 2020-11-26 | 2021-02-26 | 国电南瑞科技股份有限公司 | Wind speed time sequence characteristic mining method and short-term wind power prediction method |
Non-Patent Citations (6)
Title |
---|
Deng Cai等.Unsupervised Feature Selection for Multi-Cluster Data.《KDD '10: Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining》.2010,333–342. * |
Feiping Nie等.Efficient and Robust Feature Selection via Joint l2,1-Norms Minimization.《Advances in Neural Information Processing Systems 23 (NIPS 2010)》.2010,1-9. * |
ILRC: a hybrid biomarker discovery algorithm based on improved L1 regularization and clustering in microarray data;Kun Yu等;《BMC Bioinformatics》;20211022;第22卷;1-19 * |
基于稀疏聚类的无监督特征选择;董利梅等;《南京大学学报(自然科学)》;20180131;第54卷(第1期);107-115 * |
改进的无监督同时正交基聚类特征选择;钱有程;《吉林化工学院学报》;20190731;第36卷(第7期);80-85 * |
面向基因表达微阵列数据的高效特征选择和分类方法研究;李自法;《中国优秀硕士学位论文全文数据库 信息科技辑》;20190115(第01期);I140-2420 * |
Also Published As
Publication number | Publication date |
---|---|
CN113177604A (en) | 2021-07-27 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109977994B (en) | Representative image selection method based on multi-example active learning | |
CN110659207B (en) | Heterogeneous cross-project software defect prediction method based on nuclear spectrum mapping migration integration | |
CN108108762B (en) | Nuclear extreme learning machine for coronary heart disease data and random forest classification method | |
CN112085059B (en) | Breast cancer image feature selection method based on improved sine and cosine optimization algorithm | |
CN113298230B (en) | Prediction method based on unbalanced data set generated against network | |
CN113408605A (en) | Hyperspectral image semi-supervised classification method based on small sample learning | |
CN106203534A (en) | A kind of cost-sensitive Software Defects Predict Methods based on Boosting | |
CN114091603A (en) | Spatial transcriptome cell clustering and analyzing method | |
CN112926640A (en) | Cancer gene classification method and equipment based on two-stage depth feature selection and storage medium | |
Morovvat et al. | An ensemble of filters and wrappers for microarray data classification | |
Park et al. | Evolutionary fuzzy clustering algorithm with knowledge-based evaluation and applications for gene expression profiling | |
CN111444989A (en) | Network intrusion detection method | |
CN113177604B (en) | High-dimensional data feature selection method based on improved L1 regularization and clustering | |
CN111832645A (en) | Classification data feature selection method based on discrete crow difference collaborative search algorithm | |
CN108304546B (en) | Medical image retrieval method based on content similarity and Softmax classifier | |
CN110837853A (en) | Rapid classification model construction method | |
CN112801163B (en) | Multi-target feature selection method of mouse model hippocampal biomarker based on dynamic graph structure | |
CN115758462A (en) | Method, device, processor and computer readable storage medium for realizing sensitive data identification in trusted environment | |
CN114334168A (en) | Feature selection algorithm of particle swarm hybrid optimization combined with collaborative learning strategy | |
CN108182347B (en) | Large-scale cross-platform gene expression data classification method | |
CN118053501A (en) | Biomarker identification method based on genetic algorithm | |
CN114512188B (en) | DNA binding protein recognition method based on improved protein sequence position specificity matrix | |
Uddin et al. | Practical analysis of macromolecule identity from cryo-electron tomography images using deep learning | |
Ranjan et al. | A Modified Binary Arithmetic Optimization Algorithm for Feature Selection | |
Walker | Iterative Random Forest Based High Performance Computing Methods Applied to Biological Systems and Human Health |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |