CN113177604A - High-dimensional data feature selection method based on improved L1 regularization and clustering - Google Patents
High-dimensional data feature selection method based on improved L1 regularization and clustering Download PDFInfo
- Publication number
- CN113177604A CN113177604A CN202110525604.8A CN202110525604A CN113177604A CN 113177604 A CN113177604 A CN 113177604A CN 202110525604 A CN202110525604 A CN 202110525604A CN 113177604 A CN113177604 A CN 113177604A
- Authority
- CN
- China
- Prior art keywords
- feature
- cluster
- regularization
- clustering
- sample
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000010187 selection method Methods 0.000 title claims abstract description 13
- 238000000034 method Methods 0.000 claims abstract description 17
- 238000003064 k means clustering Methods 0.000 claims abstract description 11
- 108090000623 proteins and genes Proteins 0.000 claims description 17
- 238000002493 microarray Methods 0.000 claims description 16
- 238000005070 sampling Methods 0.000 claims description 7
- 238000012360 testing method Methods 0.000 claims description 5
- 238000004364 calculation method Methods 0.000 claims description 4
- 238000012163 sequencing technique Methods 0.000 claims description 4
- 238000004458 analytical method Methods 0.000 claims description 3
- 230000001186 cumulative effect Effects 0.000 claims description 3
- 238000000638 solvent extraction Methods 0.000 claims description 3
- 238000010208 microarray analysis Methods 0.000 abstract description 5
- 238000010801 machine learning Methods 0.000 abstract description 4
- 238000007781 pre-processing Methods 0.000 abstract description 2
- 230000000875 corresponding effect Effects 0.000 description 19
- 239000000090 biomarker Substances 0.000 description 6
- 201000010099 disease Diseases 0.000 description 5
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 5
- 230000006870 function Effects 0.000 description 3
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000002596 correlated effect Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000006806 disease prevention Effects 0.000 description 1
- 238000013399 early diagnosis Methods 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 238000004949 mass spectrometry Methods 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000004806 packaging method and process Methods 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
- 230000035945 sensitivity Effects 0.000 description 1
- 238000013179 statistical model Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000012549 training Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/211—Selection of the most significant subset of features
- G06F18/2113—Selection of the most significant subset of features by ranking or filtering the set of features, e.g. using a measure of variance or of feature cross-correlation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
- G06F18/23213—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Databases & Information Systems (AREA)
- Biophysics (AREA)
- Epidemiology (AREA)
- Bioethics (AREA)
- Public Health (AREA)
- Software Systems (AREA)
- Biotechnology (AREA)
- General Health & Medical Sciences (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Probability & Statistics with Applications (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
The invention provides a high-dimensional data feature selection method based on improved L1 regularization and clustering, and relates to the technical field of machine learning. The invention provides a mixed feature selection algorithm for microarray data analysis, which is based on a K-Means clustering algorithm and an improved L1 regularization idea, wherein the K-Means clustering algorithm is used for data preprocessing to delete redundant features, and an improved L1 regularization method is used for feature selection to improve stability and classification accuracy.
Description
Technical Field
The invention relates to the technical field of machine learning, in particular to a high-dimensional data feature selection method based on improved L1 regularization and clustering.
Background
Clinically, a close relationship between many disease isogenes has been confirmed. In general, genes whose expression levels are highly correlated with the occurrence of diseases are called biomarkers, and the discovery of biomarkers is of great significance for early diagnosis and prevention of diseases. Microarray data analysis techniques have been developed to find the most informative biomarkers and to remove redundant and non-target disease-related biomarkers.
Microarray data analysis techniques are used to determine biomarkers. It is well known that the actual number of disease-related features (genes) in the raw microarray data is relatively small due to the high feature dimensions and the small sample size. Such data typically contains a small number of samples and a large number of features unrelated to the target disease. In addition, microarray data has a high complexity, i.e., features are direct or interrelated results with a high degree of redundancy, which makes many applied machine learning algorithms exhibit low robustness and poor classification accuracy. Therefore, an appropriate method is found before the model is constructed to reduce the number of features, and the classification accuracy and the robustness of the model are improved, so that the method has important significance.
The feature selection has important significance for mining large-scale high-dimensional data sets, such as data sets generated by microarray and mass spectrometry and establishing statistical models. In feature selection, significant features in the entire training data set may be identified. Feature selection is an important step in the selection of biomarkers in high-dimensional, small-sample biological data. Common feature selection methods can be divided into a filtering method, a packaging method and an embedding method, and the currently more advanced feature selection method is a mixed feature selection method formed by improving and combining three methods in different ways. Most of the methods adopt a selection method of superposing more than two features to improve the classification accuracy. However, in microarray data analysis, researchers tend to pay more attention to the stability of feature selection results and non-redundancy among feature subsets, i.e., there are fewer redundant relationships among feature subsets.
The L1 regularization is an important means in machine learning, a sparse coefficient matrix is realized by adding an L1 norm as a penalty term to a cost function, the purpose of feature selection is realized, the improved L1 regularization method is based on the combination of sampling and selection, the sensitivity of a feature selection result to a regularization coefficient is weakened, the result stability can be obviously improved, and false positive is controlled. Clustering is a process of classifying and organizing data members similar in some aspects in a data set, and a K-means clustering algorithm can divide a sample into a plurality of subsets with weak association through a calculation method based on Euclidean distance, so that clustering and screening of features can be realized.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a high-dimensional data feature selection method based on improved L1 regularization and clustering.
The technical scheme of the invention is that a high-dimensional data feature selection method based on improved L1 regularization and clustering comprises the following steps:
step 1: according to a given gene microarray data set, clustering of gene microarray data characteristics is achieved by using a K-Means clustering algorithm;
step 1.1: using gene microarray data sample set D ═ { x ═ x1,x2,…,xmTaking the K-Means clustering algorithm as input, wherein the number of the clustering clusters is K, xjRepresenting the jth feature in the sample set, wherein m is the number of samples;
step 1.2: randomly selecting k samples from the sample set D as an initial mean vector [ mu ]1,μ2,…,μkIn which μiRepresenting the mean vector corresponding to the ith sample;
step 1.3: for each feature x in the sample set DjInitialization is to make j equal to 1, and the following operations are performed:
Step 1.3.2: computing feature xjWith each mean vector muiAnd is denoted by djiThe formula is shown as follows;
dij=||xj-μi||2 (1)
step 1.3.3: computing feature xjCluster mark of (2)jThe formula is shown as follows;
Step 1.3.5: if j is greater than m, go to step 1.4, otherwise go to step 1.3.2;
step 1.4: for each mean vector μiLet i equal to 1, the following operations are performed:
step 1.4.1: to muiIs updated and is recorded as μ'iAs shown in the following formula;
wherein x represents all data sets CiThe features of (1);
step 1.4.2: judging the current muiIs equal to mu'iIf yes, go to step 1.4.3, otherwise keep the current muiIf not, turning to the step 1.4.4;
step 1.4.3: vector mu of the current mean valueiIs updated to μ'i;
Step 1.4.4: making i equal to i +1, and judging whether i is greater than k, if so, turning to step 1.5, otherwise, turning to step 1.4.1;
step 1.5: if the current mean vector muiIf the update is carried out, the step 1.3 is carried out, otherwise, the step 1.6 is carried out;
step 1.6: for all C obtainediWhere i is 1,2, …, k, let C be { C ═ C1,C2,…,Ck};
Step 1.7: cluster C after output division ═ { C1,C2,…,Ck};
Step 2: for each cluster C generated in step 11-CkIteratively deleting redundant features by utilizing a Pearson correlation coefficient, and updating each cluster;
step 2.1: for cluster C after partitioning { C ═ C1,C2,…,CkLet parameter q be 1, perform the following steps:
step 2.1.1: for CqCalculating each feature xiThe value of the test statistic P of the independent sample t is shown as the following formula;
whereinAndis a characteristic xiCorresponding positive and negative sample variances; n is1And n2For positive and negative sample volumes corresponding to the feature,n is the total number of samples;
step 2.1.2: for allCarry out sequencing, orderX corresponding to the maximum valueiIs a cluster CqSeed node x ofs;
Step 2.1.3: computing cluster CqMiddle seed node xsAll nodes except for xsCorrelation coefficient ofThe formula is as follows:
wherein E is a mathematical expectation;
step 2.1.4: sorting the correlation numbers from large to small, and deleting nodes corresponding to the first 15% of correlation coefficients in each cluster;
Step 2.1.6: making q equal to q +1, judging whether q is larger than k, if so, turning to a step 2.2, otherwise, turning to a step 2.1.1;
step 2.2: order the updated cluster to be aggregatedAnd when the parameter w is 1, executing the following steps:
step 2.2.1: using a feature selection algorithm with improved L1 regularization, for each cluster of inputsSelecting features with the weight of the tth feature as
Step 2.2.1.1: input sample spaceWherein n represents the number of samples and p represents the number of features; target variableDefining a parameter regularization coefficient alpha, repeating sampling times K, and setting a counter i to be 1;
step 2.2.1.2: random sampling in sample space XNumber of samplesBook subspace X*(ii) a And obtain the corresponding target variable y*。
Step 2.2.1.3: calculation of the loss function E (X) using the Lasso regression model*,y*) + α | w |, where w is a penalty term coefficient.
Step 2.2.1.4: will return x in the modeliIf the corresponding coefficient is marked as t, the feature is proved to be selected, and the feature weight is enabled to be
Step 2.2.1.5: and (5) enabling i to be i +1, judging whether i is larger than K, if so, turning to a step 2.2.1.6, and otherwise, turning to a step 2.2.1.2.
Step 2.2.2: if w is greater than k, executing step 2.3, otherwise executing step 2.2.1;
step 2.3: calculating cumulative weights for each featureAll p arewSorting according to the sequence from big to small;
step 2.4: according to pwThe first l features are output as a final feature set f ═ f as a result of the sorting of (1)1,f2…, f), where f1Corresponds to pwThe term with the largest accumulated weight;
and step 3: for the resulting feature set f ═ f1,f2,…,flAnd finding out the corresponding gene name from the original microarray data to complete the characteristic analysis of the gene.
The beneficial effects produced by adopting the technical method are as follows:
the invention provides a high-dimensional data feature selection method based on improved L1 regularization and clustering, provides a mixed feature selection algorithm for microarray data analysis, and is based on a K-Means clustering algorithm and an improved L1 regularization thought, wherein the K-Means clustering algorithm is used for data preprocessing to delete redundant features, and the improved L1 regularization method is used for feature selection to improve stability and classification accuracy.
Drawings
FIG. 1 is an overall flow chart of the present invention.
Detailed Description
The following detailed description of embodiments of the present invention is provided in connection with the accompanying drawings and examples. The following examples are intended to illustrate the invention but are not intended to limit the scope of the invention.
A high-dimensional data feature selection method based on improved L1 regularization and clustering, as shown in FIG. 1, comprises the following steps:
step 1: according to a given gene microarray data set, clustering of gene microarray data characteristics is achieved by using a K-Means clustering algorithm;
step 1.1: using gene microarray data sample set D ═ { x ═ x1,x2,…,xmTaking the K-Means clustering algorithm as input, wherein the number of the clustering clusters is K, xjRepresenting the jth feature in the sample set, wherein m is the number of samples;
step 1.2: randomly selecting k samples from the sample set D as an initial mean vector [ mu ]1,μ2,…,μkIn which μiRepresenting the mean vector corresponding to the ith sample;
step 1.3: for each feature x in the sample set DjInitialization is to make j equal to 1, and the following operations are performed:
Step 1.3.2: computing feature xjWith each mean vector muiAnd is denoted by djiThe formula is shown as follows;
dji=||xj-μi||2 (1)
step 1.3.3: computing feature xjCluster mark of (2)jThe formula is shown as follows;
Step 1.3.5: if j is greater than m, go to step 1.4, otherwise go to step 1.3.2;
step 1.4: for each mean vector μiLet i equal to 1, the following operations are performed:
step 1.4.1: to muiIs updated and is recorded as μ'iAs shown in the following formula;
wherein x represents all data sets CiThe features of (1);
step 1.4.2: judging the current muiIs equal to mu'iIf yes, go to step 1.4.3, otherwise keep the current muiIf not, turning to the step 1.4.4;
step 1.4.3: vector mu of the current mean valueiIs updated to μ'i;
Step 1.4.4: making i equal to i +1, and judging whether i is greater than k, if so, turning to step 1.5, otherwise, turning to step 1.4.1;
step 1.5: if the current mean vector muiIf the update is carried out, the step 1.3 is carried out, otherwise, the step 1.6 is carried out;
step 1.6: for all C obtainediWhere i is 1,2, …, k, let C be { C ═ C1,C2,…,Ck};
Step 1.7: cluster C after output division ═ { C1,C2,…,Ck};
Step 2: for each cluster C generated in step 11-CkIteratively deleting redundant features by utilizing a Pearson correlation coefficient, and updating each cluster;
step 2.1: for cluster C after partitioning { C ═ C1,C2,…,CkLet parameter q be 1, perform the following steps:
step 2.1.1: for CqCalculating each feature xiThe value of the test statistic P of the independent sample t is shown as the following formula;
whereinAndis a characteristic xiCorresponding positive and negative sample variances; n is1And n2For positive and negative sample volumes corresponding to the feature,n is the total number of samples;
step 2.1.2: for allCarry out sequencing, orderX corresponding to the maximum valueiIs a cluster CqSeed node x ofs;
Step 2.1.3: computing cluster CqMiddle seed node xsAll nodes except for xsCorrelation coefficient ofThe formula is as follows:
wherein E is a mathematical expectation;
step 2.1.4: sorting the correlation numbers from large to small, and deleting nodes corresponding to the first 15% of correlation coefficients in each cluster;
Step 2.1.6: making q equal to q +1, judging whether q is larger than k, if so, turning to a step 2.2, otherwise, turning to a step 2.1.1;
step 2.2: order the updated cluster to be aggregatedAnd when the parameter w is 1, executing the following steps:
step 2.2.1: using a feature selection algorithm with improved L1 regularization, for each cluster of inputsSelecting features with the weight of the tth feature as
Step 2.2.1.1: input sample spaceWherein n represents the number of samples and p represents the number of features; target variableDefining a parameter regularization coefficient alpha, repeating sampling times K, and setting a counter i to be 1;
step 2.2.1.2: random sampling in sample space XNumber of sample subspaces X*(ii) a And obtain the corresponding target variable y*。
Step 2.2.1.3: calculation of the loss function E (X) using the Lasso regression model*,y*) + α | w |, where w is a penalty term coefficient.
Step 2.2.1.4: will return x in the modeliIf the corresponding coefficient is marked as t, the feature is proved to be selected, and the feature weight is enabled to be
Step 2.2.1.5: and (5) enabling i to be i +1, judging whether i is larger than K, if so, turning to a step 2.2.1.6, and otherwise, turning to a step 2.2.1.2.
Step 2.2.2: if w is greater than k, executing step 2.3, otherwise executing step 2.2.1;
step 2.3: calculating cumulative weights for each featureAll p arewSorting according to the sequence from big to small;
step 2.4: according to pwThe first l features are output as a final feature set f ═ f as a result of the sorting of (1)1,f2,…,flIn which f1Corresponds to pwThe term with the largest accumulated weight;
and step 3: for the resulting feature set f ═ f1,f2…, f, finding out the corresponding gene name from the original microarray data, and completing the characteristic analysis of the gene.
In this embodiment, tests were performed on 8 public microarray datasets using different classifiers, and as shown in the following table, the cluster K in the test is 5, the number of repeated sampling K is 100, the penalty term α is 0.3, and the number of selected features is 10.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit of the corresponding technical solutions and scope of the present invention as defined in the appended claims.
Claims (4)
1. A high-dimensional data feature selection method based on improved L1 regularization and clustering is characterized by comprising the following steps:
step 1: according to a given gene microarray data set, clustering of gene microarray data characteristics is achieved by using a K-Means clustering algorithm;
step 2: for each cluster C generated in step 11-CkIteratively deleting redundant features by utilizing a Pearson correlation coefficient, and updating each cluster;
and step 3: for the resulting feature set f ═ f1,f2,...,flAnd finding out the corresponding gene name from the original microarray data to complete the characteristic analysis of the gene.
2. The method for selecting the high-dimensional data features based on the improved L1 regularization and clustering according to claim 1, wherein the step 1 specifically comprises the following steps:
step 1.1: using gene microarray data sample set D ═ { x ═ x1,x2,…,xmTaking the K-Means clustering algorithm as input, wherein the number of the clustering clusters is K, xjRepresenting the jth feature in the sample set, wherein m is the number of samples;
step 1.2: randomly selecting k samples from the sample set D as an initial mean vector [ mu ]1,μ2,...,μkIn which μiRepresenting the mean vector corresponding to the ith sample;
step 1.3: for each feature x in the sample set DjInitialization is to make j equal to 1, and the following operations are performed:
Step 1.3.2: computing feature xjWith each mean vector muiAnd is denoted by djiThe formula is shown as follows;
dji=||xj-μi||2 (1)
step 1.3.3: computing feature xjCluster mark of (2)jThe formula is shown as follows;
Step 1.3.5: if j is greater than m, go to step 1.4, otherwise go to step 1.3.2;
step 1.4: for each mean vector μiLet i equal to 1, the following operations are performed:
step 1.4.1: to muiIs updated and is recorded as μ'iAs shown in the following formula;
wherein x represents all data sets CiThe features of (1);
step 1.4.2: judging the current muiIs equal to mu'iIf yes, go to step 1.4.3, otherwise keep the current muiIf not, turning to the step 1.4.4;
step 1.4.3: vector mu of the current mean valueiIs updated to μ'i;
Step 1.4.4: making i equal to i +1, and judging whether i is greater than k, if so, turning to step 1.5, otherwise, turning to step 1.4.1;
step 1.5: if the current mean vector muiIf the update is carried out, the step 1.3 is carried out, otherwise, the step 1.6 is carried out;
step 1.6: for all C obtainediWhere i is 1,2,.., k, let C be { C ═ C1,C2,...,Ck};
Step 1.7: cluster C after output division ═ { C1,C2,...,Ck}。
3. The method for selecting the high-dimensional data features based on the improved L1 regularization and clustering according to claim 1, wherein the step 2 specifically comprises the following steps:
step 2.1: for cluster C after partitioning { C ═ C1,C2,...,CkLet parameter q be 1, perform the following steps:
step 2.1.1: for CqCalculating each feature xiThe value of the test statistic P of the independent sample t is shown as the following formula;
whereinAndis a characteristic xiCorresponding positive and negative sample variances; n is1And n2For positive and negative sample volumes corresponding to the feature,i is 1 … n, and n is the total number of samples;
step 2.1.2: for allCarry out sequencing, orderX corresponding to the maximum valueiIs a cluster CqSeed node x ofs;
Step 2.1.3: computing cluster CqMiddle seed node xsAll nodes except for xsCorrelation coefficient ofThe formula is as follows:
wherein E is a mathematical expectation;
step 2.1.4: sorting the correlation numbers from large to small, and deleting nodes corresponding to the first 15% of correlation coefficients in each cluster;
Step 2.1.6: making q equal to q +1, judging whether q is larger than k, if so, turning to a step 2.2, otherwise, turning to a step 2.1.1;
step 2.2: order the updated cluster to be aggregatedAnd when the parameter w is 1, executing the following steps:
step 2.2.1: using a feature selection algorithm with improved L1 regularization, for each cluster of inputsSelecting features with the weight of the tth feature as
Step 2.2.2: if w is greater than k, executing step 2.3, otherwise executing step 2.2.1;
step 2.4: according to pwThe first l features are output as a final feature set f ═ f as a result of the sorting of (1)1,f2,...,flIn which f1Corresponds to pwThe one with the largest accumulated weight.
4. The method for selecting high-dimensional data features based on improved L1 regularization and clustering according to claim 3, wherein the step 2.2.1 specifically comprises the steps of:
step 2.2.1.1: input sample spaceWherein n represents the number of samples and p represents the number of features; target variableDefining a parameter regularization coefficient alpha, repeating sampling times K, and setting a counter i to be 1;
step 2.2.1.2: randomly picking rods in sample space XNumber of sample subspaces X*(ii) a And obtain the corresponding target variable y*;
Step 2.2.1.3: calculation of the loss function E (X) using the Lasso regression model*,y*) + α | | w | |, where w is a penalty term coefficient;
step 2.2.1.4: will return x in the modeliIf the corresponding coefficient is marked as t, the feature is proved to be selected, and the feature weight is enabled to be
Step 2.2.1.5: and (5) enabling i to be i +1, judging whether i is larger than K, if so, turning to a step 2.2.1.6, and otherwise, turning to a step 2.2.1.2.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110525604.8A CN113177604B (en) | 2021-05-14 | 2021-05-14 | High-dimensional data feature selection method based on improved L1 regularization and clustering |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110525604.8A CN113177604B (en) | 2021-05-14 | 2021-05-14 | High-dimensional data feature selection method based on improved L1 regularization and clustering |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113177604A true CN113177604A (en) | 2021-07-27 |
CN113177604B CN113177604B (en) | 2024-04-16 |
Family
ID=76929261
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110525604.8A Active CN113177604B (en) | 2021-05-14 | 2021-05-14 | High-dimensional data feature selection method based on improved L1 regularization and clustering |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113177604B (en) |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2009067655A2 (en) * | 2007-11-21 | 2009-05-28 | University Of Florida Research Foundation, Inc. | Methods of feature selection through local learning; breast and prostate cancer prognostic markers |
CN105372198A (en) * | 2015-10-28 | 2016-03-02 | 中北大学 | Infrared spectrum wavelength selection method based on integrated L1 regularization |
CN105740653A (en) * | 2016-01-27 | 2016-07-06 | 北京工业大学 | Redundancy removal feature selection method LLRFC score+ based on LLRFC and correlation analysis |
CN107203787A (en) * | 2017-06-14 | 2017-09-26 | 江西师范大学 | Unsupervised regularization matrix decomposition feature selection method |
CN108960341A (en) * | 2018-07-23 | 2018-12-07 | 安徽师范大学 | A kind of structured features selection method towards brain network |
CN109993214A (en) * | 2019-03-08 | 2019-07-09 | 华南理工大学 | Multiple view clustering method based on Laplace regularization and order constraint |
CN112232413A (en) * | 2020-10-16 | 2021-01-15 | 东北大学 | High-dimensional data feature selection method based on graph neural network and spectral clustering |
CN112327701A (en) * | 2020-11-09 | 2021-02-05 | 浙江大学 | Slow characteristic network monitoring method for nonlinear dynamic industrial process |
CN112364902A (en) * | 2020-10-30 | 2021-02-12 | 太原理工大学 | Feature selection learning method based on self-adaptive similarity |
CN112417028A (en) * | 2020-11-26 | 2021-02-26 | 国电南瑞科技股份有限公司 | Wind speed time sequence characteristic mining method and short-term wind power prediction method |
-
2021
- 2021-05-14 CN CN202110525604.8A patent/CN113177604B/en active Active
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2009067655A2 (en) * | 2007-11-21 | 2009-05-28 | University Of Florida Research Foundation, Inc. | Methods of feature selection through local learning; breast and prostate cancer prognostic markers |
CN105372198A (en) * | 2015-10-28 | 2016-03-02 | 中北大学 | Infrared spectrum wavelength selection method based on integrated L1 regularization |
CN105740653A (en) * | 2016-01-27 | 2016-07-06 | 北京工业大学 | Redundancy removal feature selection method LLRFC score+ based on LLRFC and correlation analysis |
CN107203787A (en) * | 2017-06-14 | 2017-09-26 | 江西师范大学 | Unsupervised regularization matrix decomposition feature selection method |
CN108960341A (en) * | 2018-07-23 | 2018-12-07 | 安徽师范大学 | A kind of structured features selection method towards brain network |
CN109993214A (en) * | 2019-03-08 | 2019-07-09 | 华南理工大学 | Multiple view clustering method based on Laplace regularization and order constraint |
CN112232413A (en) * | 2020-10-16 | 2021-01-15 | 东北大学 | High-dimensional data feature selection method based on graph neural network and spectral clustering |
CN112364902A (en) * | 2020-10-30 | 2021-02-12 | 太原理工大学 | Feature selection learning method based on self-adaptive similarity |
CN112327701A (en) * | 2020-11-09 | 2021-02-05 | 浙江大学 | Slow characteristic network monitoring method for nonlinear dynamic industrial process |
CN112417028A (en) * | 2020-11-26 | 2021-02-26 | 国电南瑞科技股份有限公司 | Wind speed time sequence characteristic mining method and short-term wind power prediction method |
Non-Patent Citations (6)
Title |
---|
DENG CAI等: "Unsupervised Feature Selection for Multi-Cluster Data", 《KDD \'10: PROCEEDINGS OF THE 16TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING》, 25 July 2010 (2010-07-25), pages 333, XP058270591, DOI: 10.1145/1835804.1835848 * |
FEIPING NIE等: "Efficient and Robust Feature Selection via Joint l2, 1-Norms Minimization", 《ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 23 (NIPS 2010)》, 31 December 2010 (2010-12-31), pages 1 - 9 * |
KUN YU等: "ILRC: a hybrid biomarker discovery algorithm based on improved L1 regularization and clustering in microarray data", 《BMC BIOINFORMATICS》, vol. 22, 22 October 2021 (2021-10-22), pages 1 - 19, XP021297783, DOI: 10.1186/s12859-021-04443-7 * |
李自法: "面向基因表达微阵列数据的高效特征选择和分类方法研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》, no. 01, 15 January 2019 (2019-01-15), pages 140 - 2420 * |
董利梅等: "基于稀疏聚类的无监督特征选择", 《南京大学学报(自然科学)》, vol. 54, no. 1, 31 January 2018 (2018-01-31), pages 107 - 115 * |
钱有程: "改进的无监督同时正交基聚类特征选择", 《吉林化工学院学报》, vol. 36, no. 7, 31 July 2019 (2019-07-31), pages 80 - 85 * |
Also Published As
Publication number | Publication date |
---|---|
CN113177604B (en) | 2024-04-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Sun et al. | Local-learning-based feature selection for high-dimensional data analysis | |
CN108108762B (en) | Nuclear extreme learning machine for coronary heart disease data and random forest classification method | |
EP3317823A1 (en) | Method and apparatus for large scale machine learning | |
Alomari et al. | A hybrid filter-wrapper gene selection method for cancer classification | |
Futschik et al. | Evolving connectionist systems for knowledge discovery from gene expression data of cancer tissue | |
CN112926640B (en) | Cancer gene classification method and equipment based on two-stage depth feature selection and storage medium | |
CN113764034B (en) | Method, device, equipment and medium for predicting potential BGC in genome sequence | |
CN116226629B (en) | Multi-model feature selection method and system based on feature contribution | |
CN114091603A (en) | Spatial transcriptome cell clustering and analyzing method | |
CN117520914A (en) | Single cell classification method, system, equipment and computer readable storage medium | |
Morovvat et al. | An ensemble of filters and wrappers for microarray data classification | |
CN112613391B (en) | Hyperspectral image waveband selection method based on reverse learning binary rice breeding algorithm | |
Chellamuthu et al. | Data mining and machine learning approaches in breast cancer biomedical research | |
CN115758462A (en) | Method, device, processor and computer readable storage medium for realizing sensitive data identification in trusted environment | |
CN112801163B (en) | Multi-target feature selection method of mouse model hippocampal biomarker based on dynamic graph structure | |
CN113838519B (en) | Gene selection method and system based on adaptive gene interaction regularization elastic network model | |
CN113177604B (en) | High-dimensional data feature selection method based on improved L1 regularization and clustering | |
CN115116619A (en) | Intelligent analysis method and system for stroke data distribution rule | |
CN114334168A (en) | Feature selection algorithm of particle swarm hybrid optimization combined with collaborative learning strategy | |
Bazan et al. | Comparison of aggregation classes in ensemble classifiers for high dimensional datasets | |
CN113971984A (en) | Classification model construction method and device, electronic equipment and storage medium | |
CN115017125B (en) | Data processing method and device for improving KNN method | |
CN118053501A (en) | Biomarker identification method based on genetic algorithm | |
CN114462548B (en) | Method for improving accuracy of single-cell deep clustering algorithm | |
Guo et al. | A comparison between the wrapper and hybrid methods for feature selection on biology Omics datasets |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |