CN113177604A - High-dimensional data feature selection method based on improved L1 regularization and clustering - Google Patents

High-dimensional data feature selection method based on improved L1 regularization and clustering Download PDF

Info

Publication number
CN113177604A
CN113177604A CN202110525604.8A CN202110525604A CN113177604A CN 113177604 A CN113177604 A CN 113177604A CN 202110525604 A CN202110525604 A CN 202110525604A CN 113177604 A CN113177604 A CN 113177604A
Authority
CN
China
Prior art keywords
feature
cluster
regularization
clustering
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110525604.8A
Other languages
Chinese (zh)
Other versions
CN113177604B (en
Inventor
栗伟
谢维冬
王林洁
闵新�
王珊珊
于鲲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northeastern University China
Original Assignee
Northeastern University China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northeastern University China filed Critical Northeastern University China
Priority to CN202110525604.8A priority Critical patent/CN113177604B/en
Publication of CN113177604A publication Critical patent/CN113177604A/en
Application granted granted Critical
Publication of CN113177604B publication Critical patent/CN113177604B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/211Selection of the most significant subset of features
    • G06F18/2113Selection of the most significant subset of features by ranking or filtering the set of features, e.g. using a measure of variance or of feature cross-correlation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Medical Informatics (AREA)
  • Health & Medical Sciences (AREA)
  • Epidemiology (AREA)
  • Biophysics (AREA)
  • Databases & Information Systems (AREA)
  • Bioethics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Public Health (AREA)
  • Software Systems (AREA)
  • Biotechnology (AREA)
  • General Health & Medical Sciences (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a high-dimensional data feature selection method based on improved L1 regularization and clustering, and relates to the technical field of machine learning. The invention provides a mixed feature selection algorithm for microarray data analysis, which is based on a K-Means clustering algorithm and an improved L1 regularization idea, wherein the K-Means clustering algorithm is used for data preprocessing to delete redundant features, and an improved L1 regularization method is used for feature selection to improve stability and classification accuracy.

Description

High-dimensional data feature selection method based on improved L1 regularization and clustering
Technical Field
The invention relates to the technical field of machine learning, in particular to a high-dimensional data feature selection method based on improved L1 regularization and clustering.
Background
Clinically, a close relationship between many disease isogenes has been confirmed. In general, genes whose expression levels are highly correlated with the occurrence of diseases are called biomarkers, and the discovery of biomarkers is of great significance for early diagnosis and prevention of diseases. Microarray data analysis techniques have been developed to find the most informative biomarkers and to remove redundant and non-target disease-related biomarkers.
Microarray data analysis techniques are used to determine biomarkers. It is well known that the actual number of disease-related features (genes) in the raw microarray data is relatively small due to the high feature dimensions and the small sample size. Such data typically contains a small number of samples and a large number of features unrelated to the target disease. In addition, microarray data has a high complexity, i.e., features are direct or interrelated results with a high degree of redundancy, which makes many applied machine learning algorithms exhibit low robustness and poor classification accuracy. Therefore, an appropriate method is found before the model is constructed to reduce the number of features, and the classification accuracy and the robustness of the model are improved, so that the method has important significance.
The feature selection has important significance for mining large-scale high-dimensional data sets, such as data sets generated by microarray and mass spectrometry and establishing statistical models. In feature selection, significant features in the entire training data set may be identified. Feature selection is an important step in the selection of biomarkers in high-dimensional, small-sample biological data. Common feature selection methods can be divided into a filtering method, a packaging method and an embedding method, and the currently more advanced feature selection method is a mixed feature selection method formed by improving and combining three methods in different ways. Most of the methods adopt a selection method of superposing more than two features to improve the classification accuracy. However, in microarray data analysis, researchers tend to pay more attention to the stability of feature selection results and non-redundancy among feature subsets, i.e., there are fewer redundant relationships among feature subsets.
The L1 regularization is an important means in machine learning, a sparse coefficient matrix is realized by adding an L1 norm as a penalty term to a cost function, the purpose of feature selection is realized, the improved L1 regularization method is based on the combination of sampling and selection, the sensitivity of a feature selection result to a regularization coefficient is weakened, the result stability can be obviously improved, and false positive is controlled. Clustering is a process of classifying and organizing data members similar in some aspects in a data set, and a K-means clustering algorithm can divide a sample into a plurality of subsets with weak association through a calculation method based on Euclidean distance, so that clustering and screening of features can be realized.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a high-dimensional data feature selection method based on improved L1 regularization and clustering.
The technical scheme of the invention is that a high-dimensional data feature selection method based on improved L1 regularization and clustering comprises the following steps:
step 1: according to a given gene microarray data set, clustering of gene microarray data characteristics is achieved by using a K-Means clustering algorithm;
step 1.1: using gene microarray data sample set D ═ { x ═ x1,x2,…,xmTaking the K-Means clustering algorithm as input, wherein the number of the clustering clusters is K, xjRepresenting the jth feature in the sample set, wherein m is the number of samples;
step 1.2: randomly selecting k samples from the sample set D as an initial mean vector [ mu ]12,…,μkIn which μiRepresenting the mean vector corresponding to the ith sample;
step 1.3: for each feature x in the sample set DjInitialization is to make j equal to 1, and the following operations are performed:
step 1.3.1: defining clusters corresponding to clustered storage samples
Figure BDA0003065601710000021
Step 1.3.2: computing feature xjWith each mean vector muiAnd is denoted by djiThe formula is shown as follows;
dij=||xji||2 (1)
step 1.3.3: computing feature xjCluster mark of (2)jThe formula is shown as follows;
Figure BDA0003065601710000022
step 1.3.4: will be characteristic xjPut into corresponding clusters, i.e.
Figure BDA0003065601710000024
Step 1.3.5: if j is greater than m, go to step 1.4, otherwise go to step 1.3.2;
step 1.4: for each mean vector μiLet i equal to 1, the following operations are performed:
step 1.4.1: to muiIs updated and is recorded as μ'iAs shown in the following formula;
Figure BDA0003065601710000023
wherein x represents all data sets CiThe features of (1);
step 1.4.2: judging the current muiIs equal to mu'iIf yes, go to step 1.4.3, otherwise keep the current muiIf not, turning to the step 1.4.4;
step 1.4.3: vector mu of the current mean valueiIs updated to μ'i
Step 1.4.4: making i equal to i +1, and judging whether i is greater than k, if so, turning to step 1.5, otherwise, turning to step 1.4.1;
step 1.5: if the current mean vector muiIf the update is carried out, the step 1.3 is carried out, otherwise, the step 1.6 is carried out;
step 1.6: for all C obtainediWhere i is 1,2, …, k, let C be { C ═ C1,C2,…,Ck};
Step 1.7: cluster C after output division ═ { C1,C2,…,Ck};
Step 2: for each cluster C generated in step 11-CkIteratively deleting redundant features by utilizing a Pearson correlation coefficient, and updating each cluster;
step 2.1: for cluster C after partitioning { C ═ C1,C2,…,CkLet parameter q be 1, perform the following steps:
step 2.1.1: for CqCalculating each feature xiThe value of the test statistic P of the independent sample t is shown as the following formula;
Figure BDA0003065601710000031
wherein
Figure BDA0003065601710000032
And
Figure BDA0003065601710000033
is a characteristic xiCorresponding positive and negative sample variances; n is1And n2For positive and negative sample volumes corresponding to the feature,
Figure BDA0003065601710000034
n is the total number of samples;
step 2.1.2: for all
Figure BDA0003065601710000035
Carry out sequencing, order
Figure BDA0003065601710000036
X corresponding to the maximum valueiIs a cluster CqSeed node x ofs
Step 2.1.3: computing cluster CqMiddle seed node xsAll nodes except for xsCorrelation coefficient of
Figure BDA0003065601710000037
The formula is as follows:
Figure BDA0003065601710000038
wherein E is a mathematical expectation;
step 2.1.4: sorting the correlation numbers from large to small, and deleting nodes corresponding to the first 15% of correlation coefficients in each cluster;
step 2.1.5: reserving remaining nodes as new clusters
Figure BDA0003065601710000039
Step 2.1.6: making q equal to q +1, judging whether q is larger than k, if so, turning to a step 2.2, otherwise, turning to a step 2.1.1;
step 2.2: order the updated cluster to be aggregated
Figure BDA00030656017100000310
And when the parameter w is 1, executing the following steps:
step 2.2.1: using a feature selection algorithm with improved L1 regularization, for each cluster of inputs
Figure BDA00030656017100000311
Selecting features with the weight of the tth feature as
Figure BDA00030656017100000312
Step 2.2.1.1: input sample space
Figure BDA00030656017100000313
Wherein n represents the number of samples and p represents the number of features; target variable
Figure BDA00030656017100000314
Defining a parameter regularization coefficient alpha, repeating sampling times K, and setting a counter i to be 1;
step 2.2.1.2: random sampling in sample space X
Figure BDA00030656017100000315
Number of samplesBook subspace X*(ii) a And obtain the corresponding target variable y*
Step 2.2.1.3: calculation of the loss function E (X) using the Lasso regression model*,y*) + α | w |, where w is a penalty term coefficient.
Step 2.2.1.4: will return x in the modeliIf the corresponding coefficient is marked as t, the feature is proved to be selected, and the feature weight is enabled to be
Figure BDA00030656017100000316
Step 2.2.1.5: and (5) enabling i to be i +1, judging whether i is larger than K, if so, turning to a step 2.2.1.6, and otherwise, turning to a step 2.2.1.2.
Step 2.2.1.6: output all xiCorresponding feature weight
Figure BDA00030656017100000317
Step 2.2.2: if w is greater than k, executing step 2.3, otherwise executing step 2.2.1;
step 2.3: calculating cumulative weights for each feature
Figure BDA0003065601710000041
All p arewSorting according to the sequence from big to small;
step 2.4: according to pwThe first l features are output as a final feature set f ═ f as a result of the sorting of (1)1,f2…, f), where f1Corresponds to pwThe term with the largest accumulated weight;
and step 3: for the resulting feature set f ═ f1,f2,…,flAnd finding out the corresponding gene name from the original microarray data to complete the characteristic analysis of the gene.
The beneficial effects produced by adopting the technical method are as follows:
the invention provides a high-dimensional data feature selection method based on improved L1 regularization and clustering, provides a mixed feature selection algorithm for microarray data analysis, and is based on a K-Means clustering algorithm and an improved L1 regularization thought, wherein the K-Means clustering algorithm is used for data preprocessing to delete redundant features, and the improved L1 regularization method is used for feature selection to improve stability and classification accuracy.
Drawings
FIG. 1 is an overall flow chart of the present invention.
Detailed Description
The following detailed description of embodiments of the present invention is provided in connection with the accompanying drawings and examples. The following examples are intended to illustrate the invention but are not intended to limit the scope of the invention.
A high-dimensional data feature selection method based on improved L1 regularization and clustering, as shown in FIG. 1, comprises the following steps:
step 1: according to a given gene microarray data set, clustering of gene microarray data characteristics is achieved by using a K-Means clustering algorithm;
step 1.1: using gene microarray data sample set D ═ { x ═ x1,x2,…,xmTaking the K-Means clustering algorithm as input, wherein the number of the clustering clusters is K, xjRepresenting the jth feature in the sample set, wherein m is the number of samples;
step 1.2: randomly selecting k samples from the sample set D as an initial mean vector [ mu ]12,…,μkIn which μiRepresenting the mean vector corresponding to the ith sample;
step 1.3: for each feature x in the sample set DjInitialization is to make j equal to 1, and the following operations are performed:
step 1.3.1: defining clusters corresponding to clustered storage samples
Figure BDA0003065601710000042
Step 1.3.2: computing feature xjWith each mean vector muiAnd is denoted by djiThe formula is shown as follows;
dji=||xji||2 (1)
step 1.3.3: computing feature xjCluster mark of (2)jThe formula is shown as follows;
Figure BDA0003065601710000043
step 1.3.4: will be characteristic xjPut into corresponding clusters, i.e.
Figure BDA0003065601710000044
Step 1.3.5: if j is greater than m, go to step 1.4, otherwise go to step 1.3.2;
step 1.4: for each mean vector μiLet i equal to 1, the following operations are performed:
step 1.4.1: to muiIs updated and is recorded as μ'iAs shown in the following formula;
Figure BDA0003065601710000051
wherein x represents all data sets CiThe features of (1);
step 1.4.2: judging the current muiIs equal to mu'iIf yes, go to step 1.4.3, otherwise keep the current muiIf not, turning to the step 1.4.4;
step 1.4.3: vector mu of the current mean valueiIs updated to μ'i
Step 1.4.4: making i equal to i +1, and judging whether i is greater than k, if so, turning to step 1.5, otherwise, turning to step 1.4.1;
step 1.5: if the current mean vector muiIf the update is carried out, the step 1.3 is carried out, otherwise, the step 1.6 is carried out;
step 1.6: for all C obtainediWhere i is 1,2, …, k, let C be { C ═ C1,C2,…,Ck};
Step 1.7: cluster C after output division ═ { C1,C2,…,Ck};
Step 2: for each cluster C generated in step 11-CkIteratively deleting redundant features by utilizing a Pearson correlation coefficient, and updating each cluster;
step 2.1: for cluster C after partitioning { C ═ C1,C2,…,CkLet parameter q be 1, perform the following steps:
step 2.1.1: for CqCalculating each feature xiThe value of the test statistic P of the independent sample t is shown as the following formula;
Figure BDA0003065601710000052
wherein
Figure BDA0003065601710000053
And
Figure BDA0003065601710000054
is a characteristic xiCorresponding positive and negative sample variances; n is1And n2For positive and negative sample volumes corresponding to the feature,
Figure BDA0003065601710000055
n is the total number of samples;
step 2.1.2: for all
Figure BDA0003065601710000056
Carry out sequencing, order
Figure BDA0003065601710000057
X corresponding to the maximum valueiIs a cluster CqSeed node x ofs
Step 2.1.3: computing cluster CqMiddle seed node xsAll nodes except for xsCorrelation coefficient of
Figure BDA0003065601710000058
The formula is as follows:
Figure BDA0003065601710000059
wherein E is a mathematical expectation;
step 2.1.4: sorting the correlation numbers from large to small, and deleting nodes corresponding to the first 15% of correlation coefficients in each cluster;
step 2.1.5: reserving remaining nodes as new clusters
Figure BDA00030656017100000510
Step 2.1.6: making q equal to q +1, judging whether q is larger than k, if so, turning to a step 2.2, otherwise, turning to a step 2.1.1;
step 2.2: order the updated cluster to be aggregated
Figure BDA00030656017100000511
And when the parameter w is 1, executing the following steps:
step 2.2.1: using a feature selection algorithm with improved L1 regularization, for each cluster of inputs
Figure BDA0003065601710000061
Selecting features with the weight of the tth feature as
Figure BDA0003065601710000062
Step 2.2.1.1: input sample space
Figure BDA0003065601710000063
Wherein n represents the number of samples and p represents the number of features; target variable
Figure BDA0003065601710000064
Defining a parameter regularization coefficient alpha, repeating sampling times K, and setting a counter i to be 1;
step 2.2.1.2: random sampling in sample space X
Figure BDA0003065601710000065
Number of sample subspaces X*(ii) a And obtain the corresponding target variable y*
Step 2.2.1.3: calculation of the loss function E (X) using the Lasso regression model*,y*) + α | w |, where w is a penalty term coefficient.
Step 2.2.1.4: will return x in the modeliIf the corresponding coefficient is marked as t, the feature is proved to be selected, and the feature weight is enabled to be
Figure BDA0003065601710000066
Step 2.2.1.5: and (5) enabling i to be i +1, judging whether i is larger than K, if so, turning to a step 2.2.1.6, and otherwise, turning to a step 2.2.1.2.
Step 2.2.1.6: output all xiCorresponding feature weight
Figure BDA0003065601710000067
Step 2.2.2: if w is greater than k, executing step 2.3, otherwise executing step 2.2.1;
step 2.3: calculating cumulative weights for each feature
Figure BDA0003065601710000068
All p arewSorting according to the sequence from big to small;
step 2.4: according to pwThe first l features are output as a final feature set f ═ f as a result of the sorting of (1)1,f2,…,flIn which f1Corresponds to pwThe term with the largest accumulated weight;
and step 3: for the resulting feature set f ═ f1,f2…, f, finding out the corresponding gene name from the original microarray data, and completing the characteristic analysis of the gene.
In this embodiment, tests were performed on 8 public microarray datasets using different classifiers, and as shown in the following table, the cluster K in the test is 5, the number of repeated sampling K is 100, the penalty term α is 0.3, and the number of selected features is 10.
Figure BDA0003065601710000069
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit of the corresponding technical solutions and scope of the present invention as defined in the appended claims.

Claims (4)

1. A high-dimensional data feature selection method based on improved L1 regularization and clustering is characterized by comprising the following steps:
step 1: according to a given gene microarray data set, clustering of gene microarray data characteristics is achieved by using a K-Means clustering algorithm;
step 2: for each cluster C generated in step 11-CkIteratively deleting redundant features by utilizing a Pearson correlation coefficient, and updating each cluster;
and step 3: for the resulting feature set f ═ f1,f2,...,flAnd finding out the corresponding gene name from the original microarray data to complete the characteristic analysis of the gene.
2. The method for selecting the high-dimensional data features based on the improved L1 regularization and clustering according to claim 1, wherein the step 1 specifically comprises the following steps:
step 1.1: using gene microarray data sample set D ═ { x ═ x1,x2,…,xmTaking the K-Means clustering algorithm as input, wherein the number of the clustering clusters is K, xjRepresenting the jth feature in the sample set, wherein m is the number of samples;
step 1.2: randomly selecting k samples from the sample set D as an initial mean vector [ mu ]1,μ2,...,μkIn which μiRepresenting the mean vector corresponding to the ith sample;
step 1.3: for each feature x in the sample set DjInitialization is to make j equal to 1, and the following operations are performed:
step 1.3.1: defining clusters corresponding to clustered storage samples
Figure FDA0003065601700000013
Step 1.3.2: computing feature xjWith each mean vector muiAnd is denoted by djiThe formula is shown as follows;
dji=||xji||2 (1)
step 1.3.3: computing feature xjCluster mark of (2)jThe formula is shown as follows;
Figure FDA0003065601700000011
step 1.3.4: will be characteristic xjPut into corresponding clusters, i.e.
Figure FDA0003065601700000014
Step 1.3.5: if j is greater than m, go to step 1.4, otherwise go to step 1.3.2;
step 1.4: for each mean vector μiLet i equal to 1, the following operations are performed:
step 1.4.1: to muiIs updated and is recorded as μ'iAs shown in the following formula;
Figure FDA0003065601700000012
wherein x represents all data sets CiThe features of (1);
step 1.4.2: judging the current muiIs equal to mu'iIf yes, go to step 1.4.3, otherwise keep the current muiIf not, turning to the step 1.4.4;
step 1.4.3: vector mu of the current mean valueiIs updated to μ'i
Step 1.4.4: making i equal to i +1, and judging whether i is greater than k, if so, turning to step 1.5, otherwise, turning to step 1.4.1;
step 1.5: if the current mean vector muiIf the update is carried out, the step 1.3 is carried out, otherwise, the step 1.6 is carried out;
step 1.6: for all C obtainediWhere i is 1,2,.., k, let C be { C ═ C1,C2,...,Ck};
Step 1.7: cluster C after output division ═ { C1,C2,...,Ck}。
3. The method for selecting the high-dimensional data features based on the improved L1 regularization and clustering according to claim 1, wherein the step 2 specifically comprises the following steps:
step 2.1: for cluster C after partitioning { C ═ C1,C2,...,CkLet parameter q be 1, perform the following steps:
step 2.1.1: for CqCalculating each feature xiThe value of the test statistic P of the independent sample t is shown as the following formula;
Figure FDA0003065601700000021
wherein
Figure FDA0003065601700000022
And
Figure FDA0003065601700000023
is a characteristic xiCorresponding positive and negative sample variances; n is1And n2For positive and negative sample volumes corresponding to the feature,
Figure FDA0003065601700000024
i is 1 … n, and n is the total number of samples;
step 2.1.2: for all
Figure FDA0003065601700000025
Carry out sequencing, order
Figure FDA0003065601700000026
X corresponding to the maximum valueiIs a cluster CqSeed node x ofs
Step 2.1.3: computing cluster CqMiddle seed node xsAll nodes except for xsCorrelation coefficient of
Figure FDA0003065601700000027
The formula is as follows:
Figure FDA0003065601700000028
wherein E is a mathematical expectation;
step 2.1.4: sorting the correlation numbers from large to small, and deleting nodes corresponding to the first 15% of correlation coefficients in each cluster;
step 2.1.5: reserving remaining nodes as new clusters
Figure FDA0003065601700000029
Step 2.1.6: making q equal to q +1, judging whether q is larger than k, if so, turning to a step 2.2, otherwise, turning to a step 2.1.1;
step 2.2: order the updated cluster to be aggregated
Figure FDA00030656017000000210
And when the parameter w is 1, executing the following steps:
step 2.2.1: using a feature selection algorithm with improved L1 regularization, for each cluster of inputs
Figure FDA00030656017000000211
Selecting features with the weight of the tth feature as
Figure FDA00030656017000000212
Step 2.2.2: if w is greater than k, executing step 2.3, otherwise executing step 2.2.1;
step 2.3: calculating cumulative weights for each feature
Figure FDA00030656017000000213
Sequencing all the pw from large to small;
step 2.4: according to pwThe first l features are output as a final feature set f ═ f as a result of the sorting of (1)1,f2,...,flIn which f1Corresponds to pwThe one with the largest accumulated weight.
4. The method for selecting high-dimensional data features based on improved L1 regularization and clustering according to claim 3, wherein the step 2.2.1 specifically comprises the steps of:
step 2.2.1.1: input sample space
Figure FDA0003065601700000031
Wherein n represents the number of samples and p represents the number of features; target variable
Figure FDA0003065601700000032
Defining a parameter regularization coefficient alpha, repeating sampling times K, and setting a counter i to be 1;
step 2.2.1.2: randomly picking rods in sample space X
Figure FDA0003065601700000033
Number of sample subspaces X*(ii) a And obtain the corresponding target variable y*
Step 2.2.1.3: calculation of the loss function E (X) using the Lasso regression model*,y*) + α | | w | |, where w is a penalty term coefficient;
step 2.2.1.4: will return x in the modeliIf the corresponding coefficient is marked as t, the feature is proved to be selected, and the feature weight is enabled to be
Figure FDA0003065601700000034
Step 2.2.1.5: and (5) enabling i to be i +1, judging whether i is larger than K, if so, turning to a step 2.2.1.6, and otherwise, turning to a step 2.2.1.2.
Step 2.2.1.6: output all xiCorresponding feature weight
Figure FDA0003065601700000035
CN202110525604.8A 2021-05-14 2021-05-14 High-dimensional data feature selection method based on improved L1 regularization and clustering Active CN113177604B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110525604.8A CN113177604B (en) 2021-05-14 2021-05-14 High-dimensional data feature selection method based on improved L1 regularization and clustering

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110525604.8A CN113177604B (en) 2021-05-14 2021-05-14 High-dimensional data feature selection method based on improved L1 regularization and clustering

Publications (2)

Publication Number Publication Date
CN113177604A true CN113177604A (en) 2021-07-27
CN113177604B CN113177604B (en) 2024-04-16

Family

ID=76929261

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110525604.8A Active CN113177604B (en) 2021-05-14 2021-05-14 High-dimensional data feature selection method based on improved L1 regularization and clustering

Country Status (1)

Country Link
CN (1) CN113177604B (en)

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2009067655A2 (en) * 2007-11-21 2009-05-28 University Of Florida Research Foundation, Inc. Methods of feature selection through local learning; breast and prostate cancer prognostic markers
CN105372198A (en) * 2015-10-28 2016-03-02 中北大学 Infrared spectrum wavelength selection method based on integrated L1 regularization
CN105740653A (en) * 2016-01-27 2016-07-06 北京工业大学 Redundancy removal feature selection method LLRFC score+ based on LLRFC and correlation analysis
CN107203787A (en) * 2017-06-14 2017-09-26 江西师范大学 A kind of unsupervised regularization matrix characteristics of decomposition system of selection
CN108960341A (en) * 2018-07-23 2018-12-07 安徽师范大学 A kind of structured features selection method towards brain network
CN109993214A (en) * 2019-03-08 2019-07-09 华南理工大学 Multiple view clustering method based on Laplace regularization and order constraint
CN112232413A (en) * 2020-10-16 2021-01-15 东北大学 High-dimensional data feature selection method based on graph neural network and spectral clustering
CN112327701A (en) * 2020-11-09 2021-02-05 浙江大学 Slow characteristic network monitoring method for nonlinear dynamic industrial process
CN112364902A (en) * 2020-10-30 2021-02-12 太原理工大学 Feature selection learning method based on self-adaptive similarity
CN112417028A (en) * 2020-11-26 2021-02-26 国电南瑞科技股份有限公司 Wind speed time sequence characteristic mining method and short-term wind power prediction method

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2009067655A2 (en) * 2007-11-21 2009-05-28 University Of Florida Research Foundation, Inc. Methods of feature selection through local learning; breast and prostate cancer prognostic markers
CN105372198A (en) * 2015-10-28 2016-03-02 中北大学 Infrared spectrum wavelength selection method based on integrated L1 regularization
CN105740653A (en) * 2016-01-27 2016-07-06 北京工业大学 Redundancy removal feature selection method LLRFC score+ based on LLRFC and correlation analysis
CN107203787A (en) * 2017-06-14 2017-09-26 江西师范大学 A kind of unsupervised regularization matrix characteristics of decomposition system of selection
CN108960341A (en) * 2018-07-23 2018-12-07 安徽师范大学 A kind of structured features selection method towards brain network
CN109993214A (en) * 2019-03-08 2019-07-09 华南理工大学 Multiple view clustering method based on Laplace regularization and order constraint
CN112232413A (en) * 2020-10-16 2021-01-15 东北大学 High-dimensional data feature selection method based on graph neural network and spectral clustering
CN112364902A (en) * 2020-10-30 2021-02-12 太原理工大学 Feature selection learning method based on self-adaptive similarity
CN112327701A (en) * 2020-11-09 2021-02-05 浙江大学 Slow characteristic network monitoring method for nonlinear dynamic industrial process
CN112417028A (en) * 2020-11-26 2021-02-26 国电南瑞科技股份有限公司 Wind speed time sequence characteristic mining method and short-term wind power prediction method

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
DENG CAI等: "Unsupervised Feature Selection for Multi-Cluster Data", 《KDD \'10: PROCEEDINGS OF THE 16TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING》, 25 July 2010 (2010-07-25), pages 333, XP058270591, DOI: 10.1145/1835804.1835848 *
FEIPING NIE等: "Efficient and Robust Feature Selection via Joint l2, 1-Norms Minimization", 《ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 23 (NIPS 2010)》, 31 December 2010 (2010-12-31), pages 1 - 9 *
KUN YU等: "ILRC: a hybrid biomarker discovery algorithm based on improved L1 regularization and clustering in microarray data", 《BMC BIOINFORMATICS》, vol. 22, 22 October 2021 (2021-10-22), pages 1 - 19, XP021297783, DOI: 10.1186/s12859-021-04443-7 *
李自法: "面向基因表达微阵列数据的高效特征选择和分类方法研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》, no. 01, 15 January 2019 (2019-01-15), pages 140 - 2420 *
董利梅等: "基于稀疏聚类的无监督特征选择", 《南京大学学报(自然科学)》, vol. 54, no. 1, 31 January 2018 (2018-01-31), pages 107 - 115 *
钱有程: "改进的无监督同时正交基聚类特征选择", 《吉林化工学院学报》, vol. 36, no. 7, 31 July 2019 (2019-07-31), pages 80 - 85 *

Also Published As

Publication number Publication date
CN113177604B (en) 2024-04-16

Similar Documents

Publication Publication Date Title
Sun et al. Local-learning-based feature selection for high-dimensional data analysis
EP3317823A1 (en) Method and apparatus for large scale machine learning
Futschik et al. Evolving connectionist systems for knowledge discovery from gene expression data of cancer tissue
Alomari et al. A hybrid filter-wrapper gene selection method for cancer classification
CN112926640B (en) Cancer gene classification method and equipment based on two-stage depth feature selection and storage medium
CN114091603A (en) Spatial transcriptome cell clustering and analyzing method
CN113764034B (en) Method, device, equipment and medium for predicting potential BGC in genome sequence
Morovvat et al. An ensemble of filters and wrappers for microarray data classification
CN112613391B (en) Hyperspectral image waveband selection method based on reverse learning binary rice breeding algorithm
CN112801163B (en) Multi-target feature selection method of mouse model hippocampal biomarker based on dynamic graph structure
CN113838519B (en) Gene selection method and system based on adaptive gene interaction regularization elastic network model
CN113177604B (en) High-dimensional data feature selection method based on improved L1 regularization and clustering
Chellamuthu et al. Data mining and machine learning approaches in breast cancer biomedical research
Bazan et al. Comparison of aggregation classes in ensemble classifiers for high dimensional datasets
CN113971984A (en) Classification model construction method and device, electronic equipment and storage medium
Wang et al. Semisupervised Bacterial Heuristic Feature Selection Algorithm for High-Dimensional Classification with Missing Labels
CN115017125B (en) Data processing method and device for improving KNN method
CN118053501A (en) Biomarker identification method based on genetic algorithm
CN116226629B (en) Multi-model feature selection method and system based on feature contribution
Guo et al. A comparison between the wrapper and hybrid methods for feature selection on biology Omics datasets
CN114334168A (en) Feature selection algorithm of particle swarm hybrid optimization combined with collaborative learning strategy
Kumar et al. Meta-heuristic search based gene selection and classification of microarray data
CN115758462A (en) Method, device, processor and computer readable storage medium for realizing sensitive data identification in trusted environment
Walker Iterative Random Forest Based High Performance Computing Methods Applied to Biological Systems and Human Health
Kowalski et al. Feature selection for regression tasks base on explainable artificial intelligence procedures

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant