CN113177604A

CN113177604A - High-dimensional data feature selection method based on improved L1 regularization and clustering

Info

Publication number: CN113177604A
Application number: CN202110525604.8A
Authority: CN
Inventors: 栗伟; 谢维冬; 王林洁; 闵新�; 王珊珊; 于鲲
Original assignee: Northeastern University China
Current assignee: Northeastern University China
Priority date: 2021-05-14
Filing date: 2021-05-14
Publication date: 2021-07-27
Anticipated expiration: 2041-05-14
Also published as: CN113177604B

Abstract

The invention provides a high-dimensional data feature selection method based on improved L1 regularization and clustering, and relates to the technical field of machine learning. The invention provides a mixed feature selection algorithm for microarray data analysis, which is based on a K-Means clustering algorithm and an improved L1 regularization idea, wherein the K-Means clustering algorithm is used for data preprocessing to delete redundant features, and an improved L1 regularization method is used for feature selection to improve stability and classification accuracy.

Description

High-dimensional data feature selection method based on improved L1 regularization and clustering

Technical Field

The invention relates to the technical field of machine learning, in particular to a high-dimensional data feature selection method based on improved L1 regularization and clustering.

Background

Clinically, a close relationship between many disease isogenes has been confirmed. In general, genes whose expression levels are highly correlated with the occurrence of diseases are called biomarkers, and the discovery of biomarkers is of great significance for early diagnosis and prevention of diseases. Microarray data analysis techniques have been developed to find the most informative biomarkers and to remove redundant and non-target disease-related biomarkers.

Microarray data analysis techniques are used to determine biomarkers. It is well known that the actual number of disease-related features (genes) in the raw microarray data is relatively small due to the high feature dimensions and the small sample size. Such data typically contains a small number of samples and a large number of features unrelated to the target disease. In addition, microarray data has a high complexity, i.e., features are direct or interrelated results with a high degree of redundancy, which makes many applied machine learning algorithms exhibit low robustness and poor classification accuracy. Therefore, an appropriate method is found before the model is constructed to reduce the number of features, and the classification accuracy and the robustness of the model are improved, so that the method has important significance.

The feature selection has important significance for mining large-scale high-dimensional data sets, such as data sets generated by microarray and mass spectrometry and establishing statistical models. In feature selection, significant features in the entire training data set may be identified. Feature selection is an important step in the selection of biomarkers in high-dimensional, small-sample biological data. Common feature selection methods can be divided into a filtering method, a packaging method and an embedding method, and the currently more advanced feature selection method is a mixed feature selection method formed by improving and combining three methods in different ways. Most of the methods adopt a selection method of superposing more than two features to improve the classification accuracy. However, in microarray data analysis, researchers tend to pay more attention to the stability of feature selection results and non-redundancy among feature subsets, i.e., there are fewer redundant relationships among feature subsets.

The L1 regularization is an important means in machine learning, a sparse coefficient matrix is realized by adding an L1 norm as a penalty term to a cost function, the purpose of feature selection is realized, the improved L1 regularization method is based on the combination of sampling and selection, the sensitivity of a feature selection result to a regularization coefficient is weakened, the result stability can be obviously improved, and false positive is controlled. Clustering is a process of classifying and organizing data members similar in some aspects in a data set, and a K-means clustering algorithm can divide a sample into a plurality of subsets with weak association through a calculation method based on Euclidean distance, so that clustering and screening of features can be realized.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a high-dimensional data feature selection method based on improved L1 regularization and clustering.

The technical scheme of the invention is that a high-dimensional data feature selection method based on improved L1 regularization and clustering comprises the following steps:

step 1: according to a given gene microarray data set, clustering of gene microarray data characteristics is achieved by using a K-Means clustering algorithm;

step 1.1: using gene microarray data sample set D ═ { x ═ x₁,x₂,…,x_mTaking the K-Means clustering algorithm as input, wherein the number of the clustering clusters is K, x_jRepresenting the jth feature in the sample set, wherein m is the number of samples;

step 1.2: randomly selecting k samples from the sample set D as an initial mean vector [ mu ]₁,μ₂,…,μ_kIn which μ_iRepresenting the mean vector corresponding to the ith sample;

step 1.3: for each feature x in the sample set D_jInitialization is to make j equal to 1, and the following operations are performed:

step 1.3.1: defining clusters corresponding to clustered storage samples

Step 1.3.2: computing feature x_jWith each mean vector mu_iAnd is denoted by d_jiThe formula is shown as follows;

d_ij＝||x_j-μ_i||₂ (1)

step 1.3.3: computing feature x_jCluster mark of (2)_jThe formula is shown as follows;

step 1.3.4: will be characteristic x_jPut into corresponding clusters, i.e.

Step 1.3.5: if j is greater than m, go to step 1.4, otherwise go to step 1.3.2;

step 1.4: for each mean vector μ_iLet i equal to 1, the following operations are performed:

step 1.4.1: to mu_iIs updated and is recorded as μ'_iAs shown in the following formula;

wherein x represents all data sets C_iThe features of (1);

step 1.4.2: judging the current mu_iIs equal to mu'_iIf yes, go to step 1.4.3, otherwise keep the current mu_iIf not, turning to the step 1.4.4;

step 1.4.3: vector mu of the current mean value_iIs updated to μ'_i；

Step 1.4.4: making i equal to i +1, and judging whether i is greater than k, if so, turning to step 1.5, otherwise, turning to step 1.4.1;

step 1.5: if the current mean vector mu_iIf the update is carried out, the step 1.3 is carried out, otherwise, the step 1.6 is carried out;

step 1.6: for all C obtained_iWhere i is 1,2, …, k, let C be { C ═ C₁,C₂,…,C_k}；

Step 1.7: cluster C after output division ═ { C₁,C₂,…,C_k}；

Step 2: for each cluster C generated in step 1₁-C_kIteratively deleting redundant features by utilizing a Pearson correlation coefficient, and updating each cluster;

step 2.1: for cluster C after partitioning { C ═ C₁,C₂,…,C_kLet parameter q be 1, perform the following steps:

step 2.1.1: for C_qCalculating each feature x_iThe value of the test statistic P of the independent sample t is shown as the following formula;

wherein

And

is a characteristic x_iCorresponding positive and negative sample variances; n is₁And n₂For positive and negative sample volumes corresponding to the feature,

n is the total number of samples;

step 2.1.2: for all

Carry out sequencing, order

X corresponding to the maximum value_iIs a cluster C_qSeed node x of_s；

Step 2.1.3: computing cluster C_qMiddle seed node x_sAll nodes except for x_sCorrelation coefficient of

The formula is as follows:

wherein E is a mathematical expectation;

step 2.1.4: sorting the correlation numbers from large to small, and deleting nodes corresponding to the first 15% of correlation coefficients in each cluster;

step 2.1.5: reserving remaining nodes as new clusters

Step 2.1.6: making q equal to q +1, judging whether q is larger than k, if so, turning to a step 2.2, otherwise, turning to a step 2.1.1;

step 2.2: order the updated cluster to be aggregated

And when the parameter w is 1, executing the following steps:

step 2.2.1: using a feature selection algorithm with improved L1 regularization, for each cluster of inputs

Selecting features with the weight of the tth feature as

Step 2.2.1.1: input sample space

Wherein n represents the number of samples and p represents the number of features; target variable

Defining a parameter regularization coefficient alpha, repeating sampling times K, and setting a counter i to be 1;

step 2.2.1.2: random sampling in sample space X

Number of samplesBook subspace X^*(ii) a And obtain the corresponding target variable y^*。

Step 2.2.1.3: calculation of the loss function E (X) using the Lasso regression model^*,y^*) + α | w |, where w is a penalty term coefficient.

Step 2.2.1.4: will return x in the model_iIf the corresponding coefficient is marked as t, the feature is proved to be selected, and the feature weight is enabled to be

Step 2.2.1.5: and (5) enabling i to be i +1, judging whether i is larger than K, if so, turning to a step 2.2.1.6, and otherwise, turning to a step 2.2.1.2.

Step 2.2.1.6: output all x_iCorresponding feature weight

Step 2.2.2: if w is greater than k, executing step 2.3, otherwise executing step 2.2.1;

step 2.3: calculating cumulative weights for each feature

All p are_wSorting according to the sequence from big to small;

step 2.4: according to p_wThe first l features are output as a final feature set f ═ f as a result of the sorting of (1)₁,f₂…, f), where f₁Corresponds to p_wThe term with the largest accumulated weight;

and step 3: for the resulting feature set f ═ f₁,f₂,…,f_lAnd finding out the corresponding gene name from the original microarray data to complete the characteristic analysis of the gene.

The beneficial effects produced by adopting the technical method are as follows:

the invention provides a high-dimensional data feature selection method based on improved L1 regularization and clustering, provides a mixed feature selection algorithm for microarray data analysis, and is based on a K-Means clustering algorithm and an improved L1 regularization thought, wherein the K-Means clustering algorithm is used for data preprocessing to delete redundant features, and the improved L1 regularization method is used for feature selection to improve stability and classification accuracy.

Drawings

FIG. 1 is an overall flow chart of the present invention.

Detailed Description

The following detailed description of embodiments of the present invention is provided in connection with the accompanying drawings and examples. The following examples are intended to illustrate the invention but are not intended to limit the scope of the invention.

A high-dimensional data feature selection method based on improved L1 regularization and clustering, as shown in FIG. 1, comprises the following steps:

step 1.3.1: defining clusters corresponding to clustered storage samples

d_ji＝||x_j-μ_i||₂ (1)

step 1.3.4: will be characteristic x_jPut into corresponding clusters, i.e.

Step 1.3.5: if j is greater than m, go to step 1.4, otherwise go to step 1.3.2;

wherein x represents all data sets C_iThe features of (1);

step 1.4.3: vector mu of the current mean value_iIs updated to μ'_i；

Step 1.7: cluster C after output division ═ { C₁,C₂,…,C_k}；

wherein

And

n is the total number of samples;

step 2.1.2: for all

Carry out sequencing, order

X corresponding to the maximum value_iIs a cluster C_qSeed node x of_s；

The formula is as follows:

wherein E is a mathematical expectation;

step 2.1.5: reserving remaining nodes as new clusters

step 2.2: order the updated cluster to be aggregated

And when the parameter w is 1, executing the following steps:

Selecting features with the weight of the tth feature as

Step 2.2.1.1: input sample space

step 2.2.1.2: random sampling in sample space X

Number of sample subspaces X^*(ii) a And obtain the corresponding target variable y^*。

Step 2.2.1.6: output all x_iCorresponding feature weight

step 2.3: calculating cumulative weights for each feature

All p are_wSorting according to the sequence from big to small;

step 2.4: according to p_wThe first l features are output as a final feature set f ═ f as a result of the sorting of (1)₁,f₂,…,f_lIn which f₁Corresponds to p_wThe term with the largest accumulated weight;

and step 3: for the resulting feature set f ═ f₁,f₂…, f, finding out the corresponding gene name from the original microarray data, and completing the characteristic analysis of the gene.

In this embodiment, tests were performed on 8 public microarray datasets using different classifiers, and as shown in the following table, the cluster K in the test is 5, the number of repeated sampling K is 100, the penalty term α is 0.3, and the number of selected features is 10.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit of the corresponding technical solutions and scope of the present invention as defined in the appended claims.

Claims

1. A high-dimensional data feature selection method based on improved L1 regularization and clustering is characterized by comprising the following steps:

and step 3: for the resulting feature set f ═ f₁，f₂，...，f_lAnd finding out the corresponding gene name from the original microarray data to complete the characteristic analysis of the gene.

2. The method for selecting the high-dimensional data features based on the improved L1 regularization and clustering according to claim 1, wherein the step 1 specifically comprises the following steps:

step 1.1: using gene microarray data sample set D ═ { x ═ x₁，x₂，…，x_mTaking the K-Means clustering algorithm as input, wherein the number of the clustering clusters is K, x_jRepresenting the jth feature in the sample set, wherein m is the number of samples;

step 1.2: randomly selecting k samples from the sample set D as an initial mean vector [ mu ]₁，μ₂，...，μ_kIn which μ_iRepresenting the mean vector corresponding to the ith sample;

step 1.3.1: defining clusters corresponding to clustered storage samples

d_ji＝||x_j-μ_i||₂ (1)

step 1.3.4: will be characteristic x_jPut into corresponding clusters, i.e.

Step 1.3.5: if j is greater than m, go to step 1.4, otherwise go to step 1.3.2;

wherein x represents all data sets C_iThe features of (1);

step 1.4.3: vector mu of the current mean value_iIs updated to μ'_i；

step 1.6: for all C obtained_iWhere i is 1,2,.., k, let C be { C ═ C₁，C₂，...，C_k}；

Step 1.7: cluster C after output division ═ { C₁，C₂，...，C_k}。

3. The method for selecting the high-dimensional data features based on the improved L1 regularization and clustering according to claim 1, wherein the step 2 specifically comprises the following steps:

step 2.1: for cluster C after partitioning { C ═ C₁，C₂，...，C_kLet parameter q be 1, perform the following steps:

wherein

And

i is 1 … n, and n is the total number of samples;

step 2.1.2: for all

Carry out sequencing, order

X corresponding to the maximum value_iIs a cluster C_qSeed node x of_s；

The formula is as follows:

wherein E is a mathematical expectation;

step 2.1.5: reserving remaining nodes as new clusters

step 2.2: order the updated cluster to be aggregated

And when the parameter w is 1, executing the following steps:

Selecting features with the weight of the tth feature as

step 2.3: calculating cumulative weights for each feature

Sequencing all the pw from large to small;

step 2.4: according to p_wThe first l features are output as a final feature set f ═ f as a result of the sorting of (1)₁，f₂，...，f_lIn which f₁Corresponds to p_wThe one with the largest accumulated weight.

4. The method for selecting high-dimensional data features based on improved L1 regularization and clustering according to claim 3, wherein the step 2.2.1 specifically comprises the steps of:

step 2.2.1.1: input sample space

step 2.2.1.2: randomly picking rods in sample space X

Number of sample subspaces X^*(ii) a And obtain the corresponding target variable y^*；

Step 2.2.1.3: calculation of the loss function E (X) using the Lasso regression model^*，y^*) + α | | w | |, where w is a penalty term coefficient;

Step 2.2.1.6: output all x_iCorresponding feature weight