CN113177604B

CN113177604B - High-dimensional data feature selection method based on improved L1 regularization and clustering

Info

Publication number: CN113177604B
Application number: CN202110525604.8A
Authority: CN
Inventors: 栗伟; 谢维冬; 王林洁; 闵新�; 王珊珊; 于鲲
Original assignee: 东北大学
Priority date: 2021-05-14
Filing date: 2021-05-14
Publication date: 2024-04-16
Anticipated expiration: 2041-05-14
Also published as: CN113177604A

Abstract

The invention provides a high-dimensional data feature selection method based on improved L1 regularization and clustering, and relates to the technical field of machine learning. The invention provides a hybrid feature selection algorithm for microarray data analysis, which is based on a K-Means clustering algorithm and an idea of improving L1 regularization, wherein the K-Means clustering algorithm is used for data preprocessing to delete redundant features, and the improved L1 regularization method is used for feature selection to improve stability and classification accuracy.

Description

High-dimensional data feature selection method based on improved L1 regularization and clustering

Technical Field

The invention relates to the technical field of machine learning, in particular to a high-dimensional data feature selection method based on improved L1 regularization and clustering.

Background

Clinically, many diseases have been shown to have a close relationship with genes. In general, genes whose expression levels are highly correlated with the occurrence of diseases are called biomarkers, and the discovery of biomarkers is of great importance for early diagnosis and prevention of diseases. Microarray data analysis techniques have been developed to find the most informative biomarkers and to remove redundant and non-target disease related biomarkers.

Microarray data analysis techniques are used to determine biomarkers. It is well known that the actual number of disease-related features (genes) in raw microarray data is relatively small, due to the high feature dimensions and small sample size. Such data typically contains a small sample and a large number of features unrelated to the disease of interest. In addition, microarray data has a high complexity, i.e., features are direct or interrelated results with high redundancy, which makes many applied machine learning algorithms exhibit low robustness and poor classification accuracy. Therefore, a proper method is searched for to reduce the number of features before the model is built, and the method has very important significance for improving the classification accuracy and the robustness of the model.

Feature selection is significant in mining large-scale high-dimensional datasets, such as those generated by microarray and mass spectrometry, and in creating statistical models. In feature selection, significant features in the entire training dataset can be identified. Feature selection is an important step in selecting biomarkers in biological data of high dimension, small samples. Common feature selection methods can be divided into a filtering method, a packaging method and an embedding method, and the currently more advanced feature selection method is a hybrid feature selection method formed by improving and combining three methods in different ways. Most of the methods adopt a method of selecting more than two characteristics in a superposition way, so as to improve the classification accuracy. However, in microarray data analysis, researchers tend to pay more attention to the stability of feature selection results and the non-redundancy between feature subsets, i.e., fewer redundant relationships exist between feature subsets.

The L1 regularization is an important means in machine learning, the L1 norm is added to the cost function as a penalty term to realize a sparse coefficient matrix, the purpose of feature selection is realized, and the improved L1 regularization method is based on the combination of sampling and selection, so that the sensitivity of a feature selection result to regularization coefficients is weakened, the stability of the result can be obviously improved, and false positives are controlled. The clustering is a process of classifying and organizing data in some similar data members, and the K-means clustering algorithm can divide the sample into a plurality of subsets with weaker relevance by a calculation method based on Euclidean distance, so that the clustering and screening of the features can be realized.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a high-dimensional data feature selection method based on improved L1 regularization and clustering.

The technical scheme of the invention is that the high-dimensional data feature selection method based on improved L1 regularization and clustering comprises the following steps:

step 1: according to a given gene microarray data set, clustering the gene microarray data characteristics by using a K-Means clustering algorithm;

step 1.1: data sample set D= { x by gene microarray ₁ ，x ₂ ，…，x _m Using the number of clusters K, x as input to implement K-Means clustering algorithm _j Representing the j-th feature in the sample set, and m is the number of samples;

step 1.2: randomly selecting k samples from the sample set D as an initial mean vector { mu } ₁ ，μ ₂ ，...，μ _o ，...，μ _k [ mu ] therein _o Representing the mean vector corresponding to the o sample;

step 1.3: for each feature x in the sample set D _j Initializing j=1, and performing the following operations:

step 1.3.1: defining clusters corresponding to clustered storage samples

Step 1.3.2: calculating the characteristic x _j And each mean vector mu _o And is denoted as d _jo The formula is shown below;

d _jo ＝||x _j -μ _o || ₂ (1)

step 1.3.3: calculating the characteristic x _j Cluster marking lambda _j The formula is shown below;

step 1.3.4: feature x _j Put into corresponding clusters, i.e

Step 1.3.5: let j=j+1, judge j is greater than n, if yes, go to step 1.4, otherwise go to step 1.3.2;

step 1.4: for each mean vector μ _o Let o=1, perform the following operations:

step 1.4.1: for mu _o The value of the update is recorded as mu' _o As shown in the following formula;

where x represents all data sets C _b Is characterized by (2);

step 1.4.2: judging the current mu _o Whether or not to equal mu' _o If yes, go to step 1.4.3, otherwise keep the current μ _o Unchanged, go to step 1.4.4;

step 1.4.3: vector μ of the current mean _o The value of μ 'is updated' _o ；

Step 1.4.4: let o=o+1 and judge if i is greater than k, if so go to step 1.5, otherwise go to step 1.4.1;

step 1.5: if when it isFront mean vector mu _o Being updated, turning to step 1.3, otherwise turning to step 1.6;

step 1.6: for all C obtained _b Where b=1, 2,..k, let c= { C ₁ ，C ₂ ，...，C _k }；

Step 1.7: output of cluster c= { C after division ₁ ,C ₂ ,…,C _k }；

Step 2: for each cluster C generated in step 1 ₁ -C _k Iteratively deleting redundant characteristics by using the pearson correlation coefficient, and updating each cluster;

step 2.1: for cluster c= { C after division ₁ ,C ₂ ,…,C _k Let parameter q=1, perform the following steps:

step 2.1.1: for C _q Calculate each feature x _j The independent sample t-test statistic P value of (c) is shown in the following formula;

wherein the method comprises the steps ofAnd->Is the characteristic x _j Corresponding positive and negative sample variances; n is n ₁ And n ₂ For positive and negative sample capacities corresponding to this feature,n is the total number of features;

step 2.1.2: for all ofOrdering, let->X corresponding to maximum value _j Is cluster C _q Seed node x of (a) _s ；

Step 2.1.3: computing cluster C _q Seed node x is removed in the middle _s All nodes outside and x _s Is of the correlation coefficient of (2)The formula is as follows:

wherein E is a mathematical expectation;

step 2.1.4: sorting the correlation coefficients from large to small, and deleting the nodes corresponding to the first 15% of correlation coefficients in each cluster;

step 2.1.5: reserving the remaining nodes as new clusters

Step 2.1.6: let q=q+1, judge q is greater than k, if yes, go to step 2.2, otherwise go to step 2.1.1;

step 2.2: clustering updated clustersThe parameter r=1, the following steps are performed:

step 2.2.1: for each cluster of inputs using a feature selection algorithm that improves L1 regularizationSelecting features, and making the weight of the j-th feature be +.>

Step 2.2.1.1: input sample spaceWherein m represents the number of samples and n represents the total number of features; target changeQuantity->Defining a parameter regularization coefficient alpha, repeating sampling times K and a counter h=1;

step 2.2.1.2: random sampling number in sample spaceAs subspace X ^* The method comprises the steps of carrying out a first treatment on the surface of the And obtain the corresponding target variable y ^* 。

Step 2.2.1.3: calculation of loss function E (X) using Lasso regression model ^* ,y ^* ) +alpha II w is not limited to the above-mentioned formula, where w is the penalty term coefficient.

Step 2.2.1.4: regression model x _j If the corresponding coefficient is marked as g, the feature is proved to be selected, and the feature is weighted

Step 2.2.1.5: let h=h+1, judge whether h is greater than K, if yes, go to step 2.2.1.6, otherwise go to step 2.2.1.2.

Step 2.2.1.6: output all x _j Corresponding feature weights

Step 2.2.2: let r=r+1, judge whether r is greater than k, if yes, carry out step 2.3, otherwise carry out step 2.2.1; step 2.3: calculating the accumulated weight of each featureAll->Sorting from big to small;

step 2.4: according toOutputs the first l features as the final feature set f= { f ₁ ,f ₂ ,…,f _l Of f, where f ₁ Correspond to->One of the greatest accumulated weights;

step 3: for the resulting feature set f= { f ₁ ,f ₂ ,…,f _l And finding out the corresponding gene name from the original microarray data to complete the characteristic analysis of the gene.

The beneficial effects generated by adopting the technical method are as follows:

the invention provides a high-dimensional data feature selection method based on improved L1 regularization and clustering, and provides a mixed feature selection algorithm for microarray data analysis, based on a K-Means clustering algorithm and an idea of improving L1 regularization, wherein the K-Means clustering algorithm is used for data preprocessing to delete redundant features, and the improved L1 regularization method is used for feature selection, so that stability and classification accuracy are improved.

Drawings

FIG. 1 is a flow chart of the overall process of the present invention.

Detailed Description

The following describes in further detail the embodiments of the present invention with reference to the drawings and examples. The following examples are illustrative of the invention and are not intended to limit the scope of the invention.

A high-dimensional data feature selection method based on improved L1 regularization and clustering, as shown in FIG. 1, comprises the following steps:

step 1.1: data sample set D= { x by gene microarray ₁ ,x ₂ ,…,x _m Using the number of clusters K, x as input to implement K-Means clustering algorithm _j Representing the j-th feature in the sample set, and m is the number of samples;

step (a)1.2: randomly selecting k samples from the sample set D as an initial mean vector { mu } ₁ ,μ ₂ ,…,μ _o ,…,μ _k [ mu ] therein _o Representing the mean vector corresponding to the o sample;

step 1.3.1: defining clusters corresponding to clustered storage samples

d _jo ＝||x _j -μ _o || ₂ (1)

step 1.3.4: feature x _j Put into corresponding clusters, i.e

step 1.4: for each mean vector μ _o Let o=1, perform the following operations:

where x represents all data sets C _b Is characterized by (2);

Step 1.4.4: let o=o+1 and judge if o is greater than k, if so go to step 1.5, otherwise go to step 1.4.1;

step 1.5: if the current mean value vector mu _o Being updated, turning to step 1.3, otherwise turning to step 1.6;

step 1.6: for all C obtained _b Where b=1, 2, …, k, let c= { C ₁ ,C ₂ ,…,C _k }；

Step 1.7: output of cluster c= { C after division ₁ ,C ₂ ,…,C _k }；

wherein E is a mathematical expectation;

step 2.1.5: reserving the remaining nodes as new clusters

Step 2.2.1.1: input sample spaceWherein m represents the number of samples and n represents the total number of features; target variable->Defining a parameter regularization coefficient alpha, repeating sampling times K and a counter h=1;

Step 2.2.1.5: let h=h+1, judge if i is greater than K, if so go to step 2.2.1.6, otherwise go to step 2.2.1.2.

Step 2.2.1.6: output all x _j Corresponding feature weights

Step 2.2.2: let r=r+1, judge whether r is greater than k, if yes, carry out step 2.3, otherwise carry out step 2.2.1;

step 2.3:calculating the accumulated weight of each featureAll->Sorting from big to small;

In this embodiment, tests are performed on 8 kinds of public microarray data sets by using different classifiers, as shown in the following table, the cluster k=5 in the test, the repeated sampling number k=100, the penalty term α is 0.3, and the number of selected features is 10.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the corresponding technical solutions, which are defined by the scope of the appended claims.

Claims

1. The high-dimensional data feature selection method based on the improved L1 regularization and clustering is characterized by comprising the following steps of:

wherein E is a mathematical expectation;

step 2.1.5: reserving the remaining nodes as new clusters

Step 2.2.1.1: input sample spaceWherein m represents the number of samples and n represents the total number of features; target variableDefining a parameter regularization coefficient alpha, repeating sampling times K and a counter h=1;

step 2.2.1.2: random sampling number in sample spaceAs subspace X ^* The method comprises the steps of carrying out a first treatment on the surface of the And obtain the corresponding target variable y ^* ；

Step 2.2.1.3: calculation of loss function E (X) using Lasso regression model ^* ,y ^* ) +α|w| where w is a penalty term coefficient;

Step 2.2.1.6: output all x _j Corresponding feature weights

step 2.3: calculating the accumulated weight of each featureAll->Sorting from big to small;

2. The method for selecting high-dimensional data features based on improved L1 regularization and clustering of claim 1, wherein said step 1 specifically comprises the steps of:

step 1.2: randomly selecting k samples from the sample set D as an initial mean vector { mu } ₁ ,μ ₂ ,…,μ _o ,…,μ _k [ mu ] therein _o Representing the mean vector corresponding to the o sample;

step 1.3.1: defining clusters corresponding to clustered storage samples

d _jo ＝||x _j -μ _o || ₂ (1)

step 1.3.4: feature x _j Put into corresponding clusters, i.e

step 1.4: for each mean vector μ _o Let o=1, perform the following operations:

where x represents all data sets C _b Is characterized by (2);

step 1.5: if the current average value is toQuantity mu _o Being updated, turning to step 1.3, otherwise turning to step 1.6;

Step 1.7: output of cluster c= { C after division ₁ ,C ₂ ,…,C _k }。