CN104268564B

CN104268564B - It is a kind of based on the sparse Gene Expression Data Analysis method for blocking power

Info

Publication number: CN104268564B
Application number: CN201410472872.8A
Authority: CN
Inventors: 李静; 沈宁敏; 周培云
Original assignee: Nanjing University of Aeronautics and Astronautics
Current assignee: Nanjing University of Aeronautics and Astronautics
Priority date: 2014-09-16
Filing date: 2014-09-16
Publication date: 2017-11-10
Anticipated expiration: 2034-09-16
Also published as: CN104268564A

Abstract

The invention discloses a sparse gene expression data analysis method based on truncated power, which specifically includes: preprocessing the gene data set, including regularization processing, using the principal component analysis method to determine the number of principal components and combining local iterative search to determine the principal components. The cardinality of the components; feature extraction is performed on the genetic data in the genetic data set processed in step 1, reducing the interference of the data and improving the accuracy of clustering in the subsequent process; performing clustering processing on the genetic data whose data features are extracted; The clustering processing result obtained in step 3 is compared with the set clustering accuracy rate, and the tuning parameters of sparse dimensionality reduction are adjusted in feedback to achieve the best clustering accuracy. The invention solves the problem of sparse eigenvalue decomposition, and is used for sparse principal component analysis, which not only has strong interpretation ability of principal components, but also has fast operation speed, can well verify the sparse principal component method, and improves the efficiency and accuracy of gene data analysis.

Description

A truncated power-based method for sparse gene expression data analysis

技术领域technical field

本发明公开了一种基于截断幂的稀疏基因表达数据分析方法，涉及对基因表达的数据分析技术领域。The invention discloses a sparse gene expression data analysis method based on truncated power, and relates to the technical field of gene expression data analysis.

背景技术Background technique

伴随生物医学水平的快速发展，DNA芯片(DNA microarray)的广泛应用可以快速测量基因的表达水平。由于基因数据的分析可以用来识别癌细胞以预测某一疾病发生的概率，对人的生活具有重大的意义。因此，基因聚类已经成为目前研究的热门课题。With the rapid development of biomedicine, the wide application of DNA chip (DNA microarray) can quickly measure the expression level of genes. Since the analysis of genetic data can be used to identify cancer cells to predict the probability of a certain disease, it is of great significance to human life. Therefore, gene clustering has become a hot topic in current research.

原始收集的基因数据具有属性多、样本少等特点，直接对此聚类分析其结果往往会受到大量冗余数据的干扰，并且高维数据对传统的聚类方法也是一项挑战。有为了克服这些缺点，不同的降维主特征提取方法被相继提出，独立成分分析(IndependentComponent Analysis，ICA)可以将多维数据集分解到各自独立的成分(ICs)，消除了高阶依赖性。主成分分析(Principle Component Analysis，PCA)方法是一种经典的降维方法，可以将高维数据进行降维处理提取其主要特征数据，它所寻求的目标是方差最大化，即属性之间的相关变化最大。但由于其自身的线性组合缺陷导致其生成的主成分不具有可解释性，即基因数据中一个症状不知有哪些具体的基因来决定。因此，通过在主成分的基础上对负载因子进行稀疏化处理，可以在提取主成分的过程中考虑主成分的表达能力与负载因子的稀疏性(Loadings)，使得主成分有少量的属性决定，同时使得因子系数的非零个数小于等于基因的个数但可表达能力比主成分分析更明显。The original collected genetic data has the characteristics of many attributes and few samples. The results of direct clustering analysis are often disturbed by a large amount of redundant data, and high-dimensional data is also a challenge to traditional clustering methods. In order to overcome these shortcomings, different dimensionality reduction principal feature extraction methods have been proposed one after another. Independent Component Analysis (ICA) can decompose multidimensional data sets into independent components (ICs), eliminating high-order dependencies. The Principal Component Analysis (PCA) method is a classic dimensionality reduction method, which can perform dimensionality reduction processing on high-dimensional data to extract its main feature data. Its goal is to maximize the variance, that is, the relationship between attributes. The largest relative change. However, due to its own linear combination defects, the principal components generated by it are not interpretable, that is, it is not known which specific genes determine a symptom in the genetic data. Therefore, by sparsely processing the loading factor on the basis of the principal component, the expression ability of the principal component and the sparsity of the loading factor (Loadings) can be considered in the process of extracting the principal component, so that the principal component has a small number of attributes. At the same time, the non-zero number of factor coefficients is less than or equal to the number of genes, but the expressive ability is more obvious than that of principal component analysis.

稀疏主成分(Sparse PCA)的求解方法有阈值、回归、能量及规划等不同类，相比之下，能量方法在主成分分可解释度、算法的运行时间及聚类的精确性都是非常稳定的，其中截断幂迭代法是其中的典型算法，可以很好的解决稀疏特征值分解问题，用于稀疏主成分分析不仅主成分的解释能力强且其运行速度快，是一种很好的特征提取方法。There are different methods for solving sparse PCA such as threshold, regression, energy, and planning. In contrast, the energy method is very good in the interpretability of principal components, the running time of the algorithm, and the accuracy of clustering. Stable, among which the truncated power iteration method is a typical algorithm, which can solve the problem of sparse eigenvalue decomposition very well. It is a good method for sparse principal component analysis, which not only has strong explanatory ability of principal components but also runs fast. feature extraction method.

将稀疏主成分分析与聚类算法结合起来对基因表达数据是一种更高效、精确的分析方法。聚类已经成为基因表达数据分析的主要方法之一，通过类别的判断可以快速、准确的判断疾病的发生概率。而由于基因数据本身的特点，属性多、样本少以致在高维数据中将存在大量的冗余数据与干扰信息，直接进行聚类分析将导致精确率不是很高。主成分分析是一种经典的降维方法，可以将高维数据映射到低维空间，但因其结果不具有强解释力。Combining sparse principal component analysis with clustering algorithm is a more efficient and accurate analysis method for gene expression data. Clustering has become one of the main methods of gene expression data analysis, and the probability of disease occurrence can be quickly and accurately judged by category judgment. However, due to the characteristics of genetic data itself, there are many attributes and few samples, so that there will be a large amount of redundant data and interference information in high-dimensional data, and direct clustering analysis will lead to a low accuracy rate. Principal component analysis is a classic dimensionality reduction method that can map high-dimensional data to low-dimensional space, but its results do not have strong explanatory power.

发明内容Contents of the invention

本发明所要解决的技术问题是：针对现有技术的缺陷，提供一种基于截断幂的稀疏基因表达数据分析方法。利用稀疏主成分分析—截断幂方法，对数据进行预处理提取其主要的表达数据，在负载因子中非零个数最小化的同时保证基因主成分具有强表达能力。通过典型的基因数据集实验，将特征提取之后的基因数据应用K-means方法进行聚类分析。The technical problem to be solved by the present invention is to provide a sparse gene expression data analysis method based on truncated powers for the defects of the prior art. Sparse principal component analysis-truncated power method was used to preprocess the data to extract the main expression data, and to ensure the strong expression ability of gene principal components while minimizing the non-zero number in the load factor. Through a typical gene data set experiment, the gene data after feature extraction are clustered using the K-means method.

本发明为解决上述技术问题采用以下技术方案：The present invention adopts the following technical schemes for solving the problems of the technologies described above:

一种基于截断幂的稀疏基因表达数据分析方法，具体步骤包括：A truncated power-based sparse gene expression data analysis method, the specific steps include:

步骤一、对基因数据集进行预处理，包括正则化、利用主成分分析法确定主成分个数与结合局部迭代搜索确定主成分的基数；Step 1. Preprocessing the genetic data set, including regularization, using principal component analysis to determine the number of principal components and combining local iterative search to determine the cardinality of principal components;

步骤二、对经过步骤一处理的确定的稀疏调优参数对基因数据进行截断幂稀疏降维与特征提取，减少数据的干扰性并提高后续过程聚类的准确性；Step 2. Perform truncated power sparse dimensionality reduction and feature extraction on the genetic data determined by the sparse tuning parameters processed in step 1, to reduce the interference of data and improve the accuracy of subsequent process clustering;

步骤三、对数据特征被提取的基因数据进行聚类方法处理；Step 3, performing clustering processing on the genetic data whose data features are extracted;

步骤四、将步骤三得到的聚类处理结果与设定的聚类精确率进行比对，并反馈调节步骤一中稀疏降维的调优参数以达到最佳聚类精度。Step 4: Compare the clustering processing result obtained in step 3 with the set clustering accuracy rate, and feedback and adjust the tuning parameters of sparse dimensionality reduction in step 1 to achieve the best clustering accuracy.

作为本发明的进一步优选方案，步骤一中，所述预处理的具体过程为：As a further preferred solution of the present invention, in step 1, the specific process of the pretreatment is:

设定一个基因数据集A，其样本个数为n，基因个数为p，且满足n＜＜p，对数据集A进行正则化处理后得出其协方差矩阵∑，将主成分的求解模型表示如下：Set a gene data set A, the number of samples is n, the number of genes is p, and n<<p is satisfied, the covariance matrix Σ is obtained after regularizing the data set A, and the solution of the principal components The model representation is as follows:

find x'＝arg max x^T∑x subject to x^Tx＝1find x'＝arg max x ^T ∑x subject to x ^T x＝1

其中，x为自变量，对应于高维数据转换为低维数据的系数，在优化求解的过程中将不断更新，x'目标系数，即优化求解后主成分对应的最佳载荷，T表示转置运算。Among them, x is an independent variable, which corresponds to the coefficient of transforming high-dimensional data into low-dimensional data, which will be updated continuously during the optimization solution process, x'target coefficient, that is, the optimal load corresponding to the principal component after optimization solution, and T represents the conversion set operation.

作为本发明的进一步优选方案，采用幂迭代法求解主成分的求解模型中的矩阵特征值，其迭代求解过程为：As a further preferred solution of the present invention, the matrix eigenvalues in the solution model of the principal components are solved by using the power iteration method, and the iterative solution process is:

v₁＝Sv₀ v ₁ =Sv ₀

v₂＝Sv₂＝S²v₀ v ₂ =Sv ₂ =S ² v ₀

··

v_t＝Sv_t-1＝…＝S^kv₀ v _t ＝Sv _t-1 ＝…＝S ^k v ₀

其中，S为待求解的矩阵，v_i为每次迭代过程中的更新向量，其初始值为i为迭代次数，其初始值为0，当矩阵收敛时，i的取值为t，λ为v_t向量中所有变量的最大公约数；Among them, S is the matrix to be solved, v _i is the update vector in each iteration process, and its initial value is i is the number of iterations, and its initial value is 0. When the matrix converges, the value of i is t, and λ is the greatest common divisor of all variables in the v _t vector;

设定v^*为待求解的特征向量，则v^*经由v_i同过提取公共参数λ变换得出。Assuming that v ^* is the eigenvector to be solved, then v ^* can be obtained by transforming v _i and extracting the common parameter λ.

作为本发明的进一步优选方案，步骤一中，所述稀疏降维处理需满足|x||₀≤k，其中，k为主成分的基数。As a further preferred solution of the present invention, in step 1, the sparse dimensionality reduction process needs to satisfy |x|| ₀ ≤ k, where k is the cardinal number of the main component.

作为本发明的进一步优选方案，采用截断法控制稀疏度，并结合幂迭代法，进行稀疏主成分的求解，具体过程包括：As a further preferred solution of the present invention, the truncation method is used to control the sparsity, and combined with the power iteration method, the sparse principal component is solved. The specific process includes:

(501)设定截断算子：(501) setting truncation operator:

其中，F为k个下标的集合；Among them, F is a set of k subscripts;

(502)根据如下公式求解稀疏主成分：(502) Solve the sparse principal component according to the following formula:

λ_max(Σ,k)＝max x^TΣxλ _max (Σ,k) = max x ^T Σx

subject to||x||₂＝1,||x||₀≤ksubject to||x|| ₂ ＝1,||x|| ₀ ≤k

求解过程具体包括：The solution process specifically includes:

Step1:初始化x₀与迭代次数t＝1，设置基数k_i；Step1: Initialize x ₀ and number of iterations t=1, set base k _i ;

Step2:计算x_t＝∑x_t-1/||∑x_t-1||，按绝对值大小获取k个x_t的下标赋给F_t；Step2: Calculate x _t =∑x _t-1 /||∑x _t-1 ||, obtain k subscripts of x _t according to the absolute value and assign them to F _t ;

Step3:计算x_t'＝Truncate(x_t,F_t)，归一化x_t＝x_t'/||x_t'||，t←t+1；Step3: Calculate x _t '=Truncate(x _t , F _t ), normalize x _t =x _t '/||x _t '||, t←t+1;

Step4：当Step3计算结果收敛时，停止计算；否则，重复Step2和Step3步。Step4: When the calculation result of Step3 converges, stop the calculation; otherwise, repeat Step2 and Step3.

作为本发明的进一步优选方案，步骤三中，采用K-means聚类算法进行聚类方法处理。As a further preferred solution of the present invention, in step 3, K-means clustering algorithm is used for clustering processing.

本发明采用以上技术方案与现有技术相比，具有以下技术效果：本发明可以很好验证稀疏主成分方法，提高了基因数据分析的高效性和精确性。Compared with the prior art, the present invention adopts the above technical scheme and has the following technical effects: the present invention can well verify the sparse principal component method, and improves the efficiency and accuracy of genetic data analysis.

附图说明Description of drawings

图1是本发明的基因数据处理流程示意图。Fig. 1 is a schematic diagram of the genetic data processing flow chart of the present invention.

图2是基因数据主成分个数与可解释力关系图。Figure 2 is a graph showing the relationship between the number of principal components and interpretability of genetic data.

图3是本发明的一个实施例中白血病数据基数与可解释关系图。Fig. 3 is a graph showing the relationship between the base number and explainability of leukemia data in an embodiment of the present invention.

图4是本发明的一个实施例中淋巴癌数据基数与解释关系图。Fig. 4 is a graph showing the relationship between lymphoma data base and interpretation in an embodiment of the present invention.

图5是本发明的一个实施例中白血病数据三维可视图。Fig. 5 is a three-dimensional visualization of leukemia data in an embodiment of the present invention.

图6是本发明的一个实施例中淋巴癌数据三维可视图。Fig. 6 is a three-dimensional visualization of lymphoma data in an embodiment of the present invention.

具体实施方式detailed description

下面结合附图对本发明的技术方案做进一步的详细说明：Below in conjunction with accompanying drawing, technical scheme of the present invention is described in further detail:

本发明的处理流程示意图如图1所示，整个过程包含数据预处理、特征提取及聚类，由于稀疏方法的稀疏度需要人为的指定，所以中间存在一条反馈回路以更好的调节聚类精确率与稀疏之间的关系，具体步骤如下：The schematic diagram of the processing flow of the present invention is shown in Figure 1. The whole process includes data preprocessing, feature extraction and clustering. Since the sparsity of the sparse method needs to be manually specified, there is a feedback loop in the middle to better adjust the accuracy of clustering. The relationship between rate and sparseness, the specific steps are as follows:

所述预处理的过程具体为：给定一个基因数据集A，它的样本个数为n，基因个数为p，其中n＜＜p,对数据集做正则化预处理后求得其协方差矩阵∑，则主成分的求解模型可表示如下形式：The preprocessing process is specifically as follows: given a gene data set A, its number of samples is n, and the number of genes is p, where n<<p, after performing regularization preprocessing on the data set, its correlation is obtained. The variance matrix ∑, the solution model of the principal components can be expressed as follows:

其中，x为自变量，对应于高维数据转换为低维数据的系数，在优化求解的过程中不断更新，x'目标系数，即优化求解后主成分对应的最佳载荷，T表示专转置运算。Among them, x is an independent variable, which corresponds to the coefficient of converting high-dimensional data into low-dimensional data, which is continuously updated during the optimization solution process, x'target coefficient, that is, the optimal load corresponding to the principal component after optimization solution, and T represents the special transformation set operation.

主成分的求解方法可以有两种：对数据集A做奇异值分解A＝UDV，D为数据矩阵的奇异值矩阵，与矩阵特征值一样其大小决定主成分的提取次序，U为数据矩阵的左奇异值向量。V为数据矩阵的右奇异值向量，即对应的系数矩阵，新的主成分Z＝UD；另一种为特征值分解，求得属性变量的协方差矩阵对其进行特征分解，根据特征值的大小选取对应的特征向量作为负载因子。There are two ways to solve the principal components: Singular value decomposition A=UDV is performed on the data set A, D is the singular value matrix of the data matrix, and its size determines the extraction order of the principal components, just like the eigenvalues of the matrix, U is the value of the data matrix left singular value vector. V is the right singular value vector of the data matrix, that is, the corresponding coefficient matrix, and the new principal component Z=UD; the other is the eigenvalue decomposition, which obtains the covariance matrix of the attribute variable and performs eigendecomposition on it, according to the eigenvalue The size selects the corresponding eigenvector as the load factor.

相比传统的特征值分解法，幂迭代法是求解矩阵特征值的另一种高效的方法，当给定一个矩阵S通过v_i+1＝Sv_i反复迭代，理论证明当其收敛时，v^*为其特征向量，v^*是由v_i经过提取公共参数λ变换得来的，v_i为每次迭代过程中的更新向量，初始值为其迭代过程如下所示：Compared with the traditional eigenvalue decomposition method, the power iteration method is another efficient method to solve the eigenvalue of the matrix. When a matrix S is given and iterated repeatedly through v _i+1 =Sv _i , the theory proves that when it converges, v ^* is its eigenvector, v ^* is transformed by extracting public parameter λ from v _i , and v _i is the update vector in each iteration process, the initial value is Its iterative process is as follows:

v₁＝Sv₀ v ₁ =Sv ₀

v₂＝Sv₂＝S²v₀ v ₂ =Sv ₂ =S ² v ₀

··

v_t＝Sv_t-1＝…＝S^kv₀ v _t ＝Sv _t-1 ＝…＝S ^k v ₀

其中，i为迭代次数，初始值为0，当收敛时，i的取值为t，λ为v_t向量中所有变量的最大公约数。Among them, i is the number of iterations, and the initial value is 0. When it converges, the value of i is t, and λ is the greatest common divisor of all variables in the v _t vector.

由于PCA(Principal components analysis，主成分分析)主成分的缺陷，所提取的主成分是原有属性的线性组合导致其结果不具有可解释性。因此在原有公式模型的基础上对负载因子进行稀疏化处理，使得|x||₀≤k，k为主成分的基数。Due to the defects of the principal components of PCA (Principal components analysis, principal component analysis), the extracted principal components are linear combinations of the original attributes, so the results are not interpretable. Therefore, on the basis of the original formula model, the load factor is sparsely processed, so that |x|| ₀ ≤ k, and k is the base of the main component.

本发明采用截断法控制稀疏度，并结合幂迭代法，高效的进行稀疏主成分的求解，需要首先定义一个截断算子，如下所示：The present invention uses the truncation method to control the sparsity, and combines the power iteration method to efficiently solve the sparse principal component. It is necessary to first define a truncation operator, as follows:

其中，F为k个下标集合，则幂迭代与截断求解稀疏主成分的公式模型如下所示：Among them, F is a set of k subscripts, then the formula model of power iteration and truncation to solve sparse principal components is as follows:

λ_max(Σ,k)＝max x^TΣxλ _max (Σ,k) = max x ^T Σx

subject to||x||₂＝1,||x||₀≤ksubject to||x|| ₂ ＝1,||x|| ₀ ≤k

其中，x为相应主成分对应的因子系数，它的求解过程如下所示：Among them, x is the factor coefficient corresponding to the corresponding principal component, and its solution process is as follows:

所述特征提取与聚类具体包括：基因表达数据分析可以通过聚类来判别基因的类别，传统的聚类方法处理如基因数据这样高维的样本集其准确性与效率都不是非常好，同时基因表达数据虽然维度高，可其主要的数据信息只需用1-3个主成分就可以表示，因此对数据进行特征提取可以减少数据的干扰性提高后续聚类的准确性。当x_i通过TPower方法求出后，基因特征数据z^*＝AX，其中X＝x₁…x_m，m为提取主成分的个数。当数据特征被提取后，本发明采用经典的K-means聚类算法，虽然该方法的不足之处需要指定聚类的个数，但聚类的目的是验证截断幂方法所提取的数据是否可以更好的达到聚类效果，因此，在后续的实验中采用经典的基因数据集，它们的类别事先可以被指定。The feature extraction and clustering specifically include: gene expression data analysis can be used to identify gene categories through clustering. Traditional clustering methods are not very accurate and efficient when dealing with high-dimensional sample sets such as gene data. Although the gene expression data has a high dimensionality, its main data information can be represented by only 1-3 principal components. Therefore, feature extraction of the data can reduce the interference of the data and improve the accuracy of subsequent clustering. When x _i is obtained by the TPower method, the gene feature data z ^* = AX, where X = x ₁ ... x _m , m is the number of extracted principal components. After the data feature is extracted, the present invention adopts classic K-means clustering algorithm, although the deficiency of this method needs to specify the number of clusters, but the purpose of clustering is to verify whether the data extracted by the truncated power method can Better to achieve the clustering effect, therefore, in the follow-up experiments, the classic gene data sets are used, and their categories can be specified in advance.

在本发明的一个具体实施例中，采用白血病(Le ukemia)和淋巴癌(Lymphoma)两个基因数据集为例。在生物医学领域，这两种疾病都严重的影响人的生活。由于医学技术的快速发展，基因数据的收集已不是难事，并且对基因数据的分类与聚类具有广泛的意义，因此样本的个数不是唯一的。为了验证稀疏主成分分析的高效性，实验中采用的两个基因数据是比较典型的数据集：白血病、淋巴癌。In a specific embodiment of the present invention, two gene data sets of leukemia (Leukemia) and lymphoma (Lymphoma) are used as examples. In the field of biomedicine, both diseases seriously affect people's lives. Due to the rapid development of medical technology, the collection of genetic data is not difficult, and it has broad significance for the classification and clustering of genetic data, so the number of samples is not unique. In order to verify the efficiency of sparse principal component analysis, the two gene data used in the experiment are typical data sets: leukemia and lymphoma.

白血病来源于造血干细胞不正常增殖或损坏影响其他组织和器官功能等。白血病的通常分为急性淋巴细胞白血病(Acute lymphoblastic Leukemia ALL)和急性髓细胞白血病(Acute myelogenous leukemia AML)，按病变细胞的分类，淋巴细胞又可以分为T细胞和B细胞，所以，白血病数据集大致可以分为ALL和AML两大类，如果细分，可以分为ALL_T、ALL_B和AML三类。实验中白白血病数据集含有38个样本和5000个基因，其中ALL有27例(ALL_B有19例，ALL_T有8例)，AML有11例。Leukemia originates from the abnormal proliferation or damage of hematopoietic stem cells that affects the functions of other tissues and organs. Leukemia is usually divided into acute lymphoblastic leukemia (Acute lymphoblastic Leukemia ALL) and acute myelogenous leukemia (AML). According to the classification of diseased cells, lymphocytes can be divided into T cells and B cells. Therefore, the leukemia data set It can be roughly divided into two categories: ALL and AML. If subdivided, it can be divided into three categories: ALL_T, ALL_B and AML. In the experiment, the leukemia data set contains 38 samples and 5000 genes, including 27 cases of ALL (19 cases of ALL_B, 8 cases of ALL_T), and 11 cases of AML.

淋巴癌又称淋巴瘤，是淋巴造血系统的恶性肿瘤，一旦疾病确诊，淋巴瘤将分布全身，其中非霍奇金淋巴瘤(NHL)的发病率远远高于霍奇金淋巴瘤(HL)。弥漫性大B细胞淋巴瘤(Diffuse Large B-Cell lymphoma DLBCL)和滤泡性淋巴癌(Follicular Lymphoma FL)为常见的NHL，发病率偏高；而慢性淋巴细胞白血病(Chronic Lymhocytic Lymphoma CLL)来源于造血组织的恶性肿瘤，虽然发展缓慢，但如不及时治疗将很难治愈。实验中采用的淋巴癌数据集总共有62个样本和4026个基因，其中DLBCL有42例、FL有9例及CLL有11例。Lymphoma, also known as lymphoma, is a malignant tumor of the lymphatic hematopoietic system. Once the disease is diagnosed, lymphoma will be distributed throughout the body. The incidence of non-Hodgkin's lymphoma (NHL) is much higher than that of Hodgkin's lymphoma (HL). . Diffuse Large B-Cell lymphoma (DLBCL) and follicular lymphoma (Follicular Lymphoma FL) are common NHL with a high incidence; Chronic Lymhocytic Lymphoma CLL (Chronic Lymhocytic Lymphoma CLL) comes from Although the malignant tumor of hematopoietic tissue develops slowly, it will be difficult to cure if it is not treated in time. The lymphoma data set used in the experiment has a total of 62 samples and 4026 genes, including 42 cases of DLBCL, 9 cases of FL and 11 cases of CLL.

基因表达数据的实验从主成分分析入手，根据主成分的可解释能力大致确定主成分的个数；然后通过截断幂方法对正则化后的数据进行截断稀疏，分析其提取的基因的表达能力；最后经过负载因子的乘积进行特征提取将高维基因数据变换成数据的主特征成分并用聚类算法对其进行聚类分析，对聚类精确率的分析以便调节稀疏程度。实验将与主成分分析处理的结果进行对比，以验证对主成分系数的稀疏可以更好的分析基因表达数据。The experiment of gene expression data starts with principal component analysis, and the number of principal components is roughly determined according to the interpretability of the principal components; then, the regularized data is truncated and sparse by the truncation power method, and the expression ability of the extracted genes is analyzed; Finally, the feature extraction is carried out through the product of the load factor, and the high-dimensional genetic data is transformed into the main feature components of the data, and the clustering algorithm is used for clustering analysis, and the analysis of the clustering accuracy is used to adjust the degree of sparsity. The experiment will be compared with the results of principal component analysis to verify that the sparseness of the principal component coefficients can better analyze gene expression data.

在实施主成分分析确定主成分个数时，基因数据主成分个数与可解释力关系图如图2所示，随着主成分的个数的增加，其主成分的可解释力也来越低，当主成分的个数超过25时，PEV的值几乎为0。因此在后续应用稀疏主成分分析提取特征数据时，白血病数据与淋巴癌数据的主成分分个数定为10和15，其总的可解释力分别为81.7％和66.8％。When implementing principal component analysis to determine the number of principal components, the relationship between the number of principal components and the interpretability of genetic data is shown in Figure 2. As the number of principal components increases, the interpretability of the principal components decreases. , when the number of principal components exceeds 25, the value of PEV is almost 0. Therefore, in the follow-up application of sparse principal component analysis to extract feature data, the number of principal components of leukemia data and lymphoma data is set to 10 and 15, and their total interpretability is 81.7% and 66.8%, respectively.

当提取主成分的个数一旦确定，为了调节负载因子的非零个数，将截断幂方法中基数值依次设置，基于PEV值的差异性确定因子系数的稀疏度。如图3、图4所示，两个基因数据集的前三个主成分的非零个数的调节基本上呈上升趋势，当达到一定的基数时PEV的值基本不变，这为主成分系数中的非零个数的确定提供了很好的依据。Once the number of extracted principal components is determined, in order to adjust the non-zero number of load factors, the base values in the truncated power method are set in sequence, and the sparsity of the factor coefficients is determined based on the difference of the PEV value. As shown in Figure 3 and Figure 4, the adjustment of the non-zero numbers of the first three principal components of the two genetic data sets is basically on the rise. When a certain base is reached, the value of PEV is basically unchanged, which is the principal component The determination of the number of nonzeros in the coefficients provides a good basis.

当主成分的个数与载荷的基数都确定之后，基因的主特征数据就可以用K-means方法根据聚类的精确率评估该稀疏主成分析的特征提取的有效性。为了更加凸显出提取的主成分分可以被用于更好的聚类，如图5、图6所示，在三个稀疏主成分的三维空间中，根据基因样本的真实类别画出了基因数据的三维可视图，可以看出数据通过稀疏化处理，其基因类别可以明显区分。After the number of principal components and the cardinality of loads are determined, the K-means method can be used to evaluate the effectiveness of the feature extraction of the sparse principal component analysis according to the clustering accuracy. In order to highlight that the extracted principal components can be used for better clustering, as shown in Figure 5 and Figure 6, in the three-dimensional space of three sparse principal components, the genetic data is drawn according to the true category of the genetic samples The three-dimensional visualization of the data shows that the gene categories can be clearly distinguished after the data is sparsely processed.

上述实施例的聚类实验结果可参加下表：The clustering experiment result of above-mentioned embodiment can participate in following table:

聚类实验结果Clustering experiment results

上面结合附图对本发明的实施方式作了详细说明，但是本发明并不限于上述实施方式，在本领域普通技术人员所具备的知识范围内，还可以在不脱离本发明宗旨的前提下做出各种变化。The embodiments of the present invention have been described in detail above in conjunction with the accompanying drawings, but the present invention is not limited to the above embodiments, and can also be made without departing from the gist of the present invention within the scope of knowledge possessed by those of ordinary skill in the art. Variations.

Claims

It is 1. a kind of based on the sparse Gene Expression Data Analysis method for blocking power, it is characterised in that specific steps include：

Step 1: pre-processed to gene data collection, including regularization, determined using PCA principal component number, The radix for determining principal component is searched for reference to local iteration；

Step 2: the sparse tuning parameter of the determination after step 1 is handled is carried out blocking power sparse dimension reduction to gene data With feature extraction, reduce the interference of data and improve the accuracy of subsequent process cluster；

Step 3: the gene data being extracted to data characteristics carries out clustering method processing, carried out using K-means clustering algorithms Clustering method processing；

Step 4: the clustering processing result that step 3 obtains is compared with the cluster accurate rate set, and feedback regulation is dilute The tuning parameter of dimensionality reduction is dredged to reach optimal clustering precision；

In step 1, the detailed process of the pretreatment is：

A gene data collection A is set, its number of samples is n, and gene number is p, and meets n ＜＜ p, and data set A is carried out Its covariance matrix ∑ is drawn after Regularization, the solving model of principal component is represented as follows：

Find x'=arg max x^T∑x subject to x^TX=1

Wherein, x is independent variable, and the coefficient of low-dimensional data is converted to corresponding to high dimensional data, will not during Optimization Solution Disconnected to update, optimal load corresponding to principal component after x' target factors, i.e. Optimization Solution, T represents transposition computing；

The matrix exgenvalue in the solving model of principal component is solved using power iteration method, its iterative process is：

V1=Sv₀

V2=Sv₂=S²v₀

·

·

·

v_t=Sv_t-1=...=S^kv₀

Wherein, S is matrix to be solved, v_iFor the renewal vector in each iterative process, its initial value isI is iterations, Its initial value is 0, and when matrix is restrained, i value is t, λ v_tThe greatest common divisor of all variables in vector；

Set v^*For characteristic vector to be solved, then v^*Via v_iDrawn by extracting common parameter λ-conversion.
It is 2. as claimed in claim 1 a kind of based on the sparse Gene Expression Data Analysis method for blocking power, it is characterised in that step In rapid one, the sparse dimension reduction processing needs to meet | | x | |₀≤ k, wherein, k is the radix of principal component.
It is 3. as claimed in claim 2 a kind of based on the sparse Gene Expression Data Analysis method for blocking power, it is characterised in that to adopt Degree of rarefication is controlled with intercept method, and combines power iteration method, carries out the solution of sparse principal component, detailed process includes：

(501) interruption operator is set：

<mrow> <msub> <mrow> <mo>&lsqb;</mo> <mi>T</mi> <mi>r</mi> <mi>u</mi> <mi>n</mi> <mi>c</mi> <mi>a</mi> <mi>t</mi> <mi>e</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>,</mo> <mi>F</mi> <mo>)</mo> </mrow> <mo>&rsqb;</mo> </mrow> <mi>j</mi> </msub> <mo>=</mo> <mfenced open = "{" close = ""> <mtable> <mtr> <mtd> <msub> <mrow> <mo>&lsqb;</mo> <mi>x</mi> <mo>&rsqb;</mo> </mrow> <mi>j</mi> </msub> </mtd> <mtd> <mrow> <mi>j</mi> <mo>&Element;</mo> <mi>F</mi> </mrow> </mtd> </mtr> <mtr> <mtd> <mn>0</mn> </mtd> <mtd> <mrow> <mi>o</mi> <mi>t</mi> <mi>h</mi> <mi>e</mi> <mi>r</mi> <mi>w</mi> <mi>i</mi> <mi>s</mi> <mi>e</mi> </mrow> </mtd> </mtr> </mtable> </mfenced> </mrow>

Wherein, F is k lower target set；

(502) sparse principal component is solved according to equation below：

λ_max(Σ, k)=max x^TΣx

subject to||x||₂=1, | | x | |₀≤k

Solution procedure specifically includes：

Step1:Initialize x₀With iterations t=1, radix k is set_i；

Step2:Calculate x_t=∑ x_t-1/||∑x_t-1| |, obtain k x by order of magnitude_tSubscript be assigned to F_t；

Step3:Calculate x_t'=Truncate (x_t,F_t), normalize x_t=x_t'/||x_t' | |, t ← t+1；

Step4：When Step3 numerical convergences, stop calculating；Otherwise, Step2 and Step3 steps are repeated.