CN116246712B

CN116246712B - Data subtype classification method with sparse constraint multi-mode matrix joint decomposition

Info

Publication number: CN116246712B
Application number: CN202310104611.XA
Authority: CN
Inventors: 何昆; 尹晓尧; 伯晓晨; 王娜; 陈河兵; 董方霆; 李卫华
Original assignee: Academy of Military Medical Sciences AMMS of PLA
Current assignee: Academy of Military Medical Sciences AMMS of PLA
Priority date: 2023-02-13
Filing date: 2023-02-13
Publication date: 2024-03-26
Anticipated expiration: 2043-02-13
Also published as: CN116246712A

Abstract

The invention provides a data subtype classification method with sparse constraint multi-modal matrix joint decomposition, which can realize the mining of shared-specific data structures among any modalities. Aiming at the situation that data structure sharing exists among any modes in real data, the invention introduces group sparse constraint, wherein the group sparse constraint is a special sparse constraint method, and generally, l is adopted as follows _1,2 Norms or l _1,∞ A norm implementation that constrains samples of the same group to rely on the same feature, while samples of different groups rely on more different features. By converting the group into the data mode and applying the sparse constraint method of the group and combining the sharing-mode specific concept in matrix joint decomposition, the mining of sharing-specific data structures among any modes can be realized, and on the basis, a more reasonable clustering effect is obtained, and the effect is verified on simulation data and real breast cancer data.

Description

Data subtype classification method with sparse constraint multi-mode matrix joint decomposition

Technical Field

The invention relates to the technical field of tumor multiunit chemical data subtype classification, in particular to a data subtype classification method based on joint decomposition of a multi-modal matrix with sparse constraint.

Background

Complex diseases such as neoplastic diseases have a variety and heterogeneity of characteristics such as genome, transcriptome, proteome, and epigenome. Recent advances in technology have enabled the acquisition of multiple sets of chemical data that can be used to explore the pathological complexity of the disease. Notably, cancer genomic maps (The Cancer Genome Atlas, TCGA) collected genomic and transcriptomic information from more than 20 cancers from thousands of patients, including about 2000 breast cancer samples. Based on TCGA data, integrated cluster analysis of gene expression profiles and DNA methylation data can identify new subgroups outside classical biomarker expression subtypes.

Therefore, there is a need to further develop a calculation method for integrally analyzing multiple sets of data represented by cancer patients, particularly when the multiple sets of data exhibit heterogeneity among the sets (each set can be regarded as a modality). Most existing integrated analysis methods need to address several issues related to biological data, namely small samples with very high dimensional features (also known as dimensional disaster issues), inconsistent data ranges, and potentially group-specific and shared structural patterns among multiple groups of chemical data that are easily ignored. The multi-modal data integration analysis currently performed by mathematical methods can be divided into three major categories, early integration, late integration and intermediate integration. Early integration was the simplest method of connecting different histology data into a single matrix and applying a single histology clustering technique. However, this will increase the data dimension, exacerbating the so-called dimension disaster problem. In alternative techniques, a set of important features may also be pre-selected from each data modality, and then the modalities integrated using consistency clustering, non-negative matrix factorization (Non-negative Matrix Factorization, NMF) or independent component analysis (Independent Component Analysis, ICA). The feature pre-selection process is extremely time consuming and may discard important information. The post-integration method obtains the final result by grouping each of the histology data separately and then integrating the clustered results. Such a brute force integration scheme may lead to confusing results when the clustering results of the different sets of data are inconsistent.

As another type of multi-modal integration analysis, intermediate integration is interposed between early integration and late integration, and can be further divided into sequential analysis and joint analysis. In sequential analysis, the model first analyzes one data pattern and then adjusts the optimization results by subsequent analysis of the other data patterns. Sequence analysis methods, such as multiple co-perturbation methods and trans-process related and cis-related gene analysis, assume causal relationships exist between one set of omics data (e.g., transcriptome) and another set of omics data. However, such methods are sensitive to the order of data analysis, i.e. the integrated analysis of the different modalities must be performed in a certain order. Thus, changing the analysis sequence may lead to different results, while these methods are not applicable to other types of data sets. The joint analysis has at least one of the following characteristics: 1) Calculating sample similarity; 2) Combining the different histology data sets by using a dimension reduction method; 3) And carrying out statistical modeling on the multi-mode data. Based on these properties, the joint analysis method can be further classified into a similarity-based method, a dimension-reduction method, a statistical method, and a deep learning-based method. Statistical models, such as iCluster and variants thereof, including icluster+ and iCluster Bayes, assume that multiple sets of mathematical data share potential gaussian variables. However, the iterative expectation-maximization algorithm used is computationally complex and does not necessarily converge to a deterministic or optimal solution. Similarity-based methods, including SNF (Similarity Network Fusion), CIMLR (Cancer Integration via Multikernel Learning), rmkl LPP, mixKernel, and extended spectral clustering methods, have attracted considerable attention, which pre-construct a similarity matrix of samples by multi-kernel learning, and then group the samples using spectral clustering. The construction of the similarity matrix can solve the dimension problem to some extent, essentially trying to construct a new similarity matrix based on multiple similarity matrices. However, during the similarity matrix construction process, some features may disappear, and extracting the cluster-related features becomes complicated. The dimension reduction method aims at projecting multiple sets of mathematical data into shared and histologic-specific sub-matrices in a low dimensional space using covariance between the data sets or matrix/tensor joint factorization and imposing additional sparse constraints on the sub-matrices. However, these methods are generally not effective because they do not extract well any inter-group sharing and inter-group specific features because of the lack of shared information in all modes they acquire. The deep learning model fuses multiple data patterns before or after learning the low-dimensional embedding by an automatic encoder, a denoising automatic encoder, a variation automatic encoder, or a stacking variation automatic encoder. Typically intNMF (integrative NMF) and PintMF (Penalized Integrative Matrix Factorization), these depth methods do not perform well due to the overfitting when embedding very high dimensional raw data into very low spaces.

In summary, the existing methods have three problems:

1. the method only considers the data structure completely shared among different modes, but does not consider the situation of sharing among any modes, namely, a certain clustering relation can be shared among two or three modes only, but not among all modes, so that the obtained sharing matrix cannot represent the modes completely or the information shared by the modes is mixed together;

2. the methods either do not provide a clustering number estimation method for multi-mode data, or the proposed clustering number estimation method is not practical, and the due clustering number cannot be accurately estimated on artificially synthesized analog data;

3. the method is not used for providing technical support for extracting the clustering related features, or the original features are erased by constructing the similarity matrix, so that the method can only be used for estimating the features through a mutual information method, or the method is used for testing all the features one by one through a boottrap method, and the corresponding clustering analysis method needs to be tested for P times (P is the number of the features) based on the mutual information and the boottrap method due to the high dimensionality of biomedical data, so that the method is extremely time-consuming and labor-consuming.

Finally, none of the existing methods take into account the individual parts of all modes and therefore either cannot represent them at all or mix them with the information shared by all modes.

Disclosure of Invention

In view of the above, the invention provides a data subtype classification method with sparse constraint multi-modal matrix joint decomposition, which can realize the mining of shared-specific data structures among any modalities.

In order to achieve the above purpose, the technical scheme of the invention is as follows:

a data subtype classification method with sparse constraint multi-modal matrix joint decomposition comprises the following steps:

finding a shared potential representation E among the multimodal data and performing cluster analysis on the basis of the potential representation E, wherein the potential representation E is madeFor a set of observations of N samples in N modalities,/for N samples>Consists of eigenvectors of samples in the i-th modality, i=1, 2..n; find a set of pattern-specific basis matricesWherein->Is the base matrix of the ith modality, so that data X of the ith modality ⁱ By EH ⁱ Reconstructing, wherein k is the number of clusters, m _i Feature dimension of the sample in the ith mode;

wherein the base matrix is assembledIs treated as a group and based thereon the set of basis matrices The set of sparsity constraints is applied by defining an objective function and solving the objective function.

Wherein the objective function is:

wherein the method comprises the steps ofA group sparsity constraint; j is an intermediate variable, represents the row index of the matrix, and has a value range of 1-k;

in the solving process, an observation data set of N samples in N modes is input The coefficient lambda of the group sparse constraint term, the model termination condition theta and the clustering number k.

The specific solving process of the objective function is as follows:

step S1, a submatrix E shared among modes and a base matrix set specific to the modes are subjected to a singular value decomposition methodThe initialization is carried out, and the specific steps are as shown in the steps S11-S16:

s11, setting an all 0 matrix with i initial value of 1 and E initial value of n multiplied by k;

step S12, X is defined as ⁱ Decomposition into u using SVD ⁱ d ⁱ v ⁱ ；

Step S13, let H ⁱ ＝v ⁱ _1：k，· V, i.e ⁱ The first k rows of (a);

step S14, E+ (u) ⁱ d ⁱ ) _·，1：k Assigning the result of (a) to E;

step S15, if i is less than N, assigning a value of i+1 to i, and jumping to step S12; otherwise, jumping to the step S16;

s16, assigning the E/N result to E;

step S2, initializing other related variables, setting the change amplitude delta initial value of the objective function loss as 1, setting the loss pre_loss initial value of the objective function in the previous iteration as 0, and setting the loss this_loss initial value of the objective function in the current iteration as 0;

Step S3, fixingThe updating E is unchanged, and the specific steps are as shown in the steps S31-S36:

step S31, setting the initial value of i1 to 1, i1=1, 2..n; an all 0 matrix with an initial value of n×k for the intermediate quantity XH and an all 0 matrix with an initial value of k×k for the intermediate quantity HH;

step S32, willThe result of (2) is assigned to XH, and the superscript T represents the transpose;

step S33, willIs assigned to HH;

step S34, if i1 is less than N, the value of i1+1 is assigned to i1, and the step S32 is skipped; otherwise, jumping to the step S35;

step S35, solving the matrix inverse revh= (HH) ^-1 ；

Step S36, assigning the result of XH x revH to E;

step S4, fixing E unchanged, updatingThe specific steps are as follows from step S41 to step S421The illustration is:

step s41, setting the initial value of i2 to 1, i2=1, 2..n;

s42, setting an initial value of j to be 1;

step s43, setting an initial value of l to 1, l=1, 2..k; intermediate quantity g ⁱ² _j ＝X ⁱ² ；

Step S44, if l is not equal to j, R is ⁱ² _j -E _·，l H ⁱ² _l，· Assigned to R as a result of ⁱ² _j Step S45, if not, directly jumping to step S45;

s45, if l is smaller than k, assigning a value of l+1 to l, and jumping to step S44, otherwise jumping to step S46;

step S46, making the intermediate variableVector v= (E _·，j ) ^T R ⁱ² _j λ, m= |v| represents the length of vector V;

step s47, taking an absolute value of V, and then arranging the V as a vector v1=sort (abs (V)) in descending order;

S48, setting an initial value of M to be 1, m=1, 2..M, and setting an initial value of an intermediate variable count to be 0;

step S49, setting an initial value of p to 1, p=1, 2..m, and setting an initial value of an intermediate variable S1 to 0;

s410, S1+V1 _p Assigning the result of (2) to S1; v1 _p The p-th element of the vector V1;

s411, if p is less than m, assigning a value of p+1 to p, and jumping to S410, otherwise jumping to S412;

s412, if (S1-1)/m is less than V1 _m Giving the value of m to count, jumping to step S413, otherwise jumping directly to step S413;

s413, if M is less than M, giving a value of m+1 to M, jumping to S49, otherwise jumping to S414;

step S414, if count is 0, setting the vector V2 as a zero vector with the same length as V1, jumping to step S419, otherwise jumping to step S415;

step S415. Letting the intermediate quantityV1 _o An o-th element representing vector V1, o=1, 2..count;

step S416, setting the initial value of m1 as 1, setting the vector V2 as the vector which is completely the same as V1, jumping to step S419, otherwise jumping to step S415; m1=1, 2..m;

step S417 if V2 _m1 Not less than τ, orderIf V2 _m1 Make +.tau.and ∈ ->Otherwise, directly jumping to step S418; v2 _m1 The m1 st element of the vector V2;

Step S418, if M1 is less than M, giving a value of m1+1 to M1, jumping to step S417, otherwise jumping to step S419;

step S419 the value of V2 is assigned (H ⁱ² ) _j，· ；

Step S420, if j is less than k, assigning a value of j+1 to j, and jumping to step S43, otherwise jumping to step S421;

step S421, if i2 is less than N, the value of i2+1 is given to i2, the step S42 is skipped, otherwise, the step S5 is skipped;

step S5, calculating the loss of the current objective function, namely the loss;

step S6, calculating the change amplitude delta=abs (this_loss-pre_loss)/pre_loss of the objective function loss;

step S7, assigning the value of this_loss to pre _loss ；

Step S8, if delta is more than or equal to theta, jumping to step S3, otherwise, terminating calculation to obtain a submatrix E and a submatrix EClustering the sub-matrix E by using a hierarchical clustering method to obtain final subtype classificationClass results.

Wherein the modularity is extended to a weighted modularity:

wherein,v and w are any two nodes in the network; delta (c) _v ，c _w ) Is used to determine whether nodes v and w are in the same community, delta (c) _v ，c _w ) =1, otherwise δ (c _v ，c _w )＝0；k _v And k _w Respectively representing weights of node v and node w, c represents the total number of communities, e _r，s Representing the edge of one node in community r and another node in community s, e _rr Representing a ratio of all edges within community r to all edges of the entire network; a, a _r Representing the ratio of the degree of the nodes in the r community to the degree of the whole network; a is that _v，w Representing the weights between nodes v and w, D representing the sum of the weights of all the edges in the network;

let the number of clusters be k, the modularity value obtained by calculating the data of the ith mode beWhen the clustering number is k, the modularity mean value between modes is +.>The specific calculation method is as follows:

SS1 is given by using the similarity calculation method in R packet SNFtoolComputing similarity network set->

SS2 obtaining the number of the current clusters asClustering result C at k _k ；

SS3, setting the initial value of i to be 1;

SS4. Set E ⁱ ∈R ^k*k The initial value is an all 0 matrix,initial value is 0, calculate network S ⁱ The sum of the weights of all sides +.>S ⁱ _t，t1 For network S ⁱ T1 column element of row t; t=1, 2..n, t1=1, 2..n;

SS5, setting the initial value of r to be 1; r=1, 2..k;

SS6 acquisition of C _k Index corresponding to sample belonging to class r _r ；

SS7, setting the initial value of s to be 1; s=1, 2..k;

SS8 obtaining C _k Index corresponding to sample belonging to category s _s ；

SS9. Ream

SS10, if s is less than k, the value of s+1 is given to s, the process jumps to SS8, otherwise the process jumps to SS11;

SS11 will beIs assigned to->

SS11, if r is less than k, assigning r+1 value to r, jumping to SS6, otherwise jumping to SS12;

SS12, if i is less than N, assigning a value of i+1 to i, jumping to SS4, otherwise jumping to SS13;

SS13. Ream

Wherein for each feature, a sample with a standard score z-score of ≡1 for that feature is defined as 1, the remaining samples are defined as 0, and a hypergeometric distribution test is used to evaluate whether each class is significantly enriched for overexpression of that gene; defining a sample with z-score < 1 as 1 and the remaining samples as 0, using a hypergeometric distribution test to evaluate whether each class is significantly enriched for over-expression of the gene; after FDR correction is carried out on the results of the super-geometric distribution test of the two cases that the z-score is more than or equal to 1 and the z-score is less than or equal to-1, the genes with the p-value less than 0.05 are selected as the characteristics of remarkable enrichment.

Wherein the selected genes are ranked according to standard deviation of gene expression and only those genes that are top ranked are selected as cluster-related features of the category.

Wherein, for methylation, it is considered that the gene is highly methylated when the methylation detection value beta is not less than 0.25, and the gene is unmethylated when beta is not less than-0.25; for each cluster, high or low methylation-enriched genes were selected using the same criteria as the expression data; for binary data, the same criteria as for expression data were used to select genes enriched for mutations.

Advantageous effects

1. Aiming at the situation that data structure sharing exists among any modes in real data, the invention introduces group sparse constraint, wherein the group sparse constraint is a special sparse constraint method, and generally, l is adopted as follows _1，2 Norms or l _1，∞ A norm implementation that constrains samples of the same group to rely on the same feature, while samples of different groups rely on more different features. By converting the group into the data mode and applying the sparse constraint method of the group and combining the sharing-mode specific concept in matrix joint decomposition, the mining of sharing-specific data structures among any modes can be realized, and on the basis, a more reasonable clustering effect is obtained, and the effect is verified on simulation data and real breast cancer data.

2. Aiming at the problem of clustering number estimation of multi-mode data, the invention introduces the concept of modularity in network science, the traditional modularity can well estimate the community number in the network from the data, the situation of expanding the community number to the multi-mode data is designed, and the score of the average value (mean of modality modularity, M3) of the modularity among the modes is designed for estimating the optimal clustering number of the multi-mode data.

3. Aiming at the situation that the dimension of the current biomedical data is high, and the feature extraction method is extremely time-consuming and labor-consuming, the invention provides a method for using super-geometric distribution to estimate the distribution condition (over-expression and under-expression) of each feature in different categories, and the super-geometric distribution is used for checking whether the over-expression and the under-expression of each feature have obvious difference between the categories or not, so that the feature of each category-specific over-expression and under-expression is selected.

4. In the invention, the score of the inter-mode modularity mean value (mean of modality modularity, M3) can better estimate the clustering number of the multi-mode data, and the performance of the clustering number is verified on the simulation data.

5. According to the invention, the multi-modal characteristics related to clustering can be extracted rapidly, the standard deviation of the super-geometric distribution test and the data distribution is comprehensively considered, the extracted key characteristics are provided with a judging line, and the characteristics extraction heat maps of the simulation data and the real breast cancer data both verify the specificity of the extracted characteristics.

Drawings

FIG. 1 is a schematic diagram of the results of analysis of simulated data of experimental verification 1 of the present invention;

FIG. 2 is a schematic diagram of the results of the experiment verification 1 of the invention for testing the ability of other methods in cluster number evaluation;

FIG. 3 is a graphical illustration of cluster-related features extracted by the method of the present invention on the experimental verification 1 of the second and third modalities of three sets of simulation data;

FIG. 4 is a schematic diagram of the results of the clustering number estimation and subtype classification analysis of BRCA breast cancer data of experimental verification 2 of the invention;

FIG. 5 is a schematic diagram of the subtype classification results and related feature visualization results of BRCA breast cancer data of experimental verification 2 of the present invention;

FIG. 6 is a schematic representation of survival curves of BRCA breast cancer data analyzed by the different methods of experimental verification 2 of the present invention.

Detailed Description

The invention will now be described in detail by way of example with reference to the accompanying drawings.

The invention provides a data subtype classification method based on band set sparse constraint multi-modal matrix joint decomposition, in particular to a tumor multi-mathematic data subtype classification method based on band set sparse constraint multi-modal matrix joint decomposition, aiming at finding a shared potential representation E among multi-modal data and carrying out cluster analysis on the basis, wherein the method is used for enablingObservation data for N samples in N modes,/-for N samples>Is composed of eigenvectors of samples in the i-th modality, i=1, 2. m is m _i Is the characteristic dimension of the sample in the ith mode. Meanwhile, in order to maintain consistency of semantics, a group of mode-specific base matrix sets is also required to be found out +.>Wherein the method comprises the steps ofIs the base matrix of the ith modality, so that data X of the ith modality ⁱ Through EH ⁱ And (5) reconstructing. The method is realized by defining an objective function and solving the objective function, and comprises the following specific steps:

defining an objective function J:

wherein the method comprises the steps ofAnd->Each data mode is regarded as a group, and the base matrix set is treated as a group>Applying group sparsity constraint, and rewriting the objective function as:

wherein the method comprises the steps ofJ is an intermediate variable for group sparsity constraint, represents the row index of the matrix, and has a value range of 1-k.

In the solving process, an observation data set of N samples in N modes is input The coefficient lambda of the group sparse constraint term, the model termination condition theta and the clustering number k. The specific solving process is as follows:

Step S13, let H ⁱ ＝v ⁱ _1：k，· V, i.e ⁱ The first k rows of (a);

step S14, E+ (u) ⁱ d ⁱ ) _·，1：k Assigning the result of (a) to E;

s16, assigning the E/N result to E;

step S33, willIs assigned to HH;

step S35, solving the matrix inverse revh= (HH) ^-1 ；

Step S36, assigning the result of XH x revH to E;

step S4, fixing E unchanged, updatingThe specific steps are as follows in step S41-step S421:

step s41, setting the initial value of i2 to 1, i2=1, 2..n;

s42, setting an initial value of j to be 1;

step s43, setting an initial value of l to 1, l=1, 2..k; intermediate quantity R ⁱ² _j ＝X ⁱ² ；

step S419 the value of V2 is assigned (H ⁱ² ) _j，· ；

step S7, assigning the value of this_loss to pre _loss ；

Step S8, if delta is more than or equal to theta, jumping to step S3, otherwise, terminating calculation to obtain a submatrix E and a submatrix EClustering the submatrices E by using a hierarchical clustering method to obtain a final subtype classification result.

In 2006, newman published in PNAS article [ Newman M E J. Modularity and community structure in networks [ J ]. Proceedings of the national academy of sciences,2006, 103 (23): 8577-8582) defines the manner of calculation of the modularity as:

wherein,v and w are any two nodes in the network, if v and w are connected by an edge, A _v，w =1, otherwise a _v，w =0, d is the number of all edges in the network. k (k) _v And k _w Representing the weights of nodes v and w, respectively, delta (c) _v ，c _w ) Is used to determine whether nodes v and w are in the same community, delta (c) _v ，c _w ) =1, otherwise δ (c _v ，c _w )＝0。e _r，s Representing the edges of one node within community r and another node within community s. Then e _rr A ratio of all edges within community r to all edges of the entire network is represented. And a is _r It represents the ratio of the degree of nodes within the r-community to the degree of the entire network. c represents the total number of communities.

In order to consider the similarity between nodes (i.e. samples) in the present invention, the modularity is extended to a weighted modularity, which is different from the conventional modularity in that A _v，w And D is defined, A _v，w Representing the weight between nodes v and w (i.e., the similarity of the two nodes), while D represents the sum of the weights of all the edges in the network. Let the number of clusters be k, the modularity value obtained by calculating the data of the ith mode beWhen the clustering number is k, the modularity average value among the modesThe specific calculation method is as follows:

SS1. Similarity calculation in R-packet SNFtoolMethod of givingComputing similarity network (matrix) set +.>

SS2 obtaining the clustering result C when the number of clusters is k _k ；

SS3, setting the initial value of i to be 1;

SS4. Set E ⁱ ∈R ^k*k The initial value is 0,initial value is 0, calculate network S ⁱ The sum of the weights of all sides in (a)S ⁱ _t，t1 For network S ⁱ T1 column element of row t; t=1, 2..n, t1=1, 2..n;

SS5, setting the initial value of r to be 1; r=1, 2..k;

SS7, setting the initial value of s to be 1; s=1, 2..k;

SS8 obtaining C _k Index corresponding to sample belonging to category s _s ；

SS9. Ream

SS11 will beIs assigned to->

SS12: if i is less than N, the value of i+1 is given to i, the SS4 is skipped, and otherwise, the SS13 is skipped;

SS13. Ream

To select for significantly enriched expression changes in the cluster, it is believed that the gene is over-expressed when the standard fraction of the gene expression profile, z-score, is ≡1, and under-expressed when z-score is ≡1. For each feature, a sample with a standard score z-score of 1. Gtoreq.1 for that feature was defined as 1, the remaining samples were defined as 0, and a hypergeometric distribution test was used to evaluate whether each class significantly enriched for overexpression of that gene; likewise, a sample with z-score of 1 was defined as 1, the remaining samples were defined as 0, and a hypergeometric distribution test was used to evaluate whether each class significantly enriched for overexpression of the gene; after FDR correction is carried out on the results of the super-geometric distribution test of the two cases that the z-score is more than or equal to 1 and the z-score is less than or equal to-1, the genes with the p-value less than 0.05 are selected as the characteristics of remarkable enrichment. To select features that best represent a single class, the selected genes are further ranked according to the standard deviation of gene expression, and only those genes that are top ranked are selected as cluster-related features for that class.

For methylation, it is considered that the gene is highly methylated when the methylation detection value β is.gtoreq.0.25, and unmethylated when β is.ltoreq.0.25. For each cluster, high or low methylation-enriched genes were selected using the same criteria as the expression data. For binary data like gene mutation, the same criteria as expression data are used to select genes enriched for mutation. Order theObservation data for N samples in N modalities, wherein +.>From characteristics of samples in the ith modalityThe vector is formed, and the clustering result is C when the number of clusters is k _k The number of the selected features of each category is f. The specific calculation method comprises the following steps:

sss1, let i=1;

SSS2, calculating a Z-score of the gene expression profile in such a way that the original value is subtracted from the mean and divided by the variance to obtain Z ⁱ ；

SSS3. ReamFu ⁱ ∈R ^f*k ，Fd ⁱ ∈R ^f*k ；

Sss4, let j1=1;

sss5, let l1=1;

SSS6, if Z ⁱ _j1，l1 Make up equal to or greater than 1 ⁱ _j1，l1 =1, otherwise up ⁱ _j1，l1 =0, if Z ⁱ _j1，l1 Make Down at less than or equal to-1 ⁱ _j1，l1 =1, otherwise down ⁱ _j1，l1 ＝0；

SSS7 if l1 is less than m _i Let l1=l1+1, jump to SSS6, otherwise jump to SSS8;

SSS8, if j1 is less than n, letting j1 = j1+1, jumping to SSS5, otherwise jumping to SSS9;

sss9, let l2=1;

SSS10. Ream/>

Sss11, let k2=1;

SSS12 based on clustering result C _k And up ⁱ _·，l2 Performing super-geometric distribution test to obtain enrichment significance pu of the current feature in the current class ⁱ _l2，k2 Based on clustering result C _k And down ⁱ _·，l2 Performing super-geometric distribution test to obtain enrichment significance pd of the current characteristic ⁱ _l2，k2 ；

SSS13 calculation of standard deviation sd of the first 2 nd feature ⁱ _l2 ；

SSS14, if k2 is less than k, making k2=k2+1, jumping to SSS12, otherwise jumping to SSS15;

SSS15 if l2 is less than m _i Let l2=l2+1, jump to SSS10, otherwise jump to SSS16;

sss16, let k3=1;

SSS17 p pu ⁱ _·，k3 FDR correction is carried out to obtain pau ⁱ _·，k3 For pd ⁱ _·，k3 FDR correction is carried out to obtain pdu ⁱ _·，k3 ；

SSS18 selection of pau ⁱ _·，k3 Features less than 0.05, and based on sd for selected features ⁱ Sorting, namely selecting the first f features as significantly enriched up-regulation features of the current category k3, and storing the features into Fu ⁱ _·，k3， Selecting pdu ⁱ _·，k3 Features less than 0.05, and based on sd for selected features ⁱ Sorting, namely selecting the first f features as significantly enriched down-regulating features of the current category k3, and storing the features to Fd ⁱ _·，k3 ；

SSS19, if k3 is less than k, making k3=k3+1, jumping to SSS17, otherwise jumping to SSS20;

SSS20, if i is less than N, making i=i+1, jumping to SSS2, otherwise ending the operation process to obtain Fu ⁱ I=1,.. ⁱ ，i＝1，...，N。

Experiment verification 1: m3JF and other multi-set chemical integration analysis methods were tested on three sets of simulated reference data. The first group, called interger_data, is generated by R-pack interger based on the TCGA real ovarian cancer dataset. The dataset consisted of four classes, each class having 100, 150, 135 and 115 samples, respectively, modeling 367 DNA methylation signatures, 131 mRNA gene expression signatures and 160 protein expression signatures in total. The second group is called iNMF_data, the data generation code of which is disclosed on the gitsub, see https:// www.github.com/yangzi4/iNMF, which is derived by the iNMF method, wherein each data modality is built up from the sum of three matrices: one consisting of three shared diagonal blocks of the same dimension, one consisting of one or two data-specific off-diagonal blocks, and the other consisting of random uniform noise. The dataset consisted of four classes, each class having 25 samples each with a dimension of 100. The dataset contains data of three modalities, each generated under different disturbances and noise. The last set is called crimix_data, three modalities of data with different data distributions generated by R-packets crimix. The data contained four classes, each with 10, 20, 5 and 25 samples, simulating 1000 transcriptome features following the Guassian distribution, 5000 DNA methylation features with the β -like distribution, and 500 gene mutation features with the binary distribution.

In order to show the ability of the M3 value to evaluate the number of unsupervised clusters, the present experiment varied the number of clusters k in the range of 2-10, clusters the simulated data using the M3JF method, and calculates the modularity value for each modality as well as the M3 value. Fig. 1A, D, G shows the corresponding modularity values and M3 values of the interim_data, imrf_data, and crimix_data data sets. It is clear that the M3 value can evaluate the number of clusters correctly in all cases.

To evaluate the effectiveness of M3JF and related methods, the experiment generated 30 times per dataset, clustered the datasets, and calculated corrected lander coefficients (adjusted rand index, ARI) for the clustering results and truth labels. The results are shown in FIGS. 1B, E and H, respectively. Of all methods, M3JF, SNF and immf gave the best results on all simulated data sets. PintMF drops slightly on crilmix_data because the gaussian noise of the dataset is relatively stronger. Each time the data structure is complex, the pinsp will fail, whereas the CIMLR will not work properly when the number of samples in each cluster varies significantly. Mocrouster drops sharply on imf_data because of its failure of a strong hypothesis, which considers noise to have the same variance between variables. RGCCA performs poorly, possibly due to too many parameters that need to be adjusted. Although lfmmdVAE and efmmdVAE work best in all unsupervised deep learning models for comprehensive analysis of cancer data, there is a great gap between them and non-deep methods. The experiment selects the first 20 features, which are very different, and shows significant enrichment in each cluster (FDR < 0.05). Fig. 1C, F, I depicts a thermal diagram of selected features of a first data modality of three simulated data sets, which features show significant differences between clusters.

In addition, the present experiment also tested the ability of other methods in cluster number evaluation, including rotation loss values of SNF (ratation cost value, RCV), cluster prediction index of intNMF (cluster prediction index, CPI), variance interpretation percentage of pinttmf (percentage of variation explained, PVE), and separation cost values of CIMLR (separation cost value, SCV). The results of the testing of these four values by varying k from 2 to 10 over three sets of simulated data are shown in fig. 2. The SNF selects the lowest optimal number of clusters for RCV and selects 4, 4 for the interim_data, iNMF_data and crimix_data data sets, consistent with the truth labels. On the other hand, intNMF prefers CPI maximized cluster number, 4, 2 are selected respectively. For PVE, it is preferred that the number of clusters when the value starts to enter the plateau and it takes 4, 4 as the best number of clusters for each group. While SCV is optimal when the maximum drop in SCV is obtained over a set of possible cluster numbers, and 4, 10 are selected as the optimal cluster numbers. According to the result, the M3 value provided by the invention can estimate the optimal clustering number on a plurality of simulation data sets.

The experiment further plots selected features of the second and third modalities of the simulated dataset as shown in fig. 3. Visualization of the selected features of each cluster demonstrates the effectiveness of the feature selection method used by the present invention.

Experiment verification 2: to facilitate comparison with other methods, this experiment analyzed mRNA (ID: tcga. BRCA. Samplemap/HiSeqV 2) data, miRNA (ID: tcga. BRCA. Samplemap/mirna_hiseq_gene) data, and DNA methylation data (two platform IDs: tcgap. BRCA. Amplimemap/HumanMethylation 27 and tcga. Bbrca. Samplemap/HumanMethylation 450) of breast invasive cancer (breast invasive cancer, BRCA). Specifically, the dataset included 1215 samples with 20530 mRNA features, 853 samples with 1046 miRNA features, 872 samples with 485577 DNA methylation features (human methylation 450) and 345 samples with 27578 DNA methylation features. Through data filtering, there were 826 samples with both mRNA, miRNA and DNA methylation characteristics, and this experiment retained 20073 DNA methylation characteristics shared between the two humanmethyl 450 and humanmethyl 27 platforms.

The experiment varies the number k of clusters between 2 and 10 to estimate the optimal number of clusters for BRCA data based on the M3 value, RCV, CPI, PVE and SCV. The M3 values result as shown in fig. 4A, there is a trend in the modularity values of mRNA, miRNA, and DNA methylation patterns, respectively, and M3 values. From fig. 4, it can be seen that the M3 value selects 5 as the optimal cluster number. The cluster number evaluation results of the other four values are shown in fig. 4B-E, and 4, 2, 5, and 3 are selected as the optimal cluster numbers, respectively.

According to the performance of the cluster number evaluation method, the experiment uses all methods to classify BRCA data into 5 clusters and further uses their preferred number of best clusters to test SNF, intNMF, pintMF and CIMLR. FIG. 4F shows the results of the cox log rank sum test p-value of each method taken as-log 10. It can be seen therein that the p-value of the M3JF method is lower than the other methods. Then, kaplan-Meier survival curves for BRCA subtype results for each subtype approach were plotted, with the survival curves for M3JF shown in FIG. 5A, and survival curves for other approaches and approaches with other cluster numbers shown in FIG. 6. Comparison of Kaplan-Meier survival curves shows that subtype results of M3JF are more discriminatory between subtypes compared to other methods.

In addition, the consistency index for each method was calculated in this experiment (M3 JF 0.671, LRAclumer 0.543, PINSPlus 0.527, SNF 0.588, CIMLR 0.607, moluster 0.586, intNMF 0.475, pintMF 0.578, lfmmdVAE 0.545, efmmdVAE 0.496, and SNF 4 subtypes, intNMF 2 subtypes and CIMLR 0.519). The higher consistency index of the M3JF compared to other methods means that the M3JF captures a better intrinsic data structure whose sample pair predictions are more consistent with the truth labels. And by projecting the high-dimensional features of the samples in each subtype into two-dimensional space using t-SNE, the clustering result of M3 JFs is further visualized, and as shown in FIG. 5B, samples of different subtypes can be obviously separated according to M3JF classification.

The experiment selected the first 20 features that showed significant enrichment in each cluster (FDR < 0.05) and the greatest difference, figure 5C plots the heat map of selected features of mRNA data, miRNA data, and DNA methylation data, totaling 186 mRNA features, 127 miRNA features, and 104 DNA methylation features, which exhibited significant differences between subtypes.

All data used in the present invention is downloaded from the TCGA, and can be downloaded from the UCSC cancer browser (https:// genome-cancer. UCSC. Edu).

In summary, the above embodiments are only preferred embodiments of the present invention, and are not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A data subtype classification method with sparse constraint multi-modal matrix joint decomposition is characterized by comprising the following steps:

finding a shared potential representation E among the multimodal data and performing cluster analysis on the basis of the potential representation E, wherein the potential representation E is madeFor a set of observations of N samples in N modalities,/for N samples>Consists of eigenvectors of samples in the i-th modality, i=1, 2..n; find a set of pattern-specific basis matrices Wherein->Is the base matrix of the ith modality, so that data X of the ith modality ⁱ By EH ⁱ Reconstructing, wherein k is the number of clusters, m _i Feature dimension of the sample in the ith mode;

wherein the base matrix is assembledIs treated as a group and based thereon the base matrix set +.>Applying a group of sparse constraints by defining an objective function and solving the objective function;

the objective function is:

in the solving process, an observation data set of N samples in N modes is input The coefficient lambda of the group sparse constraint term, the model termination condition alpha and the clustering number k;

the specific solving process of the objective function is as follows:

Step S13, let H ⁱ ＝v ⁱ _1：k，· V, i.e ⁱ The first k rows of (a);

step S14, E+ (u) ⁱ d ⁱ ) _·，1：k Assigning the result of (a) to E;

s16, assigning the E/N result to E;

step S31, setting the initial value of i1 to 1, i1=1, 2 … N; an all 0 matrix with an initial value of n×k for the intermediate quantity XH and an all 0 matrix with an initial value of k×k for the intermediate quantity HH;

step S33, willIs assigned to HH;

step S35, solving the matrix inverse revh= (HH) ^-1 ；

Step S36, assigning the result of XH x revH to E;

step s41, setting the initial value of i2 to 1, i2=1, 2..n;

s42, setting an initial value of j to be 1;

step S419 the value of V2 is assigned (H ⁱ² ) _j，· ；

step S7, assigning the value of this_loss to pre _loss ；

Step S8, if delta is more than or equal to theta,then jump to step S3, otherwise, calculate and terminate to obtain submatrices E andclustering the submatrices E by using a hierarchical clustering method to obtain a final subtype classification result.

2. The method of claim 1, wherein the modularity is extended to a weighted modularity:

wherein,v and w are any two nodes in the network; delta (c) _v ，c _w ) Is used to determine whether nodes v and w are in the same community, delta (c) _v ，c _w ) =1, otherwise δ (c _v ，c _w )＝0；k _v And k _w Respectively representing weights of node v and node w, c represents the total number of communities, e _r，s Representing the edge of one node in community r and another node in community s, e _rr Representing a ratio of all edges within community r to all edges of the entire network; a, a _r Representing the ratio of the degree of the nodes in the r community to the degree of the whole network; a is that _v,w Representing the weights between nodes v and w, D representing the sum of the weights of all the edges in the network;

SS2 obtaining the clustering result C when the number of clusters is k _k ；

SS3, setting the initial value of i to be 1;

SS4. Set E ⁱ ∈R ^k*k The initial value is an all 0 matrix,initial value is 0, calculate network S ⁱ The sum of the weights of all sides in (a) S ⁱ _t,t1 For network S ⁱ T1 column element of row t; t=1, 2..n, t1=1, 2..n;

SS5, setting the initial value of r to be 1; r=1, 2..k;

SS7, setting the initial value of s to be 1; s=1, 2..k;

SS8 obtaining C _k Index corresponding to sample belonging to category s _s ；

SS9. Ream

SS11 will beIs assigned to->

SS13. Ream

3. The method of any one of claims 1-2, wherein for each feature, a sample with a standard score z-score of ≡1 for that feature is defined as 1, the remaining samples are defined as 0, and a hypergeometric distribution test is used to evaluate whether each class is significantly enriched for overexpression of a gene; defining a sample with z-score < 1 as 1 and the remaining samples as 0, using a hypergeometric distribution test to evaluate whether each class is significantly enriched for overexpression of a gene; after FDR correction is carried out on the results of the super-geometric distribution test of the two cases that the z-score is more than or equal to 1 and the z-score is less than or equal to-1, the genes with the p-value less than 0.05 are selected as the characteristics of remarkable enrichment.

4. A method according to claim 3, characterized in that the selected genes are ordered according to the standard deviation of gene expression and only those genes that are top ranked are selected as cluster-related features of the category.

5. The method according to any one of claims 1 to 2 or 4, wherein for methylation, the gene is considered highly methylated when the methylation detection value β is ≡0.25 and non-methylated when β is ≡0.25; for each cluster, high or low methylation-enriched genes were selected using the same criteria as the expression data; for binary data, the same criteria as for expression data were used to select genes enriched for mutations.