CN116246712B - Data subtype classification method with sparse constraint multi-mode matrix joint decomposition - Google Patents

Data subtype classification method with sparse constraint multi-mode matrix joint decomposition Download PDF

Info

Publication number
CN116246712B
CN116246712B CN202310104611.XA CN202310104611A CN116246712B CN 116246712 B CN116246712 B CN 116246712B CN 202310104611 A CN202310104611 A CN 202310104611A CN 116246712 B CN116246712 B CN 116246712B
Authority
CN
China
Prior art keywords
jumping
value
data
initial value
setting
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310104611.XA
Other languages
Chinese (zh)
Other versions
CN116246712A (en
Inventor
何昆
尹晓尧
伯晓晨
王娜
陈河兵
董方霆
李卫华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Academy of Military Medical Sciences AMMS of PLA
Original Assignee
Academy of Military Medical Sciences AMMS of PLA
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Academy of Military Medical Sciences AMMS of PLA filed Critical Academy of Military Medical Sciences AMMS of PLA
Priority to CN202310104611.XA priority Critical patent/CN116246712B/en
Publication of CN116246712A publication Critical patent/CN116246712A/en
Application granted granted Critical
Publication of CN116246712B publication Critical patent/CN116246712B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/30Unsupervised data analysis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/50Mutagenesis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
    • G16B25/10Gene or protein expression profiling; Expression-ratio estimation or normalisation
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients

Landscapes

  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Medical Informatics (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Theoretical Computer Science (AREA)
  • Genetics & Genomics (AREA)
  • Public Health (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Molecular Biology (AREA)
  • Bioethics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Biomedical Technology (AREA)
  • Pathology (AREA)
  • Primary Health Care (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a data subtype classification method with sparse constraint multi-modal matrix joint decomposition, which can realize the mining of shared-specific data structures among any modalities. Aiming at the situation that data structure sharing exists among any modes in real data, the invention introduces group sparse constraint, wherein the group sparse constraint is a special sparse constraint method, and generally, l is adopted as follows 1,2 Norms or l 1,∞ A norm implementation that constrains samples of the same group to rely on the same feature, while samples of different groups rely on more different features. By converting the group into the data mode and applying the sparse constraint method of the group and combining the sharing-mode specific concept in matrix joint decomposition, the mining of sharing-specific data structures among any modes can be realized, and on the basis, a more reasonable clustering effect is obtained, and the effect is verified on simulation data and real breast cancer data.

Description

Data subtype classification method with sparse constraint multi-mode matrix joint decomposition
Technical Field
The invention relates to the technical field of tumor multiunit chemical data subtype classification, in particular to a data subtype classification method based on joint decomposition of a multi-modal matrix with sparse constraint.
Background
Complex diseases such as neoplastic diseases have a variety and heterogeneity of characteristics such as genome, transcriptome, proteome, and epigenome. Recent advances in technology have enabled the acquisition of multiple sets of chemical data that can be used to explore the pathological complexity of the disease. Notably, cancer genomic maps (The Cancer Genome Atlas, TCGA) collected genomic and transcriptomic information from more than 20 cancers from thousands of patients, including about 2000 breast cancer samples. Based on TCGA data, integrated cluster analysis of gene expression profiles and DNA methylation data can identify new subgroups outside classical biomarker expression subtypes.
Therefore, there is a need to further develop a calculation method for integrally analyzing multiple sets of data represented by cancer patients, particularly when the multiple sets of data exhibit heterogeneity among the sets (each set can be regarded as a modality). Most existing integrated analysis methods need to address several issues related to biological data, namely small samples with very high dimensional features (also known as dimensional disaster issues), inconsistent data ranges, and potentially group-specific and shared structural patterns among multiple groups of chemical data that are easily ignored. The multi-modal data integration analysis currently performed by mathematical methods can be divided into three major categories, early integration, late integration and intermediate integration. Early integration was the simplest method of connecting different histology data into a single matrix and applying a single histology clustering technique. However, this will increase the data dimension, exacerbating the so-called dimension disaster problem. In alternative techniques, a set of important features may also be pre-selected from each data modality, and then the modalities integrated using consistency clustering, non-negative matrix factorization (Non-negative Matrix Factorization, NMF) or independent component analysis (Independent Component Analysis, ICA). The feature pre-selection process is extremely time consuming and may discard important information. The post-integration method obtains the final result by grouping each of the histology data separately and then integrating the clustered results. Such a brute force integration scheme may lead to confusing results when the clustering results of the different sets of data are inconsistent.
As another type of multi-modal integration analysis, intermediate integration is interposed between early integration and late integration, and can be further divided into sequential analysis and joint analysis. In sequential analysis, the model first analyzes one data pattern and then adjusts the optimization results by subsequent analysis of the other data patterns. Sequence analysis methods, such as multiple co-perturbation methods and trans-process related and cis-related gene analysis, assume causal relationships exist between one set of omics data (e.g., transcriptome) and another set of omics data. However, such methods are sensitive to the order of data analysis, i.e. the integrated analysis of the different modalities must be performed in a certain order. Thus, changing the analysis sequence may lead to different results, while these methods are not applicable to other types of data sets. The joint analysis has at least one of the following characteristics: 1) Calculating sample similarity; 2) Combining the different histology data sets by using a dimension reduction method; 3) And carrying out statistical modeling on the multi-mode data. Based on these properties, the joint analysis method can be further classified into a similarity-based method, a dimension-reduction method, a statistical method, and a deep learning-based method. Statistical models, such as iCluster and variants thereof, including icluster+ and iCluster Bayes, assume that multiple sets of mathematical data share potential gaussian variables. However, the iterative expectation-maximization algorithm used is computationally complex and does not necessarily converge to a deterministic or optimal solution. Similarity-based methods, including SNF (Similarity Network Fusion), CIMLR (Cancer Integration via Multikernel Learning), rmkl LPP, mixKernel, and extended spectral clustering methods, have attracted considerable attention, which pre-construct a similarity matrix of samples by multi-kernel learning, and then group the samples using spectral clustering. The construction of the similarity matrix can solve the dimension problem to some extent, essentially trying to construct a new similarity matrix based on multiple similarity matrices. However, during the similarity matrix construction process, some features may disappear, and extracting the cluster-related features becomes complicated. The dimension reduction method aims at projecting multiple sets of mathematical data into shared and histologic-specific sub-matrices in a low dimensional space using covariance between the data sets or matrix/tensor joint factorization and imposing additional sparse constraints on the sub-matrices. However, these methods are generally not effective because they do not extract well any inter-group sharing and inter-group specific features because of the lack of shared information in all modes they acquire. The deep learning model fuses multiple data patterns before or after learning the low-dimensional embedding by an automatic encoder, a denoising automatic encoder, a variation automatic encoder, or a stacking variation automatic encoder. Typically intNMF (integrative NMF) and PintMF (Penalized Integrative Matrix Factorization), these depth methods do not perform well due to the overfitting when embedding very high dimensional raw data into very low spaces.
In summary, the existing methods have three problems:
1. the method only considers the data structure completely shared among different modes, but does not consider the situation of sharing among any modes, namely, a certain clustering relation can be shared among two or three modes only, but not among all modes, so that the obtained sharing matrix cannot represent the modes completely or the information shared by the modes is mixed together;
2. the methods either do not provide a clustering number estimation method for multi-mode data, or the proposed clustering number estimation method is not practical, and the due clustering number cannot be accurately estimated on artificially synthesized analog data;
3. the method is not used for providing technical support for extracting the clustering related features, or the original features are erased by constructing the similarity matrix, so that the method can only be used for estimating the features through a mutual information method, or the method is used for testing all the features one by one through a boottrap method, and the corresponding clustering analysis method needs to be tested for P times (P is the number of the features) based on the mutual information and the boottrap method due to the high dimensionality of biomedical data, so that the method is extremely time-consuming and labor-consuming.
Finally, none of the existing methods take into account the individual parts of all modes and therefore either cannot represent them at all or mix them with the information shared by all modes.
Disclosure of Invention
In view of the above, the invention provides a data subtype classification method with sparse constraint multi-modal matrix joint decomposition, which can realize the mining of shared-specific data structures among any modalities.
In order to achieve the above purpose, the technical scheme of the invention is as follows:
a data subtype classification method with sparse constraint multi-modal matrix joint decomposition comprises the following steps:
finding a shared potential representation E among the multimodal data and performing cluster analysis on the basis of the potential representation E, wherein the potential representation E is madeFor a set of observations of N samples in N modalities,/for N samples>Consists of eigenvectors of samples in the i-th modality, i=1, 2..n; find a set of pattern-specific basis matricesWherein->Is the base matrix of the ith modality, so that data X of the ith modality i By EH i Reconstructing, wherein k is the number of clusters, m i Feature dimension of the sample in the ith mode;
wherein the base matrix is assembledIs treated as a group and based thereon the set of basis matrices The set of sparsity constraints is applied by defining an objective function and solving the objective function.
Wherein the objective function is:
wherein the method comprises the steps ofA group sparsity constraint; j is an intermediate variable, represents the row index of the matrix, and has a value range of 1-k;
in the solving process, an observation data set of N samples in N modes is input The coefficient lambda of the group sparse constraint term, the model termination condition theta and the clustering number k.
The specific solving process of the objective function is as follows:
step S1, a submatrix E shared among modes and a base matrix set specific to the modes are subjected to a singular value decomposition methodThe initialization is carried out, and the specific steps are as shown in the steps S11-S16:
s11, setting an all 0 matrix with i initial value of 1 and E initial value of n multiplied by k;
step S12, X is defined as i Decomposition into u using SVD i d i v i
Step S13, let H i =v i 1:k,· V, i.e i The first k rows of (a);
step S14, E+ (u) i d i ) ·,1:k Assigning the result of (a) to E;
step S15, if i is less than N, assigning a value of i+1 to i, and jumping to step S12; otherwise, jumping to the step S16;
s16, assigning the E/N result to E;
step S2, initializing other related variables, setting the change amplitude delta initial value of the objective function loss as 1, setting the loss pre_loss initial value of the objective function in the previous iteration as 0, and setting the loss this_loss initial value of the objective function in the current iteration as 0;
Step S3, fixingThe updating E is unchanged, and the specific steps are as shown in the steps S31-S36:
step S31, setting the initial value of i1 to 1, i1=1, 2..n; an all 0 matrix with an initial value of n×k for the intermediate quantity XH and an all 0 matrix with an initial value of k×k for the intermediate quantity HH;
step S32, willThe result of (2) is assigned to XH, and the superscript T represents the transpose;
step S33, willIs assigned to HH;
step S34, if i1 is less than N, the value of i1+1 is assigned to i1, and the step S32 is skipped; otherwise, jumping to the step S35;
step S35, solving the matrix inverse revh= (HH) -1
Step S36, assigning the result of XH x revH to E;
step S4, fixing E unchanged, updatingThe specific steps are as follows from step S41 to step S421The illustration is:
step s41, setting the initial value of i2 to 1, i2=1, 2..n;
s42, setting an initial value of j to be 1;
step s43, setting an initial value of l to 1, l=1, 2..k; intermediate quantity g i2 j =X i2
Step S44, if l is not equal to j, R is i2 j -E ·,l H i2 l,· Assigned to R as a result of i2 j Step S45, if not, directly jumping to step S45;
s45, if l is smaller than k, assigning a value of l+1 to l, and jumping to step S44, otherwise jumping to step S46;
step S46, making the intermediate variableVector v= (E ·,j ) T R i2 j λ, m= |v| represents the length of vector V;
step s47, taking an absolute value of V, and then arranging the V as a vector v1=sort (abs (V)) in descending order;
S48, setting an initial value of M to be 1, m=1, 2..M, and setting an initial value of an intermediate variable count to be 0;
step S49, setting an initial value of p to 1, p=1, 2..m, and setting an initial value of an intermediate variable S1 to 0;
s410, S1+V1 p Assigning the result of (2) to S1; v1 p The p-th element of the vector V1;
s411, if p is less than m, assigning a value of p+1 to p, and jumping to S410, otherwise jumping to S412;
s412, if (S1-1)/m is less than V1 m Giving the value of m to count, jumping to step S413, otherwise jumping directly to step S413;
s413, if M is less than M, giving a value of m+1 to M, jumping to S49, otherwise jumping to S414;
step S414, if count is 0, setting the vector V2 as a zero vector with the same length as V1, jumping to step S419, otherwise jumping to step S415;
step S415. Letting the intermediate quantityV1 o An o-th element representing vector V1, o=1, 2..count;
step S416, setting the initial value of m1 as 1, setting the vector V2 as the vector which is completely the same as V1, jumping to step S419, otherwise jumping to step S415; m1=1, 2..m;
step S417 if V2 m1 Not less than τ, orderIf V2 m1 Make +.tau.and ∈ ->Otherwise, directly jumping to step S418; v2 m1 The m1 st element of the vector V2;
Step S418, if M1 is less than M, giving a value of m1+1 to M1, jumping to step S417, otherwise jumping to step S419;
step S419 the value of V2 is assigned (H i2 ) j,·
Step S420, if j is less than k, assigning a value of j+1 to j, and jumping to step S43, otherwise jumping to step S421;
step S421, if i2 is less than N, the value of i2+1 is given to i2, the step S42 is skipped, otherwise, the step S5 is skipped;
step S5, calculating the loss of the current objective function, namely the loss;
step S6, calculating the change amplitude delta=abs (this_loss-pre_loss)/pre_loss of the objective function loss;
step S7, assigning the value of this_loss to pre loss
Step S8, if delta is more than or equal to theta, jumping to step S3, otherwise, terminating calculation to obtain a submatrix E and a submatrix EClustering the sub-matrix E by using a hierarchical clustering method to obtain final subtype classificationClass results.
Wherein the modularity is extended to a weighted modularity:
wherein,v and w are any two nodes in the network; delta (c) v ,c w ) Is used to determine whether nodes v and w are in the same community, delta (c) v ,c w ) =1, otherwise δ (c v ,c w )=0;k v And k w Respectively representing weights of node v and node w, c represents the total number of communities, e r,s Representing the edge of one node in community r and another node in community s, e rr Representing a ratio of all edges within community r to all edges of the entire network; a, a r Representing the ratio of the degree of the nodes in the r community to the degree of the whole network; a is that v,w Representing the weights between nodes v and w, D representing the sum of the weights of all the edges in the network;
let the number of clusters be k, the modularity value obtained by calculating the data of the ith mode beWhen the clustering number is k, the modularity mean value between modes is +.>The specific calculation method is as follows:
SS1 is given by using the similarity calculation method in R packet SNFtoolComputing similarity network set->
SS2 obtaining the number of the current clusters asClustering result C at k k
SS3, setting the initial value of i to be 1;
SS4. Set E i ∈R k*k The initial value is an all 0 matrix,initial value is 0, calculate network S i The sum of the weights of all sides +.>S i t,t1 For network S i T1 column element of row t; t=1, 2..n, t1=1, 2..n;
SS5, setting the initial value of r to be 1; r=1, 2..k;
SS6 acquisition of C k Index corresponding to sample belonging to class r r
SS7, setting the initial value of s to be 1; s=1, 2..k;
SS8 obtaining C k Index corresponding to sample belonging to category s s
SS9. Ream
SS10, if s is less than k, the value of s+1 is given to s, the process jumps to SS8, otherwise the process jumps to SS11;
SS11 will beIs assigned to->
SS11, if r is less than k, assigning r+1 value to r, jumping to SS6, otherwise jumping to SS12;
SS12, if i is less than N, assigning a value of i+1 to i, jumping to SS4, otherwise jumping to SS13;
SS13. Ream
Wherein for each feature, a sample with a standard score z-score of ≡1 for that feature is defined as 1, the remaining samples are defined as 0, and a hypergeometric distribution test is used to evaluate whether each class is significantly enriched for overexpression of that gene; defining a sample with z-score < 1 as 1 and the remaining samples as 0, using a hypergeometric distribution test to evaluate whether each class is significantly enriched for over-expression of the gene; after FDR correction is carried out on the results of the super-geometric distribution test of the two cases that the z-score is more than or equal to 1 and the z-score is less than or equal to-1, the genes with the p-value less than 0.05 are selected as the characteristics of remarkable enrichment.
Wherein the selected genes are ranked according to standard deviation of gene expression and only those genes that are top ranked are selected as cluster-related features of the category.
Wherein, for methylation, it is considered that the gene is highly methylated when the methylation detection value beta is not less than 0.25, and the gene is unmethylated when beta is not less than-0.25; for each cluster, high or low methylation-enriched genes were selected using the same criteria as the expression data; for binary data, the same criteria as for expression data were used to select genes enriched for mutations.
Advantageous effects
1. Aiming at the situation that data structure sharing exists among any modes in real data, the invention introduces group sparse constraint, wherein the group sparse constraint is a special sparse constraint method, and generally, l is adopted as follows 1,2 Norms or l 1,∞ A norm implementation that constrains samples of the same group to rely on the same feature, while samples of different groups rely on more different features. By converting the group into the data mode and applying the sparse constraint method of the group and combining the sharing-mode specific concept in matrix joint decomposition, the mining of sharing-specific data structures among any modes can be realized, and on the basis, a more reasonable clustering effect is obtained, and the effect is verified on simulation data and real breast cancer data.
2. Aiming at the problem of clustering number estimation of multi-mode data, the invention introduces the concept of modularity in network science, the traditional modularity can well estimate the community number in the network from the data, the situation of expanding the community number to the multi-mode data is designed, and the score of the average value (mean of modality modularity, M3) of the modularity among the modes is designed for estimating the optimal clustering number of the multi-mode data.
3. Aiming at the situation that the dimension of the current biomedical data is high, and the feature extraction method is extremely time-consuming and labor-consuming, the invention provides a method for using super-geometric distribution to estimate the distribution condition (over-expression and under-expression) of each feature in different categories, and the super-geometric distribution is used for checking whether the over-expression and the under-expression of each feature have obvious difference between the categories or not, so that the feature of each category-specific over-expression and under-expression is selected.
4. In the invention, the score of the inter-mode modularity mean value (mean of modality modularity, M3) can better estimate the clustering number of the multi-mode data, and the performance of the clustering number is verified on the simulation data.
5. According to the invention, the multi-modal characteristics related to clustering can be extracted rapidly, the standard deviation of the super-geometric distribution test and the data distribution is comprehensively considered, the extracted key characteristics are provided with a judging line, and the characteristics extraction heat maps of the simulation data and the real breast cancer data both verify the specificity of the extracted characteristics.
Drawings
FIG. 1 is a schematic diagram of the results of analysis of simulated data of experimental verification 1 of the present invention;
FIG. 2 is a schematic diagram of the results of the experiment verification 1 of the invention for testing the ability of other methods in cluster number evaluation;
FIG. 3 is a graphical illustration of cluster-related features extracted by the method of the present invention on the experimental verification 1 of the second and third modalities of three sets of simulation data;
FIG. 4 is a schematic diagram of the results of the clustering number estimation and subtype classification analysis of BRCA breast cancer data of experimental verification 2 of the invention;
FIG. 5 is a schematic diagram of the subtype classification results and related feature visualization results of BRCA breast cancer data of experimental verification 2 of the present invention;
FIG. 6 is a schematic representation of survival curves of BRCA breast cancer data analyzed by the different methods of experimental verification 2 of the present invention.
Detailed Description
The invention will now be described in detail by way of example with reference to the accompanying drawings.
The invention provides a data subtype classification method based on band set sparse constraint multi-modal matrix joint decomposition, in particular to a tumor multi-mathematic data subtype classification method based on band set sparse constraint multi-modal matrix joint decomposition, aiming at finding a shared potential representation E among multi-modal data and carrying out cluster analysis on the basis, wherein the method is used for enablingObservation data for N samples in N modes,/-for N samples>Is composed of eigenvectors of samples in the i-th modality, i=1, 2. m is m i Is the characteristic dimension of the sample in the ith mode. Meanwhile, in order to maintain consistency of semantics, a group of mode-specific base matrix sets is also required to be found out +.>Wherein the method comprises the steps ofIs the base matrix of the ith modality, so that data X of the ith modality i Through EH i And (5) reconstructing. The method is realized by defining an objective function and solving the objective function, and comprises the following specific steps:
defining an objective function J:
wherein the method comprises the steps ofAnd->Each data mode is regarded as a group, and the base matrix set is treated as a group>Applying group sparsity constraint, and rewriting the objective function as:
wherein the method comprises the steps ofJ is an intermediate variable for group sparsity constraint, represents the row index of the matrix, and has a value range of 1-k.
In the solving process, an observation data set of N samples in N modes is input The coefficient lambda of the group sparse constraint term, the model termination condition theta and the clustering number k. The specific solving process is as follows:
step S1, a submatrix E shared among modes and a base matrix set specific to the modes are subjected to a singular value decomposition methodThe initialization is carried out, and the specific steps are as shown in the steps S11-S16:
s11, setting an all 0 matrix with i initial value of 1 and E initial value of n multiplied by k;
step S12, X is defined as i Decomposition into u using SVD i d i v i
Step S13, let H i =v i 1:k,· V, i.e i The first k rows of (a);
step S14, E+ (u) i d i ) ·,1:k Assigning the result of (a) to E;
step S15, if i is less than N, assigning a value of i+1 to i, and jumping to step S12; otherwise, jumping to the step S16;
s16, assigning the E/N result to E;
step S2, initializing other related variables, setting the change amplitude delta initial value of the objective function loss as 1, setting the loss pre_loss initial value of the objective function in the previous iteration as 0, and setting the loss this_loss initial value of the objective function in the current iteration as 0;
step S3, fixingThe updating E is unchanged, and the specific steps are as shown in the steps S31-S36:
step S31, setting the initial value of i1 to 1, i1=1, 2..n; an all 0 matrix with an initial value of n×k for the intermediate quantity XH and an all 0 matrix with an initial value of k×k for the intermediate quantity HH;
Step S32, willThe result of (2) is assigned to XH, and the superscript T represents the transpose;
step S33, willIs assigned to HH;
step S34, if i1 is less than N, the value of i1+1 is assigned to i1, and the step S32 is skipped; otherwise, jumping to the step S35;
step S35, solving the matrix inverse revh= (HH) -1
Step S36, assigning the result of XH x revH to E;
step S4, fixing E unchanged, updatingThe specific steps are as follows in step S41-step S421:
step s41, setting the initial value of i2 to 1, i2=1, 2..n;
s42, setting an initial value of j to be 1;
step s43, setting an initial value of l to 1, l=1, 2..k; intermediate quantity R i2 j =X i2
Step S44, if l is not equal to j, R is i2 j -E ·,l H i2 l,· Assigned to R as a result of i2 j Step S45, if not, directly jumping to step S45;
s45, if l is smaller than k, assigning a value of l+1 to l, and jumping to step S44, otherwise jumping to step S46;
step S46, making the intermediate variableVector v= (E ·,j ) T R i2 j λ, m= |v| represents the length of vector V;
step s47, taking an absolute value of V, and then arranging the V as a vector v1=sort (abs (V)) in descending order;
s48, setting an initial value of M to be 1, m=1, 2..M, and setting an initial value of an intermediate variable count to be 0;
step S49, setting an initial value of p to 1, p=1, 2..m, and setting an initial value of an intermediate variable S1 to 0;
S410, S1+V1 p Assigning the result of (2) to S1; v1 p The p-th element of the vector V1;
s411, if p is less than m, assigning a value of p+1 to p, and jumping to S410, otherwise jumping to S412;
s412, if (S1-1)/m is less than V1 m Giving the value of m to count, jumping to step S413, otherwise jumping directly to step S413;
s413, if M is less than M, giving a value of m+1 to M, jumping to S49, otherwise jumping to S414;
step S414, if count is 0, setting the vector V2 as a zero vector with the same length as V1, jumping to step S419, otherwise jumping to step S415;
step S415. Letting the intermediate quantityV1 o An o-th element representing vector V1, o=1, 2..count;
step S416, setting the initial value of m1 as 1, setting the vector V2 as the vector which is completely the same as V1, jumping to step S419, otherwise jumping to step S415; m1=1, 2..m;
step S417 if V2 m1 Not less than τ, orderIf V2 m1 Make +.tau.and ∈ ->Otherwise, directly jumping to step S418; v2 m1 The m1 st element of the vector V2;
step S418, if M1 is less than M, giving a value of m1+1 to M1, jumping to step S417, otherwise jumping to step S419;
step S419 the value of V2 is assigned (H i2 ) j,·
Step S420, if j is less than k, assigning a value of j+1 to j, and jumping to step S43, otherwise jumping to step S421;
Step S421, if i2 is less than N, the value of i2+1 is given to i2, the step S42 is skipped, otherwise, the step S5 is skipped;
step S5, calculating the loss of the current objective function, namely the loss;
step S6, calculating the change amplitude delta=abs (this_loss-pre_loss)/pre_loss of the objective function loss;
step S7, assigning the value of this_loss to pre loss
Step S8, if delta is more than or equal to theta, jumping to step S3, otherwise, terminating calculation to obtain a submatrix E and a submatrix EClustering the submatrices E by using a hierarchical clustering method to obtain a final subtype classification result.
In 2006, newman published in PNAS article [ Newman M E J. Modularity and community structure in networks [ J ]. Proceedings of the national academy of sciences,2006, 103 (23): 8577-8582) defines the manner of calculation of the modularity as:
wherein,v and w are any two nodes in the network, if v and w are connected by an edge, A v,w =1, otherwise a v,w =0, d is the number of all edges in the network. k (k) v And k w Representing the weights of nodes v and w, respectively, delta (c) v ,c w ) Is used to determine whether nodes v and w are in the same community, delta (c) v ,c w ) =1, otherwise δ (c v ,c w )=0。e r,s Representing the edges of one node within community r and another node within community s. Then e rr A ratio of all edges within community r to all edges of the entire network is represented. And a is r It represents the ratio of the degree of nodes within the r-community to the degree of the entire network. c represents the total number of communities.
In order to consider the similarity between nodes (i.e. samples) in the present invention, the modularity is extended to a weighted modularity, which is different from the conventional modularity in that A v,w And D is defined, A v,w Representing the weight between nodes v and w (i.e., the similarity of the two nodes), while D represents the sum of the weights of all the edges in the network. Let the number of clusters be k, the modularity value obtained by calculating the data of the ith mode beWhen the clustering number is k, the modularity average value among the modesThe specific calculation method is as follows:
SS1. Similarity calculation in R-packet SNFtoolMethod of givingComputing similarity network (matrix) set +.>
SS2 obtaining the clustering result C when the number of clusters is k k
SS3, setting the initial value of i to be 1;
SS4. Set E i ∈R k*k The initial value is 0,initial value is 0, calculate network S i The sum of the weights of all sides in (a)S i t,t1 For network S i T1 column element of row t; t=1, 2..n, t1=1, 2..n;
SS5, setting the initial value of r to be 1; r=1, 2..k;
SS6 acquisition of C k Index corresponding to sample belonging to class r r
SS7, setting the initial value of s to be 1; s=1, 2..k;
SS8 obtaining C k Index corresponding to sample belonging to category s s
SS9. Ream
SS10, if s is less than k, the value of s+1 is given to s, the process jumps to SS8, otherwise the process jumps to SS11;
SS11 will beIs assigned to->
SS11, if r is less than k, assigning r+1 value to r, jumping to SS6, otherwise jumping to SS12;
SS12: if i is less than N, the value of i+1 is given to i, the SS4 is skipped, and otherwise, the SS13 is skipped;
SS13. Ream
To select for significantly enriched expression changes in the cluster, it is believed that the gene is over-expressed when the standard fraction of the gene expression profile, z-score, is ≡1, and under-expressed when z-score is ≡1. For each feature, a sample with a standard score z-score of 1. Gtoreq.1 for that feature was defined as 1, the remaining samples were defined as 0, and a hypergeometric distribution test was used to evaluate whether each class significantly enriched for overexpression of that gene; likewise, a sample with z-score of 1 was defined as 1, the remaining samples were defined as 0, and a hypergeometric distribution test was used to evaluate whether each class significantly enriched for overexpression of the gene; after FDR correction is carried out on the results of the super-geometric distribution test of the two cases that the z-score is more than or equal to 1 and the z-score is less than or equal to-1, the genes with the p-value less than 0.05 are selected as the characteristics of remarkable enrichment. To select features that best represent a single class, the selected genes are further ranked according to the standard deviation of gene expression, and only those genes that are top ranked are selected as cluster-related features for that class.
For methylation, it is considered that the gene is highly methylated when the methylation detection value β is.gtoreq.0.25, and unmethylated when β is.ltoreq.0.25. For each cluster, high or low methylation-enriched genes were selected using the same criteria as the expression data. For binary data like gene mutation, the same criteria as expression data are used to select genes enriched for mutation. Order theObservation data for N samples in N modalities, wherein +.>From characteristics of samples in the ith modalityThe vector is formed, and the clustering result is C when the number of clusters is k k The number of the selected features of each category is f. The specific calculation method comprises the following steps:
sss1, let i=1;
SSS2, calculating a Z-score of the gene expression profile in such a way that the original value is subtracted from the mean and divided by the variance to obtain Z i
SSS3. ReamFu i ∈R f*k ,Fd i ∈R f*k
Sss4, let j1=1;
sss5, let l1=1;
SSS6, if Z i j1,l1 Make up equal to or greater than 1 i j1,l1 =1, otherwise up i j1,l1 =0, if Z i j1,l1 Make Down at less than or equal to-1 i j1,l1 =1, otherwise down i j1,l1 =0;
SSS7 if l1 is less than m i Let l1=l1+1, jump to SSS6, otherwise jump to SSS8;
SSS8, if j1 is less than n, letting j1 = j1+1, jumping to SSS5, otherwise jumping to SSS9;
sss9, let l2=1;
SSS10. Ream/>
Sss11, let k2=1;
SSS12 based on clustering result C k And up i ·,l2 Performing super-geometric distribution test to obtain enrichment significance pu of the current feature in the current class i l2,k2 Based on clustering result C k And down i ·,l2 Performing super-geometric distribution test to obtain enrichment significance pd of the current characteristic i l2,k2
SSS13 calculation of standard deviation sd of the first 2 nd feature i l2
SSS14, if k2 is less than k, making k2=k2+1, jumping to SSS12, otherwise jumping to SSS15;
SSS15 if l2 is less than m i Let l2=l2+1, jump to SSS10, otherwise jump to SSS16;
sss16, let k3=1;
SSS17 p pu i ·,k3 FDR correction is carried out to obtain pau i ·,k3 For pd i ·,k3 FDR correction is carried out to obtain pdu i ·,k3
SSS18 selection of pau i ·,k3 Features less than 0.05, and based on sd for selected features i Sorting, namely selecting the first f features as significantly enriched up-regulation features of the current category k3, and storing the features into Fu i ·,k3, Selecting pdu i ·,k3 Features less than 0.05, and based on sd for selected features i Sorting, namely selecting the first f features as significantly enriched down-regulating features of the current category k3, and storing the features to Fd i ·,k3
SSS19, if k3 is less than k, making k3=k3+1, jumping to SSS17, otherwise jumping to SSS20;
SSS20, if i is less than N, making i=i+1, jumping to SSS2, otherwise ending the operation process to obtain Fu i I=1,.. i ,i=1,...,N。
Experiment verification 1: m3JF and other multi-set chemical integration analysis methods were tested on three sets of simulated reference data. The first group, called interger_data, is generated by R-pack interger based on the TCGA real ovarian cancer dataset. The dataset consisted of four classes, each class having 100, 150, 135 and 115 samples, respectively, modeling 367 DNA methylation signatures, 131 mRNA gene expression signatures and 160 protein expression signatures in total. The second group is called iNMF_data, the data generation code of which is disclosed on the gitsub, see https:// www.github.com/yangzi4/iNMF, which is derived by the iNMF method, wherein each data modality is built up from the sum of three matrices: one consisting of three shared diagonal blocks of the same dimension, one consisting of one or two data-specific off-diagonal blocks, and the other consisting of random uniform noise. The dataset consisted of four classes, each class having 25 samples each with a dimension of 100. The dataset contains data of three modalities, each generated under different disturbances and noise. The last set is called crimix_data, three modalities of data with different data distributions generated by R-packets crimix. The data contained four classes, each with 10, 20, 5 and 25 samples, simulating 1000 transcriptome features following the Guassian distribution, 5000 DNA methylation features with the β -like distribution, and 500 gene mutation features with the binary distribution.
In order to show the ability of the M3 value to evaluate the number of unsupervised clusters, the present experiment varied the number of clusters k in the range of 2-10, clusters the simulated data using the M3JF method, and calculates the modularity value for each modality as well as the M3 value. Fig. 1A, D, G shows the corresponding modularity values and M3 values of the interim_data, imrf_data, and crimix_data data sets. It is clear that the M3 value can evaluate the number of clusters correctly in all cases.
To evaluate the effectiveness of M3JF and related methods, the experiment generated 30 times per dataset, clustered the datasets, and calculated corrected lander coefficients (adjusted rand index, ARI) for the clustering results and truth labels. The results are shown in FIGS. 1B, E and H, respectively. Of all methods, M3JF, SNF and immf gave the best results on all simulated data sets. PintMF drops slightly on crilmix_data because the gaussian noise of the dataset is relatively stronger. Each time the data structure is complex, the pinsp will fail, whereas the CIMLR will not work properly when the number of samples in each cluster varies significantly. Mocrouster drops sharply on imf_data because of its failure of a strong hypothesis, which considers noise to have the same variance between variables. RGCCA performs poorly, possibly due to too many parameters that need to be adjusted. Although lfmmdVAE and efmmdVAE work best in all unsupervised deep learning models for comprehensive analysis of cancer data, there is a great gap between them and non-deep methods. The experiment selects the first 20 features, which are very different, and shows significant enrichment in each cluster (FDR < 0.05). Fig. 1C, F, I depicts a thermal diagram of selected features of a first data modality of three simulated data sets, which features show significant differences between clusters.
In addition, the present experiment also tested the ability of other methods in cluster number evaluation, including rotation loss values of SNF (ratation cost value, RCV), cluster prediction index of intNMF (cluster prediction index, CPI), variance interpretation percentage of pinttmf (percentage of variation explained, PVE), and separation cost values of CIMLR (separation cost value, SCV). The results of the testing of these four values by varying k from 2 to 10 over three sets of simulated data are shown in fig. 2. The SNF selects the lowest optimal number of clusters for RCV and selects 4, 4 for the interim_data, iNMF_data and crimix_data data sets, consistent with the truth labels. On the other hand, intNMF prefers CPI maximized cluster number, 4, 2 are selected respectively. For PVE, it is preferred that the number of clusters when the value starts to enter the plateau and it takes 4, 4 as the best number of clusters for each group. While SCV is optimal when the maximum drop in SCV is obtained over a set of possible cluster numbers, and 4, 10 are selected as the optimal cluster numbers. According to the result, the M3 value provided by the invention can estimate the optimal clustering number on a plurality of simulation data sets.
The experiment further plots selected features of the second and third modalities of the simulated dataset as shown in fig. 3. Visualization of the selected features of each cluster demonstrates the effectiveness of the feature selection method used by the present invention.
Experiment verification 2: to facilitate comparison with other methods, this experiment analyzed mRNA (ID: tcga. BRCA. Samplemap/HiSeqV 2) data, miRNA (ID: tcga. BRCA. Samplemap/mirna_hiseq_gene) data, and DNA methylation data (two platform IDs: tcgap. BRCA. Amplimemap/HumanMethylation 27 and tcga. Bbrca. Samplemap/HumanMethylation 450) of breast invasive cancer (breast invasive cancer, BRCA). Specifically, the dataset included 1215 samples with 20530 mRNA features, 853 samples with 1046 miRNA features, 872 samples with 485577 DNA methylation features (human methylation 450) and 345 samples with 27578 DNA methylation features. Through data filtering, there were 826 samples with both mRNA, miRNA and DNA methylation characteristics, and this experiment retained 20073 DNA methylation characteristics shared between the two humanmethyl 450 and humanmethyl 27 platforms.
The experiment varies the number k of clusters between 2 and 10 to estimate the optimal number of clusters for BRCA data based on the M3 value, RCV, CPI, PVE and SCV. The M3 values result as shown in fig. 4A, there is a trend in the modularity values of mRNA, miRNA, and DNA methylation patterns, respectively, and M3 values. From fig. 4, it can be seen that the M3 value selects 5 as the optimal cluster number. The cluster number evaluation results of the other four values are shown in fig. 4B-E, and 4, 2, 5, and 3 are selected as the optimal cluster numbers, respectively.
According to the performance of the cluster number evaluation method, the experiment uses all methods to classify BRCA data into 5 clusters and further uses their preferred number of best clusters to test SNF, intNMF, pintMF and CIMLR. FIG. 4F shows the results of the cox log rank sum test p-value of each method taken as-log 10. It can be seen therein that the p-value of the M3JF method is lower than the other methods. Then, kaplan-Meier survival curves for BRCA subtype results for each subtype approach were plotted, with the survival curves for M3JF shown in FIG. 5A, and survival curves for other approaches and approaches with other cluster numbers shown in FIG. 6. Comparison of Kaplan-Meier survival curves shows that subtype results of M3JF are more discriminatory between subtypes compared to other methods.
In addition, the consistency index for each method was calculated in this experiment (M3 JF 0.671, LRAclumer 0.543, PINSPlus 0.527, SNF 0.588, CIMLR 0.607, moluster 0.586, intNMF 0.475, pintMF 0.578, lfmmdVAE 0.545, efmmdVAE 0.496, and SNF 4 subtypes, intNMF 2 subtypes and CIMLR 0.519). The higher consistency index of the M3JF compared to other methods means that the M3JF captures a better intrinsic data structure whose sample pair predictions are more consistent with the truth labels. And by projecting the high-dimensional features of the samples in each subtype into two-dimensional space using t-SNE, the clustering result of M3 JFs is further visualized, and as shown in FIG. 5B, samples of different subtypes can be obviously separated according to M3JF classification.
The experiment selected the first 20 features that showed significant enrichment in each cluster (FDR < 0.05) and the greatest difference, figure 5C plots the heat map of selected features of mRNA data, miRNA data, and DNA methylation data, totaling 186 mRNA features, 127 miRNA features, and 104 DNA methylation features, which exhibited significant differences between subtypes.
All data used in the present invention is downloaded from the TCGA, and can be downloaded from the UCSC cancer browser (https:// genome-cancer. UCSC. Edu).
In summary, the above embodiments are only preferred embodiments of the present invention, and are not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (5)

1. A data subtype classification method with sparse constraint multi-modal matrix joint decomposition is characterized by comprising the following steps:
finding a shared potential representation E among the multimodal data and performing cluster analysis on the basis of the potential representation E, wherein the potential representation E is madeFor a set of observations of N samples in N modalities,/for N samples>Consists of eigenvectors of samples in the i-th modality, i=1, 2..n; find a set of pattern-specific basis matrices Wherein->Is the base matrix of the ith modality, so that data X of the ith modality i By EH i Reconstructing, wherein k is the number of clusters, m i Feature dimension of the sample in the ith mode;
wherein the base matrix is assembledIs treated as a group and based thereon the base matrix set +.>Applying a group of sparse constraints by defining an objective function and solving the objective function;
the objective function is:
wherein the method comprises the steps ofA group sparsity constraint; j is an intermediate variable, represents the row index of the matrix, and has a value range of 1-k;
in the solving process, an observation data set of N samples in N modes is input The coefficient lambda of the group sparse constraint term, the model termination condition alpha and the clustering number k;
the specific solving process of the objective function is as follows:
step S1, a submatrix E shared among modes and a base matrix set specific to the modes are subjected to a singular value decomposition methodThe initialization is carried out, and the specific steps are as shown in the steps S11-S16:
s11, setting an all 0 matrix with i initial value of 1 and E initial value of n multiplied by k;
step S12, X is defined as i Decomposition into u using SVD i d i v i
Step S13, let H i =v i 1:k,· V, i.e i The first k rows of (a);
step S14, E+ (u) i d i ) ·,1:k Assigning the result of (a) to E;
Step S15, if i is less than N, assigning a value of i+1 to i, and jumping to step S12; otherwise, jumping to the step S16;
s16, assigning the E/N result to E;
step S2, initializing other related variables, setting the change amplitude delta initial value of the objective function loss as 1, setting the loss pre_loss initial value of the objective function in the previous iteration as 0, and setting the loss this_loss initial value of the objective function in the current iteration as 0;
step S3, fixingThe updating E is unchanged, and the specific steps are as shown in the steps S31-S36:
step S31, setting the initial value of i1 to 1, i1=1, 2 … N; an all 0 matrix with an initial value of n×k for the intermediate quantity XH and an all 0 matrix with an initial value of k×k for the intermediate quantity HH;
step S32, willThe result of (2) is assigned to XH, and the superscript T represents the transpose;
step S33, willIs assigned to HH;
step S34, if i1 is less than N, the value of i1+1 is assigned to i1, and the step S32 is skipped; otherwise, jumping to the step S35;
step S35, solving the matrix inverse revh= (HH) -1
Step S36, assigning the result of XH x revH to E;
step S4, fixing E unchanged, updatingThe specific steps are as follows in step S41-step S421:
step s41, setting the initial value of i2 to 1, i2=1, 2..n;
s42, setting an initial value of j to be 1;
Step s43, setting an initial value of l to 1, l=1, 2..k; intermediate quantity R i2 j =X i2
Step S44, if l is not equal to j, R is i2 j -E ·,l H i2 l,· Assigned to R as a result of i2 j Step S45, if not, directly jumping to step S45;
s45, if l is smaller than k, assigning a value of l+1 to l, and jumping to step S44, otherwise jumping to step S46;
step S46, making the intermediate variableVector v= (E ·,j ) T R i2 j λ, m= |v| represents the length of vector V;
step s47, taking an absolute value of V, and then arranging the V as a vector v1=sort (abs (V)) in descending order;
s48, setting an initial value of M to be 1, m=1, 2..M, and setting an initial value of an intermediate variable count to be 0;
step S49, setting an initial value of p to 1, p=1, 2..m, and setting an initial value of an intermediate variable S1 to 0;
s410, S1+V1 p Assigning the result of (2) to S1; v1 p The p-th element of the vector V1;
s411, if p is less than m, assigning a value of p+1 to p, and jumping to S410, otherwise jumping to S412;
s412, if (S1-1)/m is less than V1 m Giving the value of m to count, jumping to step S413, otherwise jumping directly to step S413;
s413, if M is less than M, giving a value of m+1 to M, jumping to S49, otherwise jumping to S414;
step S414, if count is 0, setting the vector V2 as a zero vector with the same length as V1, jumping to step S419, otherwise jumping to step S415;
Step S415. Letting the intermediate quantityV1 o An o-th element representing vector V1, o=1, 2..count;
step S416, setting the initial value of m1 as 1, setting the vector V2 as the vector which is completely the same as V1, jumping to step S419, otherwise jumping to step S415; m1=1, 2..m;
step S417 if V2 m1 Not less than τ, orderIf V2 m1 Make +.tau.and ∈ ->Otherwise, directly jumping to step S418; v2 m1 The m1 st element of the vector V2;
step S418, if M1 is less than M, giving a value of m1+1 to M1, jumping to step S417, otherwise jumping to step S419;
step S419 the value of V2 is assigned (H i2 ) j,·
Step S420, if j is less than k, assigning a value of j+1 to j, and jumping to step S43, otherwise jumping to step S421;
step S421, if i2 is less than N, the value of i2+1 is given to i2, the step S42 is skipped, otherwise, the step S5 is skipped;
step S5, calculating the loss of the current objective function, namely the loss;
step S6, calculating the change amplitude delta=abs (this_loss-pre_loss)/pre_loss of the objective function loss;
step S7, assigning the value of this_loss to pre loss
Step S8, if delta is more than or equal to theta,then jump to step S3, otherwise, calculate and terminate to obtain submatrices E andclustering the submatrices E by using a hierarchical clustering method to obtain a final subtype classification result.
2. The method of claim 1, wherein the modularity is extended to a weighted modularity:
wherein,v and w are any two nodes in the network; delta (c) v ,c w ) Is used to determine whether nodes v and w are in the same community, delta (c) v ,c w ) =1, otherwise δ (c v ,c w )=0;k v And k w Respectively representing weights of node v and node w, c represents the total number of communities, e r,s Representing the edge of one node in community r and another node in community s, e rr Representing a ratio of all edges within community r to all edges of the entire network; a, a r Representing the ratio of the degree of the nodes in the r community to the degree of the whole network; a is that v,w Representing the weights between nodes v and w, D representing the sum of the weights of all the edges in the network;
let the number of clusters be k, the modularity value obtained by calculating the data of the ith mode beWhen the clustering number is k, the modularity mean value between modes is +.>The specific calculation method is as follows:
SS1 is given by using the similarity calculation method in R packet SNFtoolComputing similarity network set->
SS2 obtaining the clustering result C when the number of clusters is k k
SS3, setting the initial value of i to be 1;
SS4. Set E i ∈R k*k The initial value is an all 0 matrix,initial value is 0, calculate network S i The sum of the weights of all sides in (a) S i t,t1 For network S i T1 column element of row t; t=1, 2..n, t1=1, 2..n;
SS5, setting the initial value of r to be 1; r=1, 2..k;
SS6 acquisition of C k Index corresponding to sample belonging to class r r
SS7, setting the initial value of s to be 1; s=1, 2..k;
SS8 obtaining C k Index corresponding to sample belonging to category s s
SS9. Ream
SS10, if s is less than k, the value of s+1 is given to s, the process jumps to SS8, otherwise the process jumps to SS11;
SS11 will beIs assigned to->
SS11, if r is less than k, assigning r+1 value to r, jumping to SS6, otherwise jumping to SS12;
SS12, if i is less than N, assigning a value of i+1 to i, jumping to SS4, otherwise jumping to SS13;
SS13. Ream
3. The method of any one of claims 1-2, wherein for each feature, a sample with a standard score z-score of ≡1 for that feature is defined as 1, the remaining samples are defined as 0, and a hypergeometric distribution test is used to evaluate whether each class is significantly enriched for overexpression of a gene; defining a sample with z-score < 1 as 1 and the remaining samples as 0, using a hypergeometric distribution test to evaluate whether each class is significantly enriched for overexpression of a gene; after FDR correction is carried out on the results of the super-geometric distribution test of the two cases that the z-score is more than or equal to 1 and the z-score is less than or equal to-1, the genes with the p-value less than 0.05 are selected as the characteristics of remarkable enrichment.
4. A method according to claim 3, characterized in that the selected genes are ordered according to the standard deviation of gene expression and only those genes that are top ranked are selected as cluster-related features of the category.
5. The method according to any one of claims 1 to 2 or 4, wherein for methylation, the gene is considered highly methylated when the methylation detection value β is ≡0.25 and non-methylated when β is ≡0.25; for each cluster, high or low methylation-enriched genes were selected using the same criteria as the expression data; for binary data, the same criteria as for expression data were used to select genes enriched for mutations.
CN202310104611.XA 2023-02-13 2023-02-13 Data subtype classification method with sparse constraint multi-mode matrix joint decomposition Active CN116246712B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310104611.XA CN116246712B (en) 2023-02-13 2023-02-13 Data subtype classification method with sparse constraint multi-mode matrix joint decomposition

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310104611.XA CN116246712B (en) 2023-02-13 2023-02-13 Data subtype classification method with sparse constraint multi-mode matrix joint decomposition

Publications (2)

Publication Number Publication Date
CN116246712A CN116246712A (en) 2023-06-09
CN116246712B true CN116246712B (en) 2024-03-26

Family

ID=86625634

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310104611.XA Active CN116246712B (en) 2023-02-13 2023-02-13 Data subtype classification method with sparse constraint multi-mode matrix joint decomposition

Country Status (1)

Country Link
CN (1) CN116246712B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017128799A1 (en) * 2016-01-27 2017-08-03 深圳大学 Hyperspectral remote sensing image classification method and system based on three-dimensional gabor feature selection
WO2018009887A1 (en) * 2016-07-08 2018-01-11 University Of Hawaii Joint analysis of multiple high-dimensional data using sparse matrix approximations of rank-1
CN109411019A (en) * 2018-12-12 2019-03-01 中国人民解放军军事科学院军事医学研究院 A kind of drug prediction technique, device, server and storage medium
CN109670418A (en) * 2018-12-04 2019-04-23 厦门理工学院 In conjunction with the unsupervised object identification method of multi-source feature learning and group sparse constraint
CN109670543A (en) * 2018-12-12 2019-04-23 中国人民解放军军事科学院军事医学研究院 A kind of data fusion method and device
CN110633732A (en) * 2019-08-15 2019-12-31 电子科技大学 Multi-modal image recognition method based on low-rank and joint sparsity
CN110826635A (en) * 2019-11-12 2020-02-21 曲阜师范大学 Sample clustering and feature identification method based on integration non-negative matrix factorization

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017128799A1 (en) * 2016-01-27 2017-08-03 深圳大学 Hyperspectral remote sensing image classification method and system based on three-dimensional gabor feature selection
WO2018009887A1 (en) * 2016-07-08 2018-01-11 University Of Hawaii Joint analysis of multiple high-dimensional data using sparse matrix approximations of rank-1
CN109670418A (en) * 2018-12-04 2019-04-23 厦门理工学院 In conjunction with the unsupervised object identification method of multi-source feature learning and group sparse constraint
CN109411019A (en) * 2018-12-12 2019-03-01 中国人民解放军军事科学院军事医学研究院 A kind of drug prediction technique, device, server and storage medium
CN109670543A (en) * 2018-12-12 2019-04-23 中国人民解放军军事科学院军事医学研究院 A kind of data fusion method and device
CN110633732A (en) * 2019-08-15 2019-12-31 电子科技大学 Multi-modal image recognition method based on low-rank and joint sparsity
CN110826635A (en) * 2019-11-12 2020-02-21 曲阜师范大学 Sample clustering and feature identification method based on integration non-negative matrix factorization

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Zahir Noorie ; Fatemeh Afsari.Regularized sparse feature selection with constraints embedded in graph Laplacian matrix.IEEE.2018,全文. *
基于低秩分解的联合动态稀疏表示多观测样本分类算法;胡正平;高红霄;赵淑欢;;电子学报(03);全文 *
基于改进稀疏非负矩阵分解方法的乳腺癌微阵列表达数据分析;孔薇;王娟;牟晓阳;;安徽医科大学学报(07);全文 *

Also Published As

Publication number Publication date
CN116246712A (en) 2023-06-09

Similar Documents

Publication Publication Date Title
Deb et al. Reliable classification of two-class cancer data using evolutionary algorithms
Brazma et al. Gene expression data analysis
Levine et al. Resampling method for unsupervised estimation of cluster validity
CN109326316B (en) Multilayer network model construction method and application of interaction of cancer-related SNP, gene, miRNA and protein
Frigyesi et al. Independent component analysis reveals new and biologically significant structures in micro array data
Li et al. Gene selection using genetic algorithm and support vectors machines
Pensa et al. Assessment of discretization techniques for relevant pattern discovery from gene expression data
CN112466404B (en) Metagenome contig unsupervised clustering method and system
CN112232413A (en) High-dimensional data feature selection method based on graph neural network and spectral clustering
Montserrat et al. Lai-net: Local-ancestry inference with neural networks
Park et al. Evolutionary fuzzy clustering algorithm with knowledge-based evaluation and applications for gene expression profiling
US20090043718A1 (en) Evolutionary hypernetwork classifiers for microarray data analysis
CN116246712B (en) Data subtype classification method with sparse constraint multi-mode matrix joint decomposition
CN115662504A (en) Multi-angle fusion-based biological omics data analysis method
KR102376212B1 (en) Gene expression marker screening method using neural network based on gene selection algorithm
Tasoulis et al. Unsupervised clustering of bioinformatics data
Oh et al. Hybrid clustering of single-cell gene expression and spatial information via integrated NMF and k-means
Ye et al. Interactive gene identification for cancer subtyping based on multi-omics clustering
CN112768001A (en) Single cell trajectory inference method based on manifold learning and main curve
CN108182347B (en) Large-scale cross-platform gene expression data classification method
CN111739581A (en) Comprehensive screening method for genome variables
Zhang et al. Differential function analysis: identifying structure and activation variations in dysregulated pathways
Tang et al. Mining multiple phenotype structures underlying gene expression profiles
Brenerman et al. Random Forest Factorization Reveals Latent Structure in Single Cell RNA Sequencing Data
Wang et al. Clustering analysis of microarray gene expression data by splitting algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant