CN113537358B

CN113537358B - Cancer subtype identification method and system based on multiple sets of mathematical data sets

Info

Publication number: CN113537358B
Application number: CN202110813430.5A
Authority: CN
Inventors: 蔡宏民; 阿里
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2021-07-19
Filing date: 2021-07-19
Publication date: 2023-09-01
Anticipated expiration: 2041-07-19
Also published as: CN113537358A

Abstract

The invention discloses a cancer subtype identification method and system based on a plurality of sets of chemical data. The method comprises the following steps: acquiring sample data of each patient; performing dimension reduction treatment on the sample data by adopting a principal component analysis method; constructing a similarity graph based on the dimension-reduced data; the similarity graph is used for representing the similarity between patients; projecting each similarity graph into a low-dimensional subspace; merging the subspaces on a Grassman manifold; based on the combined subspaces, the cancer subtypes are identified through a k-means clustering algorithm. The present invention combines multilateral molecular data (mRNA, microRNA and methylation), clinical data and pathway information to identify patient populations with different biological characteristics and different prognosis, thereby enabling rapid and accurate identification of cancer subtypes.

Description

Cancer subtype identification method and system based on multiple sets of mathematical data sets

Technical Field

The invention relates to the technical field of cancer subtype identification, in particular to a method and a system for identifying cancer subtypes based on multiple sets of chemical data sets.

Background

Most of the previous studies focused on the identification of cancer subtypes using single data, with little reliance on comprehensive analysis. The definition of the analysis-by-synthesis is the use of multiple source datasets to better understand the system. Although there is a great deal of research based on single source histology data, most of the etiology of complex traits remains unexplained. Single source histology data does not allow for comprehensive observation of biological systems and performs poorly in identifying new subtypes.

Disclosure of Invention

The invention aims to provide a method and a system for identifying cancer subtypes based on multiple sets of chemical data sets, which are used for quickly and accurately identifying the cancer subtypes.

In order to achieve the above object, the present invention provides the following solutions:

a method of cancer subtype identification based on a plurality of sets of mathematical data, comprising:

acquiring sample data of each patient;

performing dimension reduction treatment on the sample data by adopting a principal component analysis method;

constructing a similarity graph based on the dimension-reduced data; the similarity graph is used for representing the similarity between patients;

projecting each similarity graph into a low-dimensional subspace;

merging the subspaces on a Grassman manifold;

based on the combined subspaces, the cancer subtypes are identified through a k-means clustering algorithm.

Optionally, the sample data comprises gene expression, miRNA expression, and DNA methylation.

Optionally, the expression of the similarity graph is as follows:

G ^(m) ＝{V ^(m) ，E ^(m) }

wherein ,G^(m) Represents the mth similarity graph, node V ^(m) Representing the patient, edge E ^(m) Representing the connection between patients.

Optionally, constructing a similarity graph based on the dimensionality reduced data, and then further includes:

calculating a similarity matrix of the similarity graph;

and according to the similarity matrix, adopting a k-nearest neighbor algorithm to reserve the local structure of each similarity graph.

The invention also provides a cancer subtype identification system based on a plurality of groups of chemical data sets, which comprises:

a sample acquisition film for acquiring sample data of each patient;

the dimension reduction module is used for carrying out dimension reduction processing on the sample data by adopting a principal component analysis method;

the similarity diagram construction module is used for constructing a similarity diagram based on the dimensionality reduced data; the similarity graph is used for representing the similarity between patients;

the projection module is used for projecting each similar graph to the low-dimensional subspace;

a merging module for merging the subspaces on the Grassman manifold;

and the identification module is used for identifying the cancer subtype through a k-means clustering algorithm based on the combined subspaces.

Optionally, the expression of the similarity graph is as follows:

G ^(m) ＝{V ^(m) ，E ^(m) }

Optionally, the method further comprises:

the calculation module is used for calculating a similarity matrix of the similarity graph;

and the reservation module is used for reserving the local structure of each similarity graph by adopting a k-nearest neighbor algorithm according to the similarity matrix.

According to the specific embodiment provided by the invention, the invention discloses the following technical effects:

the invention provides a cancer subtype identification method based on a plurality of groups of chemical data sets, which comprises the following steps: acquiring sample data of each patient; performing dimension reduction treatment on the sample data by adopting a principal component analysis method; constructing a similarity graph based on the dimension-reduced data; the similarity graph is used for representing the similarity between patients; projecting each similarity graph into a low-dimensional subspace; merging the subspaces on a Grassman manifold; based on the combined subspaces, the cancer subtypes are identified through a k-means clustering algorithm. The present invention combines multilateral molecular data (mRNA, microRNA and methylation), clinical data and pathway information to identify patient populations with different biological characteristics and different prognosis, thereby enabling rapid and accurate identification of cancer subtypes.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the prior art, the drawings that are needed in the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a method for identifying cancer subtypes based on multiple sets of mathematical data according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a cancer subtype identification method based on a plurality of sets of mathematical data according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In order that the above-recited objects, features and advantages of the present invention will become more readily apparent, a more particular description of the invention will be rendered by reference to the appended drawings and appended detailed description.

As shown in fig. 1-2, the invention discloses a method for identifying cancer subtype based on a plurality of groups of chemical data sets, which comprises the following steps:

step 101: sample data for each patient is obtained. The sample data includes gene expression, miRNA expression, and DNA methylation.

Step 102: and performing dimension reduction treatment on the sample data by adopting a principal component analysis method.

Step 103: constructing a similarity graph based on the dimension-reduced data; the similarity graph is used to represent the similarity between patients.

The expression of the similarity graph is as follows:

G ^(m) ＝{V ^(m) ，E ^(m) }

wherein ,G^(m) Represents the mth similarity graph, node V ^(m) Representing the patient, edge E ^(m) And represents the connection between patients.

Step 104: each similarity graph is projected into a low-dimensional subspace.

Step 105: the subspaces are merged on a glasman manifold.

Step 106: based on the combined subspaces, the cancer subtypes are identified through a k-means clustering algorithm.

Wherein, after step 103, further comprises:

calculating a similarity matrix of the similarity graph;

Specific examples are as follows:

(1) The present invention is downloaded from the TCGA website, including BIC (breast invasive carcinoma), COAD (colon adenocarcinoma), KRCC (renal clear cell carcinoma), GBM (glioblastoma multiforme) and LSCC (lung squamous cell carcinoma). Each cancer contains three data types (DNA methylation, gene expression, and miRNA expression).

(2) The present invention uses popular Principal Component Analysis (PCA) techniques for dimension reduction. The invention performs PCA on a single data type as a matrix, the goal of which is to find the maximum projection variance of all samples, which can be expressed as:

matrix w= [ W ₁ ，w ₂ ，…，w _k ]Is a orthonormal basis for a low dimensional space. Clearly, eq.2 solution is defined by Z ^(m) Top k feature vector. Let lambda be ₁ ≥λ ₂ Not less than … not less than 0 is Z ^(m) />Is lambda _i Is w _k . Thus, the final result of PCA is calculated as H ^(m)T ＝W ^T Z ^(m) 。

(3) The present invention builds a patient-to-patient map in PCA space that models specific structures within each view. For the mth figure, G ^(m) ＝{V ^(m) ，E ^(m) Node V ^(m) Representing the patient in space, edge E ^(m) Representing the connection between these patients. Thus, the present invention first calculates the graph G ^(m) Similarity matrix of (c)W ^(m) . Each elementThe similarity between patients i and j is measured, and the calculation formula is as follows

The parameter t is a normalization factor. The higher the value, the more similar the two patients are.

Next, the present invention preserves the k-nearest neighbor (k-NN) of each patient to preserve the local structure of each graph.

wherein N_i Consists of the k nearest neighbors of patient i. The parameter k depends on the sample size. Since different histology have different structures, the k-NN map is more similar than the originalMore typically.

(4) To further extract key features of the histology, the present invention projects all the graphs into a low-dimensional subspace and obtains their relevant embedding in these spaces.

The invention firstly calculates the normalized graph Laplace matrix L ^(m) Defined as wherein D^(m) Is->Is defined by +.>And (5) calculating. Using a learned Laplace matrix U ^(m) Can be communicated according to a spectral clustering methodThe relevant eigenvalue problems are solved to calculate their embedding.

The solution of equation (4) is a normalized Laplace matrix L ^(m) Is defined in the block (a) and the minimum feature vector k of (b). Since embedding is the base of each space, the histology is more comparable than the original graph.

(5) For M-embedding of histology, minimizing the integrated embedding and the euclidean distance between it is a natural way to obtain a fused representation,

however, this approach assumes that similar patients are close in euclidean space, but this is often not the case. It is clear that multiple sets of mathematical data are complex and heterogeneous and therefore more suitable for measuring their distance on manifolds than euclidean space, such as glasman manifolds.

The glasman manifold G (k, n) is a set of k-dimensional linear subspaces. Mathematically, each point of G (k, n) represents a set of orthonormal bases Y, which can span a k-space span (Y). Thus, the space span (Y) andthe distance between can be defined as the sum of the principal angles of all base pairs:

wherein Is the base point Y _i And base->A main included angle between the two.

Based on this measurement, the distance between embeddings can be expressed as:

thus, the objective function is

Equation (8) forces the integrated representation U to approach all embedded U in terms of projected distance on the Grassman manifold ^(m) . Its solution is to correct the Laplace matrixIs defined as the average maximum eigenvector k of (c).

Finally, by the method in L _mod And obtaining the cluster labels by applying a k-means algorithm.

To verify the effectiveness of this method, the present invention compares it to Similar Network Fusion (SNF) and glasmann clustering. The present invention compares the method of the present invention with the results of SNF and Grassman clustering using Cox survival p values, the results are shown in Table (1). For fair comparison, the invention takes the same number of subtypes for SNF and Grassman clusters for each cancer. The method of the present invention shows important differences between survival times. Three-fifths of the cancers were studied by SNF, indicating that the methods of the invention have significant differences in survival time between the different subtypes.

Table 1 log rank test analysis of five cancer survival rates

Type of cancer	Grassman clustering	SNF	The method of the invention
				BIC (5 kinds)	2.0×10 ^-4	1.1×10 ^-3	4.3×10 ^-5
GBM (3 kinds)	4.3×10 ^-3	2.0×10 ^-4	2.3×10 ^-4
				KRCCC (3 kinds)	2.8×10 ^-2	2.9×10 ^-2	1.4×10 ^-1
LSCC (4 kinds)	1.6×10 ^-2	2.0×10 ^-2	2.7×10 ^-3
				COAD (3 kinds)	4.2×10 ^-2	2.0×10 ^-2	2.7×10 ^-3

a sample acquisition film for acquiring sample data of each patient;

a merging module for merging the subspaces on the Grassman manifold;

In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other.

The principles and embodiments of the present invention have been described herein with reference to specific examples, the description of which is intended only to assist in understanding the methods of the present invention and the core ideas thereof; also, it is within the scope of the present invention to be modified by those of ordinary skill in the art in light of the present teachings. In view of the foregoing, this description should not be construed as limiting the invention.

Claims

1. A method of identifying a subtype of cancer based on a plurality of sets of mathematical data, comprising:

acquiring sample data of each patient;

projecting each similarity graph into a low-dimensional subspace;

merging the subspaces on a Grassman manifold;

based on the combined subspaces, identifying cancer subtypes through a k-means clustering algorithm;

wherein, merging each subspace on the Grassman manifold specifically comprises:

space span (Y)The distance between is defined as the sum of the principal angles of all base pairs:

wherein ,is the base point Y _i And base->The main angles between i represent the space span (Y) and +.>The number of the ith base pair in between, k denotes the space span (Y) and +.>Total number of base pairs between, < >>Representing the sum of squares of cosine values of main included angles of all base pairs;

based on this measurement, the distance between embeddings can be expressed as:

wherein M represents the total number of histology,representing span (U) common subspace and span (U) ^(m) ) The Grassmann manifold distance of subspaces, U denotes the base of all groups of common subspaces, U ^(m) Representing the base representing the mth histology-specific subspace, tr (UU) ^T U ^(m) U ^(m)T ) Representation span (U) ^(m) ) Sum of squares of cosine values of main included angles of all substrate pairs between subspaces and span (U) common subspaces;

thus, the objective function is:

wherein I represents a unit array;

forcing the integrated representation U to approach all embedded U in terms of projected distance on the Grassman manifold ^(m) 。

2. The method of claim 1, wherein the sample data comprises gene expression, miRNA expression, and DNA methylation.

3. The method for identifying cancer subtypes based on multiple sets of chemical data according to claim 1, characterized in that the expression of the similarity map is as follows:

G ^(m) ＝{V ^(m) ，E ^(m) }

wherein ,G^(m) Represents the mth similarity graph, node V ^(m) Representing the patient, edge G ^(m) ＝{V ^(m) ，E ^(m) And represents the connection between patients.

4. The method for identifying cancer subtypes based on multiple sets of chemical data according to claim 1, characterized in that after constructing a similarity map based on the dimensionality-reduced data, it further comprises:

calculating a similarity matrix of the similarity graph;

5. A cancer subtype identification system based on a plurality of sets of mathematical data, comprising:

a sample acquisition film for acquiring sample data of each patient;

a merging module for merging the subspaces on the Grassman manifold;

the identification module is used for identifying the cancer subtype through a k-means clustering algorithm based on the combined subspaces;

wherein, merging each subspace on the Grassman manifold specifically comprises:

based on this measurement, the distance between embeddings can be expressed as:

wherein M represents the total number of histology,representing span (U) common subspace and span (U) ^(m) ) The Grassmann manifold distance of subspaces, U denotes the base of all groups of common subspaces, U ^(m) Representing the base representing the mth histology specific subspace,/->Representation span (U) ^(m) ) Sum of squares of cosine values of main included angles of all substrate pairs between subspaces and span (U) common subspaces;

thus, the objective function is:

s.t.U ^T U＝I

wherein I represents a unit array;

6. The multiple set of chemical data based cancer subtype recognition system of claim 5, wherein the sample data includes gene expression, miRNA expression, and DNA methylation.

7. The multiple sets of chemical data based cancer subtype identification system of claim 5, wherein the expression of the similarity map is as follows:

G ^(m) ＝{V ^(m) ，E ^(m) }

8. The multiple-set based cancer subtype identification system of claim 5, further comprising: