WO2020113673A1

WO2020113673A1 - Cancer subtype classification method employing multiomics integration

Info

Publication number: WO2020113673A1
Application number: PCT/CN2018/121838
Authority: WO
Inventors: 杨超; 殷鹏; 蒋佳新
Original assignee: 深圳先进技术研究院
Priority date: 2018-12-07
Filing date: 2018-12-18
Publication date: 2020-06-11
Also published as: CN111291777A; CN111291777B

Abstract

The present application provides a cancer subtype classification method employing multiomics integration. The method comprises: acquiring target multiomics data of each patient in a target cancer patient group; calculating to obtain an omics similarity matrix; performing prediction on each omics similarity matrix to obtain a predicted similarity matrix; using the omics similarity matrix to correct the predicted similarity matrix, and acquiring a corrected matrix; performing weighted fusion to obtain a fusion matrix; and performing spectral clustering on the fusion matrix, and establishing a cancer subtype category label corresponding to the fusion matrix of each patient. The present application improves the accuracy of classification evaluation of cancer subtypes, while also using a more flexible integration method to classify patients, thereby improving the efficiency of data analysis, and facilitating research on cancer subtypes.

Description

A cancer subtype classification method based on multi-omics integration

Cross-reference of related applications

This application requires the priority of the Chinese patent application with the application number 201811496363.3 and the title of "a cancer subtype classification method based on multi-omics integration" submitted to the Chinese Patent Office on December 07, 2018, the entire content of which is cited by reference Incorporated in this application.

Technical field

The present application relates to the technical field of cancer subtype classification and evaluation, and more specifically, to a cancer subtype classification method based on multi-omics integration.

Background technique

The identification of cancer subtypes is essential for cancer diagnosis and treatment. There is a relatively unbalanced classification of cancer subtypes using only single-omics information, and the divided cancer subtypes often have large differences in survival rates. Therefore, in recent years, many methods for identifying cancer subtypes by integrating target multi-omics data have been proposed.

Common methods for cancer target multi-omics data integration include feature extraction, dimensionality reduction, and similarity matrix calculation. Among them, feature extraction and dimensionality reduction methods are generally used in combination, such as latent variable factorization. Common clustering methods include: K-means, mean shift clustering, density-based clustering, and spectral clustering.

However, the existing methods do not consider the similarity deviation between samples and the weight of different omics data in the integration, resulting in poor accuracy and large errors in the classification results of cancer subtypes for patients.

Summary of the invention

In view of this, this application provides a cancer subtype classification method based on multi-omics integration, including:

Obtaining target multi-omics data for each patient in the target cancer patient group; and, calculating an omics similarity matrix corresponding to each omics in the target multi-omics data;

Predicting each of the omics similarity matrices using a linear regression method to obtain a predicted similarity matrix corresponding to each of the omics similarity matrices;

Using the omics similarity matrix to modify the predicted similarity matrix to obtain a modified matrix;

Performing weighted fusion on the correction matrix corresponding to each omics to obtain a fusion matrix;

Perform spectral clustering on the fusion matrix corresponding to each patient to determine the cancer subtype category corresponding to each patient.

Preferably, the "prediction of each of the omics similarity matrices using a linear regression method to obtain a predicted similarity matrix corresponding to each of the omics similarity matrices" includes:

Based on the linear regression method, the omology similarity matrix corresponding to each omics of the target multi-omics data of each patient is taken as the target matrix, and the omics similarity matrix corresponding to the other omics is used for the target The matrix performs linear regression prediction to obtain the predicted value corresponding to the data in each of the target matrices of the target multi-omics data, respectively, and obtain the prediction corresponding to each of the omics similarity matrix containing the predicted values Similarity matrix.

Preferably, the linear regression prediction is performed using the following formula:

Among them, β ₀ is the hyperparameter, β _t is the parameter obtained by the linear regression learning model; r′ _k,ij is the predicted value.

Preferably, the "using the omics similarity matrix to modify the predicted similarity matrix to obtain a modified matrix" includes:

Summing and averaging the omics similarity matrix of each omics in the target multiomics data and the corresponding predicted similarity matrix to obtain each omics in the target multiomics data Correction matrix.

Preferably, the summation average is calculated by the following formula:

Where k is omics in the target multi-omics data, W _k is a correction matrix, M _k is an omics similarity matrix, and M′ _k is the predicted similarity matrix.

Preferably, the “obtaining target multi-omics data for each patient in the target cancer patient group; and calculating the omics similarity matrix corresponding to each omics in the target multi-omics data” includes:

Determine the target multi-omics data for each patient in the target cancer patient group, and perform mean interpolation on the data corresponding to the missing omics;

Perform similarity calculation on the target multi-omics data to obtain the similarity matrix corresponding to each of the target multi-omics data; the similarity calculation formula is:

Where x _k,it is the value of feature t corresponding to cancer patient i in the kth omics;

It is the average value of cancer patient i in the k-th omics.

Preferably, after "calculating the similarity of the target multi-omics data to obtain the similarity matrix corresponding to each of the target multi-omics data", the method further includes:

Data processing is performed on the similarity matrix by Fisher conversion to obtain the processed similarity matrix; wherein, the Fisher conversion formula is:

Among them, r _k,ij is the similarity matrix of matrix transformation, and the similarity matrix M _k,ij is the matrix composed of S _k,ij .

Preferably, the “weight fusion of the correction matrix corresponding to each omics to obtain a fusion matrix” includes:

Use the differential search method to determine the weight of the omics corresponding to each omics in the target multi-omics data;

According to the omics weights of each omics, weighted fusion is performed on the correction matrix corresponding to each omology of each patient to obtain a fusion matrix; wherein, the weighted fusion is performed by the following formula:

Among them, W _{k is the k} -th group similarity matrix revised by the patent, ω _k is the weight corresponding to W _k , and W is the final matrix after weighted fusion.

Preferably, after "spectral clustering the fusion matrix corresponding to each patient to determine the cancer subtype category corresponding to each patient", the method further includes:

Calculating the mean of each omics of all patients in each of the cancer subtype categories of the target cancer patient group as the mean of subtypes;

Calculating the center point of the subtype group for the mean value of all the subtypes in each of the cancer subtype categories of the target cancer patient group;

Obtain the multi-omics data of the patients in the patient group to be analyzed; wherein the omics category in the multi-omics data to be tested is the same as the omology category of the target multi-omics data; and, calculate The center point of the multi-omics data to be tested of each patient in the patient group to be analyzed is taken as the center point to be analyzed;

Based on the Euclidean distance algorithm, calculating the relative distance between the center point of the multi-omics data of each patient in the patient group to be analyzed and the center point of each subtype group as the detection distance value;

Selecting the cancer subtype category corresponding to the detection distance value with the smallest distance among all the detection distance values of each patient in the patient group to be analyzed, as the patient’s Cancer subtype category.

In addition, in order to solve the above problems, the present application also provides a cancer subtype classification device based on multi-omics integration, including:

An acquisition module configured to acquire target multi-omics data for each patient in the target cancer patient group; and, calculate and obtain an omics similarity matrix corresponding to each omics in the target multi-omics data;

A prediction module configured to predict each of the omics similarity matrices using a linear regression method to obtain a predicted similarity matrix corresponding to each of the omics similarity matrices;

A correction module configured to modify the predicted similarity matrix using the omics similarity matrix to obtain a correction matrix;

A fusion module configured to perform weighted fusion on the correction matrix corresponding to each omics to obtain a fusion matrix;

The clustering module is configured to perform spectral clustering on the fusion matrix corresponding to each patient to determine the cancer subtype category corresponding to each patient.

This application provides a cancer subtype classification method based on multi-omics integration. In this application, the omics similarity matrix corresponding to each omics in the target multi-omics data of each patient in the target patient group is calculated, and the linear regression method is used to predict each omics similarity matrix. Predictive similarity matrix, and then combine and correct the omics similarity matrix and the predicted similarity matrix to obtain a modified matrix, perform weighted fusion according to weights, and then perform spectral clustering to establish a predetermined cancer subtype for each patient. The corresponding cancer subtype category number of the category label. This application proposes a simple and effective similarity fusion model based on the similarity matrix, which is configured to integrate target multi-omics data to identify cancer subtypes. For the similarity deviation between the samples in each omics data, and using a linear model to predict the similarity between samples for correction, and then the weights to integrate the corrected correction matrix from the target multi-omics data to achieve Cluster patient samples into different subtype groups for classification. This application improves the accuracy of classification and evaluation of cancer subtypes, and realizes the classification of patients through a more flexible integration method, improves the efficiency of data analysis, and provides convenience for the study of cancer subtypes.

BRIEF DESCRIPTION

FIG. 1 is a schematic structural diagram of a hardware operating environment involved in an embodiment of a cancer subtype classification method based on multi-omics integration of the present application;

FIG. 2 is a schematic flowchart of a first embodiment of a cancer subtype classification method based on multi-omics integration;

FIG. 3 is a schematic flowchart of a second embodiment of a cancer subtype classification method based on multi-omics integration;

FIG. 4 is a schematic flowchart of a third embodiment of a cancer subtype classification method based on multi-omics integration;

FIG. 5 is a schematic flowchart of another implementation manner in the third example of the cancer subtype classification method based on multi-omics integration of the present application;

FIG. 6 is a schematic flow chart after step S50 of the fourth embodiment of the cancer subtype classification method based on multi-omics integration of the present application;

7 is a comparison diagram of survival rate and survival time of glioblastoma cancer subtypes based on a multi-omics integrated cancer subtype classification method;

8 is a schematic diagram of functional modules of a cancer subtype classification device based on multi-omics integration of the present application.

The implementation, functional characteristics and advantages of the present application will be further described in conjunction with the embodiments and with reference to the drawings.

detailed description

The embodiments of the present application are described in detail below, in which the same or similar reference numerals indicate the same or similar elements or the elements having the same or similar functions throughout.

In addition, the terms "first" and "second" are for descriptive purposes only, and cannot be understood as indicating or implying relative importance or implicitly indicating the number of indicated technical features. Thus, features defined as "first" and "second" may explicitly or implicitly include one or more of the features. In the description of this application, the meaning of "plurality" is two or more, unless otherwise specifically limited.

In this application, unless otherwise clearly specified and defined, the terms "installation", "connected", "connected" and "fixed" should be understood in a broad sense, for example, it can be a fixed connection or a detachable connection , Or integrated; it can be mechanical connection or electrical connection; it can be directly connected or indirectly connected through an intermediary, it can be the connection between two components or the interaction between two components. Those of ordinary skill in the art can understand the specific meanings of the above terms in this application according to specific situations.

It should be understood that the specific embodiments described herein are only used to explain the present application, and are not used to limit the present application.

As shown in FIG. 1, it is a schematic structural diagram of a hardware operating environment of a terminal involved in a solution of an embodiment of the present application.

The computer device in the embodiment of the present application may be a PC, or may be a smart phone, a tablet computer, or a portable terminal device with a certain portable computer. As shown in FIG. 1, the computer device may include a processor 1001, such as a CPU, a network interface 1004, a user interface 1003, a memory 1005, and a communication bus 1002. Among them, the communication bus 1002 is configured to implement connection communication between these components. The user interface 1003 may include a display screen, an input unit such as a keyboard and a remote controller, and the optional user interface 1003 may also include a standard wired interface and a wireless interface. The network interface 1004 may optionally include a standard wired interface and a wireless interface (such as a WI-FI interface). The memory 1005 may be a high-speed RAM memory or a stable memory, such as a disk memory. The memory 1005 may optionally also be a storage device independent of the aforementioned processor 1001. Optionally, the terminal may further include an RF (Radio Frequency) circuit, an audio circuit, a WiFi module, and so on. In addition, the computer device can also be configured with other sensors such as gyroscopes, barometers, hygrometers, thermometers, and infrared sensors, which will not be repeated here.

A person skilled in the art may understand that the computer device shown in FIG. 1 does not constitute a limitation on the computer device, and may include more or less components than shown, or combine some components, or arrange different components. As shown in FIG. 1, the memory 1005 as a computer-readable storage medium may include an operating system, a data interface control program, a network connection program, and a cancer subtype classification program integrated based on multi-omics.

This application provides a cancer subtype classification method based on multi-omics integration. Among them, the method improves the accuracy of classification and evaluation of cancer subtypes, and realizes the classification of patients through a more flexible integration method, improves the efficiency of data analysis, and provides convenience for the study of cancer subtypes.

Example 1:

Referring to FIG. 2, the first embodiment of the present application provides a cancer subtype classification method based on multi-omics integration, including:

Step S10: Obtain target multi-omics data for each patient in the target cancer patient group; and, calculate and obtain an omics similarity matrix corresponding to each omics in the target multi-omics data;

As mentioned above, the omics similarity matrix is M _k ; where M _k can be expressed in the following form:

As mentioned above, the target cancer patient group is a collection that requires data analysis to perform batch cancer subtype classification for all patients in the group. The target cancer patient group includes pathological data (physical and chemical index data and biochemical test results, etc.) of patients with the same type of cancer but with the same and/or different conditions.

As described above, the target cancer patient group includes multiple patients with the same type of cancer, and each patient has target multi-omics data including multiple omics.

As mentioned above, the target multi-omics data is a combination of multiple omics that each patient has in the target cancer patient group and needs data analysis.

For example, for lung cancer, establish a target cancer patient population for lung cancer. All patients in the group are lung cancer patients, the number is 400 patients. According to the importance of different omics and their relevance to lung cancer, the three omics of mRNA, methylation and gene expression are defined for analysis and research, and the three omics of mRNA, methylation and gene expression are used as each The target multi-omics data corresponding to each patient, and the mRNA, methylation and gene expression levels are the individual omics in the target multi-omics data.

Step S20, predicting each of the omics similarity matrices using a linear regression method to obtain a predicted similarity matrix corresponding to each of the omics similarity matrices;

Step S30, using the omics similarity matrix to modify the predicted similarity matrix to obtain a modified matrix;

As mentioned above, the existing cancer subtype data classification processing technology generally has the following solutions:

(1) Integrate target multi-omics data;

(2) Perform clustering;

(3) Analysis of survival rate of clustering results;

(4) Evaluate the clustering results.

It can be seen that the existing cancer subtype data classification processing method does not consider the similarity deviation between samples and the weight of different omics data in the integration, which is also a common phenomenon in the existing classification methods. Too many dimensions, the quality of feature selection affects the quality of clustering results, greatly reducing the credibility and accuracy of the results.

In this embodiment, these shortcomings are considered, and simple regression and linear fusion are used to integrate the weights of different omics data to avoid feature selection and dimensionality reduction. This embodiment not only considers the similarity between patients of different types of data, but also weighs the weights occupied by different types, and finally uses the spectral clustering method. The model is improved on the basis of the similarity matrix, and cancer subtype classification can be performed simply and effectively. Improve the quality of cancer subtype classification, and the consistency within subtypes is stronger, which is more conducive to ensuring the follow-up research and cancer treatment of subtypes.

In this embodiment, it is proposed that after or obtaining the multi-omics similarity matrix of each patient in the target cancer patient group, a prediction matrix corresponding to each similarity matrix is established by linear regression, and the similarity matrix is performed by the prediction matrix Correction, that is, to comprehensively correct the measured value and the predicted value, so as to obtain a classification result with stronger consistency, higher accuracy, and higher data reliability.

It should be noted that linear regression is a statistical analysis method that uses regression analysis in mathematical statistics to determine the interdependent quantitative relationship between two or more variables. The regression analysis includes two or more independent variables, and the linear relationship between the dependent variable and the independent variable is called multiple linear regression analysis. In this embodiment, through multiple linear regression analysis, each data in the omics similarity matrix is predicted to obtain a predicted similarity matrix corresponding to the omics similarity matrix.

Step S40: Perform weighted fusion on the correction matrix corresponding to each omics to obtain a fusion matrix;

As described above, according to the weight corresponding to each omics, the correction matrices corresponding to the multiple omics in the target multi-omics data are weighted and fused, thereby obtaining the fusion matrix corresponding to each patient.

Step S50: Perform spectral clustering on the fusion matrix corresponding to each patient to determine the cancer subtype category corresponding to each patient.

As mentioned above, the preset cancer subtype category label is a label for classification of the target cancer type. For example, after clustering, the subtypes of lung cancer are classified into three types, one type, two types, and three types, which are the category labels of lung cancer. Correspondingly, the cancer subtype category number is C1 for the first type, C2 for the second type, and C3 for the third type. In this embodiment, the category label is used as the corresponding category label to implement classification based on the label.

It should be noted that the spectral clustering algorithm is based on the theory of spectral graph. Compared with the traditional clustering algorithm, it has the advantages of clustering on a sample space of arbitrary shape and converging to the global optimal solution. The algorithm first defines an affinity matrix describing the similarity of paired data points according to a given sample data set, and calculates the eigenvalues and eigenvectors of the matrix, and then selects the appropriate eigenvectors to cluster different data points.

As mentioned above, the target cancer is classified through spectral clustering, and after the classification, the cancer subtype category labels corresponding to different classification types are established (for example, C1\C2\C3), that is, the patient fusion matrix and the cancer subtype category are realized The purpose of establishing the relationship between tags is to classify patients with different subtypes of cancer through the cancer subtype category tags.

This embodiment provides a method for classifying cancer subtypes based on multi-omics integration, through calculation, an omics similarity matrix corresponding to each omics in the target multi-omics data of each patient in the target patient group, and The linear regression method is used to make predictions to obtain the predicted similarity matrix corresponding to each of the omics similarity matrices. The omics similarity matrix and the predicted similarity matrix are combined and corrected to obtain a correction matrix, which is weighted and fused according to the weights. Spectral clustering is performed to establish a corresponding cancer subtype category number based on a preset cancer subtype category label for each patient. This embodiment proposes a simple and effective similarity fusion model based on the similarity matrix, which is configured to integrate target multi-omics data to identify cancer subtypes. For the similarity deviation between the samples in each omics data, and use the linear model to predict the similarity between the samples for correction, and then the weights to integrate the corrected correction matrix from the target multi-omics data to achieve Cluster patient samples into different subtype groups for classification. This embodiment improves the accuracy of the classification evaluation of cancer subtypes, and realizes the classification of patients through a more flexible integration method, improves the efficiency of data analysis, and provides convenience for the study of cancer subtypes.

Example 2:

Referring to FIG. 3, the second embodiment of the present application provides a method for classifying cancer subtypes based on multi-omics integration. Based on the first embodiment shown in FIG. 2 above, the step S20, "use linear regression for each The prediction of the omics similarity matrix to obtain a predicted similarity matrix corresponding to each of the omics similarity matrices includes:

Step S21, based on the linear regression method, taking the omology similarity matrix corresponding to each omics of the target multi-omics data of each patient as the target matrix, and using the omics similarity matrix pairs corresponding to other omics Performing linear regression prediction on the target matrix, respectively obtaining the predicted value corresponding to the data in each of the target matrices of the target multi-omics data, and obtaining each of the omics similarity matrices containing the predicted values Corresponding predicted similarity matrix;

The linear regression prediction is performed using the following formula:

As mentioned above, the target multi-omics data for each patient includes multiple omics, and the corresponding omics similarity matrix is obtained by calculating the similarity corresponding to each omics. Then, the linear regression method is used to predict each omics to obtain a predicted similarity matrix.

Specifically, for each omics included in the target multi-omics data, the omics similarity matrix of one omics is used as the target matrix, and the omics similarity corresponding to other omics different from the target matrix is used. The matrix performs linear regression prediction on the target matrix to obtain the predicted value of the data in the target matrix, that is, the predicted similarity matrix corresponding to the target matrix. Then, this method is used to predict other matrices different from the target matrix, so as to obtain the predicted similarity matrix corresponding to each omics similarity matrix.

For example, the patient's target multi-omics data includes 3 omics such as M1, M2, and M3. The linear regression prediction process is:

Use M2 and M3 to perform linear regression prediction on M1 to get M1';

Use M1 and M3 to perform linear regression prediction on M2 to get M2';

Using M1 and M2 to perform linear regression prediction on M3, get M3'. The above M1', M2' and M3' are the prediction similarity matrices corresponding to M1, M2 and M3 respectively obtained by linear regression prediction.

In step S30, "correcting the predicted similarity matrix using the omics similarity matrix to obtain a correction matrix" includes:

Step S31: Summing and averaging the omics similarity matrix of each omics in the target multi-omics data and the corresponding predicted similarity matrix to obtain each of the target multi-omics data Modification matrix of omics;

The summation average is calculated by the following formula:

As mentioned above, after obtaining the omics similarity matrix of each patient, each omics similarity matrix is predicted based on the linear regression method to obtain the predicted similarity matrix corresponding to each omics similarity matrix. Then, according to the omics similarity matrix and the corresponding predicted similarity matrix, a summation average is performed, that is, the similarity value that has been obtained is corrected by the predicted value, thereby improving the accuracy of the similarity value, so that each In the case of similarity deviation between samples in the omics data, the linear model is used to predict the similarity between samples, to make up for the problem of similarity between patients, so that the value of the obtained similarity matrix is more accurate and has Reliability.

Example 3:

4-5, the third embodiment of the present application provides a cancer subtype classification method based on multi-omics integration, based on the first embodiment shown in FIG. 2 above, the step S10, "obtain target cancer patient population The target multi-omics data for each patient in; and the omics similarity matrix corresponding to each omics in the target multi-omics data is calculated to include:

Step S11: Determine the target multi-omics data of each patient in the target cancer patient group, and perform mean interpolation on the data corresponding to the missing omics.

As mentioned above, the target multi-omics data contains multiple omics, but due to the large number of patients, each patient may not be fully tested for each omics. There may be a lack of inspection, resulting in some patients lacking If a certain omics cannot be calculated, it is necessary to interpolate the missing omics from the missing patients using the average of all the items of other patients to supplement the data without changing the true value of the data Ensure the statistical significance of the data.

Step S12: Perform similarity calculation on the target multi-omics data to obtain the similarity matrix corresponding to each of the target multi-omics data; the similarity calculation formula is:

Among them, x _k,it is the value of the feature t corresponding to the cancer patient i in the kth omics; x _k,i is the average value of the cancer patient i in the kth omics.

The above, the similarity matrix is to list the data of a certain omics of all patients, for example, the horizontal axis is the gene expression amount, and the vertical axis is the patient name or number. In this icon, the gene expression of each patient is calculated The similarity of the amount of gene expression with other patients. Thus, an omics similarity matrix of gene expression corresponding to patients is established.

In another embodiment, after step S10, "calculate the similarity of the target multi-omics data to obtain the similarity matrix corresponding to each of the target multi-omics data", Also includes:

Step S60: Perform data processing on the similarity matrix by Fisher conversion to obtain the processed similarity matrix; wherein, the Fisher conversion formula is:

As described above, after obtaining the omics similarity matrix corresponding to each omics of each patient, data conversion is performed on the similarity matrix through Fisher transformation. The process of pre-processing and normalizing the data in the patient's omics similarity matrix to obtain the pre-processed omics similarity matrix can be run more efficiently during further data processing.

The step S40, "weight fusion of the correction matrix corresponding to each omics to obtain a fusion matrix" includes:

Step S41, using a differential search method to determine the weight of the omics corresponding to each omics in the target multi-omics data;

As mentioned above, a 0.05-step differential search method is used to determine the optimal weights corresponding to each omics of cancer.

Step S42: Perform weighted fusion on the correction matrix corresponding to each omics of each patient according to the omics weight of each omics to obtain a fusion matrix; wherein, the weighted fusion is performed by the following formula:

As mentioned above, according to the optimal weight corresponding to each omics, all the correction matrices included in each patient are fused, so that the fusion matrix of each patient can be obtained.

For example, in this embodiment, weighted fusion and single omics are compared. For details, see Table 1:

Table 1 Cox-log P-value comparison table of single-omics and weighted fusion in subtype survival analysis

数据data	基因表达gene expression	DNA甲基化DNA methylation	miRNA表达miRNA expression	加权融合Weighted fusion
GBMGBM	2.49×10 ^-3 2.49×10 ^-3	5.71×10 ^-3 5.71×10 ^-3	1.50×10 ^-3 1.50×10 ^-3	2.66×10 ^-4 2.66×10 ^-4

From this table, it can be seen that weighted fusion has a smaller P value, and the reliability of subtype classification is higher. Therefore, in this embodiment, a weighted fusion method is used to fuse data from multiple omics, so that statistically more reliable and accurate calculation and analysis results can be obtained.

Example 4:

Referring to FIG. 6, the fourth embodiment of the present application provides a cancer subtype classification method based on multi-omics integration. Based on the first embodiment shown in FIG. 2 above, in the step S50, "for each patient’s corresponding After the fusion matrix performs spectral clustering to determine the cancer subtype category corresponding to each patient, it also includes:

Step S70, calculating the mean of each omics of all patients in each of the cancer subtype categories of the target cancer patient group as the mean of subtypes;

After confirming the cancer subtype category corresponding to each patient in the target cancer patient group, a general rule can be constructed according to the determined category. According to this general rule, it is used as a data analysis model to analyze the case data of other individual patients or multiple patient groups, so as to achieve the purpose of rapid typing.

In addition, as a data analysis model for building a general law, the number of patients in the target cancer patient group needs to reach a certain number. The larger the number, the higher the accuracy of the data analysis model as a general law, so it can be set here A preset threshold, when the number of patients in the target cancer patient group reaches the preset threshold, it can be used as a data analysis model to analyze the cancer classification of omics data of other patients. For example, the preset threshold is 300 cases, that is, the number of patients in the target cancer patient group must be not less than 300.

As mentioned above, the target cancer patient group contains multiple patients, and each patient corresponds to a cancer subtype category, that is, after data classification analysis, all patients in the target cancer patient group are divided according to different The group corresponding to the cancer subtype category.

An average value of each omics characteristic of all patients in each of the cancer subtype categories of the target cancer patient group is averaged to obtain a subtype average. Among them, the average number of subtypes is the same as the number of cancer subtypes.

Step S80, calculating the center point of the subtype group of the mean value of all the subtypes in each of the cancer subtype categories of the target cancer patient group;

After step S70, the average of multiple subtypes of each of the cancer subtypes in the target cancer patient group can be obtained, and the average of the multiple subtypes of each of the cancer subtypes can be averaged to obtain the subtype group center point.

Step S90: Obtain the multi-omics data of the patients in the patient group to be analyzed; wherein the omics category in the multi-omics data to be tested is the same as the omology category of the target multi-omics data; And, calculate the center point of the multi-omics data to be tested for each patient in the patient group to be analyzed as the center point to be analyzed;

In the above, the step “obtaining the multi-omics data of the patients in the patient group to be analyzed; wherein the omics category in the multi-omics data to be tested is the same as the omology category of the target multi-omics data "Can be performed before step S70 or simultaneously with step S70, as long as it is completed before performing "calculating the center point of the multi-omics data of each patient in the patient group to be analyzed as the center point of the analysis".

As mentioned above, the patient group to be analyzed is a combination of patients different from the target cancer patient group, and this group may be one patient or multiple patients. Wherein, the omics category defining the to-be-tested multi-omics data of each patient in the patient group to be analyzed must be consistent with the omics category in the target multi-omics data of each patient in the target cancer patient group. For example, the target multi-omics data in the target cancer patient group includes mutations, methylation, and mRNA. Each patient in the corresponding patient group to be analyzed must also have such data as mutations, methylation, and mRNA. Only when the omics data is consistent can the comparison and analysis be performed.

As mentioned above, by obtaining the average value of all omics in the multi-omics data of each patient in the patient group to be analyzed, the central point to be analyzed is obtained.

Step S100, based on the Euclidean distance algorithm, calculate the relative distance between the center point of the multi-omics data of each patient in the patient group to be analyzed and the center point of each subtype group as the detection distance value ;

It should be noted that Euclid Distance (Euclid Distance), also known as Euclidean metric and Euclidean distance, is a commonly used definition of distance, which is the true distance between two points in m-dimensional space. The Euclidean distance in two-dimensional space is the distance of a straight line between two points.

Through the Euclidean distance algorithm, the Euclidean distance between the center point of the multi-omics data of each patient in the patient group to be analyzed and the center point of each subtype group is calculated as the detection distance value.

In step S110, among all the detection distance values of each patient in the patient group to be analyzed, the cancer subtype category corresponding to the detection distance value with the smallest distance is selected as the patient group to be analyzed In this patient's cancer subtype category.

After obtaining all the detection distance values of each patient, compare all the detection distance values and select the cancer subtype category corresponding to the smallest detection distance value in the numerical value as the cancer subtype category of the patient, thereby achieving After the classification of all patients in the target cancer patient group, it is used as a general regular data analysis model to quickly classify other patients.

For example, for a single or multiple cancer patients in the newly added patient group to be analyzed, the original cluster label data can be used to perform classification calculation on a single sample or multiple sample groups to directly determine the cancer subtype category.

There are 500 patients in the target cancer patient group, and each patient includes O1, O2, O3 three omics data, and the patient group is divided into two subtypes of C1 and C2 by the method of steps S10-S50. Set the patients newly added to a group of patients (patient group to be analyzed) as n1, n2, ..., nk. details as follows:

1. Calculate the mean of each omics (O1, O2, O3) of all patients in each of the cancer subtype categories (C1, C2) of the target cancer patient group as the mean of the subtypes. X _1,1 , X _1,2 , X _1,3 , and X _2,1 , X _2,2 , X _2,3 . Among them, the 1 before the comma in the X subscript corresponds to C1, the 2 before the comma corresponds to C2, and the 1, 2, and 3 after the comma correspond to O1, O2, and O3, respectively.

2. Calculate the center point of the subtype group for the mean value of all the subtypes in each of the cancer subtype categories (C1, C2) of the target cancer patient group:

X1=(X _1,1 +X _1,2 +X _1,3 )/3; corresponding to C1;

X2=(X _2,1 +X _2,2 +X _2,3 )/3; corresponds to C2.

3. Obtain the multi-omics data of the patients in the patient group to be analyzed; wherein the omics category in the multi-omics data to be tested is the same as the omology category of the target multi-omics data; and , Calculate the center point of the multi-omics data to be tested for each patient in the patient group to be analyzed as the center point to be analyzed;

Calculate the center points of the new samples n1, n2, ..., nk respectively:

new1=(n1,1+n1,2+n1,3)/3;

new2=(n2,1+n2,2+n2,3)/3;

...

newk=(n _k,1 +n _k,2 +n _k,3 )/3;

Among them, n _k,1 ,n _k,2 ,n _k,3 are the k-th patient of the new sample (in this case, multiple patients) in omics O1, O2, O3 values.

4. Classification of new sample subtypes: among all the detection distance values of each patient in the patient group to be analyzed, the cancer subtype category corresponding to the detection distance value with the smallest distance is selected as the The cancer subtype category of the patient in the patient group to be analyzed.

Using the Euclidean distance algorithm formula:

Find all the detection distance values of each patient; where, i is the number of subtype categories, and calculate the detection distance values d _1,k and d _{2,k of the} new sample k and the center of each subtype family ( The cancer subtype categories determined in this embodiment are C1 and C2, so the corresponding two detection distance values must be calculated).

If d _1,k <d _2,k , the new sample k belongs to subtype C1; if d _1,k >d _2,k , it belongs to subtype C2. In addition, if there are multiple cancer subtype categories, for example, five, the cancer subtype category corresponding to the smallest detection distance value may be selected as the cancer subtype category of the patient.

In this embodiment, for a single or multiple cancer patients in the newly added patient group to be analyzed, the original cluster label data can be used to classify a single sample or multiple sample groups to directly determine the cancer subtype category In this way, a general rule can be established as a data analysis model according to the cancer subtype classification method to realize data analysis for other patients, so that it can provide convenience for rapid typing and data analysis of target patients or patient groups in clinical research. In addition, the classification data of each other patient added later can also be added to the model, so as to continuously correct and improve the accuracy of the model analysis, which can be statistically credible.

Statistical application experiment based on glioblastoma cancer:

In order to better illustrate the cancer subtype classification method based on multi-omics integration provided in this application, application comparison experiments were conducted separately.

First, for patients with glioblastoma cancer, there are 215 cases, and the patients of the above 215 cases are classified by a cancer subtype classification method based on multi-omics integration. Thus, the classification result is obtained (as shown in Table 2). It can be seen from Table 2 that the classification results obtained after clustering are counted, and the age, sex and survival time of the three subtypes are counted. It can be analyzed that there are significantly different pathogenesis of C1 subtype and C2 subtype. Subsequent research and analysis can be based on clinical drugs and other experimental comparisons of subtype treatment effects, to study the corresponding treatment drugs and treatment methods of patients of each subtype.

Further, the obtained classification result is plotted against the corresponding survival rate, where the three subtypes including Subtype1, Subtype2 and Subtype3 correspond to C1, C2 and C3. According to the analysis results, the survival rates of the three subtypes in the above-mentioned glioblastoma cancer patient group and the comparison with the corresponding survival time are established. The results are shown in Figure 7, which shows that there are significant survival rates among the three subtypes The sexual difference proves that the cancer subtype classification method based on multi-omics integration provided in this embodiment is accurate and effective, and has statistical significance of data and credibility.

Table 2 Comparison table of clinical characteristics of glioblastoma carcinoma

子类型IDSubtype ID	C1(N＝42)C1 (N=42)	C1(N＝112)C1 (N=112)	C1(N＝61)C1 (N=61)
患者(男性：女性)Patient (Male: Female)	(24:18)(24:18)	(69:43)(69:43)	(41:20)(41:20)
平均年龄(岁)Average age (years)	46.446.4	58.858.8	54.854.8
平均生存时间(天)Average survival time (days)	931.9931.9	402.5402.5	504.9504.9

In addition, referring to FIG. 8, the present application also provides a cancer subtype classification device based on multi-omics integration, including:

The obtaining module 10 is configured to obtain target multi-omics data for each patient in the target cancer patient group; and, calculate and obtain an omics similarity matrix corresponding to each omics in the target multi-omics data;

The prediction module 20 is configured to use a linear regression method to predict each of the omics similarity matrices to obtain a predicted similarity matrix corresponding to each of the omics similarity matrices;

The modification module 30 is configured to modify the predicted similarity matrix using the omics similarity matrix to obtain a modified matrix;

The fusion module 40 is configured to perform weighted fusion on the correction matrix corresponding to each omics to obtain a fusion matrix;

The clustering module 50 is configured to perform spectral clustering on the fusion matrix corresponding to each patient to determine the cancer subtype category corresponding to each patient.

In addition, the present application also provides a computer device including a memory and a processor configured to store a cancer subtype classification program integrated based on multi-omics, the processor running the multi-omics-based An integrated cancer subtype classification program to enable the mobile terminal to perform a multi-omics integrated cancer subtype classification method as described above.

In addition, the present application also provides a computer-readable storage medium on which is stored a cancer subtype classification program based on multi-omics integration, the cancer subtype classification program based on multi-omics integration is When executed by the processor, the cancer subtype classification method based on multi-omics integration as described above is realized.

The sequence numbers of the above embodiments of the present application are for description only, and do not represent the advantages and disadvantages of the embodiments.

Through the description of the above embodiments, those skilled in the art can clearly understand that the methods of the above embodiments can be implemented by means of software plus the necessary general hardware platform, and of course, can also be implemented by hardware, but in many cases the former is better Implementation. Based on this understanding, the technical solution of the present application can be embodied in the form of a software product in essence or part that contributes to the existing technology, and the computer software product is stored in a storage medium (such as ROM/RAM as described above) , Magnetic disks and optical disks), including several instructions to enable a terminal device (which may be a mobile phone, computer, server, or network device, etc.) to perform the method described in each embodiment of the present application. The above are only the preferred embodiments of the present application, and do not limit the patent scope of the present application. Any equivalent structure or equivalent process transformation made by using the description and drawings of this application, or directly or indirectly used in other related technical fields , The same reason is included in the scope of patent protection in this application.

Claims

A cancer subtype classification method based on multi-omics integration, which includes:

Obtaining target multi-omics data for each patient in the target cancer patient group; and, calculating an omics similarity matrix corresponding to each omics in the target multi-omics data;

Predicting each of the omics similarity matrices using a linear regression method to obtain a predicted similarity matrix corresponding to each of the omics similarity matrices;

Using the omics similarity matrix to modify the predicted similarity matrix to obtain a modified matrix;

Performing weighted fusion on the correction matrix corresponding to each omics to obtain a fusion matrix;

Perform spectral clustering on the fusion matrix corresponding to each patient to determine the cancer subtype category corresponding to each patient.
The cancer subtype classification method based on multi-omics integration according to claim 1, wherein the "prediction of each of the omics similarity matrices using linear regression method to obtain each of the omics similarity The predicted similarity matrix corresponding to the degree matrix includes:

Based on the linear regression method, the omology similarity matrix corresponding to each omics of the target multi-omics data of each patient is taken as the target matrix, and the omics similarity matrix corresponding to the other omics is used for the target The matrix performs linear regression prediction to obtain the predicted value corresponding to the data in each of the target matrices of the target multi-omics data, respectively, and obtain the prediction corresponding to each of the omics similarity matrix containing the predicted values Similarity matrix.
The cancer subtype classification method based on multi-omics integration according to claim 2, wherein the linear regression prediction is performed using the following formula:

Among them, β 0 is the hyperparameter, β t is the parameter obtained by the linear regression learning model; r′ k,ij is the predicted value.
The cancer subtype classification method based on multi-omics integration according to claim 1, characterized in that the "using the omics similarity matrix to modify the predicted similarity matrix to obtain a correction matrix" includes:

Summing and averaging the omics similarity matrix of each omics in the target multiomics data and the corresponding predicted similarity matrix to obtain each omics in the target multiomics data Correction matrix.
The cancer subtype classification method based on multi-omics integration according to claim 4, wherein the summation average is calculated by the following formula:

Where k is omics in the target multi-omics data, W k is a correction matrix, M k is an omics similarity matrix, and M′ k is the predicted similarity matrix.
The cancer subtype classification method based on multi-omics integration according to claim 1, wherein the "acquiring the target multi-omics data of each patient in the target cancer patient group; and, calculating The omics similarity matrix corresponding to each omics in omics data includes:

Determine the target multi-omics data for each patient in the target cancer patient group, and perform mean interpolation on the data corresponding to the missing omics;

Perform similarity calculation on the target multi-omics data to obtain the similarity matrix corresponding to each of the target multi-omics data; the calculation formula of the similarity Sk,ij is:

Where x k,it is the value of feature t corresponding to cancer patient i in the kth omics;
It is the average value of cancer patient i in the k-th omics.
The cancer subtype classification method based on multi-omics integration according to claim 6, characterized in that the "similarity calculation is performed on the target multi-omics data to obtain each group in the target multi-omics data After learning the corresponding similarity matrix, it also includes:

Data processing is performed on the similarity matrix by Fisher conversion to obtain the processed similarity matrix; wherein, the Fisher conversion formula is:

Among them, r k,ij is the similarity matrix of matrix transformation, and the similarity matrix M k,ij is the matrix composed of S k,ij .
The cancer subtype classification method based on multi-omics integration according to claim 1, wherein the "weighted fusion of the correction matrix corresponding to each omics to obtain a fusion matrix" includes:

Use the differential search method to determine the weight of the omics corresponding to each omics in the target multi-omics data;

According to the omics weights of each omics, weighted fusion is performed on the correction matrix corresponding to each omology of each patient to obtain a fusion matrix; wherein, the weighted fusion is performed by the following formula:

Among them, W k is the k -th group similarity matrix revised by the patent, ω k is the weight corresponding to W k , and W is the final matrix after weighted fusion.
The cancer subtype classification method based on multi-omics integration according to claim 1, characterized in that, in the "spectral clustering of the fusion matrix corresponding to each patient, the cancer subtype corresponding to each patient is determined After the category, it also includes:

Calculating the mean of each omics of all patients in each of the cancer subtype categories of the target cancer patient group as the mean of subtypes;

Calculating the center point of the subtype group for the mean value of all the subtypes in each of the cancer subtype categories of the target cancer patient group;

Obtain the multi-omics data of the patients in the patient group to be analyzed; wherein the omics category in the multi-omics data to be tested is the same as the omology category of the target multi-omics data; and, calculate the The center point of the multi-omics data of each patient in the patient group to be analyzed is taken as the center point of the analysis;

Based on the Euclidean distance algorithm, calculating the relative distance between the center point of the multi-omics data of each patient in the patient group to be analyzed and the center point of each subtype group as the detection distance value;

Selecting the cancer subtype category corresponding to the detection distance value with the smallest distance among all the detection distance values of each patient in the patient group to be analyzed, as the patient’s Cancer subtype category.
A cancer subtype classification device based on multi-omics integration is characterized in that it includes:

An acquisition module configured to acquire target multi-omics data for each patient in the target cancer patient group; and, calculate and obtain an omics similarity matrix corresponding to each omics in the target multi-omics data;

A prediction module configured to predict each of the omics similarity matrices using a linear regression method to obtain a predicted similarity matrix corresponding to each of the omics similarity matrices;

A correction module configured to modify the predicted similarity matrix using the omics similarity matrix to obtain a correction matrix;

A fusion module configured to perform weighted fusion on the correction matrix corresponding to each omics to obtain a fusion matrix;

The clustering module is configured to perform spectral clustering on the fusion matrix corresponding to each patient to determine the cancer subtype category corresponding to each patient.